import os
from glob import glob
from collections import Counter
from typing import List, Dict, Tuple
import numpy as np
import pandas as pd
from fastcore.basics import patch
from fastcore.foundation import L
from sleepstagingidal.data import *
from sleepstagingidal.dataa import *
from sleepstagingidal.dataa import swap_dict
from sleepstagingidal.feature_extraction import *
Validating the results
import matplotlib.pyplot as plt
import mne
import yasa
from rich.progress import track
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.ensemble import RandomForestClassifier
= glob(os.path.join(path_data, "*.edf")) path_files
= ["C3", "C4", "A1", "A2", "O1", "O2", "LOC", "ROC", "LAT1", "LAT2", "ECGL", "ECGR", "CHIN1", "CHIN2"] channels
Patient-Fold
Before trying a lot of different configurations for the models or different feature extraction techniques, it’s crucial to set up a truthful way of knowing how are this changes affecting our results. Because of that, we’re going to lay out the fundation of our validation pipeline: the Patient-Fold.
By similarity with traditional K-Fold, we are going to separate all the recordings we have and, iteratively, train with some of them while testing with a different set. This way of performing cross-validation will give us a good estimate on the inter-patient generalization capability of the model.
from sklearn.model_selection import KFold
PatientFold
PatientFold (path_files:List[str], n_splits:int, random_state:int)
Manager to perform the so-called PatientFold.
Type | Details | |
---|---|---|
path_files | List | Path to the .edf files we want to use. |
n_splits | int | Number of folds to use. |
random_state | int | Random seed for reproducibility |
Loading and preprocessing the raw .edf
files takes quite a lot of time, so it can be very convenient to separate that part from the cross-validation part. Keep in mind that we can do this without collapsin the memory from the server because the loaded files themselves load the data in a lazy way. The best way to ensure that the loading and preprocessing is done only once is to use a property
:
PatientFold.patients
PatientFold.patients ()
Ensures that the .edf
files are only loaded and preprocessed once.
We know that different recordings may have different encodings for the same sleep stage, so we should be unifying them before joining data from different recordings. The easiest way to do it is turning them into their human-readable representation, and encode all of them together to ensure that all of them are encoded in the same way.
And finally, we can build a simple function to build the appropriate input data and its labels from a set of patients loaded:
We want the process to be as streamlined as possible, so we can implement a .fit()
method to quickly perform the Patient-Fold with any estimator:
from sklearn.preprocessing import LabelEncoder
PatientFold.fit
PatientFold.fit (estimator, **kwargs)
Performs the cross-validation loop by training the estimator
on the different folds and returns the results.
Details | |
---|---|
estimator | Any object implementing a .fit() method to be crossvalidated. Must not be instantiated. |
kwargs |
= PatientFold(path_files=path_files[:2],
pf =len(path_files[:2]),
n_splits=42) random_state
pf.fit(RandomForestClassifier)
Using data from preloaded Raw for 765 events and 3000 original time points ...
1 bad epochs dropped
Using data from preloaded Raw for 719 events and 3000 original time points ...
1 bad epochs dropped
Using data from preloaded Raw for 719 events and 3000 original time points ...
1 bad epochs dropped
Using data from preloaded Raw for 765 events and 3000 original time points ...
1 bad epochs dropped
ValueError: y contains previously unseen labels: 'Sleep stage N3'