Validating the results

Ofter overlooked, preparing a good validation pipeline is crucial to getting a good model.
import os
from glob import glob
from collections import Counter
from typing import List, Dict, Tuple

import numpy as np
import pandas as pd
from fastcore.basics import patch
from fastcore.foundation import L

from sleepstagingidal.data import *
from sleepstagingidal.dataa import *
from sleepstagingidal.dataa import swap_dict
from sleepstagingidal.feature_extraction import *
import matplotlib.pyplot as plt
import mne
import yasa
from rich.progress import track

from sklearn.model_selection import train_test_split, cross_validate
from sklearn.ensemble import RandomForestClassifier
path_files = glob(os.path.join(path_data, "*.edf"))
channels = ["C3", "C4", "A1", "A2", "O1", "O2", "LOC", "ROC", "LAT1", "LAT2", "ECGL", "ECGR", "CHIN1", "CHIN2"]

Patient-Fold

Before trying a lot of different configurations for the models or different feature extraction techniques, it’s crucial to set up a truthful way of knowing how are this changes affecting our results. Because of that, we’re going to lay out the fundation of our validation pipeline: the Patient-Fold.

By similarity with traditional K-Fold, we are going to separate all the recordings we have and, iteratively, train with some of them while testing with a different set. This way of performing cross-validation will give us a good estimate on the inter-patient generalization capability of the model.

from sklearn.model_selection import KFold

source

PatientFold

 PatientFold (path_files:List[str], n_splits:int, random_state:int)

Manager to perform the so-called PatientFold.

Type Details
path_files List Path to the .edf files we want to use.
n_splits int Number of folds to use.
random_state int Random seed for reproducibility

Loading and preprocessing the raw .edf files takes quite a lot of time, so it can be very convenient to separate that part from the cross-validation part. Keep in mind that we can do this without collapsin the memory from the server because the loaded files themselves load the data in a lazy way. The best way to ensure that the loading and preprocessing is done only once is to use a property:


source

PatientFold.patients

 PatientFold.patients ()

Ensures that the .edf files are only loaded and preprocessed once.

We know that different recordings may have different encodings for the same sleep stage, so we should be unifying them before joining data from different recordings. The easiest way to do it is turning them into their human-readable representation, and encode all of them together to ensure that all of them are encoded in the same way.

And finally, we can build a simple function to build the appropriate input data and its labels from a set of patients loaded:

We want the process to be as streamlined as possible, so we can implement a .fit() method to quickly perform the Patient-Fold with any estimator:

from sklearn.preprocessing import LabelEncoder

source

PatientFold.fit

 PatientFold.fit (estimator, **kwargs)

Performs the cross-validation loop by training the estimator on the different folds and returns the results.

Details
estimator Any object implementing a .fit() method to be crossvalidated. Must not be instantiated.
kwargs
pf = PatientFold(path_files=path_files[:2],
                 n_splits=len(path_files[:2]),
                 random_state=42)
pf.fit(RandomForestClassifier)


Using data from preloaded Raw for 765 events and 3000 original time points ...
1 bad epochs dropped


Using data from preloaded Raw for 719 events and 3000 original time points ...
1 bad epochs dropped


Using data from preloaded Raw for 719 events and 3000 original time points ...
1 bad epochs dropped


Using data from preloaded Raw for 765 events and 3000 original time points ...
1 bad epochs dropped


ValueError: y contains previously unseen labels: 'Sleep stage N3'