from multiprocessing.spawn import import_main_path
import os
from glob import glob
from collections import Counter
from typing import List, Dict, Tuple
from rich.progress import track
import numpy as np
import pandas as pd
import mne
import yasa
from sleepstagingidal.data import *
from sleepstagingidal.dataa import *
from sleepstagingidal.dataa import swap_dict
Feature extraction
Amplitude-independent features
In our case, we have downsampled our data and have 3000 data points per epoch (30s fragment of the recording), but this is a huge amount of features to be fed into a model. Getting even further, we have the problem of calibration and normalization, which is aggravated even more when working with medical data (each hospital may work slightly different). Because of this, we are going to employ features that are independent of the amplitude of the signal: we are going to utilize mainly frequency-related features.
We can use the library yasa
to perform a basic feature extraction:
= glob(os.path.join(path_data, "*.edf")) path_files
= ["C3", "C4", "A1", "A2", "O1", "O2", "LOC", "ROC", "LAT1", "LAT2", "ECGL", "ECGR", "CHIN1", "CHIN2"] channels
Before extracting the features, we are going to perform downsampling and a bandpass filter to keep only the low frequencies:
read_clean_edf
read_clean_edf (path, resample:int=None, bandpass:Tuple[float,float]=None, verbose:bool=False)
Loads and (potentially) cleans an .edf
file.
Type | Default | Details | |
---|---|---|---|
path | Path to an .edf file. |
||
resample | int | None | Frequency (Hz) we want to resample to. If None , no resampling is applied. |
bandpass | Tuple | None | Tuple (min_freq, max_freq) for the bandpass filter. If None , no bandpass is applied. |
verbose | bool | False | Quantity of information to be shown. |
Returns | RawEDF | Raw object with the corresponding cleaning, |
= read_clean_edf(path_files[0], resample=100, bandpass=(0.3, 49))
raw raw
Once it’s done, we can use the function yasa.bandpower()
to extract frequency-related features from the data. Keep in mind that this function expects the data to be fed in epochs form, so we have to transform it first:
= get_epochs(raw, channels=channels, verbose=False)
epochs, sr epochs
Using data from preloaded Raw for 719 events and 3000 original time points ...
1 bad epochs dropped
Number of events | 718 |
---|---|
Events | Sleep stage N1: 29 Sleep stage N2: 323 Sleep stage W: 366 |
Time range | 0.000 – 29.990 sec |
Baseline | off |
calculate_bandpower
calculate_bandpower (epochs, sf=100)
Extracts the bandpower per epoch and returns it as an array ready to be fed into a model.
Type | Default | Details | |
---|---|---|---|
epochs | Epochs object or 3D array [Epochs, Channels, Data]. | ||
sf | int | 100 | Sampling frequency of the data. |
Returns | array | Numpy array of shape [Epochs, 6*Channels] representing 6 bands. |
= calculate_bandpower(epochs, sf=sr) bandpowers
CPU times: user 4.34 s, sys: 0 ns, total: 4.34 s
Wall time: 4.34 s
If we check the shape of the produced array, we find that we have quite reduced the dimensionality of the problem but, are we still able to classify the different sleep stages with this features?
bandpowers.shape
(718, 84)
Simple model
To check the usability of this features we will train a very simple model performing a random split within the same recording. As we are building our project sequentially, we want to make sure that the things we do are usable as building blocks for the next things.
First of all, we can extract the labels from the epochs
object we have already created:
= epochs.events[:,-1]
labels labels.shape
(718,)
Next, we can randomly split our data and train a simple default random forest:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
= train_test_split(bandpowers, labels, test_size=0.2, random_state=42)
X_train, X_test, Y_train, Y_test
assert X_train.shape[0] == Y_train.shape[0]
assert X_test.shape[0] == Y_test.shape[0]
assert X_train.shape[1] == X_test.shape[1]
= RandomForestClassifier(random_state=42)
model model.fit(X_train, Y_train)
CPU times: user 423 ms, sys: 3.37 ms, total: 426 ms
Wall time: 444 ms
RandomForestClassifier(random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(random_state=42)
model.score(X_train, Y_train), model.score(X_test, Y_test)
(1.0, 0.5972222222222222)
Having performed this simple check, we know that the features extracted hold, at least, some useful information that can be used to predict the different sleep stages. We can continue adding complexity to our pipeline.