Feature extraction

Extracting meaningful features from the data is always important, but plays a key role when our data is of very high dimensionality.
from multiprocessing.spawn import import_main_path
import os
from glob import glob
from collections import Counter
from typing import List, Dict, Tuple

from rich.progress import track
import numpy as np
import pandas as pd
import mne
import yasa

from sleepstagingidal.data import *
from sleepstagingidal.dataa import *
from sleepstagingidal.dataa import swap_dict

Amplitude-independent features

In our case, we have downsampled our data and have 3000 data points per epoch (30s fragment of the recording), but this is a huge amount of features to be fed into a model. Getting even further, we have the problem of calibration and normalization, which is aggravated even more when working with medical data (each hospital may work slightly different). Because of this, we are going to employ features that are independent of the amplitude of the signal: we are going to utilize mainly frequency-related features.

We can use the library yasa to perform a basic feature extraction:

path_files = glob(os.path.join(path_data, "*.edf"))
channels = ["C3", "C4", "A1", "A2", "O1", "O2", "LOC", "ROC", "LAT1", "LAT2", "ECGL", "ECGR", "CHIN1", "CHIN2"]

Before extracting the features, we are going to perform downsampling and a bandpass filter to keep only the low frequencies:


source

read_clean_edf

 read_clean_edf (path, resample:int=None,
                 bandpass:Tuple[float,float]=None, verbose:bool=False)

Loads and (potentially) cleans an .edf file.

Type Default Details
path Path to an .edf file.
resample int None Frequency (Hz) we want to resample to. If None, no resampling is applied.
bandpass Tuple None Tuple (min_freq, max_freq) for the bandpass filter. If None, no bandpass is applied.
verbose bool False Quantity of information to be shown.
Returns RawEDF Raw object with the corresponding cleaning,
raw = read_clean_edf(path_files[0], resample=100, bandpass=(0.3, 49))
raw

Once it’s done, we can use the function yasa.bandpower() to extract frequency-related features from the data. Keep in mind that this function expects the data to be fed in epochs form, so we have to transform it first:

epochs, sr = get_epochs(raw, channels=channels, verbose=False)
epochs
Using data from preloaded Raw for 719 events and 3000 original time points ...
1 bad epochs dropped
Number of events 718
Events Sleep stage N1: 29
Sleep stage N2: 323
Sleep stage W: 366
Time range 0.000 – 29.990 sec
Baseline off

source

calculate_bandpower

 calculate_bandpower (epochs, sf=100)

Extracts the bandpower per epoch and returns it as an array ready to be fed into a model.

Type Default Details
epochs Epochs object or 3D array [Epochs, Channels, Data].
sf int 100 Sampling frequency of the data.
Returns array Numpy array of shape [Epochs, 6*Channels] representing 6 bands.
bandpowers = calculate_bandpower(epochs, sf=sr)
CPU times: user 4.34 s, sys: 0 ns, total: 4.34 s
Wall time: 4.34 s

If we check the shape of the produced array, we find that we have quite reduced the dimensionality of the problem but, are we still able to classify the different sleep stages with this features?

bandpowers.shape
(718, 84)

Simple model

To check the usability of this features we will train a very simple model performing a random split within the same recording. As we are building our project sequentially, we want to make sure that the things we do are usable as building blocks for the next things.

First of all, we can extract the labels from the epochs object we have already created:

labels = epochs.events[:,-1]
labels.shape
(718,)

Next, we can randomly split our data and train a simple default random forest:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(bandpowers, labels, test_size=0.2, random_state=42)

assert X_train.shape[0] == Y_train.shape[0]
assert X_test.shape[0] == Y_test.shape[0]
assert X_train.shape[1] == X_test.shape[1]
model = RandomForestClassifier(random_state=42)
model.fit(X_train, Y_train)
CPU times: user 423 ms, sys: 3.37 ms, total: 426 ms
Wall time: 444 ms
RandomForestClassifier(random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
model.score(X_train, Y_train), model.score(X_test, Y_test)
(1.0, 0.5972222222222222)

Having performed this simple check, we know that the features extracted hold, at least, some useful information that can be used to predict the different sleep stages. We can continue adding complexity to our pipeline.