Reducing Commercial Aviation Fatalities — A novel case of using physiological data for real time awareness detection

14 min readMar 30, 2021

Overview:

Shortly before midnight on December 29, 1972, Eastern Airlines Flight 401, scheduled to fly from New York to Miami, crashed into the Florida Everglades, causing 101 fatalities. The crash occurred while the entire cockpit crew was preoccupied with a landing gear light malfunction, and failed to notice that the plane was not on autopilot. This incident shows how important it is for a pilot to be alert about everything that’s going on with the plane.

Business Problem:

Pilots in commercial airlines have to manage several important tasks at once, like communication among crew or radio communication, programming the flight monitoring system, dealing with equipment malfunction and responding to abnormal situations. Even though they are well trained, it is human nature to get distracted or panic under stress. However, if we could have a system that alerts the rest of the crew when there is some trouble with the cognitive state of the pilot, it would help prevent many accidents that happen due to human error.

To solve this problem, Booz Allen Hamilton designed a Kaggle challenge where the participants were asked to build a model to detect troubling events in the cockpit, by making use of the aircrew’s physiological data. The model so built, must also be able to detect these events in real-time, in order to prevent accidents, and save lives.

Machine Learning formulation:

The given challenge is a classification task where the cognitive state of the pilot has to be classified as one of the following:

Channelised Attention (CA): This is the state when the pilot is completely focused on one task, with no attention being paid to any other task
Diverted Attention (DA): This is the state when the pilot is being distracted by a secondary task, that requires some kind of decision making
Startled/Surprised (SS): This is the condition where the pilot is in a state of shock or panic when encountered with an abnormal situation
Baseline: This is the condition where the pilot is not experiencing any of the above conditions and is in a normal resting state.

The training dataset consists of physiological data obtained after subjecting the pilots to three different types of experiments in a controlled environment, outside a flight simulator. The three types of experiments are denoted as follows:

CA: To simulate channelised attention, the pilots were made to play an engaging puzzle-based video game and the induced response was used as a benchmark
DA: Diverted attention state was induced in pilots where while continuously monitoring a display, they were made to solve a math problem in between.
SS: The pilots were made to watch movie clips with jump scares in them

During an experiment, a pilot experienced either the baseline state or the state corresponding to the experiment. The test dataset consists of physiological data obtained during a full flight in a flight simulator. This experiment is denoted as LOFT (Line Oriented Flight Training). During LOFT, the pilot could experience any of the states, but not more than one at a time. The goal was to predict the probability of the pilot belonging to a state at each instance of time.

Data Fields:

id(int): A unique identifier for crew + time combination. This is present only in the test dataset
crew(int): Primary key to identify a pair of pilots
seat(int): Indicator of whether the pilot is in left seat (0) or right seat (1)
experiment(str): The type of experiment the pilots are subjected to. One of CA, DA, SS or LOFT.
time(float) : Time elapsed since the experiment started, in seconds

The following fields beginning with eeg are electroencephalogram readings. They measure the electrical activity at different points on the brain.

eeg_fp1 (float)
eeg_f7 (float)
eeg_f8 (float)
eeg_t4 (float)
eeg_t6 (float)
eeg_t5 (float)
eeg_t3 (float)
eeg_fp2 (float)
eeg_o1 (float)
eeg_p3 (float)
eeg_pz (float)
eeg_f3 (float)
eeg_fz (float)
eeg_f4 (float)
eeg_c4 (float)
eeg_p4 (float)
eeg_poz (float)
eeg_c3 (float)
eeg_cz (float)
eeg_o2 (float)

These readings correspond to the location of 20 electrodes on the scalp during EEG and are named according to the international 10–20 system. Their location on the scalp is as given in the figure below:

The remaining data fields are:

ecg (float) : 3-point Electrocardiogram signal. It is a measure of the voltage of the electrical activity of the heart, using electrodes placed on the skin. The resolution of the electrodes is .012215 µV and the range is -100mV to +100mV. The readings are provided in microvolts
r (float): Respiration, a measure of the rise and fall of the chest. It is also known as pneumograph. The sensor had a resolution of .2384186 µV and a range of -2.0V to +2.0V. The data are provided in microvolts
gsr (float) — Galvanic Skin Response, a measure of electrical activity on the skin. The sensor had a resolution of .2384186 µV and a range of -2.0V to +2.0V. The data is provided in microvolts
event (str) — The state of the pilot, indicated by one of ‘A’, ‘B’, ‘C’ or ‘D’, where they indicate the following: ‘A’ — Baseline state, ‘B’ — Surprised/Startled state, ‘C’ — Channelised Attention state, ‘D’ — Distracted Attention state. This is the target field that is present only in the training dataset

The signals from all the sensors were sampled at 256 Hz. The data also contains noise and artefacts

Performance metric:

This is a multi-class classification problem, with the possibility of class imbalance, as the events may not be uniformly distributed across time. Hence the metric, that is used here is multi-class log loss

where N is the total number of samples in data and M is the number of classes, yᵢⱼ is 1 if i’th sample belongs to j’th class, else 0. pᵢⱼ is the probability that i’th sample belongs to j’th class.

Along with multi-class log loss, we can measure the performance of the model by checking the confusion matrix, along with precision and recall scores for each class. We need a high recall score for classes other than ‘A’ (baseline state) since the cost of error in reporting troubling events(‘B’, ‘C’, and ‘D’) is higher compared to falsely reporting baseline state as one of the troubling states.

Exploratory Data Analysis:

Let’s look at the distribution of occurrence of each event for all the pilots in the train data. We use countplot for this

The above plot clearly shows that event ‘A’ (baseline state) has the highest number of occurrences (more than 2.5 million times), whereas event ‘B’ (startled/surprised state) has the least number of occurrences (less than 0.5 million times). This shows that there is a clear imbalance amongst classes in the given train data. This is along the expected lines since troubling events occur rarely in the state of awareness of the pilot.

Univariate Analysis:

The boxplots for ‘ecg’, ‘r’ and ‘gsr’ signals against the events, shows that the values for these signals are distributed similarly across events (even though gsr shows some variation, it is not enough to separate classes by value alone)

Next, we look at the Phi-K correlation coefficient between all other features and the target variable

ECG, R, and GSR are weakly correlated to the event. But clearly, none of the features is strongly correlated to the event.

Multivariate analysis:

We’ll use the t-distributed stochastic neighbourhood embedding(T-SNE) technique to represent the multi-dimensional data points in a two dimensional plot

No cluster belonging to a distinct class can be formed in a two dimensional plot after applying TSNE on the given raw features.

After both univariate and multivariate analysis, we can conclude that the given raw features are not enough to perform classification, and we need new features along with them for better results.

Feature Engineering:

Before we generate new features, we should know that every unique combination of ‘crew’, ‘seat’ and ‘experiment’ corresponds to an instance of a pilot in an experiment. In the given train data, the data may be given such that the samples corresponding to different such instances may be given mixed up together. In order to separate the samples belonging to different instances, we’ll create a new feature called ‘pilot’ as follows:

First, we do label encoding on categorical feature ‘experiment’ and convert it to numerical feature

Next, we create a new column called ‘pilot’ as a numerical combination of ‘crew’, ‘seat’ and ‘experiment’. Every unique value of ‘pilot’ corresponds to an instance of a pilot in an experiment.

Samples corresponding to a value of ‘pilot’ is isolated for generating new features for those samples

Domain-specific features

Heart rate:

A single heartbeat is characterised by the following features in the ECG signal: the P wave, which represents the depolarisation of the atria; the QRS complex, which represents the depolarisation of the ventricles; and the T wave, which represents the repolarisation of the ventricles of the heart. The QRS complex is characterised by an R-peak between Q and S.

Heart rate (number of beats per minute) is known to be useful in determining the cognitive state of a person. Heart rate can be determined from ECG signal by identifying the R-peaks and measuring the interval between two consecutive R-peaks. For this purpose, we make use of biosppy library, where different algorithms like Christov segmenting and Engzee segmenting, are used and corrected R-peaks are determined. This library also implements a filter internally in all its functions which removes noise and other artefacts. The heart rates are obtained at a small subset of data points in an experiment. These values are interpolated to the entire duration of the experiment using cubic interpolation

Respiratory rate:

Respiratory rate of a person is known to increase during moments of panic or under stress. From the given pneumatographic signal ‘r’, which is a measure of rise and fall of chest level, we can extract the respiratory rate of the pilot. We are making use of biosppy library and using interpolation technique, as we did for heart rate

Galvanic Skin Response:

GSR signal can be divided into skin conductance level(SCL) and skin conductance response(SCR). SCL is slow changing base signal (f < 0.2 Hz) which is not related to the stimulus, SCR signal (f < 0.5 Hz) occurs either after a stimulus or during normal regulatory activity of the sympathetic nervous system.

SCR is composed of a rise zone and a decay zone. The instant at which the stimulus is presented and the signal starts to rise is known as an offset (t₀) and the instant at which the signal reaches a local maximum is known as peak (tₘₐₓ). The peak and offset points of the GSR signal can be determined using first and second-order differentials of the filtered GSR signal. The amplitude of the GSR signal is the local maximum value of the signal relative to the value at offset.

The biosppy library is used to obtain offsets, peaks and amplitude of the GSR signal. The offset and peak points in a GSR signal can be used to add new features ‘last offset’ and ‘last peak’ to measure the time since those events occurred.

Power spectral analysis of EEG:

Electroencephalogram (EEG) readings are from 20 different electrodes on the scalp of the brain. Power spectrum analysis of each EEG signal assumes that it is a linear combination of simple vibrations that vibrate at a specific frequency, and decomposes each frequency component in this signal to indicate its magnitude. The frequencies are grouped into bands as follows:

Delta band (0.2 Hz to 4 Hz) : Delta band is prominent, especially in deep sleep in normal people or in newborns. If the delta band stands out even in the case of a healthy normal person, most of the cases when the brain waves are measured, the eyes are blinked or the body is moved heavily. The frequency-domain of the artefacts caused by these eye movements or body movements is almost identical to the delta wave frequency domain, so it may appear as if the delta wave has increased. Therefore, when an EEG measurement experiment is performed for a long time, the power spectrum of delta waves is not usually considered as an analysis factor because eye movement and body movement are essential.
Theta band (4 Hz to 8 Hz): Theta band has been reported to be related to many different conditions such as memory, superpower, creativity, concentration, and anxiety
Alpha band ( 8 Hz to 13 Hz ): Alpha band usually becomes prominent in a relaxed state, and the amplitude increases with a stable and comfortable state. In particular, when a stable alpha wave appears, it is when a person closes eyes and is in a true state. When the person opens eyes, looks at objects, or becomes emotionally excited, the alpha wave is suppressed.
Beta band (13 Hz to 25 Hz): Beta band becomes prominent when the person is awake and doing all conscious activities, such as speaking. In particular, it may appear predominantly in anxiety, tension, and complicated calculations.
Gamma band (25 Hz to 40 Hz): Gamma band becomes predominant when the person is more emotionally irritated or is involved in advanced cognitive information processing, such as reasoning and judgement.

As mentioned, we are not using Delta band since it is distorted by eye movements and body movements. The power distribution of each EEG signal in the remaining bands of frequencies is obtained by using biosppy library. A window size of 40 seconds and an overlap of 0.99375 were found to be optimal parameters for conducting power spectral density analysis for the given train data.

2. Generic features:

Other than the domain specific features added above, other features which are generic to any time-series data are added. These are:

First-order differential for heart rate and respiratory rate
First-order and second-order differentials for GSR signals.

Along with the above features, ‘crew’ is one-hot encoded since it is a categorical feature.

Feature importance:

The feature importance, of the top 25 features among all the features used, obtained using Random Forest Classifier model, is as follows:

The domain specific features seem to be more important in classification, rather than generic features. The amplitude of Galvanic Skin Response signal seems to be the most important feature, in determining the cognitive state of the pilot.

Preprocessing:

The given train data from kaggle is split into train and cross-validation data in the ratio of 75:25. The numerical features, except the one-hot encoded features and ‘seat’, are standardised using sklearn’s StandardScaler. Any null values present in the data is imputed with median values for the respective feature.

Modelling:

Approach 1:

The newly generated features are used to train different machine learning models. All the models were tuned for the best hyperparameters using cross-validation data. The models were adjusted for class weights since the given data is imbalanced.

These trained models were then used to predict probability of the pilot being in one of the four cognitive states for each instance in the test data. These probabilities were submitted to Kaggle for each model and the results were obtained as shown below:

The Random Forest Classifier model gave the best private score, whereas the LightGBM Classifier model gave the best public score. Overall, Random Forest Classifier gave a better result than any other model.

Approach 2:

After trying generic models, a custom stacking classifier was built and trained, with the following design:

The whole data was split into train and cross-validation set in the ratio of 80:20
The train data was split into D1 and D2 in the ratio 50:50. On D1, sampling with replacement was performed to obtain d1,d2,d3,…dk (k samples). k different base models were trained with the k samples.
D2 was passed to these k base models and k different predictions were obtained for each point in D2.
A new dataset was created with k predictions from base models for D2 as input and actual output values for D2 as target values. A meta-classifier was trained with this dataset.
Model evaluation was done using cross-validation data, by feeding it as input to each base model, obtaining k predictions, and then feeding those predictions as input to the meta classifier to obtain final predicted probabilities. Using these final predicted probabilities and the actual target values for the cross-validation set, we can calculate the multi-class log loss score to evaluate the model

For the base models, all possible combinations of models like Logistic Regression, Naive Bayes, Random Forest Classifier, XGBoost Classifier, etc were tried and the best combination was selected based on the cross validation score.

However, it was found that the stacking classifier which we built did not improve the private and public score on Kaggle for the test dataset.

Final submission:

The Random Forest Classifier was finally selected to be the best model based on both public score and private score

Conclusion:

On the given data, both the bagging model (Random Forest Classifier) and boosting model(LightGBM classifier) perform better than the stacking model
With good amount of feature engineering, a simple ensemble model like Random Forest Classifier can give the good results

Future work:

Data balancing can be done using Borderline SMOTE technique to generate samples belonging to the surprised/startled(SS) state and distracted attention(DA) state. This balanced dataset can be used to train the models mentioned above
For Electroencephalogram(EEG) signals, Phase Locking Factor (PLF) features can be generated and used along with power bands features.

You can find the complete code for this problem here. If you have any suggestions, you can contact me via LinkedIn

Thank you for taking your time out to read!