Reducing Commercial Aviation Fatalities — A novel case of using physiological data for real time awareness detection

Table of contents:

  1. Overview


Shortly before midnight on December 29, 1972, Eastern Airlines Flight 401, scheduled to fly from New York to Miami, crashed into the Florida Everglades, causing 101 fatalities. The crash occurred while the entire cockpit crew was preoccupied with a landing gear light malfunction, and failed to notice that the plane was not on autopilot. This incident shows how important it is for a pilot to be alert about everything that’s going on with the plane.

Business Problem:

Pilots in commercial airlines have to manage several important tasks at once, like communication among crew or radio communication, programming the flight monitoring system, dealing with equipment malfunction and responding to abnormal situations. Even though they are well trained, it is human nature to get distracted or panic under stress. However, if we could have a system that alerts the rest of the crew when there is some trouble with the cognitive state of the pilot, it would help prevent many accidents that happen due to human error.

To solve this problem, Booz Allen Hamilton designed a Kaggle challenge where the participants were asked to build a model to detect troubling events in the cockpit, by making use of the aircrew’s physiological data. The model so built, must also be able to detect these events in real-time, in order to prevent accidents, and save lives.

Machine Learning formulation:

The given challenge is a classification task where the cognitive state of the pilot has to be classified as one of the following:

  • Channelised Attention (CA): This is the state when the pilot is completely focused on one task, with no attention being paid to any other task

The training dataset consists of physiological data obtained after subjecting the pilots to three different types of experiments in a controlled environment, outside a flight simulator. The three types of experiments are denoted as follows:

  • CA: To simulate channelised attention, the pilots were made to play an engaging puzzle-based video game and the induced response was used as a benchmark

During an experiment, a pilot experienced either the baseline state or the state corresponding to the experiment. The test dataset consists of physiological data obtained during a full flight in a flight simulator. This experiment is denoted as LOFT (Line Oriented Flight Training). During LOFT, the pilot could experience any of the states, but not more than one at a time. The goal was to predict the probability of the pilot belonging to a state at each instance of time.

Data Fields:

  • id(int): A unique identifier for crew + time combination. This is present only in the test dataset

The following fields beginning with eeg are electroencephalogram readings. They measure the electrical activity at different points on the brain.

  • eeg_fp1 (float)

These readings correspond to the location of 20 electrodes on the scalp during EEG and are named according to the international 10–20 system. Their location on the scalp is as given in the figure below:

EEG electrode placement

The remaining data fields are:

  • ecg (float) : 3-point Electrocardiogram signal. It is a measure of the voltage of the electrical activity of the heart, using electrodes placed on the skin. The resolution of the electrodes is .012215 µV and the range is -100mV to +100mV. The readings are provided in microvolts

The signals from all the sensors were sampled at 256 Hz. The data also contains noise and artefacts

Performance metric:

This is a multi-class classification problem, with the possibility of class imbalance, as the events may not be uniformly distributed across time. Hence the metric, that is used here is multi-class log loss

where N is the total number of samples in data and M is the number of classes, yᵢⱼ is 1 if i’th sample belongs to j’th class, else 0. pᵢⱼ is the probability that i’th sample belongs to j’th class.

Along with multi-class log loss, we can measure the performance of the model by checking the confusion matrix, along with precision and recall scores for each class. We need a high recall score for classes other than ‘A’ (baseline state) since the cost of error in reporting troubling events(‘B’, ‘C’, and ‘D’) is higher compared to falsely reporting baseline state as one of the troubling states.

Exploratory Data Analysis:

Let’s look at the distribution of occurrence of each event for all the pilots in the train data. We use countplot for this

The above plot clearly shows that event ‘A’ (baseline state) has the highest number of occurrences (more than 2.5 million times), whereas event ‘B’ (startled/surprised state) has the least number of occurrences (less than 0.5 million times). This shows that there is a clear imbalance amongst classes in the given train data. This is along the expected lines since troubling events occur rarely in the state of awareness of the pilot.

Univariate Analysis:

The boxplots for ‘ecg’, ‘r’ and ‘gsr’ signals against the events, shows that the values for these signals are distributed similarly across events (even though gsr shows some variation, it is not enough to separate classes by value alone)

Next, we look at the Phi-K correlation coefficient between all other features and the target variable

ECG, R, and GSR are weakly correlated to the event. But clearly, none of the features is strongly correlated to the event.

Multivariate analysis:

We’ll use the t-distributed stochastic neighbourhood embedding(T-SNE) technique to represent the multi-dimensional data points in a two dimensional plot

No cluster belonging to a distinct class can be formed in a two dimensional plot after applying TSNE on the given raw features.

After both univariate and multivariate analysis, we can conclude that the given raw features are not enough to perform classification, and we need new features along with them for better results.

Feature Engineering:

Before we generate new features, we should know that every unique combination of ‘crew’, ‘seat’ and ‘experiment’ corresponds to an instance of a pilot in an experiment. In the given train data, the data may be given such that the samples corresponding to different such instances may be given mixed up together. In order to separate the samples belonging to different instances, we’ll create a new feature called ‘pilot’ as follows:

First, we do label encoding on categorical feature ‘experiment’ and convert it to numerical feature

Next, we create a new column called ‘pilot’ as a numerical combination of ‘crew’, ‘seat’ and ‘experiment’. Every unique value of ‘pilot’ corresponds to an instance of a pilot in an experiment.

Samples corresponding to a value of ‘pilot’ is isolated for generating new features for those samples

  1. Domain-specific features

Heart rate:

A single heartbeat is characterised by the following features in the ECG signal: the P wave, which represents the depolarisation of the atria; the QRS complex, which represents the depolarisation of the ventricles; and the T wave, which represents the repolarisation of the ventricles of the heart. The QRS complex is characterised by an R-peak between Q and S.

Heart rate (number of beats per minute) is known to be useful in determining the cognitive state of a person. Heart rate can be determined from ECG signal by identifying the R-peaks and measuring the interval between two consecutive R-peaks. For this purpose, we make use of biosppy library, where different algorithms like Christov segmenting and Engzee segmenting, are used and corrected R-peaks are determined. This library also implements a filter internally in all its functions which removes noise and other artefacts. The heart rates are obtained at a small subset of data points in an experiment. These values are interpolated to the entire duration of the experiment using cubic interpolation

Respiratory rate:

Respiratory rate of a person is known to increase during moments of panic or under stress. From the given pneumatographic signal ‘r’, which is a measure of rise and fall of chest level, we can extract the respiratory rate of the pilot. We are making use of biosppy library and using interpolation technique, as we did for heart rate

Galvanic Skin Response:

GSR signal can be divided into skin conductance level(SCL) and skin conductance response(SCR). SCL is slow changing base signal (f < 0.2 Hz) which is not related to the stimulus, SCR signal (f < 0.5 Hz) occurs either after a stimulus or during normal regulatory activity of the sympathetic nervous system.

SCR is composed of a rise zone and a decay zone. The instant at which the stimulus is presented and the signal starts to rise is known as an offset (t₀) and the instant at which the signal reaches a local maximum is known as peak (tₘₐₓ). The peak and offset points of the GSR signal can be determined using first and second-order differentials of the filtered GSR signal. The amplitude of the GSR signal is the local maximum value of the signal relative to the value at offset.

The biosppy library is used to obtain offsets, peaks and amplitude of the GSR signal. The offset and peak points in a GSR signal can be used to add new features ‘last offset’ and ‘last peak’ to measure the time since those events occurred.

Power spectral analysis of EEG:

Electroencephalogram (EEG) readings are from 20 different electrodes on the scalp of the brain. Power spectrum analysis of each EEG signal assumes that it is a linear combination of simple vibrations that vibrate at a specific frequency, and decomposes each frequency component in this signal to indicate its magnitude. The frequencies are grouped into bands as follows:

  • Delta band (0.2 Hz to 4 Hz) : Delta band is prominent, especially in deep sleep in normal people or in newborns. If the delta band stands out even in the case of a healthy normal person, most of the cases when the brain waves are measured, the eyes are blinked or the body is moved heavily. The frequency-domain of the artefacts caused by these eye movements or body movements is almost identical to the delta wave frequency domain, so it may appear as if the delta wave has increased. Therefore, when an EEG measurement experiment is performed for a long time, the power spectrum of delta waves is not usually considered as an analysis factor because eye movement and body movement are essential.

As mentioned, we are not using Delta band since it is distorted by eye movements and body movements. The power distribution of each EEG signal in the remaining bands of frequencies is obtained by using biosppy library. A window size of 40 seconds and an overlap of 0.99375 were found to be optimal parameters for conducting power spectral density analysis for the given train data.

2. Generic features:

Other than the domain specific features added above, other features which are generic to any time-series data are added. These are:

  1. First-order differential for heart rate and respiratory rate

Along with the above features, ‘crew’ is one-hot encoded since it is a categorical feature.

Feature importance:

The feature importance, of the top 25 features among all the features used, obtained using Random Forest Classifier model, is as follows:

The domain specific features seem to be more important in classification, rather than generic features. The amplitude of Galvanic Skin Response signal seems to be the most important feature, in determining the cognitive state of the pilot.


The given train data from kaggle is split into train and cross-validation data in the ratio of 75:25. The numerical features, except the one-hot encoded features and ‘seat’, are standardised using sklearn’s StandardScaler. Any null values present in the data is imputed with median values for the respective feature.


Approach 1:

The newly generated features are used to train different machine learning models. All the models were tuned for the best hyperparameters using cross-validation data. The models were adjusted for class weights since the given data is imbalanced.

These trained models were then used to predict probability of the pilot being in one of the four cognitive states for each instance in the test data. These probabilities were submitted to Kaggle for each model and the results were obtained as shown below:

The Random Forest Classifier model gave the best private score, whereas the LightGBM Classifier model gave the best public score. Overall, Random Forest Classifier gave a better result than any other model.

Approach 2:

After trying generic models, a custom stacking classifier was built and trained, with the following design:

  1. The whole data was split into train and cross-validation set in the ratio of 80:20

For the base models, all possible combinations of models like Logistic Regression, Naive Bayes, Random Forest Classifier, XGBoost Classifier, etc were tried and the best combination was selected based on the cross validation score.

However, it was found that the stacking classifier which we built did not improve the private and public score on Kaggle for the test dataset.

Final submission:

The Random Forest Classifier was finally selected to be the best model based on both public score and private score


  • On the given data, both the bagging model (Random Forest Classifier) and boosting model(LightGBM classifier) perform better than the stacking model

Future work:

  • Data balancing can be done using Borderline SMOTE technique to generate samples belonging to the surprised/startled(SS) state and distracted attention(DA) state. This balanced dataset can be used to train the models mentioned above

You can find the complete code for this problem here. If you have any suggestions, you can contact me via LinkedIn

Thank you for taking your time out to read!


Data Scientist | Former Software Engineer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store