earthquake img

Table of Contents:

1. Introduction

2. Data Overview

3. Data Preprocessing

4. Feature Extraction

5. Machine Learning Models

6. Feature Selection

7. Results

Introduction

Predicting the time remaining before laboratory earthquakes occur from real-time seismic data.

Description

Forecasting earthquakes is one of the most important problems in Earth science because of their devastating consequences. Current scientific studies related to earthquake forecasting focus on three key points: when the event will occur, where it will occur, and how large it will be.

The goal of the challenge is to capture the physical state of the laboratory fault and how close it is from failure from a snapshot of the seismic data it is emitting. We will have to build a model that predicts the time remaining before failure from a chunk of seismic data.

Problem Statement:

To predict the time remaining before laboratory earthquakes occur from real-time seismic data. Sources:
Kaggle
Kaggle Discussion

Data

train.csv - A single, continuous training segment of experimental data.

Data Overview

train.csv contains 2 columns:

Type of Machine Learning Problem

It is a Regression problem, for a given chunk of seismic data we need to predict the time remaining before laboratory earthquakes occur

Performance Metric

Source: https://www.kaggle.com/c/LANL-Earthquake-Prediction#evaluation
Metric(s): Mean Absolute Error

It is the average absolute difference between the actual and predicted values.

I have used several kernels from kaggle and ideas from discussion threads .

vettejeep kernel
allunia kernel
gpreda kernel

Exploratory Data Analysis

  1. It is given that the earthquake occurs when the time_to_failure hits 0, hence we can count that there are 16 occurences of earthquake in the whole training data

  1. If we zoom into the data we can see that the acoustic data has a peak just before the earthquake occurs and the whole training data follows the same pattern..

  1. If we plot the data for 1000000 points we can see that the graph is continously decreasing but if we zoom into it we can see that the time_to_failure stops decreasing for a while when it reaches ~4000 samples. It is due to the fact that the data is recorded in bins of 4096 samples and the recording device stops for 12 microseconds after each bin.

Data Preprocessing

Since the test data has 150000 samples in each segment, we convert the train data into samples of size 150000 and hence we get 4194 samples.
Since the datasize is too small, We split the 6.2m train data into 6 slices, take 4000 random samples,each of size 150000 from each slice. Hence now we have 24000 training data. It takes huge time to run therefore we use multiprocessing.

We divide the raw data into 6 slices using the following function. We save each of the 6 slices of raw data into csv files for furthur processing.

def split_raw_data():

We then create 4000 random indices for each slice using random sampling. Also we save the indices into a file.

def build_rnd_idxs(): 

The below function creates features for each of the 4000 datapoints in each slice. proc id is the process id. It might take a day even with multiprocessing.

def build_fields(proc_id)

We now use python pool for multiprocessing. The run_mp_build uses 6 workers to create features parallelly.

def run_mp_build():

We join the 6 slices into one train data which now has 24000 datapoints along with coressponding 24000 output.

def join_mp_build():

We scale the features using standardisation.

def scale_fields():

Feature Extraction

1> Statistical Features:

2> Signal Processing features:

Machine Learning Models

LGBM

We use the default and approximate values for parameters since we came to know that cv is not reliable. Light GBM, to its advantage, can handle the large size of data and takes lower memory to run. Training speed is much faster than other ensemble models and can also be interpretable. we get a score of 1.340. We can see below how well the model is predicting.

XGBOOST

Since we want to maximize the score, xgboost has demonstrated successful for kaggle competitions. We were able to reach a mae of 1.314. We can see below that the model is not overfitting to train data and is generalising well.

Stacking

Stacking models are very powerful and since interpretability is not important, we stack lgbm and xgboost model and use Linear Regression as meta regressor. We find that the mae is 1.379

Feature Selection

SKlearn selectkbest: selectkbest selects top features and gives us the feature scores. we use top 300 features and apply LGBM and compare the results. We also tried out autoencoders, pearson correlation to reduce the dimensions but the performance dropped. We can see that rolling features and peaks are the most important features.

Results

XGB gives the highest score of 1.314 which is currently at the 27th position at the kaggle public leaderboard.

At the time of submission, the score was at top 1% of kaggle public leaderboard. As it is public lb the leaderboard positions might change.

Future Work

  1. We could use newer and powerful models like Catboost.
  2. Use KS Test, PCA as feature selection technique.

References

vettejeep kernel
allunia kernel
gpreda kernel
kaggle discussion
kaggle kernels
AppliedAICourse
Siraj Raval Youtube