The given data does not qualify into any known probability density function.
The next deterministic factor would be to identify the independent and correlated features.
- This is to ensure the optimal number of features are included in the training data to avoid information-sparsity and reduce dimensions.
- The following figure shows a correlation graph between all the features with a color coded match.
Fig: Red indicates highest correlation, and grey being the least
As the plot suggests, there aren’t any significant correlations between any two given features from the data.
Principal Component Analysis:
The PCA is method to reduce the dimensionality of the data. This helps visualize if there are any non-contributing features in the given data-set towards predicting the target variable.
The no:of components i.e features can be determined by plotting the variance of the given dataset with respect to the no:of components.
The above plot indicates that, cummulative variance of the dataset is explained by almost all the features.
Initially we approached it with the classification model. Applied various supervised learning algorithms (classification) on the individual dataset. Three different models were built for each year as the target variable.
Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. A tree can be seen as a piece-wise constant approximation.
In which ft is the probability that was forecast, to the actual outcome of the event at instance t( if it does not happen and 1 if it does happen) and N is the number of forecasting instances. In effect, it is the mean squared error of the forecast.
Given data is highly imbalanced for the target variable.
Note: Here class-1 means recidivated and class-0 means not recidivated.
I. Logistic Regression Summary:
Recall, Precision scores are bad in class-1 for both Logistic regression model and Random Forest due to the imbalanced data.
III. Neural Network Summary:
Loss: 9.9847e-05 , Accuracy: 0.7047
Precision: 0.5126 , Recall: 0.7946
Test_loss: 0.8334 , Test_accuracy: 0.6121
Test_precision: 0.4125 , Test_recall: 0.6169
In contrast to LR and DT, the Neural Network results are far more promising. Test recall scores improved from 0.13 to 0.6.
The problem with imbalanced classification is that there are too few examples of the minority class for a model to effectively learn the decision boundary. Perhaps the most widely used approach to synthesizing new examples is the ‘Synthetic Minority Oversampling Technique’, or ‘SMOTE’ for short.
IV. Survival Analysis:
Survival analysis is a branch of statistics for analyzing the expected duration of time until one event occurs, such as death in biological organisms and failure in mechanical systems.
Often we regress covariates (e.g., age, country, etc.) against another variable – in this case durations, we cannot use traditional methods like linear regression because of censoring.
There are a few popular models in survival regression: Cox’s model, Weibull accelerated failure models, log normal models etc. One can ran through all the models and finalized the one with least brier score(errors) or/and highest accuracy.
Exponential distribution is based on the poisson process, where the event occur continuously and independently with a constant event rate 𝜆. Exponential distribution models shows how much time needed until an event occurs.
Exponential distribution is a special case of the Weibull distribution:
Cox’s proportional hazard model
Cox’s proportional hazard model is when 𝑏0 becomes (𝑏0(𝑡)), which means the baseline hazard is a function of time.
ℎ(𝑡|𝑥) = 𝑏0(𝑡) * 𝑒𝑥𝑝(∑𝑏𝑖𝑥𝑖)
- ‘ℎ(𝑡|𝑥)’ hazard
- ‘𝑏0(𝑡)’ baseline hazard, time-varying
- ‘𝑒𝑥𝑝(∑𝑏𝑖𝑥𝑖)’ partial hazard, time-invariant
- ‘xi’ is often centered to the mean
- Can fit survival models without knowing the distribution
- With censored data, inspecting distributional assumptions can be difficult
- Do not have to specify the underlying hazard function, great for estimating covariate effects and hazard ratios.
- Estimate 𝑏0,…𝑏𝑘 without having to specify 𝑏0(𝑡)
- Can not estimate s(t) and h(t)
- Non-informative censoring
- 1.1 Check: predicting censor by Xs.
- survival times (t) are independent.
- ln (hazard) is linear function of numeric Xs.
3.1 Check: residual plots
3.2 Fix: transformations
- Values of Xs don’t change over time.
4.1 Fix: add time-varying covariates
- Harzards are proportional. Hazard ratio between two subjects is constant.
5.1 Check: Schoenfeld residuals, proportional hazard test
5.2 Fix: add non-linear term, binning the variable, add an interaction term with time, stratification (run model on subgroup), add time-varying covariates.
- Amongst all the models we can say that cox proportional is doing well when we are unsure about the distribution.
- When the dataset is imbalanced it is good to use non parametric models like cox proportional.
- Emphasize on addtional sources of data and feature engineering for better results.
To Sum it Up:
Participating in the recidivsim challenge has been testing on various fronts. As we unfolded each leaf in the book, trying to model something, the complexity grew two-fold. As growing data-science professionals, that made it a great venue for diving deeper. Although our results are not very successful, the journey has definitely been worth it. Having hands on with real-world data, attempting to solve an impact-driven use has been revealing and a great deal of learning.