Recidivism Forecasting Challenge by the National Institute of Justice – Georgia

 02nd August, 2021




Challenge Overview:

Community service is one of the largest government run practices as part of law enforcement and criminal justice. It came into effect in the states inspired by the ancient legal systems based on restitution, criminals labored to compensate their particular victims for injury or loss. It is considered as the “productive form of punishment”, and rightfully so. Over time robust systems were built to ensure the appropriate formats for community service, which varies from state-to-state, were in place. It allowed the offenders to take a genuine shot at redemption and lead a civil life.
The current state of affairs relating to community service shows that its proving to be counter productive. Because studies show that the ratio of individuals on parole who successfully reintegrate back into society versus those who fail to clear the probation period is very low. According to the justice authorities, more than 40% of the prisoners are a result of parole violations. This is unfortunate for both the individual and the community, as the loss incurred in terms of the effort and time is substantial. The failures create a loop, which they refer to as “revolving door” admissions, where the individual is detained and continues to serve his prison sentence. This phenomenon costs the state a staggering 9.3 billion dollars annually.
The community service officials fulfill a complex set of duties that involve the right amount of surveillance of individuals on parole to safeguard the community and simultaneously direct them to the most suitable rehabilitative programs to help them overcome challenges. Any factor in the parole routine that is not scaled appropriately can risk re-entry and fail to prevent recidivism. Evidently, its a sensitive balance that needs to be equipped with intricate designs to promote re-entry. Hence the criticality boils down to determining the recidivism risk of an individual. This opens a window for data analytics to help provide necessary insights.
The Recidivism Forecasting problem conducted by the NIJ is an open challenge to the citizens and businesses of the country. The aim of the NIJ is to safeguard and improve communities by reducing the amount of recidivism. The results from the challenge is expected to provide critical information to community corrections departments that may help facilitate more successful reintegration into society for people previously incarcerated and on parole.


The data for this challenge is released by the NIJ state authorities. It is released under the terms and condition for research purposes only, which abstracts the identity of the individual.
The data contains 53 fields for every prisoner, corresponding to various aspects of the prison sentence, personal details, and probation. Download the data here:
The challenge is to predict in which of the 3 years of community service, the individual recividates. The target-variable in the training data is the year in which the individual recidivated.

Exploratory Data Analysis:

Structural details of the data are as follows:

  • The data comprises a total of 53 variables, of which 3 are the target variables (Recidivism-Year 1, 2 & 3)
  • The data-types present in the data are, Boolean : 20, Float : 8, Integer : 2, Object : 23
  • ⦁ .The following figure shows the percentage of the missing values in the columns.

Data imputation has been done to fill the missing values.

  • Race : [‘BLACK’ ‘WHITE’], Gender : [‘M’ ‘F’]
  • Count : 18028
    Black freqency : 10313
    White freqency : 7715
    Male freqency : 15811
    Female freqency : 2217
    Approximately, 58 % of the inmates are Black and 42% are White.
    No significant bias in the racial distribution of the data. Whereas, there is a significant gender bias
  • The important factor to seek into, is the type of data distribution it falls under. Or the closest match to help us understand the data characteristics.

The given data does not qualify into any known probability density function.

The next deterministic factor would be to identify the independent and correlated features.

  • This is to ensure the optimal number of features are included in the training data to avoid information-sparsity and reduce dimensions.
  • The following figure shows a correlation graph between all the features with a color coded match.


Fig: Red indicates highest correlation, and grey being the least

As the plot suggests, there aren’t any significant correlations between any two given features from the data.

Principal Component Analysis:

The PCA is method to reduce the dimensionality of the data. This helps visualize if there are any non-contributing features in the given data-set towards predicting the target variable.
The no:of components i.e features can be determined by plotting the variance of the given dataset with respect to the no:of components.

The above plot indicates that, cummulative variance of the dataset is explained by almost all the features.


Initially we approached it with the classification model. Applied various supervised learning algorithms (classification) on the individual dataset. Three different models were built for each year as the target variable.
Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. A tree can be seen as a piece-wise constant approximation.

Brier Score:



In which ft is the probability that was forecast, to the actual outcome of the event at instance t( if it does not happen and 1 if it does happen) and N is the number of forecasting instances. In effect, it is the mean squared error of the forecast.

Given data is highly imbalanced for the target variable.





Note: Here class-1 means recidivated and class-0 means not  recidivated.


I. Logistic Regression Summary:


II. Random Forest Summary:


Recall, Precision scores are bad in class-1 for both Logistic regression model and Random Forest due to the imbalanced data.

III. Neural Network Summary:

Loss: 9.9847e-05 , Accuracy: 0.7047
Precision: 0.5126 , Recall: 0.7946
Test_loss: 0.8334 , Test_accuracy: 0.6121
Test_precision: 0.4125 , Test_recall: 0.6169

In contrast to LR and DT, the Neural Network results are far more promising. Test recall scores improved from 0.13 to 0.6.


The problem with imbalanced classification is that there are too few examples of the minority class for a model to effectively learn the decision boundary. Perhaps the most widely used approach to synthesizing new examples is the ‘Synthetic Minority Oversampling Technique’, or ‘SMOTE’ for short.

IV. Survival Analysis:

Survival analysis is a branch of statistics for analyzing the expected duration of time until one event occurs, such as death in biological organisms and failure in mechanical systems.
Often we regress covariates (e.g., age, country, etc.) against another variable – in this case durations, we cannot use traditional methods like linear regression because of censoring.
There are a few popular models in survival regression: Cox’s model, Weibull accelerated failure models, log normal models etc. One can ran through all the models and finalized the one with least brier score(errors) or/and highest accuracy.

Exponential model

Exponential distribution is based on the poisson process, where the event occur continuously and independently with a constant event rate ?. Exponential distribution models shows how much time needed until an event occurs.


Weibull model

Exponential distribution is a special case of the Weibull distribution:


Cox’s proportional hazard model

Cox’s proportional hazard model is when ?0 becomes (?0(?)), which means the baseline hazard is a function of time.


ℎ(?|?) = ?0(?) * ???(∑????)

  • ‘ℎ(?|?)’ hazard
  • ‘?0(?)’ baseline hazard, time-varying
  • ‘???(∑????)’ partial hazard, time-invariant
  • ‘xi’ is often centered to the mean


  • Can fit survival models without knowing the distribution
  • With censored data, inspecting distributional assumptions can be difficult
  • Do not have to specify the underlying hazard function, great for estimating covariate effects and hazard ratios.


  • Estimate ?0,…?? without having to specify ?0(?)
  • Can not estimate s(t) and h(t)

Core Assumptions:

  • Non-informative censoring
  • 1.1 Check: predicting censor by Xs.
  • survival times (t) are independent.
  • ln (hazard) is linear function of numeric Xs.
    3.1 Check: residual plots
    3.2 Fix: transformations
  • Values of Xs don’t change over time.
    4.1 Fix: add time-varying covariates
  • Harzards are proportional. Hazard ratio between two subjects is constant.
    5.1 Check: Schoenfeld residuals, proportional hazard test
    5.2 Fix: add non-linear term, binning the variable, add an interaction term with time, stratification (run model on subgroup), add time-varying covariates.


  • Amongst all the models we can say that cox proportional is doing well when we are unsure about the distribution.
  • When the dataset is imbalanced it is good to use non parametric models like cox proportional.
  • Emphasize on addtional sources of data and feature engineering for better results.

To Sum it Up:

Participating in the recidivsim challenge has been testing on various fronts. As we unfolded each leaf in the book, trying to model something, the complexity grew two-fold. As growing data-science professionals, that made it a great venue for diving deeper. Although our results are not very successful, the journey has definitely been worth it. Having hands on with real-world data, attempting to solve an impact-driven use has been revealing and a great deal of learning.