Ensemble Recursive Binary Partitioning

Team Leader

David Slate
United States

Team Members

Peter Frey


Supplementary online material

Provide a URL to a web page, technical memorandum, or a paper.

No response.


Provide a general summary with relevant background information: Where does the method come from? Is it novel? Name the prior art.

We used a home-grown version of Ensemble Recursive Binary Partitioning based on prior work with Recursive Binary Partitioning.


Summarize the algorithms you used in a way that those skilled in the art should understand what to do. Profile of your methods as follows:

Data exploration and understanding

Did you use data exploration techniques to

  • [not checked]  Identify selection biases
  • [checked]  Identify temporal effects (e.g. students getting better over time)
  • [checked]  Understand the variables
  • [checked]  Explore the usefulness of the KC models
  • [checked]  Understand the relationships between the different KC types

Please describe your data understanding efforts, and interesting observations:

We explored various combinations of the available predictors.


Feature generation

  • [not checked]  Features designed to capture the step type (e.g. enter given, or ... )
  • [not checked]  Features based on the textual step name
  • [checked]  Features designed to capture the KC type
  • [checked]  Features based on the textual KC name
  • [checked]  Features derived from opportunity counts
  • [not checked]  Features derived from the problem name
  • [checked]  Features based on student ID
  • [checked]  Other features

Details on feature generation:

We used the raw Student ID, Unit, Section, and KC name strings as categoric variables. We did not use the problem or step names that way because they had too many distinct values. We also used the raw opportunity counts as features. We tried many features based on averages of performance per student, per student/unit, per KC, and others, but they didn't prove to be useful. However, two other averages, the mean number of hints per step for each student/unit, and the mean number of incorrects per step for each student, did turn out to be usefully predictive for Bridge to Algebra. Several features related to step ordinals and counts turned out to be useful: the step ordinal within the student/unit, problem, and problem/view, the total number of steps in the problem/view, and the remaining step count in the problem view. We also used the count of problem/views within this student/unit up to the current step as a feature, although it's not clear whether that was helpful. We created some features based on initial substrings of the KC names (a crude initial attempt at textual analysis) that appeared useful in our tests; in particular, the first 10 characters of KC_Rul1 and KC_Sub1 for Algebra, and the first 30 characters of KC_Sub1 for Bridge to Algebra. Here KC_Sub1 refers to the first component of KC(SubSkills).

Feature selection

  • [not checked]  Feature ranking with correlation or other criterion (specify below)
  • [not checked]  Filter method (other than feature ranking)
  • [not checked]  Wrapper with forward or backward selection (nested subset method)
  • [not checked]  Wrapper with intensive search (subsets not nested)
  • [checked]  Embedded method
  • [not checked]  Other method not listed above (specify below)

Details on feature selection:

Several features that our modeling algorithm found to be poor predictors were eliminated from our later model-building runs.

Did you attempt to identify latent factors?

  • [not checked]  Cluster students
  • [not checked]  Cluster knowledge components
  • [not checked]  Cluster steps
  • [not checked]  Latent feature discovery was performed jointly with learning

Details on latent factor discovery (techniques used, useful student/step features, how were the factors used, etc.):


Other preprocessing

  • [not checked]  Filling missing values (for KC)
  • [not checked]  Principal component analysis

More details on preprocessing:

We separated out the components of the KC model strings and retained up to the first 7 components of KC(SubSkills) and the first 3 of KC(KTracedSkills) and KC(Rules), plus their corresponding opportunity counts.


Base classifier

  • [checked]  Decision tree, stub, or Random Forest
  • [not checked]  Linear classifier (Fisher's discriminant, SVM, linear regression)
  • [not checked]  Non-linear kernel method (SVM, kernel ridge regression, kernel logistic regression)
  • [not checked]  Naïve
  • [not checked]  Bayesian Network (other than Naïve Bayes)
  • [not checked]  Neural Network
  • [not checked]  Bayesian Neural Network
  • [not checked]  Nearest neighbors
  • [not checked]  Latent variable models (e.g. matrix factorization)
  • [not checked]  Neighborhood/correlation based collaborative filtering
  • [not checked]  Bayesian Knowledge Tracing
  • [not checked]  Additive Factor Model
  • [not checked]  Item Response Theory
  • [not checked]  Other classifier not listed above (specify below)

Loss Function

  • [not checked]  Hinge loss (like in SVM)
  • [not checked]  Square loss (like in ridge regression)
  • [not checked]  Logistic loss or cross-entropy (like in logistic regression)
  • [not checked]  Exponential loss (like in boosting)
  • [checked]  None
  • [not checked]  Don't know
  • [not checked]  Other loss (specify below)


  • [not checked]  One-norm (sum of weight magnitudes, like in Lasso)
  • [not checked]  Two-norm (||w||^2, like in ridge regression and regular SVM)
  • [not checked]  Structured regularizer (like in group lasso)
  • [checked]  None
  • [not checked]  Don't know
  • [not checked]  Other (specify below)

Ensemble Method

  • [not checked]  Boosting
  • [not checked]  Bagging (check this if you use Random Forest)
  • [checked]  Other ensemble method
  • [not checked]  None

Were you able to use information present only in the training set?

  • [checked]  Corrects, incorrects, hints
  • [not checked]  Step start/end times

Did you use post-training calibration to obtain accurate probabilities?

  • [selected]  Yes
  • [not selected]  No

Did you make use of the development data sets for training?

  • [not selected]  Yes
  • [selected]  No

Details on classification:

We found that the forecasts of our models on the holdout set (see below) were generally low, except for the very high end close to 1, so we computed an adjustment to those scores based on a piece-wise linear fit of the differences between actual outcome (0 or 1) and the forecast. When we built final models from the entire training sets, we applied the same adjustment function to the forecasts of those models on the test records.

Model selection/hyperparameter selection

  • [checked]  We used the online feedback of the leaderboard.
  • [not checked]  K-fold or leave-one-out cross-validation (using training data)
  • [not checked]  Virtual leave-one-out (closed for estimations of LOO with a single classifier training)
  • [not checked]  Out-of-bag estimation (for bagging methods)
  • [not checked]  Bootstrap estimation (other than out-of-bag)
  • [not checked]  Other cross-validation method
  • [not checked]  Bayesian model selection
  • [not checked]  Penalty-based method (non-Bayesian)
  • [not checked]  Bi-level optimization
  • [checked]  Other method not listed above (specify below)

Details on model selection:

We created a holdout set for validation from the last problem/view in each student/unit, in an attempt to emulate the process used to select the test set.


Final Team Submission

Scores shown in the table below are Cup scores, not leaderboard scores. The difference between the two is described on the Evaluation page.

A reader should also know from reading the fact sheet what the strength of the method is.

Please comment about the following:

Quantitative advantages (e.g., compact feature subset, simplicity, computational advantages).


Qualitative advantages (e.g. compute posterior probabilities, theoretically motivated, has some elements of novelty).

No response.

Other methods. List other methods you tried.

No response.

How helpful did you find the included KC models?

  • [not selected]  Crucial in getting good predictions
  • [selected]  Somewhat helpful in getting good predictions
  • [not selected]  Neutral
  • [not selected]  Not particularly helpful
  • [not selected]  Irrelevant

If you learned latent factors, how helpful were they?

  • [not selected]  Crucial in getting good predictions
  • [not selected]  Somewhat helpful in getting good predictions
  • [not selected]  Neutral
  • [not selected]  Not particularly helpful
  • [not selected]  Irrelevant

Details on the relevance of the KC models and latent factors:

If we had more time, we might have explored doing more textual analysis of the KC model strings (as well as the problem and step names). We did not learn latent factors.

Software Implementation


  • [checked]  Proprietary in-house software
  • [not checked]  Commercially available in-house software
  • [not checked]  Freeware or shareware in-house software
  • [not checked]  Off-the-shelf third party commercial software
  • [not checked]  Off-the-shelf third party freeware or shareware


  • [checked]  C/C++
  • [not checked]  Java
  • [not checked]  Matlab
  • [not checked]  Python/NumPy/SciPy
  • [checked]  Other (specify below)

Details on software implementation:

We used C, Tcl/Tk, and Lua primarily.

Hardware implementation


  • [not checked]  Windows
  • [checked]  Linux or other Unix
  • [not checked]  Mac OS
  • [not checked]  Other (specify below)


  • [not selected]  <= 2 GB
  • [not selected]  <= 8 GB
  • [selected]  >= 8 GB
  • [not selected]  >= 32 GB


  • [checked]  Multi-processor machine
  • [checked]  Run in parallel different algorithms on different machines
  • [not checked]  Other (specify below)

Details on hardware implementation. Specify whether you provide a self contained-application or libraries.

We used a few home computers and workstations based on Intel or AMD x86_64 multi-core processors running Linux. Because of the large size of the datasets, we probably would have benefited from having more available computing power.

Code URL

Provide a URL for the code (if available):

No response.

Competition Setup

From a performance point of view, the training set was

  • [not selected]  Too big (could have achieved the same performance with significantly less data)
  • [not selected]  Too small (more data would have led to better performance)

From a computational point of view, the training set was

  • [not selected]  Too big (imposed serious computational challenges, limited the types of methods that can be applied)
  • [not selected]  Adequate (the computational load was easy to handle)

Was the time constraint imposed by the challenge a difficulty or did you feel enough time to understand the data, prepare it, and train models?

  • [not selected]  Not enough time
  • [not selected]  Enough time
  • [selected]  It was enough time to do something decent, but there was a lot left to explore. With more time performance could have been significantly improved.

How likely are you to keep working on this problem?

  • [not selected]  It is my main research area.
  • [not selected]  It was a very interesting problem. I'll keep working on it.
  • [not selected]  This data is a good fit for the data mining methods I am using/developing. I will use it in the future for empirical evaluation.
  • [not selected]  Maybe I'll try some ideas , but it is not high priority.
  • [selected]  Not likely to keep working on it.

Comments on the problem (What aspects of the problem you found most interesting? Did it inspire you to develop new techniques?)

This was an interesting and challenging problem, but we didn't understand why it was set up the way it was. For example, perhaps it would have made more sense to try to predict the time required to successful completion of each problem, or the number of steps to completion, or the probability of completion.


List references below.

No response.