KDD Cup 2010: Educational Data Mining Challenge

Using and Benchmarking Rank: Vadis consulting turnkey prediction software.

Team Leader

Jean-Francois Chevalier
Vadis Consulting
Belgium

Team Members

Pierre Gramme
Vadis Consulting

Thierry Van de Merkt
Vadis Consulting

Libei Chen
Vadis Consulting

Philip Smet
Vadis Consulting

Overview
Method
Results
References

Overview

Supplementary online material

Provide a URL to a web page, technical memorandum, or a paper.

http://www.vadis.com/product/rank__presentation.html

Background*

Provide a general summary with relevant background information: Where does the method come from? Is it novel? Name the prior art.

RANK is a predictive modeling tool designed by analysts for the analyst. As a result, it combines powerful techniques and modeling experience. It automates many steps of the CRISP DM methodology (http://www.crisp-dm.org/) for building models. RANK is built to allow an analyst to quickly build models on huge data sets, and have all elements to control the model choices and its quality, in order to focus his attention on the most important part of the modeling process: data quality, overfitting, stability and robustness. Using RANK, the analyst will get support for many modeling phases: audit, variable recoding, variable selection, robustness improvement, result analysis and industrialization.

Method

Summarize the algorithms you used in a way that those skilled in the art should understand what to do. Profile of your methods as follows:

Data exploration and understanding

Did you use data exploration techniques to

Identify selection biases
Identify temporal effects (e.g. students getting better over time)
Understand the variables
Explore the usefulness of the KC models
Understand the relationships between the different KC types

Please describe your data understanding efforts, and interesting observations:

Efforts have been done to understand how the test set could be compared to the training set. As we limited our investment in this competition to 10 man-days, we did not have time to have a better understanding of the KC models.

Preprocessing

Feature generation

Features designed to capture the step type (e.g. enter given, or ... )
Features based on the textual step name
Features designed to capture the KC type
Features based on the textual KC name
Features derived from opportunity counts
Features derived from the problem name
Features based on student ID
Other features

Details on feature generation:

We created more than 500 variables.

Feature selection

Feature ranking with correlation or other criterion (specify below)
Filter method (other than feature ranking)
Wrapper with forward or backward selection (nested subset method)
Wrapper with intensive search (subsets not nested)
Embedded method
Other method not listed above (specify below)

Details on feature selection:

RANK uses a highly optimized implementation of the LARS algorithm with LASSO modification. This technique is based on Efron, Hastie, Johnstone & Tibshirani [1] and allows to select the most pertinent variables for the scoring. The backward pruning in RANK iteratively eliminates variables when their removal does not influence the ROC more than a prescribed threshold. Using cross-validation, it will end with a variables selection that maximizes the area under the ROC curve.

Did you attempt to identify latent factors?

Cluster students
Cluster knowledge components
Cluster steps
Latent feature discovery was performed jointly with learning

Details on latent factor discovery (techniques used, useful student/step features, how were the factors used, etc.):

No response.

Other preprocessing

Filling missing values (for KC)
Principal component analysis

More details on preprocessing:

No response.

Classification

Base classifier

Decision tree, stub, or Random Forest
Linear classifier (Fisher's discriminant, SVM, linear regression)
Non-linear kernel method (SVM, kernel ridge regression, kernel logistic regression)
Naïve
Bayesian Network (other than Naïve Bayes)
Neural Network
Bayesian Neural Network
Nearest neighbors
Latent variable models (e.g. matrix factorization)
Neighborhood/correlation based collaborative filtering
Bayesian Knowledge Tracing
Additive Factor Model
Item Response Theory
Other classifier not listed above (specify below)

Loss Function

Hinge loss (like in SVM)
Square loss (like in ridge regression)
Logistic loss or cross-entropy (like in logistic regression)
Exponential loss (like in boosting)
None
Don't know
Other loss (specify below)

Regularizer

One-norm (sum of weight magnitudes, like in Lasso)
Two-norm (||w||^2, like in ridge regression and regular SVM)
Structured regularizer (like in group lasso)
None
Don't know
Other (specify below)

Ensemble Method

Boosting
Bagging (check this if you use Random Forest)
Other ensemble method
None

Were you able to use information present only in the training set?

Corrects, incorrects, hints
Step start/end times

Did you use post-training calibration to obtain accurate probabilities?

Did you make use of the development data sets for training?

Details on classification:

The final variable selection in Rank is based on the ROC curve (using cross-validation). This is not optimal in this context.

Model selection/hyperparameter selection

We used the online feedback of the leaderboard.
K-fold or leave-one-out cross-validation (using training data)
Virtual leave-one-out (closed for estimations of LOO with a single classifier training)
Out-of-bag estimation (for bagging methods)
Bootstrap estimation (other than out-of-bag)
Other cross-validation method
Bayesian model selection
Penalty-based method (non-Bayesian)
Bi-level optimization
Other method not listed above (specify below)

Details on model selection:

No response.

Results

Final Team Submission

Scores shown in the table below are Cup scores, not leaderboard scores. The difference between the two is described on the Evaluation page.

A reader should also know from reading the fact sheet what the strength of the method is.

Please comment about the following:

Quantitative advantages (e.g., compact feature subset, simplicity, computational advantages).

We participated in this contest to validate that our software is still providing state of the art results in really short time (as it did for the last 3 KDD cup). According to results on sample, this goal seems to be achieved. We spent only 10 man-days in total to build the model. Among these, 9 were spent for features creation.

Qualitative advantages (e.g. compute posterior probabilities, theoretically motivated, has some elements of novelty).

Automatic variable recoding: Rank offers several recoding strategies. The most efficient recoding, initially designed for nominal variables, consists of converting modalities into numeric values according to their relation with the target. RANK extends this recoding to numeric variables by coupling it with an efficient binning technique. The advantage of this recoding is that it solves the non-normal distributions problem and allows spotting highly non linear relationship between any variable and the target. Overfitting: RANK is designed to avoid overfitting. This is achieved through cross-validation, ridge regression, small modalities regrouping and missing value treatment. Performance is carefully assessed by using a large amount of bootstrap samples.

Other methods. List other methods you tried.

No response.

How helpful did you find the included KC models?

Crucial in getting good predictions
Somewhat helpful in getting good predictions
Neutral
Not particularly helpful
Irrelevant

If you learned latent factors, how helpful were they?

Crucial in getting good predictions
Somewhat helpful in getting good predictions
Neutral
Not particularly helpful
Irrelevant

Details on the relevance of the KC models and latent factors:

No response.

Software Implementation

Availability

Proprietary in-house software
Commercially available in-house software
Freeware or shareware in-house software
Off-the-shelf third party commercial software
Off-the-shelf third party freeware or shareware

Language

C/C++
Java
Matlab
Python/NumPy/SciPy
Other (specify below)

Details on software implementation:

One unique feature of RANK is its ability to store all the data required for the computation of the model inside the RAM of the computer. RANK uses an advanced proprietary compression technique that allows storing 15GB of data in barely 240MB of RAM. With this strategy, the time required to access the data is minimized. This allows RANK to easily and reliably perform multiple-pass model-computations.

Hardware implementation

Platform

Windows
Linux or other Unix
Mac OS
Other (specify below)

Memory

<= 2 GB
<= 8 GB
>= 8 GB
>= 32 GB

Parallelism

Multi-processor machine
Run in parallel different algorithms on different machines
Other (specify below)

Details on hardware implementation. Specify whether you provide a self contained-application or libraries.

No response.

Code URL

Provide a URL for the code (if available):

No response.

Competition Setup

From a performance point of view, the training set was

Too big (could have achieved the same performance with significantly less data)
Too small (more data would have led to better performance)

From a computational point of view, the training set was

Too big (imposed serious computational challenges, limited the types of methods that can be applied)
Adequate (the computational load was easy to handle)

Was the time constraint imposed by the challenge a difficulty or did you feel enough time to understand the data, prepare it, and train models?

Not enough time
Enough time
It was enough time to do something decent, but there was a lot left to explore. With more time performance could have been significantly improved.

How likely are you to keep working on this problem?

It is my main research area.
It was a very interesting problem. I'll keep working on it.
This data is a good fit for the data mining methods I am using/developing. I will use it in the future for empirical evaluation.
Maybe I'll try some ideas , but it is not high priority.
Not likely to keep working on it.

Comments on the problem (What aspects of the problem you found most interesting? Did it inspire you to develop new techniques?)

Avoid overfitting by selecting a build set similar to the test set from the complete training set.

References

List references below.

1. B. Efron, T. Hastie, I Johnstone and R. Tibshirani. Least Angle Regression, The Annals of statistics 2004, Vol 32, No 2, 407-499. 2. Hoerl, A. E. and Kennard, R. (1970). Ridge regression: biased estimation for nonorthogonal problems, Technometrics 12: 55-67. 3. Smith EP, Lipkovich I, Ye K. Weight of Evidence (WOE): Quantitative estimation of probability of impact. Blacksburg, VA: Virginia Tech, Department of Statistics; 2002.

KDD Cup 2010: Educational Data Mining Challenge

Sponsored by the Pittsburgh Science of Learning Center

Results > Vadis

Using and Benchmarking Rank: Vadis consulting turnkey prediction software.

Team Leader

Team Members

Overview

Supplementary online material

Background*

Method

Data exploration and understanding

Did you use data exploration techniques to

Preprocessing

Feature generation

Feature selection

Did you attempt to identify latent factors?

Other preprocessing

Classification

Base classifier

Loss Function

Regularizer

Ensemble Method

Were you able to use information present only in the training set?

Did you use post-training calibration to obtain accurate probabilities?

Did you make use of the development data sets for training?

Model selection/hyperparameter selection

Results

Final Team Submission

Quantitative advantages (e.g., compact feature subset, simplicity, computational advantages).

Qualitative advantages (e.g. compute posterior probabilities, theoretically motivated, has some elements of novelty).

Other methods. List other methods you tried.

How helpful did you find the included KC models?

If you learned latent factors, how helpful were they?

Software Implementation

Availability

Language

Hardware implementation

Platform

Memory

Parallelism

Code URL

Competition Setup

From a performance point of view, the training set was

From a computational point of view, the training set was

Was the time constraint imposed by the challenge a difficulty or did you feel enough time to understand the data, prepare it, and train models?

How likely are you to keep working on this problem?

Comments on the problem (What aspects of the problem you found most interesting? Did it inspire you to develop new techniques?)

References

Facebook

Elsevier

ACM

Carnegie Learning

IBM Research

DataShop