Jean-Francois Chevalier Vadis Consulting Belgium
Pierre Gramme Vadis Consulting
Thierry Van de Merkt Vadis Consulting
Libei Chen Vadis Consulting
Philip Smet Vadis Consulting
Provide a URL to a web page, technical memorandum, or a paper.
http://www.vadis.com/product/rank__presentation.html
Provide a general summary with relevant background information: Where does the method come from? Is it novel? Name the prior art.
RANK is a predictive modeling tool designed by analysts for the analyst. As a result, it combines powerful techniques and modeling experience. It automates many steps of the CRISP DM methodology (http://www.crisp-dm.org/) for building models. RANK is built to allow an analyst to quickly build models on huge data sets, and have all elements to control the model choices and its quality, in order to focus his attention on the most important part of the modeling process: data quality, overfitting, stability and robustness. Using RANK, the analyst will get support for many modeling phases: audit, variable recoding, variable selection, robustness improvement, result analysis and industrialization.
Summarize the algorithms you used in a way that those skilled in the art should understand what to do. Profile of your methods as follows:
Please describe your data understanding efforts, and interesting observations:
Efforts have been done to understand how the test set could be compared to the training set. As we limited our investment in this competition to 10 man-days, we did not have time to have a better understanding of the KC models.
Details on feature generation:
We created more than 500 variables.
Details on feature selection:
RANK uses a highly optimized implementation of the LARS algorithm with LASSO modification. This technique is based on Efron, Hastie, Johnstone & Tibshirani [1] and allows to select the most pertinent variables for the scoring. The backward pruning in RANK iteratively eliminates variables when their removal does not influence the ROC more than a prescribed threshold. Using cross-validation, it will end with a variables selection that maximizes the area under the ROC curve.
Details on latent factor discovery (techniques used, useful student/step features, how were the factors used, etc.):
No response.
More details on preprocessing:
Details on classification:
The final variable selection in Rank is based on the ROC curve (using cross-validation). This is not optimal in this context.
Details on model selection:
Scores shown in the table below are Cup scores, not leaderboard scores. The difference between the two is described on the Evaluation page.
A reader should also know from reading the fact sheet what the strength of the method is.
Please comment about the following:
We participated in this contest to validate that our software is still providing state of the art results in really short time (as it did for the last 3 KDD cup). According to results on sample, this goal seems to be achieved. We spent only 10 man-days in total to build the model. Among these, 9 were spent for features creation.
Automatic variable recoding: Rank offers several recoding strategies. The most efficient recoding, initially designed for nominal variables, consists of converting modalities into numeric values according to their relation with the target. RANK extends this recoding to numeric variables by coupling it with an efficient binning technique. The advantage of this recoding is that it solves the non-normal distributions problem and allows spotting highly non linear relationship between any variable and the target. Overfitting: RANK is designed to avoid overfitting. This is achieved through cross-validation, ridge regression, small modalities regrouping and missing value treatment. Performance is carefully assessed by using a large amount of bootstrap samples.
Details on the relevance of the KC models and latent factors:
Details on software implementation:
One unique feature of RANK is its ability to store all the data required for the computation of the model inside the RAM of the computer. RANK uses an advanced proprietary compression technique that allows storing 15GB of data in barely 240MB of RAM. With this strategy, the time required to access the data is minimized. This allows RANK to easily and reliably perform multiple-pass model-computations.
Details on hardware implementation. Specify whether you provide a self contained-application or libraries.
Provide a URL for the code (if available):
Avoid overfitting by selecting a build set similar to the test set from the complete training set.
List references below.
1. B. Efron, T. Hastie, I Johnstone and R. Tibshirani. Least Angle Regression, The Annals of statistics 2004, Vol 32, No 2, 407-499. 2. Hoerl, A. E. and Kennard, R. (1970). Ridge regression: biased estimation for nonorthogonal problems, Technometrics 12: 55-67. 3. Smith EP, Lipkovich I, Ye K. Weight of Evidence (WOE): Quantitative estimation of probability of impact. Blacksburg, VA: Virginia Tech, Department of Statistics; 2002.