GBM with SVD


Team Leader

Zhonghua Zhang
United States

Team Members

Yuchen Su

Overview

Supplementary online material

Provide a URL to a web page, technical memorandum, or a paper.

No response.

Background*

Provide a general summary with relevant background information: Where does the method come from? Is it novel? Name the prior art.

The method is a combination of quite a few well known techniques, some of which are relatively new (such as Gradient Boosting Machines) and some are very old (such as Singular Value Decomposition). We just stumbled upon the way we combine results of multiple SVDs together and later figured out it was called Gradient Boosting.

Method

Summarize the algorithms you used in a way that those skilled in the art should understand what to do. Profile of your methods as follows:

Data exploration and understanding

Did you use data exploration techniques to

  • [checked]  Identify selection biases
  • [checked]  Identify temporal effects (e.g. students getting better over time)
  • [checked]  Understand the variables
  • [checked]  Explore the usefulness of the KC models
  • [checked]  Understand the relationships between the different KC types

Please describe your data understanding efforts, and interesting observations:

The most important data exploration task as we see is to come up with our own version of test data that is as close as possible to the real one. We simply take the question right before the last one for each student/unit combination and this worked quite well.

Preprocessing

Feature generation

  • [checked]  Features designed to capture the step type (e.g. enter given, or ... )
  • [checked]  Features based on the textual step name
  • [checked]  Features designed to capture the KC type
  • [checked]  Features based on the textual KC name
  • [checked]  Features derived from opportunity counts
  • [checked]  Features derived from the problem name
  • [not checked]  Features based on student ID
  • [checked]  Other features

Details on feature generation:

We looked for patterns on the student ID but did not find any. So good job anonymizing and randomizing this time! Majority of all the efforts of this model went into feature generation. We started with the most obvious ones such as student and KCs. As we progress we looked at more and more "obscure" features. It is interesting to see that some features seem to work really well on one dataset but not the other. One example of such is the reverse position (counting from the last) of each step in each question. This seems to be really significant for Algebra but not as much for Bridge to Algebra.

Feature selection

  • [not checked]  Feature ranking with correlation or other criterion (specify below)
  • [not checked]  Filter method (other than feature ranking)
  • [not checked]  Wrapper with forward or backward selection (nested subset method)
  • [checked]  Wrapper with intensive search (subsets not nested)
  • [not checked]  Embedded method
  • [not checked]  Other method not listed above (specify below)

Details on feature selection:

Our feature set isn't very large (35 in total) so we can afford to do exhaustive search with each 2-way combination among them. Our method is entirely based SVD so we only consider pairs of features, not singles or combination of higher numbers.

Did you attempt to identify latent factors?

  • [not checked]  Cluster students
  • [not checked]  Cluster knowledge components
  • [not checked]  Cluster steps
  • [checked]  Latent feature discovery was performed jointly with learning

Details on latent factor discovery (techniques used, useful student/step features, how were the factors used, etc.):

Steps are clustered simply with their names and occurrence frequencies. We thought about clustering students/KCs/steps with more sophisticated means, such as using the resulting singular vectors from SVD, but didn't have time to do so.

Other preprocessing

  • [checked]  Filling missing values (for KC)
  • [not checked]  Principal component analysis

More details on preprocessing:

All missing KCs are grouped into one "None". They seem to indicate the easiest of all "actual" steps. SVDs are related to PCA but not exactly the same so I didn't check the box above.

Classification

Base classifier

  • [not checked]  Decision tree, stub, or Random Forest
  • [checked]  Linear classifier (Fisher's discriminant, SVM, linear regression)
  • [not checked]  Non-linear kernel method (SVM, kernel ridge regression, kernel logistic regression)
  • [not checked]  Naïve
  • [not checked]  Bayesian Network (other than Naïve Bayes)
  • [not checked]  Neural Network
  • [not checked]  Bayesian Neural Network
  • [not checked]  Nearest neighbors
  • [not checked]  Latent variable models (e.g. matrix factorization)
  • [not checked]  Neighborhood/correlation based collaborative filtering
  • [not checked]  Bayesian Knowledge Tracing
  • [not checked]  Additive Factor Model
  • [not checked]  Item Response Theory
  • [not checked]  Other classifier not listed above (specify below)

Loss Function

  • [not checked]  Hinge loss (like in SVM)
  • [checked]  Square loss (like in ridge regression)
  • [checked]  Logistic loss or cross-entropy (like in logistic regression)
  • [not checked]  Exponential loss (like in boosting)
  • [not checked]  None
  • [not checked]  Don't know
  • [not checked]  Other loss (specify below)

Regularizer

  • [not checked]  One-norm (sum of weight magnitudes, like in Lasso)
  • [not checked]  Two-norm (||w||^2, like in ridge regression and regular SVM)
  • [not checked]  Structured regularizer (like in group lasso)
  • [not checked]  None
  • [not checked]  Don't know
  • [checked]  Other (specify below)

Ensemble Method

  • [checked]  Boosting
  • [not checked]  Bagging (check this if you use Random Forest)
  • [checked]  Other ensemble method
  • [not checked]  None

Were you able to use information present only in the training set?

  • [checked]  Corrects, incorrects, hints
  • [checked]  Step start/end times

Did you use post-training calibration to obtain accurate probabilities?

  • [selected]  Yes
  • [not selected]  No

Did you make use of the development data sets for training?

  • [not selected]  Yes
  • [selected]  No

Details on classification:

I think SVDs are considered "linear" classifier so I checked it. Our method has a random component in it so we employed simple averaging in the end. We regularize through requiring the improvement offered by each feature pair to be above certain threshold.

Model selection/hyperparameter selection

  • [checked]  We used the online feedback of the leaderboard.
  • [not checked]  K-fold or leave-one-out cross-validation (using training data)
  • [not checked]  Virtual leave-one-out (closed for estimations of LOO with a single classifier training)
  • [not checked]  Out-of-bag estimation (for bagging methods)
  • [not checked]  Bootstrap estimation (other than out-of-bag)
  • [not checked]  Other cross-validation method
  • [not checked]  Bayesian model selection
  • [not checked]  Penalty-based method (non-Bayesian)
  • [not checked]  Bi-level optimization
  • [checked]  Other method not listed above (specify below)

Details on model selection:

As stated earlier, we created our test dataset in the same fashion as the actual.

Results

Final Team Submission

Scores shown in the table below are Cup scores, not leaderboard scores. The difference between the two is described on the Evaluation page.

A reader should also know from reading the fact sheet what the strength of the method is.

Please comment about the following:

Quantitative advantages (e.g., compact feature subset, simplicity, computational advantages).

It is our opinion that our approach is more computationally efficient that trees. However we didn't build a comparable tree based classifier so we are not certain. Conceptually our approach is very simple.

Qualitative advantages (e.g. compute posterior probabilities, theoretically motivated, has some elements of novelty).

We believe our combination of GBM and SVD has elements of novelty. Certainly we may have just re-invented the wheel. We will find out during the conference.

Other methods. List other methods you tried.

We tried some tree-like approach to improve the result further with no success.

How helpful did you find the included KC models?

  • [selected]  Crucial in getting good predictions
  • [not selected]  Somewhat helpful in getting good predictions
  • [not selected]  Neutral
  • [not selected]  Not particularly helpful
  • [not selected]  Irrelevant

If you learned latent factors, how helpful were they?

  • [selected]  Crucial in getting good predictions
  • [not selected]  Somewhat helpful in getting good predictions
  • [not selected]  Neutral
  • [not selected]  Not particularly helpful
  • [not selected]  Irrelevant

Details on the relevance of the KC models and latent factors:

KCs are very important. It is worth noting that the 3rd KC, (rules), on the Algebra dataset is by far the best one. It is a pity that we don't have it on the other dataset.

Software Implementation

Availability

  • [checked]  Proprietary in-house software
  • [not checked]  Commercially available in-house software
  • [not checked]  Freeware or shareware in-house software
  • [not checked]  Off-the-shelf third party commercial software
  • [not checked]  Off-the-shelf third party freeware or shareware

Language

  • [checked]  C/C++
  • [not checked]  Java
  • [not checked]  Matlab
  • [not checked]  Python/NumPy/SciPy
  • [not checked]  Other (specify below)

Details on software implementation:

We did some data exploration in R but soon it was clear that it was too slow. We wrote our entire program in C++ with freely available math libraries for the SVD part.

Hardware implementation

Platform

  • [checked]  Windows
  • [not checked]  Linux or other Unix
  • [not checked]  Mac OS
  • [not checked]  Other (specify below)

Memory

  • [not selected]  <= 2 GB
  • [not selected]  <= 8 GB
  • [selected]  >= 8 GB
  • [not selected]  >= 32 GB

Parallelism

  • [checked]  Multi-processor machine
  • [checked]  Run in parallel different algorithms on different machines
  • [not checked]  Other (specify below)

Details on hardware implementation. Specify whether you provide a self contained-application or libraries.

We will not be releasing our code.

Code URL

Provide a URL for the code (if available):

No response.

Competition Setup

From a performance point of view, the training set was

  • [not selected]  Too big (could have achieved the same performance with significantly less data)
  • [selected]  Too small (more data would have led to better performance)

From a computational point of view, the training set was

  • [not selected]  Too big (imposed serious computational challenges, limited the types of methods that can be applied)
  • [selected]  Adequate (the computational load was easy to handle)

Was the time constraint imposed by the challenge a difficulty or did you feel enough time to understand the data, prepare it, and train models?

  • [not selected]  Not enough time
  • [not selected]  Enough time
  • [selected]  It was enough time to do something decent, but there was a lot left to explore. With more time performance could have been significantly improved.

How likely are you to keep working on this problem?

  • [not selected]  It is my main research area.
  • [not selected]  It was a very interesting problem. I'll keep working on it.
  • [not selected]  This data is a good fit for the data mining methods I am using/developing. I will use it in the future for empirical evaluation.
  • [not selected]  Maybe I'll try some ideas , but it is not high priority.
  • [selected]  Not likely to keep working on it.

Comments on the problem (What aspects of the problem you found most interesting? Did it inspire you to develop new techniques?)

More data will definitely improve the result. The available dataset is reasonable large but very sparse, as noted in the original description. More data will help significantly with the less frequently touched problems/steps, etc. The current data already impose quite a computational burden. We have state-of-the-art "consumer" computers and it has been quite a struggle with the time required for fitting the model. However participants in academia or industry with access to enterprise/institutional level hardware will not find these dataset too large at all. Given more time performance could be further improved. It is quite obvious that the last a few days generated quite a jump in model performance.

References

List references below.

[1] Nathan Srebro and Tommi Jaakkola (2003). Weighted low rank approximation. In 20th International Conference on Machine Learning, 2003. [2] Jerome H. Friedman (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, Vol. 29 (2001), pp. 1189-1232.