Zhonghua Zhang United States
Yuchen Su
Provide a URL to a web page, technical memorandum, or a paper.
No response.
Provide a general summary with relevant background information: Where does the method come from? Is it novel? Name the prior art.
The method is a combination of quite a few well known techniques, some of which are relatively new (such as Gradient Boosting Machines) and some are very old (such as Singular Value Decomposition). We just stumbled upon the way we combine results of multiple SVDs together and later figured out it was called Gradient Boosting.
Summarize the algorithms you used in a way that those skilled in the art should understand what to do. Profile of your methods as follows:
Please describe your data understanding efforts, and interesting observations:
The most important data exploration task as we see is to come up with our own version of test data that is as close as possible to the real one. We simply take the question right before the last one for each student/unit combination and this worked quite well.
Details on feature generation:
We looked for patterns on the student ID but did not find any. So good job anonymizing and randomizing this time! Majority of all the efforts of this model went into feature generation. We started with the most obvious ones such as student and KCs. As we progress we looked at more and more "obscure" features. It is interesting to see that some features seem to work really well on one dataset but not the other. One example of such is the reverse position (counting from the last) of each step in each question. This seems to be really significant for Algebra but not as much for Bridge to Algebra.
Details on feature selection:
Our feature set isn't very large (35 in total) so we can afford to do exhaustive search with each 2-way combination among them. Our method is entirely based SVD so we only consider pairs of features, not singles or combination of higher numbers.
Details on latent factor discovery (techniques used, useful student/step features, how were the factors used, etc.):
Steps are clustered simply with their names and occurrence frequencies. We thought about clustering students/KCs/steps with more sophisticated means, such as using the resulting singular vectors from SVD, but didn't have time to do so.
More details on preprocessing:
All missing KCs are grouped into one "None". They seem to indicate the easiest of all "actual" steps. SVDs are related to PCA but not exactly the same so I didn't check the box above.
Details on classification:
I think SVDs are considered "linear" classifier so I checked it. Our method has a random component in it so we employed simple averaging in the end. We regularize through requiring the improvement offered by each feature pair to be above certain threshold.
Details on model selection:
As stated earlier, we created our test dataset in the same fashion as the actual.
Scores shown in the table below are Cup scores, not leaderboard scores. The difference between the two is described on the Evaluation page.
A reader should also know from reading the fact sheet what the strength of the method is.
Please comment about the following:
It is our opinion that our approach is more computationally efficient that trees. However we didn't build a comparable tree based classifier so we are not certain. Conceptually our approach is very simple.
We believe our combination of GBM and SVD has elements of novelty. Certainly we may have just re-invented the wheel. We will find out during the conference.
We tried some tree-like approach to improve the result further with no success.
Details on the relevance of the KC models and latent factors:
KCs are very important. It is worth noting that the 3rd KC, (rules), on the Algebra dataset is by far the best one. It is a pity that we don't have it on the other dataset.
Details on software implementation:
We did some data exploration in R but soon it was clear that it was too slow. We wrote our entire program in C++ with freely available math libraries for the SVD part.
Details on hardware implementation. Specify whether you provide a self contained-application or libraries.
We will not be releasing our code.
Provide a URL for the code (if available):
More data will definitely improve the result. The available dataset is reasonable large but very sparse, as noted in the original description. More data will help significantly with the less frequently touched problems/steps, etc. The current data already impose quite a computational burden. We have state-of-the-art "consumer" computers and it has been quite a struggle with the time required for fitting the model. However participants in academia or industry with access to enterprise/institutional level hardware will not find these dataset too large at all. Given more time performance could be further improved. It is quite obvious that the last a few days generated quite a jump in model performance.
List references below.
[1] Nathan Srebro and Tommi Jaakkola (2003). Weighted low rank approximation. In 20th International Conference on Machine Learning, 2003. [2] Jerome H. Friedman (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, Vol. 29 (2001), pp. 1189-1232.