Vladimir Nikulin University of Queensland Australia
Provide a URL to a web page, technical memorandum, or a paper.
No response.
Provide a general summary with relevant background information: Where does the method come from? Is it novel? Name the prior art.
Our method represents a complex of three main directions: random forest in R, Naive Bayes (or averaging), and matrix factorisation, where different KC-opportunities were considered separately. Using above three models we produced 15 different solutions for the test set and for the randomly selected (labelled) subset of the training set, named S, which was set aside and was not used for the primary training. The size of the subset S was about 5% of the whole training data for the algebra and about 2% for the bridge_to_algebra sets. Then, using above 15 solutions as an explanatory variables (or features), we created a secondary training and test sets. The final ensemble solution was constructed by the GBM function in R. We are considering our method as a novel. During last 20 hours of the Challenge we observed a dramatic improvement, and we do not consider our results as a complete or as a final. Our method is a very flexible, and there are a lot of potentials for further improvements and developments.
Summarize the algorithms you used in a way that those skilled in the art should understand what to do. Profile of your methods as follows:
Please describe your data understanding efforts, and interesting observations:
Details on feature generation:
Details on feature selection:
it was rather a feature smoothing based on the method of subintervals
Details on latent factor discovery (techniques used, useful student/step features, how were the factors used, etc.):
also, we used gradient-based matrix factorisation
More details on preprocessing:
missings were treated as a special values
Details on classification:
used k-means clustering applied to the latent factors with regularisation. The target of the regularisation is to ensure that all the clusters will be sufficiently large.
Details on model selection:
Scores shown in the table below are Cup scores, not leaderboard scores. The difference between the two is described on the Evaluation page.
A reader should also know from reading the fact sheet what the strength of the method is.
Please comment about the following:
simplicity of the model is a very important when data are large and complex.
We do believe that our method has quite interesting and novel theoretical grounds.
Details on the relevance of the KC models and latent factors:
Details on software implementation:
R and Perl
Details on hardware implementation. Specify whether you provide a self contained-application or libraries.
Provide a URL for the code (if available):
Yes, the problem is very interesting, and inspired us to develop a lot of new software (written in C). However, we had a lot of problems with pre-processings, which were conducted using Perl, and we can not be sure that all the problems have been resolved. We even requested the Organisers of the Cup (about one week before the deadline) to make an extension for one week.
List references below.