Using HMMs and bagged decision trees to leverage rich features of user and skill


Team Leader

Zachary A. Pardos
Worcester Polytechnic Institute
United States

Team Members

Neil Heffernan
Worcester Polytechnic Institute

Overview

Supplementary online material

Provide a URL to a web page, technical memorandum, or a paper.

http://users.wpi.edu/~zpardos/

Background*

Provide a general summary with relevant background information: Where does the method come from? Is it novel? Name the prior art.

The HMM used is a novel Bayesian HMM developed by Pardos & Heffernan [1,2] that predicts the probability of knowledge for each student at each opportunity as well as a prediction of probability of correctness on each step. The model learns individualized student specific parameters (prior, learn rate, guess and slip) and then uses these parameters to train skill specific models. The resulting model that considers the composition of user and skill parameters outperforms models that only take into account parameters of the skill. Models that only consider skill parameters are the predominant methods for prediction in the intelligent tutoring systems field. The Bayesian model was used in ensemble selection [3] and also to generate extra features for the decision tree classifier. The bagged decision tree classifier [4,5] was the primary classifier used and was developed by Leo Breiman and Adele Cutler.

Method

Summarize the algorithms you used in a way that those skilled in the art should understand what to do. Profile of your methods as follows:

Data exploration and understanding

Did you use data exploration techniques to

  • [not checked]  Identify selection biases
  • [checked]  Identify temporal effects (e.g. students getting better over time)
  • [checked]  Understand the variables
  • [checked]  Explore the usefulness of the KC models
  • [checked]  Understand the relationships between the different KC types

Please describe your data understanding efforts, and interesting observations:

There was a large percentage (>20%) of steps that had no skill tagging in any KC model. These steps were all assigned a separate skill id. When running cross-fold validation on the training set, rows with this null skill id were predicted with an RMSE around 0.20. Another obstacle to a skill centric analysis was the lack of prior responses to skills for some users. Around 5-8% of the test rows involved making a prediction for a user on step tagged with a skill that the user had never encountered before. For rows of this type, the RMSE was dramatically higher; an RMSE of ~0.271 for rows with prior skill responses and ~0.323 for rows without prior skill responses.

Preprocessing

Feature generation

  • [not checked]  Features designed to capture the step type (e.g. enter given, or ... )
  • [not checked]  Features based on the textual step name
  • [checked]  Features designed to capture the KC type
  • [not checked]  Features based on the textual KC name
  • [checked]  Features derived from opportunity counts
  • [not checked]  Features derived from the problem name
  • [checked]  Features based on student ID
  • [not checked]  Other features

Details on feature generation:

Several features sets of the user and skill were created as well as features of the step, problem, section and unit. Inferences on student skill knowledge and predicted probability of correct from the top user model were also used as features. Several test sets were created and features were generated for the rows of those test sets from their respective training sets. In addition to percent correct features, features indicative of users gaming the system [6] were also generated. These features include the standard deviation of the user's step duration for a given skill. I believe the creation of internal test and training sets was a key decision in being able to explore prediction methods with a manageable dataset size. This also allowed for feedback without using the leaderboard. The internal fit measures were often very close to the leaderboard submission result.

Feature selection

  • [not checked]  Feature ranking with correlation or other criterion (specify below)
  • [not checked]  Filter method (other than feature ranking)
  • [not checked]  Wrapper with forward or backward selection (nested subset method)
  • [not checked]  Wrapper with intensive search (subsets not nested)
  • [not checked]  Embedded method
  • [not checked]  Other method not listed above (specify below)

Details on feature selection:

No response.

Did you attempt to identify latent factors?

  • [not checked]  Cluster students
  • [not checked]  Cluster knowledge components
  • [not checked]  Cluster steps
  • [not checked]  Latent feature discovery was performed jointly with learning

Details on latent factor discovery (techniques used, useful student/step features, how were the factors used, etc.):

No response.

Other preprocessing

  • [checked]  Filling missing values (for KC)
  • [not checked]  Principal component analysis

More details on preprocessing:

Steps often had multiple KCs associated with them. In order to convert the multi tagging to single tagging for easier processing, two alternative KC taggings were created for each of the KC models. One alternate chose the KC with the lowest percent correct. The second alternate assigned a new KC id to unique sets of KCs.

Classification

Base classifier

  • [checked]  Decision tree, stub, or Random Forest
  • [not checked]  Linear classifier (Fisher's discriminant, SVM, linear regression)
  • [not checked]  Non-linear kernel method (SVM, kernel ridge regression, kernel logistic regression)
  • [not checked]  Naïve
  • [checked]  Bayesian Network (other than Naïve Bayes)
  • [not checked]  Neural Network
  • [not checked]  Bayesian Neural Network
  • [not checked]  Nearest neighbors
  • [not checked]  Latent variable models (e.g. matrix factorization)
  • [not checked]  Neighborhood/correlation based collaborative filtering
  • [not checked]  Bayesian Knowledge Tracing
  • [not checked]  Additive Factor Model
  • [not checked]  Item Response Theory
  • [not checked]  Other classifier not listed above (specify below)

Loss Function

  • [not checked]  Hinge loss (like in SVM)
  • [not checked]  Square loss (like in ridge regression)
  • [not checked]  Logistic loss or cross-entropy (like in logistic regression)
  • [not checked]  Exponential loss (like in boosting)
  • [not checked]  None
  • [not checked]  Don't know
  • [not checked]  Other loss (specify below)

Regularizer

  • [not checked]  One-norm (sum of weight magnitudes, like in Lasso)
  • [not checked]  Two-norm (||w||^2, like in ridge regression and regular SVM)
  • [not checked]  Structured regularizer (like in group lasso)
  • [not checked]  None
  • [not checked]  Don't know
  • [not checked]  Other (specify below)

Ensemble Method

  • [not checked]  Boosting
  • [checked]  Bagging (check this if you use Random Forest)
  • [checked]  Other ensemble method
  • [not checked]  None

Were you able to use information present only in the training set?

  • [checked]  Corrects, incorrects, hints
  • [checked]  Step start/end times

Did you use post-training calibration to obtain accurate probabilities?

  • [not selected]  Yes
  • [not selected]  No

Did you make use of the development data sets for training?

  • [not selected]  Yes
  • [not selected]  No

Details on classification:

The Bayesian models and bagged decision trees were run on a number of dataset and feature set variations. Some features required previous skill information for the user being predicted. If this information did not exist, this feature was not created for the particular test row. This resulted in feature sets that did not cover the entire test set. Because of this predictions were made that covered different portions of the test set. Ensemble selection was used to blend the various model predictions. Because of the varying number of predictions between models, a special ensemble initialization technique was created whereby the best model was chosen first and subsequent models were chosen based on the RMSE of the predicted rows excluding the rows already added to the initialized ensemble. This allowed for models to be used for the portions of the test set in which they excelled. For instance, the rows of the test set containing skills not yet seen by the user were best predicted by a model that was not a top predicting model overall.

Model selection/hyperparameter selection

  • [checked]  We used the online feedback of the leaderboard.
  • [not checked]  K-fold or leave-one-out cross-validation (using training data)
  • [not checked]  Virtual leave-one-out (closed for estimations of LOO with a single classifier training)
  • [not checked]  Out-of-bag estimation (for bagging methods)
  • [not checked]  Bootstrap estimation (other than out-of-bag)
  • [checked]  Other cross-validation method
  • [not checked]  Bayesian model selection
  • [not checked]  Penalty-based method (non-Bayesian)
  • [not checked]  Bi-level optimization
  • [not checked]  Other method not listed above (specify below)

Details on model selection:

No response.

Results

Final Team Submission

Scores shown in the table below are Cup scores, not leaderboard scores. The difference between the two is described on the Evaluation page.

A reader should also know from reading the fact sheet what the strength of the method is.

Please comment about the following:

Quantitative advantages (e.g., compact feature subset, simplicity, computational advantages).

The training of the naive decision tree classifiers of the random forest method could be parallelized. This resulted in a lower training time than the neural networks approach that was tried. Random forests were preferable also because of their small parameter space; minleaf and number of trees were the only significant parameters and these parameters take non continuous values with reasonable ranges. Because of this, a large portion of the parameter space can be reasonably explored.

Qualitative advantages (e.g. compute posterior probabilities, theoretically motivated, has some elements of novelty).

No response.

Other methods. List other methods you tried.

Neural networks with 1-3 hidden layers were tried but with predictive performance far bellow that of bagged decision trees. SVMs were also tried with both linear and non-linear kernels. The linear kernel SVM parameters were explored using a coarse grid search and then a higher resolution search around the areas of low RMSE found in the first search. This approach resulted in prediction accuracies comparable to the neural network predictions.

How helpful did you find the included KC models?

  • [selected]  Crucial in getting good predictions
  • [not selected]  Somewhat helpful in getting good predictions
  • [not selected]  Neutral
  • [not selected]  Not particularly helpful
  • [not selected]  Irrelevant

If you learned latent factors, how helpful were they?

  • [not selected]  Crucial in getting good predictions
  • [not selected]  Somewhat helpful in getting good predictions
  • [not selected]  Neutral
  • [not selected]  Not particularly helpful
  • [not selected]  Irrelevant

Details on the relevance of the KC models and latent factors:

No response.

Software Implementation

Availability

  • [not checked]  Proprietary in-house software
  • [not checked]  Commercially available in-house software
  • [not checked]  Freeware or shareware in-house software
  • [checked]  Off-the-shelf third party commercial software
  • [not checked]  Off-the-shelf third party freeware or shareware

Language

  • [not checked]  C/C++
  • [not checked]  Java
  • [checked]  Matlab
  • [not checked]  Python/NumPy/SciPy
  • [not checked]  Other (specify below)

Details on software implementation:

Perl was used to hash the dataset features into numeric values. The parallel computing toolbox from MathWorks was used with MATLAB.

Hardware implementation

Platform

  • [not checked]  Windows
  • [checked]  Linux or other Unix
  • [not checked]  Mac OS
  • [not checked]  Other (specify below)

Memory

  • [not selected]  <= 2 GB
  • [not selected]  <= 8 GB
  • [not selected]  >= 8 GB
  • [selected]  >= 32 GB

Parallelism

  • [checked]  Multi-processor machine
  • [checked]  Run in parallel different algorithms on different machines
  • [not checked]  Other (specify below)

Details on hardware implementation. Specify whether you provide a self contained-application or libraries.

A 30 node rocks cluster was used to train the ~1,500 Bayesian skill models for each dataset. One 16 core and one 8 core machine was used to run the bagged decision tree classification.

Code URL

Provide a URL for the code (if available):

No response.

Competition Setup

From a performance point of view, the training set was

  • [not selected]  Too big (could have achieved the same performance with significantly less data)
  • [selected]  Too small (more data would have led to better performance)

From a computational point of view, the training set was

  • [not selected]  Too big (imposed serious computational challenges, limited the types of methods that can be applied)
  • [selected]  Adequate (the computational load was easy to handle)

Was the time constraint imposed by the challenge a difficulty or did you feel enough time to understand the data, prepare it, and train models?

  • [not selected]  Not enough time
  • [not selected]  Enough time
  • [selected]  It was enough time to do something decent, but there was a lot left to explore. With more time performance could have been significantly improved.

How likely are you to keep working on this problem?

  • [selected]  It is my main research area.
  • [not selected]  It was a very interesting problem. I'll keep working on it.
  • [not selected]  This data is a good fit for the data mining methods I am using/developing. I will use it in the future for empirical evaluation.
  • [not selected]  Maybe I'll try some ideas , but it is not high priority.
  • [not selected]  Not likely to keep working on it.

Comments on the problem (What aspects of the problem you found most interesting? Did it inspire you to develop new techniques?)

Combining user features with skill features was very powerful in both modeling and classification approaches. Prediction error was very low for rows that had sufficient data to compile a complete user and skill feature set however error was very high for rows were the user did not have sufficient skill data. In order to increase prediction accuracy for these rows, imputing missing features could be very beneficial. What to do with these rows is a worthy area of future study since, while small, they significantly impacted overall RMSE.

References

List references below.

[1] Pardos, Z. A., Heffernan, N. T. In Press (2010) Modeling Individualization in a Bayesian Networks Implementation of Knowledge Tracing. In Proceedings of the 18th International Conference on User Modeling, Adaptation and Personalization. Hawaii. [2] Pardos, Z. A., Heffernan, N. T. In Press (2010) Navigating the parameter space of Bayesian Knowledge Tracing models: Visualizations of the convergence of the Expectation Maximization algorithm. In Proceedings of the 3rd International Conference on Educational Data Mining. Pittsburg. [3] R. Caruana, A. Niculescu-Mizil, G. Crew, and A. Ksikes. Ensemble selection from libraries of models. In Proceedings of the International Conference on Machine Learning (ICML), 2004. [4] Breiman, L., Friedman, J., Olshen, R. & Stone, C. (1984), Classification and Regression Trees, Wadsworth International. [5] Breiman, L. Random forests. Machine Learning, 45(1): 5?32, 2001. [6] Baker, R.S.J.d., Corbett, A.T., Roll, I., Koedinger, K.R. (2008) Developing a Generalizable Detector of When Students Game the System. User Modeling and User-Adapted Interaction, 18, 3, 287-314.