Hsiang-Fu Yu National Taiwan University Taiwan
Chih-Jen Lin National Taiwan University
Hsuan-Tien Lin National Taiwan University
Shou-De Lin National Taiwan University
Yin-Hsuan Wei National Taiwan University
Jui-Yu Weng National Taiwan University
Chun-Fu Chang National Taiwan University
En-Syu Yan National Taiwan University
Todd McKenzie National Taiwan University
Jing-Kai Lou National Taiwan University
Hsun-Ping Hsieh National Taiwan University
Jung-Wei Chou National Taiwan University
Chia-Hua Ho National Taiwan University
Po-Han Chung National Taiwan University
Hung-Yi Lo National Taiwan University
Che-Wei Chang National Taiwan University
Tsung-Ting Kuo National Taiwan University
Yi-Chen Lo National Taiwan University
Chieh Po National Taiwan University
Chien-Yuan Wang National Taiwan University
Po Tzu Chang National Taiwan University
Yu-Shi Lin National Taiwan University
Yi-Hung Huang National Taiwan University
Chen-Wei Hung National Taiwan University
Yu-Xun Ruan National Taiwan University
Provide a URL to a web page, technical memorandum, or a paper.
Provide a general summary with relevant background information: Where does the method come from? Is it novel? Name the prior art.
At National Taiwan University, we design a course targeting at KDD CUP 2010. 19 students and one non-registered RA were split to seven groups. Six groups expand features by various binarization and discretization techniques. The resulting sparse feature sets are trained by logistic regression (using LIBLINEAR). One group condenses features so that the number is less than 20. Then random forest is applied (using Weka). Initial development was conducted on an internal split of training data for training and validation. We identify some useful feature combination. For the final submission, each group submits a few results and TAs ensemble them by linear regression.
Summarize the algorithms you used in a way that those skilled in the art should understand what to do. Profile of your methods as follows:
Please describe your data understanding efforts, and interesting observations:
No response.
Details on feature generation:
Features derived from the step name Features based on unit ID Features based on section ID
Details on feature selection:
Details on latent factor discovery (techniques used, useful student/step features, how were the factors used, etc.):
More details on preprocessing:
Details on classification:
Details on model selection:
Scores shown in the table below are Cup scores, not leaderboard scores. The difference between the two is described on the Evaluation page.
A reader should also know from reading the fact sheet what the strength of the method is.
Please comment about the following:
- Sparse feature set - Fast linear classifier
- Effective ensemble of classifiers - Posterior probabilities from logistic regression models
Details on the relevance of the KC models and latent factors:
Details on software implementation:
We use two packages to conduct classification - LIBLINEAR http://www.csie.ntu.edu.tw/~cjlin/liblinear/ - Weka http://www.cs.waikato.ac.nz/ml/weka We use linear regression for ensembling classifiers.
Details on hardware implementation. Specify whether you provide a self contained-application or libraries.
Provide a URL for the code (if available):
List references below.