DataShop > Dataset Info

Dataset Info / Overview

This page provides both an overview and context for the current dataset. It may answer questions such as:

How, when, and where were these data collected?
What's the scope of the dataset?
If this was an experiment, what were the research goals?
How should I cite this dataset for secondary analysis?

If you are a project admin for this project, you can edit some of the fields in the Overview table—click a field to edit it. You can help other researchers by describing the dataset and the context in which it was created.

Have you or someone you know published about these data? Attach a paper to this dataset on the Files tab.

Dataset statistics at a glance

You can gauge the size of the dataset by looking at the numbers in the Statistics table, particularly the Total Number of Students, Transactions, and Student Hours.

The Knowledge Component Models, or step-to-knowledge-component mappings, are listed at the bottom of the table. If you see a few Knowledge Component Models listed, researchers have likely thought about different ways of attributing skills to steps, and potentially new ways of categorizing knowledge in this domain. You can learn more about these models and create new ones by clicking the KC Models subtab.

Dataset Info / Samples

A sample is a proper subset of a dataset and is composed of one or more filters, specific conditions that narrow down your sample. This page lists samples shared by others, as well as those owned by you.

You can use samples to:

Compare across conditions
Narrow the scope of data analysis to a specific time range, set of students, problem category, or unit of a curriculum (for example)

Creating a new dataset from an existing sample

A new dataset can be created from an existing sample by clicking on the Save as Dataset icon next to a sample. Creating a dataset from an existing sample will place the new dataset into the same project as the source dataset, thus, inheriting the same permissions, IRB attributes, Principal Investigator, and Data Provider as the parent project.

The general process for creating a new dataset from an existing sample is to:

Choose a unique name for the new dataset
Decide whether or not to include user-created KC models in your new dataset. If you choose to include them, they will be copied to the new dataset. If you choose to exclude them, your new dataset will still contain the 'default' KC model, if one was included in the original data.
Save the Dataset
Your new dataset will be added to the Import Queue. The system will send an email once the new dataset has been loaded.

Creating a new sample

The general process for creating a sample is to:

Click the edit sample icon next to the All Data sample.
Choose a unique sample name.
Add or modify an existing filter to select the subset of data you're interested in, saving the filter when done.
View the sample preview table to see the effect of adding your filter, making sure you don't have an empty set (ie, a filter or combination of filters that exclude all transactions).
Decide whether to share the sample with others who can view the dataset
Save as New

Modifying an existing sample

The general process for modifying a sample is to:

Click the edit sample icon next to the desired sample.
Choose a unique sample name.
Add or modify an existing filter to select the subset of data you're interested in, saving the filter when done.
View the sample preview table to see the effect of adding your filter, making sure you don't have an empty set (ie, a filter or combination of filters that exclude all transactions).
Decide whether to share the sample with others who can view the dataset
Save the sample

Deleting a sample

Once a sample has been deleted, it cannot be recovered.

The effect of multiple filters on samples

DataShop interprets each filter after the first as an additional restriction on the data that is included in the sample. This is also known as a logical "AND". You can see the results of multiple filters in the sample preview as soon as all filters are "saved".

Dataset Info / KC Models

A KC (Knowledge Component) model is a mapping between steps and knowledge components. In DataShop, each unique step can map to zero or more knowledge components.

From the KC Models page, you can compare existing KC models, export an existing model or template for creating a new KC model, or import a new model that you've created.

Comparing KC models

On the KC Models page, each model is described by:

a number of KCs
a number of observations labeled with KCs
five statistical measures of goodness of fit for the model: AIC, BIC, and three Cross Validation RMSE values. These model fit values are described in more detail on the Model Values help page.

Models are grouped by the number of observations, sorted in ascending order. The secondary sort defaults to AIC (lowest to highest, or best fit with fewest parameters to worst fit or additional parameters) and then model name.

One general goal of KC modeling is to determine the "best" model for representing knowledge by fitting the model to the data. The "best" model would not only account for most of the data—it would have the highest number of observations labeled with KCs—and fit the data well, but it would do so with fewest parameters (KCs). The BIC value that DataShop calculates tells you how well the model fits the data (lower values are better), and it penalizes the models for overfitting (having additional parameters). This penalty for having additional parameters is stronger than AIC's penalty, so it is used in DataShop for sorting models.

Why create additional KC models and import them to DataShop?

A primary reason for creating a new KC model is that an existing model is insufficient in some way—it may model some knowledge components too coarsely, producing learning curves that spike or dip, or it may be too fine-grained (too many knowledge components), producing curves that end after one or two opportunities. Or perhaps the model fails to model the domain sufficiently or with the right terminology. In any case, you may find value in creating a new KC model.

By importing the resulting KC model that you created back into DataShop, you can use DataShop tools to assess your new model. Most reports in DataShop support analysis by knowledge component model, while some currently support comparing values from two KC models simultaneously—see the predicted values on the error rate Learning Curve, for example. We plan to create new features in DataShop that support more direct knowledge component model comparison.

Auto-generated KC models

DataShop creates two knowledge component models in addition to the model that was logged or imported when the dataset was created:

single-KC model: the same knowledge component is applied to every transaction in the dataset, producing a very general model
unique-step model: a unique knowledge component is applied to each unique step in the dataset, producing a very precise (likely too much so) model.

Creating a new KC model

Step 1: Export an existing model or blank template

To get started, click Export at the top of the KC Models page.
Select one or more existing KC models to use as a template for the new one, or choose "(new only)" to download a blank template.
Click the Export button to download your file.

Step 2: Edit the KC model file in Excel or other text-file/spreadsheet editor

Define the KC model by filling in the cells in the column KC (model_name), replacing "model_name" with a name for your new model.
Assign multiple KCs to a step by adding additional KC (model_name) columns, placing one KC in each column. Replace "model_name" with the same model name you used for your new model; you will have multiple columns with the same header.
Add additional KC models by creating a new KC (model_name) column for each KC model, replacing "model_name" with the name of your new model.
Delete any KC model columns that duplicate existing KC models already in the dataset (unless you want to overwrite these).
Do not change the values or headers of any other columns.

Step 3: Import a KC model file

Start the import process by clicking Import at the top of the KC Models page.
Click Choose File to browse for the KC model file you edited.
Click Verify to start file verification. If errors are found in your file, fix them and re-verify the file. When DataShop successfully verifies the file, you can then import it by clicking the Import button.

Dataset Info / Custom Fields

A custom field is a new column you define for annotating transaction data. DataShop currently supports adding and modifying custom fields at the transaction level.

You can add or modify a custom field's metadata from this page, but to set the data in that custom field, you need to use web services, which is a way to interact with DataShop through a program you write. You can also add custom fields when logging or importing new data.

Permissions

A custom field has an owner, the user who created it. Users who have edit or admin permission for a project can create custom fields for a dataset in it. Only the owner or a DataShop administrator can delete or modify the custom field. Only DataShop administrators can delete custom fields that were logged with the data.

Custom Field Metadata

The following fields describe a custom field:

name—descriptive name for the new custom field. Must be unique across all custom fields for the dataset. Must be no more than 255 characters.
description—description for the new custom field. Must be no more than 500 characters.
level—the level of aggregation that the custom field describes. Currently, the only accepted value is transaction. Future versions may support other levels such as step or student. Cannot be modified later.

Data Types

A custom field value is classified as one or more of the following data types assigned internally by DataShop:

number—must be no more than 65,000 characters
string—must be no more than 65,000 characters
date—see format suggestions; must be no more than 255 characters

The Custom Fields page indicates the types of custom fields, what percentage of those custom fields fall into the aforementioned categories, and what percentage of transactions are associated with each custom field.

Caveat: Very large custom fields may cause unexpected behavior in some applications. Excel correctly handles exports with very large custom field values if you import the text from Excel. Other text editors may incorrectly wrap the text values if they become too large while programs like vim, jEdit, and Notepad++ correctly handle the maximum lengths. Additionally, when viewing custom fields in the web interface, the values are truncated to 255 characters to prevent issues with browsers. To get the full custom field value, use the transaction export feature.

Dataset Info / Problem List

The problem list page lists all problems in dataset, grouped by problem hierarchy, which is a unique hierarchy of curriculum levels containing the problem (e.g., a problem might be contained in a Unit A, Section B hierarchy).

This page is most useful for seeing which particular problems have problem content stored: any problem name shown as a hyperlink will link to the content that students saw when they interacted with that problem. You can also filter on problems with or without problem content, and search those lists.

Download all of the problem content associated with the dataset by clicking the Download Problem Content button. The format of the download is a single .zip file containing a hierarchy of .html and web content files (e.g., images, videos, audio). The exact hierarchy of this file differs depending on the source of the problem content.

Dataset Info / Step List

The Step List table lists and decomposes all of the problems in the dataset. It details the problem hierarchy (the unit, section, or other divisions that contain the problem) and composition (the steps that make up a problem).

A unique problem-solving step is shown on each row.

Export the step list table by clicking the Export button.

Dataset Info / Citation

This page displays dataset-specific citation guidance. This information is taken from the Dataset Info fields "Acknowledgement for Secondary Analysis" and "Preferred Citation for Secondary Analysis", which are settable by researchers who have edit access to the dataset.

More general citation guidance is available at the link below.

Dataset Info / Terms of Use

This page displays the terms of use that apply to your use of this dataset and other datasets in this project.

Some projects have a terms of use that apply in addition to the DataShop terms of use. If that is the case for this project, those terms are shown here.

If you are interested in specifying a terms of use for your project, please contact us.

Dataset Info / Problem Content

The Problem Content tool allows admins to map problem content to datasets.

Problem content refers to a representation (text, images, html, etc.) of the content that students interacted with in the system that generated the dataset's data. Note that the word "problem" is used in the sense of any activity the user did that was named in the problem column of the data.

When problem content is mapped to a dataset and its problems, users can jump from DataShop reports to the problem content by clicking one of the "View Problem" buttons throughout the interface (often in tooltips on problem or step name), allowing them to better understand the activities that correspond with the data.

With problem content, you can:

Learn more about the system that students used
Inspect the interface and problem to explain student difficulties suggested by data
Use machine learning on an export of problem content from the Problem List page

Datasets with problem content are noted on the list of datasets with a problem content icon .

Adding problem content to your dataset

Please contact us, and we will consult with you on the format DataShop expects for problem content. For a faster solution, consider attaching files documenting your system on the Files tab of your dataset.

If you are a project admin for a dataset with problem content that has already been uploaded to the DataShop server, you can use the Problem Content page to map problem content to problems within the dataset. Select the Conversion Tool and Content Version to see a list of content items that can be mapped to the dataset, then click add to perform the mapping.

To see a list of all problems in dataset and which have problem content, or to download all problem content for a datset, visit the Problem List page.

Sample Selector

Creating a sample

The effect of multiple filters

DataShop @CMU

Project: LearnLab Certificate Courses

Dataset: Data-driven knowledge tracing to improve learning outcomes (from POL v1.7) DKT-001