Data Versioning

You might already be familiar with versioning from working with version control or source control for your code. The basic idea is that, as you work, you create static copies of your work that you can refer back to later. (This is what the "commit" button in kernels does.) This is particularly important when you’re collaborating because if you and a collaborator are working on the same file, you can compare each of your versions and decide which changes to make.

So version control for code is a good idea… but what about data?

The idea that you should version your data is actually a somewhat controversial.

Lots of people will recommend that you don’t version your data at all. And there are some good reasons for this:

  • Most version control software is designed for files with code in them, and these files generally aren’t very big. Trying to use version control software for large files (more than a couple MB) means that a lot of their useful features, like showing line differences, are no longer as useful.

  • Version control can mean storing multiple copies of files, and if you have a large dataset this can quickly get very expensive.

  • If you’re routinely backing up your data and are using version control for the queries or script you’re using to extract data, then it may be redundant to store specific subsets of your data separately.

I agree that you don’t need to save separate versions of your data for every single task you do. However, if you’re training machine learning models, I believe you should version both the exact data you used to train your model and your code. This is because, without the code, data and environment, you won’t be able to reproduce your model. (I wrote a whole paper about it if you’re interested.)

This becomes a really thorny problem if you want to run experiments and compare models to each other. Does this new model architecture outperform the one you’re currently using? Without knowing what data you trained the original model on, it’s hard to tell.

When is data versioning appropriate?

When should you version your data?

  • When making schema/metadata changes, like adding or deleting columns or changing the units that information is stored in.

  • When you’re training experimental machine learning models. The smallest reproducible unit for machine learning models is training data + model specification code. (I talk more about this in this blog post.)

When should you consider not versioning your data?

  • When your data isn’t being used to train models. For example, it’s more space efficient to just save the SQL query you used to make a chart than it is to save all the transformed data.

  • When your data is large enough that storing a versioned copy would be prohibitively expensive. In this case, I’d recommend versioning both the scripts you used to extract the data and enough descriptive statistics that you could re-generate a very similar dataset.

  • When your project lives entirely on GitHub. Versioning large datasets via GitHub can quickly become unwieldy. (GitLFS can help with this, but if you’re storing very large datasets, in general GitHub probably isn’t the best tool for the job. A database or blog storage hosting service specifically designed for large data will generally give you fewer headaches. Most cloud services will generally already have some form of versioning built in.)

Of course, whether or not you should version data eventually comes down to a judgement call on your part.

Tools for data versioning when working locally

If your data is on Kaggle, we already take care of the data versioning for you; you can scroll to the bottom of your dataset landing page to see the "History" tab and check out previous versions of the dataset, what changes were made between versions and any updates that have been made to the metadata. This dataset of bike counters in Ottawa is a good example of a dataset that's gone through multiple iterations.

But some data shouldn't be put on Kaggle (we're not HIPAA compliant right now, for example) and sometimes you might prefer to work locally. What tools can you use in that case?

If you're already using cloud tools to store your data, most platforms will have versioning built in.You should probably use these rather than setting up your own system, if for no other reason than that it will be someone else’s problem when it inevitably breaks. 😂

The ecosystem of data versioning tools is still pretty young, but you do have some options. Right now, two of the more popular are DVC, which is short for Data Version Control, and Pachyderm. These aren't coding environments like Kaggle Kernels are. Rather, they're similar to Git: they let you save specific versions of your code and data along with comments on them. This allows you to revert your work back later and track what you did and to collaborate with others on the same code/data.

The biggest difference between these tools and Git is that they

Similarities:

  • Both are based on Git and are designed to interface well with existing Git toolchains.

  • They focus on versioning whole pipelines, versioning the data, code and trained models together.

Differences:

  • DVC is only available as a command line tool. Pachyderm also has a graphical user interface (GUI).

  • DVC is open source and free. Pachyderm does have an open source version, but to get all the bells and whistles you'll need to shell out for the enterprise edition.

  • Pachyderm is fully containerized (Docker and Kubernetes). You can use containers with DVC but they're not the default.

Other options for versioning data and pipelines include, in alphabetical order: Dataiku, Datalad, Datmo, GitLFS, qri and Quilt. These are a mix of closed and open source and a lot of them are in development, so it's a good idea to do some shopping around before you commit to a tooling system.

Exercise

This section is a little bit theoretical, so I've got some discussion questions for you.

For each of these datasets consider whether it makes sense to version this data.

  • You have a streaming datasets of sensor data with more than five billion columns. It's updated every five seconds. You’re using it to create a dashboard to let stakeholders monitor anomalies.

  • You’ve got a .CSV with a few thousand rows of student data. You’re storing it in a computer that conforms with your country's laws around data privacy. You want to build a model to see if there’s an effect of when tests are administered on test scores.

  • You’re working with a customer database of one million pet owners who have ordered custom dog food through your startup. You want to create a slide deck with visualizations that summarize information about your customers to help the marketing team decide where to buy newspaper ads.

In [1]:

# Your notes can go here if you like. :)
# You can also answer/discuss on the comments of this notebook.

Creating a Kaggle dataset from GitHub

OK, now that we’ve got some theory out of the way and you should have a better idea of whether versioning is appropriate, let’s make some new datasets!

Today, we’re going to be making datasets from GitHub repos (short for “repositories”). First, you’ll need to pick a repo.

Currently, we only support creating datasets from public repos. If you don’t have any public data of your own, you can check out this list of public GitHub datasets to see if something tickles your fancy.

Once you’ve picked a repo, it’s fairly easy to create your dataset.

Create a Kaggle Dataset from a GitHub repo

  • Go to www.kaggle.com/datasets (or just click on the "Datasets" tab up near the search bar).

  • Click on "New Dataset".

  • In the modal that pops up, click on the circle with a silhouette of an Octocat (the GitHub mascot) in it.

  • Enter a dataset title and the URL of the GitHub repository you're interested in. If the URL is valid, you should see a list of all the files that will be included in your dataset.

  • Hit create.

  • Success!

Modify versioning

  • Go to the page for your dataset. (You can find it by clicking on the "Datasets" tab next to the search bar and then clicking on the "Your Datasets" tab.)

  • Click on the "Settings" tab. (The end of the URL will be /settings.)

  • To turn off versioning, select "Latest version only" in the "Versioning" drop down.

  • To automatically update your dataset, choose your preferred frequency from the "Update" dropdown.

  • That's all there is to it! 😎

Last updated