scrapbook
  • "Unorganized" Notes
  • The Best Public Datasets for Machine Learning and Data Science
  • Practice Coding
  • plaid-API project
  • Biotech
    • Machine Learning vs. Deep Learning
  • Machine Learning for Computer Graphics
  • Books (on GitHub)
  • Ideas/Thoughts
  • Ziva for feature animation: Stylized simulation and machine learning-ready workflows
  • Tools
  • 🪶math
    • Papers
    • Math for ML (coursera)
      • Linear Algebra
        • Wk1
        • Wk2
        • Wk3
        • Wk4
        • Wk5
      • Multivariate Calculus
    • Improving your Algorithms & Data Structure Skills
    • Algorithms
    • Algorithms (MIT)
      • Lecture 1: Algorithmic Thinking, Peak Finding
    • Algorithms (khan academy)
      • Binary Search
      • Asymptotic notation
      • Sorting
      • Insertion sort
      • Recursion
      • Solve Hanoi recursively
      • Merge Sort
      • Representing graphs
      • The breadth-first search algorithm
      • Breadth First Search in JavaScript
      • Breadth-first vs Depth-first Tree Traversal in Javascript
    • Algorithms (udacity)
      • Social Network
    • Udacity
      • Linear Algebra Refresher /w Python
    • math-notes
      • functions
      • differential calculus
      • derivative
      • extras
      • Exponentials & logarithms
      • Trigonometry
    • Probability (MIT)
      • Unit 1
        • Probability Models and Axioms
        • Mathematical background: Sets; sequences, limits, and series; (un)countable sets.
    • Statistics and probability (khan academy)
      • Analyzing categorical data
      • Describing and comparing distributions
      • Outliers Definition
      • Mean Absolute Deviation (MAD)
      • Modeling data distribution
      • Exploring bivariate numerical data
      • Study Design
      • Probability
      • Counting, permutations, and combinations
      • Binomial variables
        • Binomial Distribution
        • Binomial mean and standard deviation formulas
        • Geometric random variable
      • Central Limit Theorem
      • Significance Tests (hypothesis testing)
    • Statistics (hackerrank)
      • Mean, Medium, Mode
      • Weighted Mean
      • Quartiles
      • Standard Deviation
      • Basic Probability
      • Conditional Probability
      • Permutations & Combinations
      • Binomial Distribution
      • Negative Binomial
      • Poisson Distribution
      • Normal Distribution
      • Central Limit Theorem
      • Important Concepts in Bayesian Statistics
  • 📽️PRODUCT
    • Product Strategy
    • Product Design
    • Product Development
    • Product Launch
  • 👨‍💻coding
    • of any interest
    • Maya API
      • Python API
    • Python
      • Understanding Class Inheritance in Python 3
      • 100+ Python challenging programming exercises
      • coding
      • Iterables vs. Iterators vs. Generators
      • Generator Expression
      • Stacks (LIFO) / Queues (FIFO)
      • What does -1 mean in numpy reshape?
      • Fold Left and Right in Python
      • Flatten a nested list of lists
      • Flatten a nested dictionary
      • Traverse A Tree
      • How to Implement Breadth-First Search
      • Breadth First Search
        • Level Order Tree Traversal
        • Breadth First Search or BFS for a Graph
        • BFS for Disconnected Graph
      • Trees and Tree Algorithms
      • Graph and its representations
      • Graph Data Structure Interview Questions
      • Graphs in Python
      • GitHub Repo's
    • Python in CG Production
    • GLSL/HLSL Shading programming
    • Deep Learning Specialization
      • Neural Networks and Deep Learning
      • Untitled
      • Untitled
      • Untitled
    • TensorFlow for AI, ML, and DL
      • Google ML Crash Course
      • TensorFlow C++ API
      • TensorFlow - coursera
      • Notes
      • An Introduction to different Types of Convolutions in Deep Learning
      • One by One [ 1 x 1 ] Convolution - counter-intuitively useful
      • SqueezeNet
      • Deep Compression
      • An Overview of ResNet and its Variants
      • Introducing capsule networks
      • What is a CapsNet or Capsule Network?
      • Xception
      • TensorFlow Eager
    • GitHub
      • Project README
    • Agile - User Stories
    • The Open-Source Data Science Masters
    • Coding Challenge Websites
    • Coding Interview
      • leetcode python
      • Data Structures
        • Arrays
        • Linked List
        • Hash Tables
        • Trees: Basic
        • Heaps, Stacks, Queues
        • Graphs
          • Shortest Path
      • Sorting & Searching
        • Depth-First Search & Breadth-First Search
        • Backtracking
        • Sorting
      • Dynamic Programming
        • Dynamic Programming: Basic
        • Dynamic Programming: Advanced
    • spaCy
    • Pandas
    • Python Packages
    • Julia
      • jupyter
    • macos
    • CPP
      • Debugging
      • Overview of memory management problems
      • What are lvalues and rvalues?
      • The Rule of Five
      • Concurrency
      • Avoiding Data Races
      • Mutex
      • The Monitor Object Pattern
      • Lambdas
      • Maya C++ API Programming Tips
      • How can I read and parse CSV files in C++?
      • Cpp NumPy
    • Advanced Machine Learning
      • Wk 1
      • Untitled
      • Untitled
      • Untitled
      • Untitled
  • data science
    • Resources
    • Tensorflow C++
    • Computerphile
      • Big Data
    • Google ML Crash Course
    • Kaggle
      • Data Versioning
      • The Basics of Rest APIs
      • How to Make an API
      • How to deploying your API
    • Jupiter Notebook Tips & Tricks
      • Jupyter
    • Image Datasets Notes
    • DS Cheatsheets
      • Websites & Blogs
      • Q&A
      • Strata
      • Data Visualisation
      • Matplotlib etc
      • Keras
      • Spark
      • Probability
      • Machine Learning
        • Fast Computation of AUC-ROC score
    • Data Visualisation
    • fast.ai
      • deep learning
      • How to work with Jupyter Notebook on a remote machine (Linux)
      • Up and Running With Fast.ai and Docker
      • AWS
    • Data Scientist
    • ML for Beginners (Video)
    • ML Mastery
      • Machine Learning Algorithms
      • Deep Learning With Python
    • Linear algebra cheat sheet for deep learning
    • DL_ML_Resources
    • Awesome Machine Learning
    • web scraping
    • SQL Style Guide
    • SQL - Tips & Tricks
  • 💡Ideas & Thoughts
    • Outdoors
    • Blog
      • markdown
      • How to survive your first day as an On-set VFX Supervisor
    • Book Recommendations by Demi Lee
  • career
    • Skills
    • learn.co
      • SQL
      • Distribution
      • Hypothesis Testing Glossary
      • Hypothesis Tests
      • Hypothesis & AB Testing
      • Combinatorics Continued and Maximum Likelihood Estimation
      • Bayesian Classification
      • Resampling and Monte Carlo Simulation
      • Extensions To Linear Models
      • Time Series
      • Distance Metrics
      • Graph Theory
      • Logistic Regression
      • MLE (Maximum Likelihood Estimation)
      • Gradient Descent
      • Decision Trees
      • Ensemble Methods
      • Spark
      • Machine Learning
      • Deep Learning
        • Backpropagation - math notation
        • PRACTICE DATASETS
        • Big Data
      • Deep Learning Resources
      • DL Datasets
      • DL Tutorials
      • Keras
      • Word2Vec
        • Word2Vec Tutorial Part 1 - The Skip-Gram Model
        • Word2Vec Tutorial Part 2 - Negative Sampling
        • An Intuitive Explanation of Convolutional Neural Networks
      • Mod 4 Project
        • Presentation
      • Mod 5 Project
      • Capstone Project Notes
        • Streaming large training and test files into Tensorflow's DNNClassifier
    • Carrier Prep
      • The Job Search
        • Building a Strong Job Search Foundation
        • Key Traits of Successful Job Seekers
        • Your Job Search Mindset
        • Confidence
        • Job Search Action Plan
        • CSC Weekly Activity
        • Managing Your Job Search
      • Your Online Presence
        • GitHub
      • Building Your Resume
        • Writing Your Resume Summary
        • Technical Experience
      • Effective Networking
        • 30 Second Elevator Pitch
        • Leveraging Your Network
        • Building an Online Network
        • Linkedin For Research And Networking
        • Building An In-Person Network
        • Opening The Line Of Communication
      • Applying to Jobs
        • Applying To Jobs Online
        • Cover Letters
      • Interviewing
        • Networking Coffees vs Formal Interviews
        • The Coffee Meeting/ Informational Interview
        • Communicating With Recruiters And HR Professional
        • Research Before an Interview
        • Preparing Questions for Interviews
        • Phone And Video/Virtual Interviews
        • Cultural/HR Interview Questions
        • The Salary Question
        • Talking About Apps/Projects You Built
        • Sending Thank You's After an Interview
      • Technical Interviewing
        • Technical Interviewing Formats
        • Code Challenge Best Practices
        • Technical Interviewing Resources
      • Communication
        • Following Up
        • When You Haven't Heard From an Employer
      • Job Offers
        • Approaching Salary Negotiations
      • Staying Current in the Tech Industry
      • Module 6 Post Work
      • Interview Prep
  • projects
    • Text Classification
    • TERRA-REF
    • saildrone
  • Computer Graphics
  • AI/ML
  • 3deeplearning
    • Fast and Deep Deformation Approximations
    • Compress and Denoise MoCap with Autoencoders
    • ‘Fast and Deep Deformation Approximations’ Implementation
    • Running a NeuralNet live in Maya in a Python DG Node
    • Implement a Substance like Normal Map Generator with a Convolutional Network
    • Deploying Neural Nets to the Maya C++ API
  • Tools/Plugins
  • AR/VR
  • Game Engine
  • Rigging
    • Deformer Ideas
    • Research
    • brave rabbit
    • Useful Rigging Links
  • Maya
    • Optimizing Node Graph for Parallel Evaluation
  • Houdini
    • Stuff
    • Popular Built-in VEX Attributes (Global Variables)
Powered by GitBook
On this page
  • When is data versioning appropriate?
  • Tools for data versioning when working locally
  • Exercise
  • Creating a Kaggle dataset from GitHub
  • Create a Kaggle Dataset from a GitHub repo
  • Modify versioning
  1. data science
  2. Kaggle

Data Versioning

PreviousKaggleNextThe Basics of Rest APIs

Last updated 6 years ago

You might already be familiar with versioning from working with version control or source control for your code. The basic idea is that, as you work, you create static copies of your work that you can refer back to later. (This is what the "commit" button in kernels does.) This is particularly important when you’re collaborating because if you and a collaborator are working on the same file, you can compare each of your versions and decide which changes to make.

So version control for code is a good idea… but what about data?

The idea that you should version your data is actually a somewhat controversial.

at all. And there are some good reasons for this:

  • Most version control software is designed for files with code in them, and these files generally aren’t very big. Trying to use version control software for large files (more than a couple MB) means that a lot of their useful features, like showing line differences, are no longer as useful.

  • Version control can mean storing multiple copies of files, and if you have a large dataset this can quickly get very expensive.

  • If you’re routinely backing up your data and are using version control for the queries or script you’re using to extract data, then it may be redundant to store specific subsets of your data separately.

I agree that you don’t need to save separate versions of your data for every single task you do. However, if you’re training machine learning models, I believe you should version both the exact data you used to train your model and your code. This is because, without the code, data and environment, you won’t be able to reproduce your model. (I wrote .)

This becomes a really thorny problem if you want to run experiments and compare models to each other. Does this new model architecture outperform the one you’re currently using? Without knowing what data you trained the original model on, it’s hard to tell.

When is data versioning appropriate?

When should you version your data?

  • When making schema/metadata changes, like adding or deleting columns or changing the units that information is stored in.

  • When you’re training experimental machine learning models. The smallest reproducible unit for machine learning models is training data + model specification code. (I talk more about this .)

When should you consider not versioning your data?

  • When your data isn’t being used to train models. For example, it’s more space efficient to just save the SQL query you used to make a chart than it is to save all the transformed data.

  • When your data is large enough that storing a versioned copy would be prohibitively expensive. In this case, I’d recommend versioning both the scripts you used to extract the data and enough descriptive statistics that you could re-generate a very similar dataset.

  • When your project lives entirely on GitHub. Versioning large datasets via GitHub can quickly become unwieldy. (GitLFS can help with this, but if you’re storing very large datasets, in general GitHub probably isn’t the best tool for the job. A database or blog storage hosting service specifically designed for large data will generally give you fewer headaches. Most cloud services will generally already have some form of versioning built in.)

Of course, whether or not you should version data eventually comes down to a judgement call on your part.

Tools for data versioning when working locally

If you're already using cloud tools to store your data, most platforms will have versioning built in.You should probably use these rather than setting up your own system, if for no other reason than that it will be someone else’s problem when it inevitably breaks. 😂

The biggest difference between these tools and Git is that they

Similarities:

  • Both are based on Git and are designed to interface well with existing Git toolchains.

  • They focus on versioning whole pipelines, versioning the data, code and trained models together.

Differences:

  • DVC is only available as a command line tool. Pachyderm also has a graphical user interface (GUI).

  • DVC is open source and free. Pachyderm does have an open source version, but to get all the bells and whistles you'll need to shell out for the enterprise edition.

  • Pachyderm is fully containerized (Docker and Kubernetes). You can use containers with DVC but they're not the default.

Exercise

This section is a little bit theoretical, so I've got some discussion questions for you.

For each of these datasets consider whether it makes sense to version this data.

  • You have a streaming datasets of sensor data with more than five billion columns. It's updated every five seconds. You’re using it to create a dashboard to let stakeholders monitor anomalies.

  • You’ve got a .CSV with a few thousand rows of student data. You’re storing it in a computer that conforms with your country's laws around data privacy. You want to build a model to see if there’s an effect of when tests are administered on test scores.

  • You’re working with a customer database of one million pet owners who have ordered custom dog food through your startup. You want to create a slide deck with visualizations that summarize information about your customers to help the marketing team decide where to buy newspaper ads.

In [1]:

# Your notes can go here if you like. :)
# You can also answer/discuss on the comments of this notebook.

Creating a Kaggle dataset from GitHub

OK, now that we’ve got some theory out of the way and you should have a better idea of whether versioning is appropriate, let’s make some new datasets!

Today, we’re going to be making datasets from GitHub repos (short for “repositories”). First, you’ll need to pick a repo.

Once you’ve picked a repo, it’s fairly easy to create your dataset.

Create a Kaggle Dataset from a GitHub repo

  • Go to www.kaggle.com/datasets (or just click on the "Datasets" tab up near the search bar).

  • Click on "New Dataset".

  • In the modal that pops up, click on the circle with a silhouette of an Octocat (the GitHub mascot) in it.

  • Enter a dataset title and the URL of the GitHub repository you're interested in. If the URL is valid, you should see a list of all the files that will be included in your dataset.

  • Hit create.

  • Success!

Modify versioning

  • Go to the page for your dataset. (You can find it by clicking on the "Datasets" tab next to the search bar and then clicking on the "Your Datasets" tab.)

  • Click on the "Settings" tab. (The end of the URL will be /settings.)

  • To turn off versioning, select "Latest version only" in the "Versioning" drop down.

  • To automatically update your dataset, choose your preferred frequency from the "Update" dropdown.

  • That's all there is to it! 😎

If your data is on Kaggle, we already take care of the data versioning for you; you can scroll to the bottom of your dataset landing page to see the "History" tab and check out previous versions of the dataset, what changes were made between versions and any updates that have been made to the metadata. .

But some data shouldn't be put on Kaggle (, for example) and sometimes you might prefer to work locally. What tools can you use in that case?

The ecosystem of data versioning tools is still pretty young, but you do have some options. Right now, two of the more popular are , which is short for Data Version Control, and . These aren't coding environments like Kaggle Kernels are. Rather, they're similar to Git: they let you save specific versions of your code and data along with comments on them. This allows you to revert your work back later and track what you did and to collaborate with others on the same code/data.

Other options for versioning data and pipelines include, in alphabetical order: , , , , and . These are a mix of closed and open source and a lot of them are in development, so it's a good idea to do some shopping around before you commit to a tooling system.

Currently, we only support creating datasets from public repos. If you don’t have any public data of your own, you can check out to see if something tickles your fancy.

Lots of people
will recommend
that you don’t version your data
a whole paper about it if you’re interested
in this blog post
This dataset of bike counters in Ottawa is a good example of a dataset that's gone through multiple iterations
we're not HIPAA compliant right now
DVC
Pachyderm
Dataiku
Datalad
Datmo
GitLFS
qri
Quilt
this list of public GitHub datasets