The Open-Source Data Science Masters
https://github.com/datasciencemasters/go#python-learning
Last updated
https://github.com/datasciencemasters/go#python-learning
Last updated
The open-source curriculum for learning Data Science. Foundational in both theory and technologies, the OSDSM breaks down the core competencies necessary to making use of data.
With Coursera, ebooks, Stack Overflow, and GitHub -- all free and open -- how can you afford not to take advantage of an open source education?
We need more Data Scientists.
...by 2018 the United States will experience a shortage of 190,000 skilled data scientists, and 1.5 million managers and analysts capable of reaping actionable insights from the big data deluge.
-- 23 July 2013
There are little to no Data Scientists with 5 years experience, because the job simply did not exist.
-- David Hardtke "How To Hire A Data Scientist" 13 Nov 2012
Classic academic conduits aren't providing Data Scientists -- this talent gap will be closed differently.
Academic credentials are important but not necessary for high-quality data science. The core aptitudes – curiosity, intellectual agility, statistical fluency, research stamina, scientific rigor, skeptical nature – that distinguish the best data scientists are widely distributed throughout the population.
We’re likely to see more uncredentialed, inexperienced individuals try their hands at data science, bootstrapping their skills on the open-source ecosystem and using the diversity of modeling tools available. Just as data-science platforms and tools are proliferating through the magic of open source, big data’s data-scientist pool will as well.
And there’s yet another trend that will alleviate any talent gap: the democratization of data science. While I agree wholeheartedly with Raden’s statement that “the crème-de-la-crème of data scientists will fill roles in academia, technology vendors, Wall Street, research and government,” I think he’s understating the extent to which autodidacts – the self-taught, uncredentialed, data-passionate people – will come to play a significant role in many organizations’ data science initiatives.
Start here.
Topics: Python NLP on Twitter API, Distributed Computing Paradigm, MapReduce/Hadoop & Pig Script, SQL/NoSQL, Relational Algebra, Experiment design, Statistics, Graphs, Amazon EC2, Visualization.
Topics: Data wrangling, data management, exploratory data analysis to generate hypotheses and intuition, prediction based on statistical methods such as regression and classification, communication of results through visualization, stories, and summaries.
Topics: Visualizing Data, Estimation, Models from Scaling Arguments, Arguments from Probability Models, What you Really Need to Know about Classical Statistics, Data Mining, Clustering, PCA, Map/Reduce, Predictive Analytics
Example Code in: R, Python, Sage, C, Gnu Scientific Library
Human impact is a first-class concern when building machine intelligence technology. When we build products, we deduce patterns and then reinforce them in the world. Ethics in any Engineering concerns understanding the sociotechnological impact of the products and services we are bringing to bear in the human world -- and whether they are reinforcing a future we all want to live in.
How does the real world get translated into data? How should one structure that data to make it understandable and usable? Extends beyond database design to usability of schemas and models.
Foundational & Theoretical
Practical
One of the "unteachable" skills of data science is an intuition for analysis. What constitutes valuable, achievable, and well-designed analysis is extremely dependent on context and ends at hand.
in Python
Visualization
Data Visualization and Communication
Theoretical Design of Information
Applied Design of Information
Theoretical Courses / Design & Visualization
Practical Visualization Resources
Networks Packages
-- James Kobielus, 17 Jan 2013
This is an introduction geared toward those with at least a minimum understanding of programming, and (perhaps obviously) an interest in the components of Data Science (like statistics and distributed computing). Out of personal preference and need for focus, I geared the original curriculum toward Python tools and resources. R resources can be found .
Linear Algebra
Linear Algebra / Levandosky
Linear Programming (Math 407)
The Manga Guide to Linear Algebra
An Intuitive Guide to Linear Algebra
A Programmer's Intuition for Matrix Multiplication
Vector Calculus: Understanding the Cross Product
Vector Calculus: Understanding the Dot Product
Convex Optimization / Boyd /
Stats in a Nutshell
Think Stats: Probability and Statistics for Programmers &
Think Bayes &
Differential Equations in Data Science
Problem-Solving Heuristics "How To Solve It"
Get your environment up and running with the
Algorithms Design & Analysis I
Algorithm Design, Kleinberg & Tardos
*See Intro to Data Science
Intro to Hadoop and MapReduce *includes select free excerpts of Hadoop: The Definitive Guide
Introduction to Databases
SQL School
SQL Tutorials
Mining Massive Data Sets / Stanford & &
Mining The Social Web
Introduction to Information Retrieval / Stanford &
OSDSM Specialization:
Machine Learning &
A Course in Machine Learning
The Elements of Statistical Learning / Stanford & &
Machine Learning
Programming Collective Intelligence
Machine Learning for Hackers
Intro to scikit-learn, SciPy2013
Probabilistic Programming and Bayesian Methods for Hackers
Probabilistic Graphical Models
Neural Networks
Neural Networks
Deep Learning for Natural Language Processing CS224d
Social and Economic Networks: Models and Analysis /
Social Network Analysis for Startups
From Languages to Information / Stanford CS147
NLP with Python (NLTK library) ,
How to Write a Spelling Correcter / Norvig (Tutorial)[]
Big Data Analysis with Twitter
Exploratory Data Analysis
Data Analysis in Python
Python for Data Analysis
An Example Data Science Process
The Truthful Art: Data, Charts, and Maps for Communication
Envisioning Information
The Visual Display of Quantitative Information
Information Dashboard Design: Displaying Data for At-a-Glance Monitoring
Data Visualization
Berkeley's Viz Class
Rice University's Data Viz class
D3 Library / Scott Murray
Interactive Data Visualization for the Web / Scott Murray &
OSDSM Specialization:
Learn Python the Hard Way &
Python
Think Python &
Installing Basic Packages &
for Scientific Python Packages
(data structure library)
More Libraries can be found in the repo & in related
Flexible and powerful data analysis / manipulation library with labeled data structures objects, statistical functions, etc & Tutorials
- Tools for Data Mining & Analysis
- Network Modeling & Viz
- Bayesian Inference & Markov Chain Monte Carlo sampling toolkit
- Python module that allows users to explore data, estimate statistical models, and perform statistical tests
- Multivariate Pattern Analysis in Python
- Natural Language Toolkit
- Python library for topic modeling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community.
- Python wrapper for the Twitter API
- well-integrated with analysis and data manipulation packages like numpy and pandas
- a high-level statistical visualization package built on top of matplotlib
(Linear Regression, Logistic Regression, Random Forests, K-Means Clustering)
Doing Data Science: Straight Talk from the Frontline
The Data Science Handbook: Advice and Insights from 25 Amazing Data Scientists
Capstone Analysis of Your Own Design; 's Idea Compendium
Healthcare Twitter Analysis
Analyze your LinkedIn Network
- The "Hacker News" of Data Science
- The free encyclopedia
- Bestseller Pop Sci
- Search for a concept you want to learn
- Online university courses
- The smart number and info cruncher
- High quality, free learning videos