ml-resources

Machine learning & data science educational materials

The list below contains a wide variety of machine learning educational resources aimed at data scientists, software engineers, and even non-technical business and product people. Take some time to peruse the books, articles and videos that appear most appropriate for your background. The more prepared you are coming into the event, the more you'll get out of it!

The Master Algorithm - Pedro Domingos

"...If data-ism is today's philosophy, this book is its bible..."

For all participants

A Friendly Introduction to Deep Learning

Highly visual YouTube video explaining fundamentals of Deep Learning.

For all participants

AI Playbook - Andreessen Horowitz

"...this microsite is intended to help newcomers (both non-technical and technical) begin exploring what's possible with AI"

Non-Technical

The Non-Technical Guide to Machine Learning & Artificial Intelligence

Blog articles with resources for non-technical folks about ML/AI.

Non-Technical

Machine Learning Crash Course: Part 1

"...This (KDNuggets) post, the first in a series of ML tutorials, aims to make machine learning accessible to anyone willing to learn."

For developers, non-technical

Making Sense of Machine Learning

"...Broadly speaking, machine learners are computer algorithms designed for pattern recognition, curve fitting, classification and clustering. The word learning in the term stems from the ability to learn from data..."

For developers, non-technical

A Visual Introduction To Machine Learning

A cool and interactive visualization that explains some fundamentals behind machine learning.

For developers, non-technical

How to Learn Machine Learning in 10 Days

KDNuggets article by best selling author Sebastian Raschka. Pointers to videos and other resources for an introductory high-level overview of ML and data science.

For developers, non-technical

Data Science from Scratch: First Principles with Python - Joel Grus

Best selling book, authored by our friend Joel Grus, shows how many of the most fundamental data science tools and algorithms work by implementing them from scratch. Great book for software engineers looking to get into data science.

For developers

KDNuggets - 10 Free Must-Read Books for Machine Learning and Data Science

Great curated list of FREE, online and foundational machine learning books

For developers & data scientists

KDNuggets - Top 8 Free Must-Read Books on Deep Learning

Top 8 Free Must-Read Books on Deep Learning

For developers & data scientists

KDNuggets - Top 10 Machine Learning Algorithms for Beginners

A beginner's introduction to the Top 10 Machine Learning (ML) algorithms, complete with figures and examples for easy understanding

For developers

Machine Learning Mastery

"Master machine learning by using it on real life applications, even if you’re starting from scratch." Great site for software engineers who want to understand machine learning through code.

For developers

Numsense! Data Science for the Layman: No Math Added - Annalyn Ng

"This book has been written in layman’s terms as a gentle introduction to data science and its algorithms. Each algorithm has its own dedicated chapter that explains how it works, and shows an example of a real-world application."

For developers, non-technical

Training (and retraining) Deep Image Classifiers (CNNs)

Building powerful image classification models using very little data

For data scientists

Keras and Convolutional Neural Networks (CNNs)

Part two in a three-part series on building a complete end-to-end image classification + deep learning application from PyImageSearch

For developers, data scientists

Understanding Deep Convolutional Neural Networks (CNNs) with a practical use-case

We show how to build a deep neural network that classifies images to many categories with an accuracy of a 90%. This was a very hard problem before the rise of deep networks and especially Convolutional Neural Networks.

For data scientists

The Unreasonable Effectiveness of Recurrent Neural Networks

Although a bit dated now, great blog post Andrej Karpathy on the power and possibilities of RNNs and LSTMs.

For data scientists

7 Steps to Mastering Deep Learning with Keras

Keras is a high-level neural network API, helping lead the way to the commoditization of deep learning and artificial intelligence.

For developers, data scientists

Deep Learning with Python

Written by Keras creator and Google AI researcher François Chollet, this book builds your understanding through intuitive explanations and practical examples

For developers, data scientists

Data sets

Your novel business idea should be grounded in real-world data with plausible machine-learning/analytics on top. We've compiled a collection of 100s of datasets from which to gain inspiration. Note that you are not restricted to basing your idea on the data sets below. You may discover other open source data sets that inspire your creativity or you may bring your own proprietary data sets if you wish.

Many of the datasets below are from Kaggle, Figure-Eight (Crowdflower), Data.World, etc. The advantage of these datasets is that many have been cleaned and normalized and are ready to be explored with ML and data science tools. Note that the use of these datasets is often intended for research purposes only. Be sure to read any associated license agreements to understand if there are commercial restrictions if you plan to continuing using the data after the workshop is over.

We included some sample ideas for each category to give you some ideas on how you might utilize these data sets.

Sports

Detailed NFL Play-by-Play Data 2009-2017

Generated NFL dataset with expected points and win probability

Play-by-Play Baseball dataset

Play-by-play data for every Baseball game in 2016

NFLsavant.com

NFLsavant.com is a web site dedicated to providing advanced NFL statistics in a simple to use interface

Idea: How can this play-by-play data be used to tell fantasy football players which of their players they should start on a week-by-week basis?

NFL Draft Outcomes

Consolidated draft data from pro-football-reference.com for all drafts from 1985 to 2015.

Idea: Build a service assigning expected value to future NFL players based on college statistics, height, weight, speed, draft position, etc.

Football Strategy

Contributors were presented a football scenario and asked to note what the best coaching decision would be. (originating page found here)

Idea: How can this data be used to build enticing products that give products Fantasy sports league players an edge over their competitors?

The History of Baseball

A complete history of major league baseball stats from 1871 to 2015

European Soccer Database

A complete history of major league baseball stats from 1871 to 2015

Idea: What soccer player/team attributes actually lead to wins?

NBA Salaries 1990-2016

NBA Salaries - 1990 to 2016

Crime/Law

Data.World crime data sets

A variety of crime statistics from major US cities

Idea: Build a model for predicting crime probabilities at given intersections over the course of a night/day, thus allowing police to better position assets and prevent crime.

Homicide Reports, 1980-2014

Fatal Police Shootings, 2015-Present

Civilians shot and killed by on-duty police officers in United States

Gun violence database

Archive of U.S. gun violence incidents collected from over 2,000 sources

Idea: Study how crime patterns are correlated with socio-economic trends longitudinally, perhaps utilizing census data as well. Can you predict where there are opportunities for urban renewal and gentrification?

Seattle Police Department

Seattle police incident and crime reports

Patent Litigations

Detailed Patent Litigation Data on 74k Cases, 1963-2015

The Supreme Court Database

The Supreme Court Database is the definitive source for researchers, students, journalists, and citizens interested in the U.S. Supreme Court.

Commerce/Finance

3 Million Instacart Orders, Open Sourced

This anonymized dataset contains a sample of over 3 million grocery orders from more than 200,000 Instacart users.

Idea: Design a smart grocery list application that automatically suggests other relevant grocery shopping items depending what items that user has entered in their list and what time they are planning to go shopping.

FDIC Failed Bank List

The FDIC is often appointed as receiver for failed banks. This list includes banks which have failed since October 1, 2000

Idea: Build a model that identifies key markers and points to banks that are likely to fail over the next 3 months, thus allowing regulators to better target their resources.

Amazon Reviews

This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014

Idea: Can you design a model that learns what a high-quality, specific, helpful review looks like, and give you real-time suggestions as you write reviews?

Idea 2: Can you mine the review data to pull out the three things people most like and the three things people most dislike about various products?

Amazon Q/A Data

This dataset contains Question and Answer data from Amazon, totaling around 1.4 million answered questions.

Amazon and Yelp Reviews

A large collection of Amazon and Yelp reviews, plus Yahoo Answers data.

Idea: Design a review summarizer that summarizes the positive and negative reviews for a product to allow users to quickly understand overall review sentiment from users.

Judge Emotion About Brands & Products

Labeled tweets about multiple brands and products. (originating page found here)

Daily News for Stock Market Prediction

Using 8 years daily news headlines to predict stock market movement

Ethereum Historical Data

All Ethereum data from the start to May 2017

Amazon AWS Spot Pricing

AWS Spot Pricing Market - This includes price, region, instance size, and OS for AWS Spot Instances

Health

Medical Appointment No Shows

300k medical appointments and its 15 variables of each, including whether the patient shows up or not.

Idea: Design a product for doctors offices that predicts whether a patient will show up or whether there is a particular time slot that they are more likely to show up to. How would you integrate your technology into existing scheduling systems? What would you have to offer to displace a legacy scheduling system?

Seer Cancer Incidence Database

National Cancer Institute - Cancer Statistics Query Tool

CDC Cause of Death (and other CDC data)

WONDER online databases utilize a rich ad-hoc query system for the analysis of public health data.

Mental Health in Tech Survey

Survey on Mental Health in the Tech Workplace in 2014

Prescription-based prediction

Predicting doctor attributes from prescription behavior

Medicare Data

This site provides direct access to the official data from the Centers for Medicare & Medicaid Services (CMS) that are used on the Medicare.gov Compare Websites and Directories

Human Mortality Database

The Human Mortality Database (HMD) was created to provide detailed mortality and population data to researchers, students, journalists, policy analysts, and others interested in the history of human longevity

Hospital Charges for Inpatients

How inpatient hospital charges can differ among different providers in the US

Idea: A service that allows you to input your medical treatment received during a hospital stay and determine whether the bill you get is within an acceptable range.

US county-level mortality

United States Mortality Rates by County 1980-2014

Exercise Pattern Prediction

What does your exercise pattern fall into?

Idea: Design a product that analyzes not just how much people do a particular activity, but how well they do it. Are they consistent? Are they in danger of of hurting themselves?

Idea 2: Is there a way to connect exercise patterns with improved health outcomes or longevity. Can you design the most efficient workout for someone given a desired goal -- weight loss, lower blood pressure, lower cholestorol, etc?

Human Activity Recognition

Human Activity Recognition Using Smartphones Data Set. See this blog article, "Predicting physical activity based on smartphone sensor data using CNN + LSTM" for ideas on how to use this dataset

Real estate

Zillow Rent Index, 2010-Present

Which city has the highest median price or price per square foot?

Housing Market Data From Redfin

Housing market data for metropolitan areas, cities, neighborhoods and zip codes across the nation

Idea: Can you build a model that analyzes past trends to determine which local real estate markets are about to heat up and which are likely to cool down?

More Zillow Data

Download a single file with all Zillow metrics

US Census - New Home Sales

National and regional data on the number of new single-family houses sold and for sale

House Sales in King County, WA

This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015

AirBnB Data

How is Airbnb really being used in and affecting the neighborhoods of your city?

Idea: How can all of this data be used to build a product that helps real-estate developers choose locations and building types that optimize profits/risk.

Idea 2: How can this data be used to create an addon/extension for AirBnB hosts that helps them make sure they are maximizing profit?

Free GIS Datasets

contains a categorised list of links to over 300 sites providing freely available geographic datasets - all ready for loading into a Geographic Information System

Politics

EveryPolitician

The world’s richest open dataset on politicians

GovTrack

GovTrack is here to help you track legislation being debated in the United States Congress

Idea: Can legislative patterns be used to illuminate where future economic upturns or downturns are likely to occur?

Trump Twitter Archive

Search through all of Trump's tweets

Idea: Is the content of Trump's tweets correlated with financial markets?

Classification of political social media

Thousands of social media messages from US Senators and other American politicians. Originating page can be found here

56 Major Speeches by Donald Trump

56 Major Speeches by Donald Trump (June 2015 - November 2016)

Global Historical Climatology Network

Land and ocean temperature anomalies

Idea: Can this type of data be used to predict and adjust agricultural/planting strategies? How might this data be combined with arial imagery to show the effects of global warming on local regions and the resultant economic effects.

Employment

Human resources analytics

Why are our best and most experienced employees leaving prematurely?

Idea: Design a product that uses HR data to predict who is likely to leave. What can the data say about why employees "churn"? How can these patterns be interrupted?

Current Employment Statistics

Treasure trove of employment statistics

Employment in Manufacturing

Percent change of employment from the same quarter a year ago

Idea: Discover the real patterns of why the manufacturing sector is losing jobs (not what the politicians say). Can those patterns be used target groups of workers who might be more successful in retraining programs?

Online Job Postings

Dataset of 19,000 online job posts from 2004 to 2015

US Unemployment Rates by County

US Unemployment Rate by County, 1990-2016

US Jobs on Dice.com

22,000 technology job listings

Media (Music, Movies, Audio)

Million Song Dataset

The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks.

55000+ Song Lyrics

Lyrics for 55000+ songs in English from LyricsFreak

Mozilla Common Voice

An open and publicly available dataset of voices that everyone can use to train speech-enabled applications

FMA: A Dataset For Music Analysis

The dataset is a dump of the Free Music Archive (FMA), an interactive library of high-quality, legal audio downloads.

18,393 Pitchfork Reviews

Pitchfork music reviews from Jan 5, 1999 to Jan 8, 2017

Idea: Can you combine a number of these data sources and build a model that predicts the likelihood of success of an unreleased track or album? Assuming your model is accurate, what do the features with the most predictive power tell you about how record labels should deploy marketing resources?

MovieLens 20M Dataset

Over 20 Million Movie Ratings and Tagging Activities Since 1995

Freesound: Content-Based Audio Retrieval

Find sounds with text-queries based on their acoustic content

Google AudioSet

AudioSet consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos.

Google Speech Commands Dataset

The dataset has 65,000 one-second long utterances of 30 short words, by thousands of different people, contributed by members of the public through the AIY website

Text

Tatoeba

Tatoeba is a collection of sentences and translations. It's collaborative, open, free and even addictive

Dataset from the "ChangeMyView" subreddit.

Winning Arguments: Interaction Dynamics and Persuasion Strategies in Good-faith Online Discussions

TriviaQA

A Large Scale Dataset for Reading Comprehension and Question Answering

NewsQA

The purpose of this dataset is to help the research community build algorithms that are capable of answering questions requiring human-level comprehension and reasoning skills

SQuAD

The Stanford Question Answering Dataset

Cornell Movie Dialog Corpus

This corpus contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts

Frames

Travel booking conversations for chatbot development

Conversational datasets

Conversational datasets to train a chatbot

Fake New Challenge

The goal of the Fake News Challenge is to explore how artificial intelligence technologies, particularly machine learning and natural language processing, might be leveraged to combat the fake news problem.

Fake New Net

This is a repository for an ongoing data collection project for fake news research at ASU

Kaggle Fake News Dataset

Getting Real about Fake News - Text & metadata from fake & biased news sources around the web

"Real-life Deception" dataset

A multimodal dataset consisting of real-life deception: deceptive and truthful trial testimonies, manually transcribed and annotated. The dataset includes 121 short videos, along with their transcriptions and gesture annotations

"Open-domain Deception" dataset

This is a crowdsourced deception dataset consisting of short open domain truths and lies from 512 users. Seven lies and seven truths are provided for each user. The dataset also includes user's demographic information, such as gender, age, country of origin, and education level

Images/Video

Chest X-Ray Images (Pneumonia)

Chest X-Ray Images (Pneumonia) 5,863 images, 2 categories

An Image is Worth More than a Thousand Favorites:

Surfacing the Hidden Beauty of Flickr Pictures

Idea: Can you design an app that predicts how many likes a picture will get on social media, thus helping users to select which photo they should upload to gain maximum exposure?

Image sentiment polarity classification

This data set contains over fifteen thousand sentiment-scored images. Originating page can be found here

Idea: Can you train a model on this image sentiment dataset and combine it with NLP to create a service that suggests a photo from my camera roll to post with text I have written, or suggest text to go with a photo I have chosen?

NIH Chest X-rays

NIH Chest X-rays Over 112,000 Chest X-ray images from more than 30,000 unique patients

Epic Kitchens

The largest dataset in first-person (egocentric) vision; multi-faceted non-scripted recordings in native environments - i.e. the wearers' homes, capturing all daily activities in the kitchen over multiple days.

Bedroom Images

LSUN bedroom scene

Cat Dataset

Over 9,000 images of cats with annotated facial features

Cats vs Dogs

Tons of cat and dog images (also see this dataset on cat and dog audio)

Stanford Dogs Dataset

The Stanford Dogs dataset contains images of 120 breeds of dogs from around the world. This dataset has been built using images and annotation from ImageNet for the task of fine-grained image categorization

Food Images

Many images of several categories of food

Flower Image Dataset

Flowers Recognition - This dataset contains labeled 4242 images of flowers.

Fashion: Dress patterns

This dataset contains links to images of women's dresses, and the corresponding images are categorized into 17 different pattern types

The Infamous ImageNet database

ImageNet is an image database organized according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images

Pascal VOC 2007

The PASCAL Visual Object Classes Challenge 2007

Labeled Faces in the Wild

Yahoo Image Datasets

Flickr Creative Common Images, etc.

State Farm Distracted Driver Detection

Can computer vision spot distracted drivers?

Yelp Restaurant Photo Classification

Predict attribute labels for restaurants using user-submitted photos

YouTube-8M Dataset

YouTube-8M is a large-scale labeled video dataset that consists of millions of YouTube video IDs and associated labels from a diverse vocabulary of 4700+ visual entities.

YouTube-BoundingBoxes Dataset

YouTube-BoundingBoxes is a large-scale data set of video URLs with densely-sampled high-quality single-object bounding box annotations. The data set consists of approximately 380,000 15-20s video segments extracted from 240,000 different publicly visible YouTube videos, automatically selected to feature objects in natural settings without editing or post-processing, with a recording quality often akin to that of a hand-held cell phone camera.

Google Atomic Visual Actions (AVA)

In order to facilitate further research into human action recognition, we have released AVA, coined from “atomic visual actions”, a new dataset that provides multiple action labels for each person in extended video sequences

Google Open Image Dataset (v3)

Open Images is a dataset of ~9 million URLs to images that have been annotated with image-level labels and bounding boxes spanning thousands of classes.

CVonline: Image Databases

This is a collated list of image and video databases that people have found useful for computer vision research and algorithm evaluation.

FigureQA dataset

FigureQA dataset introduces a new visual reasoning task for research, specific to graphical plots and figures. The task comes with an additional twist: all of the questions are relational, requiring the comparison of several or all elements of the underlying plot

Large Data Repositories

Data.gov

Data.gov is managed and hosted by the U.S. General Services Administration (metrics)

Data.world

Building the most meaningful, collaborative, and abundant data resource in the world

Figure Eight (formerly Crowdflower)

Here are some favorite open datasets created on the Figure Eight platform. They’re free for any and everyone to download.

BigML

Wealth of links pointing out to free and open datasets that can be used to build predictive models

Open Data Network

Publish data and share. Find data and build. Answer questions. (powered by Socrata)

Luke de Oliveira Blog Post

Fueling the Gold Rush: The Greatest Public Datasets for AI

Data for Democracy

Data for Democracy brings together an active, passionate community of people using data to drive better decisions and improve the world in which we live.

Deep Learning.net

These datasets can be used for benchmarking deep learning algorithms

Kaggle

Data Science Competition Platform

OpenML

An open, collaborative, frictionless, automated machine learning environment

Ai2 - Allen Institute for Artificial Intelligence

Public datasets

USAFacts.org

USAFacts is a new data-driven portrait of the American population, our government’s finances, and government’s impact on society. USAFacts was inspired by a conversation Steve Ballmer had with his wife, Connie.

Awesome Public Data Sources (Github)

A topic-centric list of high-quality open datasets in public domains. By everyone, for everyone!

Amazon AWS Public Datasets

Public Datasets on AWS provides a centralized repository of public datasets that can be seamlessly integrated into AWS cloud-based applications

Data.seattle.gov

Data about Seattle!

OpenDataBC

Data about British Columbia

KAPSARC - Energy Data

KAPSARC data portal is available to anyone interested in energy data. Portal is designed to enable users to better understand energy, economy and policies by quickly accessing and analyzing critical data

Financial Market Data

AssetMacro offers Free Historical Data for Leading Indicators of Economies and Market Data for Stocks, Bonds, Commodities and Currencies

KDnuggets Datasets

Datasets for Data Mining and Data Science

GoVertical presents

Machine Learning Startup Creation Weekend

Hosted by Madrona Venture Labs & TiE Seattle

Resources

Panel videos

Machine learning & data science educational materials

Data sets

Sports

Crime/Law

Commerce/Finance

Health

Real estate

Politics

Employment

Media (Music, Movies, Audio)

Text

Images/Video

Large Data Repositories