GoVertical presents

Machine Learning Startup Creation Weekend

Hosted by Madrona Venture Labs & TiE Seattle

As a free benefit for participants, we would like to extend an invitation to the Amazon SageMaker workshop on Tue, Apr 24 from 2:30-4:30pm.

LEARN MORE

Resources

Welcome to the Resources area for the GoVertical ML Startup Creation Weekend! In order to make the most of the time the weekend of the event, you'll find key educational materials and data sets. 

Be Prepared! Start thinking through what types of data could power your business and product ideas. Often times a combination of multiple, disparate data sets can yield the most ingenious ideas and solutions!

Panel videos

The following videos were recording during the April 19 Panel event. You may wish to reference them in preparation of the weekend ML event.

ML Panel moderated by Dan Weld. Panelists: Xin Luna Dong, Yejin Choi & Kevin Jamieson

WATCH VIDEO

VC Panel moderated by Jay Bartot. Panelists: Tim Porter, Mike Miller, Pradeep Rathinam & Ankur Teredesai

WATCH VIDEO

Machine learning & data science educational materials

The list below contains a wide variety of machine learning educational resources aimed at data scientists, software engineers, and even non-technical business and product people. Take some time to peruse the books, articles and videos that appear most appropriate for your background. The more prepared you are coming into the event, the more you'll get out of it!

"...If data-ism is today's philosophy, this book is its bible..." 

For all participants

Highly visual YouTube video explaining fundamentals of Deep Learning.

For all participants

"...this microsite is intended to help newcomers (both non-technical and technical) begin exploring what's possible with AI"

Non-Technical

Blog articles with resources for non-technical folks about ML/AI.

Non-Technical

"...This (KDNuggets) post, the first in a series of ML tutorials, aims to make machine learning accessible to anyone willing to learn."

For developers, non-technical

"...Broadly speaking, machine learners are computer algorithms designed for pattern recognition, curve fitting, classification and clustering. The word learning in the term stems from the ability to learn from data..."

For developers, non-technical

A cool and interactive visualization that explains some fundamentals behind machine learning.

For developers, non-technical

KDNuggets article by best selling author Sebastian Raschka. Pointers to videos and other resources for an introductory high-level overview of ML and data science.

For developers, non-technical

Best selling book, authored by our friend Joel Grus, shows how many of the most fundamental data science tools and algorithms work by implementing them from scratch. Great book for software engineers looking to get into data science.

For developers

Great curated list of FREE, online and foundational machine learning books

For developers & data scientists

Top 8 Free Must-Read Books on Deep Learning

For developers & data scientists

A beginner's introduction to the Top 10 Machine Learning (ML) algorithms, complete with figures and examples for easy understanding

For developers

"Master machine learning by using it on real life applications, even if you’re starting from scratch." Great site for software engineers who want to understand machine learning through code.

For developers

"This book has been written in layman’s terms as a gentle introduction to data science and its algorithms. Each algorithm has its own dedicated chapter that explains how it works, and shows an example of a real-world application."

For developers, non-technical

Building powerful image classification models using very little data

For data scientists

Part two in a three-part series on building a complete end-to-end image classification + deep learning application from PyImageSearch

For developers, data scientists

We show how to build a deep neural network that classifies images to many categories with an accuracy of a 90%. This was a very hard problem before the rise of deep networks and especially Convolutional Neural Networks.

For data scientists

Although a bit dated now, great blog post Andrej Karpathy on the power and possibilities of RNNs and LSTMs.

For data scientists

Keras is a high-level neural network API, helping lead the way to the commoditization of deep learning and artificial intelligence.

For developers, data scientists

Written by Keras creator and Google AI researcher François Chollet, this book builds your understanding through intuitive explanations and practical examples

For developers, data scientists

Data sets

Your novel business idea should be grounded in real-world data with plausible machine-learning/analytics on top. We've compiled a collection of 100s of datasets from which to gain inspiration. Note that you are not restricted to basing your idea on the data sets below. You may discover other open source data sets that inspire your creativity or you may bring your own proprietary data sets if you wish.

Many of the datasets below are from Kaggle, Figure-Eight (Crowdflower), Data.World, etc. The advantage of these datasets is that many have been cleaned and normalized and are ready to be explored with ML and data science tools. Note that the use of these datasets is often intended for research purposes only. Be sure to read any associated license agreements to understand if there are commercial restrictions if you plan to continuing using the data after the workshop is over.

We included some sample ideas for each category to give you some ideas on how you might utilize these data sets.

Sports

Generated NFL dataset with expected points and win probability

Play-by-play data for every Baseball game in 2016

NFLsavant.com is a web site dedicated to providing advanced NFL statistics in a simple to use interface

Idea: How can this play-by-play data be used to tell fantasy football players which of their players they should start on a week-by-week basis?

Consolidated draft data from pro-football-reference.com for all drafts from 1985 to 2015.

Idea: Build a service assigning expected value to future NFL players based on college statistics, height, weight, speed, draft position, etc.

Contributors were presented a football scenario and asked to note what the best coaching decision would be. (originating page found here)

Idea: How can this data be used to build enticing products that give products Fantasy sports league players an edge over their competitors?

A complete history of major league baseball stats from 1871 to 2015

A complete history of major league baseball stats from 1871 to 2015

Idea: What soccer player/team attributes actually lead to wins?

NBA Salaries - 1990 to 2016

Crime/Law

A variety of crime statistics from major US cities

Idea: Build a model for predicting crime probabilities at given intersections over the course of a night/day, thus allowing police to better position assets and prevent crime.

Homicide Reports, 1980-2014

Civilians shot and killed by on-duty police officers in United States

Archive of U.S. gun violence incidents collected from over 2,000 sources

Idea: Study how crime patterns are correlated with socio-economic trends longitudinally, perhaps utilizing census data as well. Can you predict where there are opportunities for urban renewal and gentrification?

Seattle police incident and crime reports

Detailed Patent Litigation Data on 74k Cases, 1963-2015

The Supreme Court Database is the definitive source for researchers, students, journalists, and citizens interested in the U.S. Supreme Court.

Commerce/Finance

This anonymized dataset contains a sample of over 3 million grocery orders from more than 200,000 Instacart users.

Idea: Design a smart grocery list application that automatically suggests other relevant grocery shopping items depending what items that user has entered in their list and what time they are planning to go shopping.

The FDIC is often appointed as receiver for failed banks. This list includes banks which have failed since October 1, 2000

Idea: Build a model that identifies key markers and points to banks that are likely to fail over the next 3 months, thus allowing regulators to better target their resources.

This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014

Idea: Can you design a model that learns what a high-quality, specific, helpful review looks like, and give you real-time suggestions as you write reviews?

Idea 2: Can you mine the review data to pull out the three things people most like and the three things people most dislike about various products?

This dataset contains Question and Answer data from Amazon, totaling around 1.4 million answered questions.

A large collection of Amazon and Yelp reviews, plus Yahoo Answers data.

Idea: Design a review summarizer that summarizes the positive and negative reviews for a product to allow users to quickly understand overall review sentiment from users.

Labeled tweets about multiple brands and products. (originating page found here)

Using 8 years daily news headlines to predict stock market movement

All Ethereum data from the start to May 2017

AWS Spot Pricing Market - This includes price, region, instance size, and OS for AWS Spot Instances

Health

300k medical appointments and its 15 variables of each, including whether the patient shows up or not.

Idea: Design a product for doctors offices that predicts whether a patient will show up or whether there is a particular time slot that they are more likely to show up to. How would you integrate your technology into existing scheduling systems? What would you have to offer to displace a legacy scheduling system?

National Cancer Institute - Cancer Statistics Query Tool

WONDER online databases utilize a rich ad-hoc query system for the analysis of public health data.

Survey on Mental Health in the Tech Workplace in 2014

Predicting doctor attributes from prescription behavior

This site provides direct access to the official data from the Centers for Medicare & Medicaid Services (CMS) that are used on the Medicare.gov Compare Websites and Directories

The Human Mortality Database (HMD) was created to provide detailed mortality and population data to researchers, students, journalists, policy analysts, and others interested in the history of human longevity

How inpatient hospital charges can differ among different providers in the US

Idea: A service that allows you to input your medical treatment received during a hospital stay and determine whether the bill you get is within an acceptable range.

United States Mortality Rates by County 1980-2014

What does your exercise pattern fall into?

Idea: Design a product that analyzes not just how much people do a particular activity, but how well they do it. Are they consistent? Are they in danger of of hurting themselves?

Idea 2: Is there a way to connect exercise patterns with improved health outcomes or longevity. Can you design the most efficient workout for someone given a desired goal -- weight loss, lower blood pressure, lower cholestorol, etc?

Human Activity Recognition Using Smartphones Data Set. See this blog article, "Predicting physical activity based on smartphone sensor data using CNN + LSTM" for ideas on how to use this dataset

Real estate

Which city has the highest median price or price per square foot?

Housing market data for metropolitan areas, cities, neighborhoods and zip codes across the nation

Idea: Can you build a model that analyzes past trends to determine which local real estate markets are about to heat up and which are likely to cool down?

Download a single file with all Zillow metrics

National and regional data on the number of new single-family houses sold and for sale

This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015

How is Airbnb really being used in and affecting the neighborhoods of your city?

Idea: How can all of this data be used to build a product that helps real-estate developers choose locations and building types that optimize profits/risk.

Idea 2: How can this data be used to create an addon/extension for AirBnB hosts that helps them make sure they are maximizing profit?

contains a categorised list of links to over 300 sites providing freely available geographic datasets - all ready for loading into a Geographic Information System

Politics

The world’s richest open dataset on politicians

GovTrack is here to help you track legislation being debated in the United States Congress

Idea: Can legislative patterns be used to illuminate where future economic upturns or downturns are likely to occur?

Search through all of Trump's tweets

Idea: Is the content of Trump's tweets correlated with financial markets?

Thousands of social media messages from US Senators and other American politicians. Originating page can be found here

56 Major Speeches by Donald Trump (June 2015 - November 2016)

Land and ocean temperature anomalies

Idea: Can this type of data be used to predict and adjust agricultural/planting strategies? How might this data be combined with arial imagery to show the effects of global warming on local regions and the resultant economic effects.

Employment

Why are our best and most experienced employees leaving prematurely?

Idea: Design a product that uses HR data to predict who is likely to leave. What can the data say about why employees "churn"? How can these patterns be interrupted?

Treasure trove of employment statistics

Percent change of employment from the same quarter a year ago

Idea: Discover the real patterns of why the manufacturing sector is losing jobs (not what the politicians say). Can those patterns be used target groups of workers who might be more successful in retraining programs?

Dataset of 19,000 online job posts from 2004 to 2015

US Unemployment Rate by County, 1990-2016

22,000 technology job listings

Media (Music, Movies, Audio)

The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks.

Lyrics for 55000+ songs in English from LyricsFreak

An open and publicly available dataset of voices that everyone can use to train speech-enabled applications

The dataset is a dump of the Free Music Archive (FMA), an interactive library of high-quality, legal audio downloads.

Pitchfork music reviews from Jan 5, 1999 to Jan 8, 2017

Idea: Can you combine a number of these data sources and build a model that predicts the likelihood of success of an unreleased track or album? Assuming your model is accurate, what do the features with the most predictive power tell you about how record labels should deploy marketing resources?

Over 20 Million Movie Ratings and Tagging Activities Since 1995

Find sounds with text-queries based on their acoustic content

AudioSet consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos.

The dataset has 65,000 one-second long utterances of 30 short words, by thousands of different people, contributed by members of the public through the AIY website

Text

Tatoeba is a collection of sentences and translations. It's collaborative, open, free and even addictive

Winning Arguments: Interaction Dynamics and Persuasion Strategies in Good-faith Online Discussions

A Large Scale Dataset for Reading Comprehension and Question Answering

The purpose of this dataset is to help the research community build algorithms that are capable of answering questions requiring human-level comprehension and reasoning skills

The Stanford Question Answering Dataset

This corpus contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts

Travel booking conversations for chatbot development

Conversational datasets to train a chatbot

The goal of the Fake News Challenge is to explore how artificial intelligence technologies, particularly machine learning and natural language processing, might be leveraged to combat the fake news problem.

This is a repository for an ongoing data collection project for fake news research at ASU

Getting Real about Fake News - Text & metadata from fake & biased news sources around the web

A multimodal dataset consisting of real-life deception: deceptive and truthful trial testimonies, manually transcribed and annotated. The dataset includes 121 short videos, along with their transcriptions and gesture annotations

This is a crowdsourced deception dataset consisting of short open domain truths and lies from 512 users. Seven lies and seven truths are provided for each user. The dataset also includes user's demographic information, such as gender, age, country of origin, and education level

Images/Video

Chest X-Ray Images (Pneumonia) 5,863 images, 2 categories

Surfacing the Hidden Beauty of Flickr Pictures

Idea: Can you design an app that predicts how many likes a picture will get on social media, thus helping users to select which photo they should upload to gain maximum exposure?

This data set contains over fifteen thousand sentiment-scored images. Originating page can be found here

Idea: Can you train a model on this image sentiment dataset and combine it with NLP to create a service that suggests a photo from my camera roll to post with text I have written, or suggest text to go with a photo I have chosen?

NIH Chest X-rays Over 112,000 Chest X-ray images from more than 30,000 unique patients

The largest dataset in first-person (egocentric) vision; multi-faceted non-scripted recordings in native environments - i.e. the wearers' homes, capturing all daily activities in the kitchen over multiple days.

LSUN bedroom scene

Over 9,000 images of cats with annotated facial features

Tons of cat and dog images (also see this dataset on cat and dog audio)

The Stanford Dogs dataset contains images of 120 breeds of dogs from around the world. This dataset has been built using images and annotation from ImageNet for the task of fine-grained image categorization

Many images of several categories of food

Flowers Recognition - This dataset contains labeled 4242 images of flowers.

This dataset contains links to images of women's dresses, and the corresponding images are categorized into 17 different pattern types

ImageNet is an image database organized according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images

The PASCAL Visual Object Classes Challenge 2007

Labeled Faces in the Wild

Flickr Creative Common Images, etc.

Can computer vision spot distracted drivers?

Predict attribute labels for restaurants using user-submitted photos

YouTube-8M is a large-scale labeled video dataset that consists of millions of YouTube video IDs and associated labels from a diverse vocabulary of 4700+ visual entities.

YouTube-BoundingBoxes is a large-scale data set of video URLs with densely-sampled high-quality single-object bounding box annotations. The data set consists of approximately 380,000 15-20s video segments extracted from 240,000 different publicly visible YouTube videos, automatically selected to feature objects in natural settings without editing or post-processing, with a recording quality often akin to that of a hand-held cell phone camera.

In order to facilitate further research into human action recognition, we have released AVA, coined from “atomic visual actions”, a new dataset that provides multiple action labels for each person in extended video sequences

Open Images is a dataset of ~9 million URLs to images that have been annotated with image-level labels and bounding boxes spanning thousands of classes.

This is a collated list of image and video databases that people have found useful for computer vision research and algorithm evaluation.

FigureQA dataset introduces a new visual reasoning task for research, specific to graphical plots and figures. The task comes with an additional twist: all of the questions are relational, requiring the comparison of several or all elements of the underlying plot

Large Data Repositories

Data.gov is managed and hosted by the U.S. General Services Administration (metrics)

Building the most meaningful, collaborative, and abundant data resource in the world

Here are some favorite open datasets created on the Figure Eight platform. They’re free for any and everyone to download.

Wealth of links pointing out to free and open datasets that can be used to build predictive models

Publish data and share. Find data and build. Answer questions. (powered by Socrata)

Fueling the Gold Rush: The Greatest Public Datasets for AI

Data for Democracy brings together an active, passionate community of people using data to drive better decisions and improve the world in which we live.

These datasets can be used for benchmarking deep learning algorithms

Data Science Competition Platform

An open, collaborative, frictionless, automated machine learning environment

USAFacts is a new data-driven portrait of the American population, our government’s finances, and government’s impact on society. USAFacts was inspired by a conversation Steve Ballmer had with his wife, Connie.

A topic-centric list of high-quality open datasets in public domains. By everyone, for everyone!

Public Datasets on AWS provides a centralized repository of public datasets that can be seamlessly integrated into AWS cloud-based applications

Data about Seattle!

Data about British Columbia

KAPSARC data portal is available to anyone interested in energy data. Portal is designed to enable users to better understand energy, economy and policies by quickly accessing and analyzing critical data

AssetMacro offers Free Historical Data for Leading Indicators of Economies and Market Data for Stocks, Bonds, Commodities and Currencies

Datasets for Data Mining and Data Science