AlexaDev Tuesday: AWS Public Datasets

Publication Date: 1/16/17 – Did you know Amazon AWS hosts a large library of free, publicly-available, dynamic datasets? This resource can open up a whole new world of Alexa Skill possibilities.


More Data


From Amazon AWS:

AWS hosts a variety of public datasets that anyone can access for free.

Previously, large datasets such as the mapping of the Human Genome required hours or days to locate, download, customize, and analyze. Now, anyone can access these datasets via the AWS centralized data repository and analyze them using Amazon EC2 instances or Amazon EMR (Hosted Hadoop) clusters. By hosting this important data where it can be quickly and easily processed with elastic computing resources, AWS hopes to enable more innovation, more quickly.


How It Works

The public datasets are hosted in two possible formats: Amazon Elastic Block Store (Amazon EBS) snapshots and/or Amazon Simple Storage Service (Amazon S3) buckets.

To access a public dataset hosted in Amazon S3: You can make simple HTTP requests, use AWS Command Line Tools and SDKs (Ruby, Java, Python, .NET, PHP, etc.), download the data using Amazon EC2, or use Hadoop to process the data with Amazon EMR.

To access a dataset hosted as an Amazon EBS snapshot: Sign up for an AWS account, launch an Amazon EC2 instance, and create an Amazon EBS volume using the Snapshot ID listed in the catalog above. Or, see the Amazon EC2 Getting Started Guide.


Here are just a handful of the many available datasets:

Landsat on AWS: An ongoing collection of moderate-resolution satellite imagery of all land on Earth produced by the Landsat 8 satellite

SpaceNet on AWS: A corpus of commercial satellite imagery and labeled training data to foster innovation in the development of computer vision algorithms

Terrain Tiles: A global dataset providing bare-earth terrain heights, tiled for easy usage and provided on S3.

GDELT: Over a quarter-billion records monitoring the world’s broadcast, print, and web news from nearly every corner of every country, updated daily.

NAIP: 1 meter aerial imagery captured during the agricultural growing seasons in the continental U.S.

IRS 990 Filings on AWS: Machine-readable data from certain electronic 990 forms filed with the IRS from 2011 to present

NEXRAD on AWS: Real-time and archival data from the Next Generation Weather Radar (NEXRAD) network

NASA NEX: A collection of Earth science datasets maintained by NASA, including climate change projections and satellite images of the Earth’s surface

Common Crawl Corpus: A corpus of web crawl data composed of over 5 billion web pages

Click here to view the full list of available AWS Public Datasets.

Click here for Amazon’s guide to Using Public Datasets with Linux instances.

Click here for Amazon’s guide to Using Public Datasets with Windows instances.

Click here for Amazon Developer discussion board threads about AWS Public Datasets.