Big Data, Cloud, Hadoop

Whitepaper – Streaming Hadoop Solutions

In this whitepaper, I take a look at the various options for Hadoop Streaming.  These include Apache Storm, Apache Spark Streaming and Apache Samza.  Also I examine commercial alternatives, such as Data Torrent.  I cover implementation details of streaming, including type of streaming and capacities of libraries and products included.

You can read this whitepaper online or download it via the included Slideshare link.

Happy streaming!

AWS, Azure, Big Data, Cloud, Data Science, Hadoop, Microsoft

New YouTube Series – Hadoop MapReduce Fundamentals

Hadoop MapReduce
Hadoop MapReduce

I’ve been working with Hadoop MapReduce in several formats over the past couple of years.  I decided to pull together my experience and record this as a free, multi-part screencast series on YouTube.

The course consists of 5 screencasts – from 30 – 50 minutes per part.  Each part tackles some aspect of Hadoop MapReduce, from basic, conceptual understanding to most common tuning processes.  Throughout the series, I’ve included screencast demos using a variety of vendor distributions of Hadoop.  These demos include Cloudera CHD4, Windows Azure HDInsight, AWS MapReduce and more.

Below is the first module of the course.

Here is a link to the entire Power Point deck.

Here is a link to the course demo files.

AWS, Big Data, Cloud, Hadoop

First Look ETL in the AWS Cloud – Data Pipelines

I tried out creating a data pipeline (ETL process) on the AWS cloud this morning.  This currently works with AWS data sources, such as S3, DynamoDB and RDS.

AWS Services
AWS Services

I found that I did need to read the AWS documentation in order to create even a simple pipeline.  Below is an example of a simple copy job in the data pipeline designer.

AWS copy job data pipeline
AWS copy job data pipeline

Enjoy the screencast