AWS, Big Data, Cloud

Use AWS? Try the og-aws

What is the og-aws? It’s a new kind of book (really booklet) crowd-sourced and published on GitHub.  ‘OG’ stands for open guide and the idea is that people who use AWS, but are NOT employees of AWS, have created a curated crib sheet with links to the stuff you really need to know, organized by category (such as ‘high availability’ or ‘billing’…) or by service (i.e. EC2, S3, etc…) and well-indexed so that you can quickly scan and get the USEFUL answer that you need.

Also, attention has been paid to common ‘mistakes’ or ‘gotchas’ when using one or more AWS services and information about mistakes has been provided as well.

There is an associated Slack for the og-aws, click the link at the top of the README.md page on the GitHub Repo to join in.  In the Slack there are active discussions about how best to use AWS services.  Also, the editors of the og-aws (including me) welcome additional community contributions (via GitHub pull requests.)  The editors have written a short guide to contributing — here.

All-in-all, this guide is useful, timely and FREE, so head over to GitHub to check out the og-aws — here.

screen-shot-2017-03-02-at-11-06-11-pm

Big Data, Cloud, google, Uncategorized

GCP Data Pipeline Patterns

Here’s the deck and screencast demos from my talks for YOWNights in Australia on Google Cloud Platform services and data pipeline patterns.

The screencast demos are linked in the slide deck (starting after slide 4) and show the following GCP services:

  • GCS – Google Cloud Storage
  • GCE  – Google Compute Engine or VMs (Linux, Windows and SQL Server)
  • BigQuery – Managed data warehouse
  • Cloud Spanner – Managed scalable RDBMS – Beta release at this time of this recording
  • Cloud Vision API – Machine Learning (Vision API)

The architecture patterns for GCP services for common data pipeline workload scenarios, include the following: Data Warehouse, Time Series, IoT and Bioinformatics.  They are taken from Google’s reference architectures – found here.

Big Data, Cloud, google

Galaxy for bioinformatics on GCP

Below are the slides from my talk for the GAME 2017 conference  on ‘Scaling Galaxy for GCP’ to be delivered in Feb 2017 at the University of Melbourne, Australia.  Galaxy is a bioinformatics tool used for genomic research.  A sample screen from Galaxy is shown below.

Screen Shot 2017-01-26 at 6.27.56 PM.pngIn this talk, I will show demos and patterns for scaling the Galaxy tool (and also for creating bioinformatics research data pipelines in general) via the Google Cloud Platform.

Patterns include the following:

  • Scaling UP using GCE virtual machines
  • Scaling OUT using GKE container clusters
  • Advanced scaling using combinations of GCP services, such as Google’s new Genomics API, along with using Big Query to analyze variants and more.  Core GCP Services used are shown below.Screen Shot 2017-01-26 at 7.10.26 PM.png

My particular area of interest is in the application of the results of using genomic sequencing for personalized medicine for  cancer genomics This is the application of the results of the totality of DNA sequence and gene expression differences between (cancer) tumor cells and normal host cells.

Building any type of genomics pipelines is true big data work, with EACH whole genome sequencing result producing 2.9 Billion base pairs (T,A,G,C).

Screen Shot 2017-01-27 at 10.56.30 PM.png

Google has an interesting genomic browser (shown above) that you can use on the reference genomic data that they host on GCP.

Big Data, Cloud, Hadoop

Whitepaper – Streaming Hadoop Solutions

In this whitepaper, I take a look at the various options for Hadoop Streaming.  These include Apache Storm, Apache Spark Streaming and Apache Samza.  Also I examine commercial alternatives, such as Data Torrent.  I cover implementation details of streaming, including type of streaming and capacities of libraries and products included.

You can read this whitepaper online or download it via the included Slideshare link.

Happy streaming!