Cloud, Uncategorized

2019 Work & Talks

This year my team and I have been working with bioinformatics customers in Australia, US and UK. See my GitHub and accounts (linked here) for more detail. I have also written several technical articles on Medium.

There are now 30 courses in the Linked In Learning / library of my creation – topics are Cloud, Big Data and more. Over 4 million students have watched these courses to date.

I’ve begun work on a book ‘Visualizing Cloud Systems’ and am in the process of delivering talks on this subject in the US and in Europe. Currently in Berlin, Germany working with these clients remotely.

Also notable in 2019, is that I have moved to Minneapolis, MN.

AWS, Big Data, Cloud

Use AWS? Try the og-aws

What is the og-aws? It’s a new kind of book (really booklet) crowd-sourced and published on GitHub.  ‘OG’ stands for open guide and the idea is that people who use AWS, but are NOT employees of AWS, have created a curated crib sheet with links to the stuff you really need to know, organized by category (such as ‘high availability’ or ‘billing’…) or by service (i.e. EC2, S3, etc…) and well-indexed so that you can quickly scan and get the USEFUL answer that you need.

Also, attention has been paid to common ‘mistakes’ or ‘gotchas’ when using one or more AWS services and information about mistakes has been provided as well.

There is an associated Slack for the og-aws, click the link at the top of the page on the GitHub Repo to join in.  In the Slack there are active discussions about how best to use AWS services.  Also, the editors of the og-aws (including me) welcome additional community contributions (via GitHub pull requests.)  The editors have written a short guide to contributing — here.

All-in-all, this guide is useful, timely and FREE, so head over to GitHub to check out the og-aws — here.


Cloud, google

Bioinformatics Code Samples

As I’ve started working with cloud big data in the cancer genomics (bioinformatics) vertical, I’ve ‘collected’ my notes, code and work in a GitHub repo.


I have general information, i.e. terms, file types, etc… at the top level of the repo.  Next, I organized tools and libraries (such as Galaxy, Hail, etc…) by folder in the repo.  I’ve included sample code when I’ve had time to test it as well.

Samples and information are presented for either the AWS or the GCP cloud.

Big Data, Cloud, google, Uncategorized

GCP Data Pipeline Patterns

Here’s the deck and screencast demos from my talks for YOWNights in Australia on Google Cloud Platform services and data pipeline patterns.

The screencast demos are linked in the slide deck (starting after slide 4) and show the following GCP services:

  • GCS – Google Cloud Storage
  • GCE  – Google Compute Engine or VMs (Linux, Windows and SQL Server)
  • BigQuery – Managed data warehouse
  • Cloud Spanner – Managed scalable RDBMS – Beta release at this time of this recording
  • Cloud Vision API – Machine Learning (Vision API)

The architecture patterns for GCP services for common data pipeline workload scenarios, include the following: Data Warehouse, Time Series, IoT and Bioinformatics.  They are taken from Google’s reference architectures – found here.

Big Data, Cloud, google

Galaxy for bioinformatics on GCP

Below are the slides from my talk for the GAME 2017 conference  on ‘Scaling Galaxy for GCP’ to be delivered in Feb 2017 at the University of Melbourne, Australia.  Galaxy is a bioinformatics tool used for genomic research.  A sample screen from Galaxy is shown below.

Screen Shot 2017-01-26 at 6.27.56 PM.pngIn this talk, I will show demos and patterns for scaling the Galaxy tool (and also for creating bioinformatics research data pipelines in general) via the Google Cloud Platform.

Patterns include the following:

  • Scaling UP using GCE virtual machines
  • Scaling OUT using GKE container clusters
  • Advanced scaling using combinations of GCP services, such as Google’s new Genomics API, along with using Big Query to analyze variants and more.  Core GCP Services used are shown below.Screen Shot 2017-01-26 at 7.10.26 PM.png

My particular area of interest is in the application of the results of using genomic sequencing for personalized medicine for  cancer genomics This is the application of the results of the totality of DNA sequence and gene expression differences between (cancer) tumor cells and normal host cells.

Building any type of genomics pipelines is true big data work, with EACH whole genome sequencing result producing 2.9 Billion base pairs (T,A,G,C).

Screen Shot 2017-01-27 at 10.56.30 PM.png

Google has an interesting genomic browser (shown above) that you can use on the reference genomic data that they host on GCP.

Cloud, google, SQL Server 2012, SQL Server 2014, SQL Server 2016, Uncategorized

SQL Server on GCP

I recently tried out running SQL Server 2016 on a Google Cloud Platform Windows-based Virtual Machine (GCE – Google Compute Engine Service).

This is a quick way to try out new features of the latest version of SQL Server.  In this case, I wanted to test out the R (language)-in database services.

Although you can certainly ‘click’ in the GCP console to start an instance of SQL Server on GCE,  you may want to script activity (for use with the Google gcloud tool).  To that end, I created a simple script to do this.  Also I added a script to enable and test the R-in database feature.  Here’s a link to my GitHub Repo.

What do you think?  Interested to try this out?  Let me know how it goes for you.