Google Cloud Data Pipeline Patterns

Here’s the deck and screencast demos from my talks for YOWNights in Australia on Google Cloud Platform services and data pipeline patterns.

The screencast demos are linked in the slide deck (starting after slide 4) and show the following GCP services:

  • GCS – Google Cloud Storage
  • GCE  – Google Compute Engine or VMs (Linux, Windows and SQL Server)
  • BigQuery – Managed data warehouse
  • Cloud Spanner – Managed scalable RDBMS – Beta release at this time of this recording
  • Cloud Vision API – Machine Learning (Vision API)

The architecture patterns for GCP services for common data pipeline workload scenarios, include the following: Data Warehouse, Time Series, IoT and Bioinformatics.  They are taken from Google’s reference architectures – found here.

Scaling Galaxy on GCP

Below are the slides from my talk for the GAME 2017 conference  on ‘Scaling Galaxy for GCP’ to be delivered in Feb 2017 at the University of Melbourne, Australia.  Galaxy is a bioinformatics tool used for genomic research.  A sample screen from Galaxy is shown below.

Screen Shot 2017-01-26 at 6.27.56 PM.pngIn this talk, I will show demos and patterns for scaling the Galaxy tool (and also for creating bioinformatics research data pipelines in general) via the Google Cloud Platform.

Patterns include the following:

  • Scaling UP using GCE virtual machines
  • Scaling OUT using GKE container clusters
  • Advanced scaling using combinations of GCP services, such as Google’s new Genomics API, along with using Big Query to analyze variants and more.  Core GCP Services used are shown below.Screen Shot 2017-01-26 at 7.10.26 PM.png

My particular area of interest is in the application of the results of using genomic sequencing for personalized medicine for  cancer genomics This is the application of the results of the totality of DNA sequence and gene expression differences between (cancer) tumor cells and normal host cells.

Building any type of genomics pipelines is true big data work, with EACH whole genome sequencing result producing 2.9 Billion base pairs (T,A,G,C).

Screen Shot 2017-01-27 at 10.56.30 PM.png

Google has an interesting genomic browser (shown above) that you can use on the reference genomic data that they host on GCP.

SQL Server on Google Cloud Platform

screen-shot-2016-11-15-at-9-57-49-am

I recently tried out running SQL Server 2016 on a Google Cloud Platform Windows-based Virtual Machine (GCE – Google Compute Engine Service).  This is a quick way to try out new features of the latest version of SQL Server.  In this case, I wanted to test out the R (language)-in database services.

Although you can certainly ‘click’ in the GCP console to start an instance of SQL Server on GCE,  you may want to script activity (for use with the Google gcloud tool).  To that end, I created a simple script to do this.  Also I added a script to enable and test the R-in database feature.  Here’s a link to my GitHub Repo.

What do you think?  Interested to try this out?  Let me know how it goes for you.

#happyExploring

AWS, GCP, Azure Consoles Graded

I work with all three major public cloud vendors for various clients.  I find it interesting to observe the differences in their approaches to the design (and subsequent usability) of their web consoles.

AWS

The AWS console reflects the state of their services (and their market share).  It is consistent, clean and very usable.  It loads very fast on browsers I use (Chrome mostly). This page show exactly the information I need (and no more). Interestingly, it does NOT show any of my security information by default on the main page.  Services are organized in a logical way, service icons ‘make sense’ in color, type and size. The ability to add service shortcuts at the top improves usability.  Also, surfacing resource groups on the first page is great, as this is a feature I use often.

I would like to see my total AWS spend per region per account on this page as well.

Grade A

AWS.png

GCP

The GCP console recently had a major overhaul and the results are very positive.  The amount of improvement from previous version is significant.  GCP uses the concept of one or more GCP projects as containers for billing and a set of GCP service instances.  I do find this convenient because I can easily see my total project costs.  I also like the ‘Billing’ widget on the first page.

Although the list of services available in GCP is easily findable by clicking the ‘hamburger’ (three white lines) menu in the upper left, I do find that this method of showing all possible services does confuse some customers (particularly those who are moving over or adding to AWS).

One feature I particularly like is the integrated command line tool (gcloud) console.  It’s fast, usable and works great!

Although I can’t think of how to do this (I am a UX consumer – rather than designer!), I’d like to see a more intuitive way to see all of the currently enabled GCP services (and all possible services) shown in the main console window.

Grade B+

GCP-1.png

Azure

Azure uses two consoles, both an a ‘classic’ and a ‘current’ console. For the purposes of this review, I am including the ‘current’ console only. As you can see by the image below, Not all of the tiles render in my browser (Chrome). I’ve tweeted about this bug a couple of times, but haven’t seen any improvement.

Azure uses the concept of subscriptions as containers for services and billing. I find the layout of this portal confusing and unintuitive. That coupled with the fact that the main page renders slowly and usually fails to render correctly is very frustrating to me.

Also the default listing of service types (which is some subset of the actual services available – some items are services, others are category names for groups of services) is once again, unintuitive and generally irritating to me.  What does ‘classic’ mean? Is it good, not good, should I use it, etc…?

Also the odd sizing of the tiles (too much blank space) is not helpful.

Generally, this ‘new’ Azure portal is not showing the increasingly more competitive set of Azure service in a positive way to me and my customers.

Grade D

Azure.png

I am interested in your opinion. Do you use any or all these cloud consoles?  If so, how do you find them?  What works well for you?  What doesn’t? What do you wish would be added and/or removed for improved usability?

Happy Programming (in the cloud)!

 

How to: Installing AerospikeDB on Google Compute Engine

Showing Aerospike at BigDataCampLA

Recently, I’ve been doing some work with AerospikeDB.  It is a super-fast in-memory NoSQL Database.  I gave a presentation at the recent BigDataCampLA on ‘Bleeding Edge Databases’ and included it because of impressive benchmarks, such as 1 Million TPS (read-only workload) PER SERVER and 40K TPS (read-write) on that same server.  Here’s the live presentation, also I did a screencast of this presentation.

In this blog post, I’ll detail how you can get started with the FREE community edition of AerospikeDB.  Again I’ll use Google Compute Engine as my platform of choice, due to the speed, ease of use and inexpensive cost for testing.  You’ll note from the screenshot below, that you can install the community edition on your own server, or on other clouds (such as AWS) as well.  I am writing this blog post because Aerospike didn’t have directions to get set up on GCE available prior to this blog post.

Aerospike Community Edition

Here’s a top level list of what you’ll need to do (below, I’ll detail each step) – I did the whole process start-to-finish in < 30 minutes.

    • Set up a Google Cloud project with Google Compute Engine (VM) API access
    • Spin up and configure a GCE instance
    • Install the Aerospike Community Edition, which runs on up to 2-nodes and can use up to 200 GB for your testing purposes
    • Run your tests and (optionally) add other nodes

Next I’ll drill into each of the steps listed above.  I’ll go into more detail and will provide sample commands for the Google Cloud test that I did.

Step One – Setup a Google Cloud project with Google Compute Engine access

If you are new to the Google Cloud, you’ll need to get the Google Cloud SDK for the command line utilities you’ll need to install and to connect to your cloud-hosted virtual machine.  There is a version of the SDK for Linux/Mac and also for Windows.

For this tutorial, I will be using Mac. There are only two steps to using the SDK:
a) From Terminal run

curl https://sdk.cloud.google.com | bash

b) Then restart Terminal and then run the command below from Terminal.  After it runs then a browser window will open, then click on your gmail account and then click on the ‘accept’ button and then login will complete in the terminal window

gcloud auth login

GCloud Authorization

If you already have a Google Cloud Project, then you can proceed to Step Two.  If you do not yet have a Google Cloud Project, then you will need to go to the Google Developer’s Console and create a new Project by clicking on the red ‘Create Project’ button at the top of the console.

Create Project

 

 

Note: Projects are containers for billing on the Google Cloud.  They can contain 1:M instances of each authorized service – in our case that would be 1:M instance of Google Compute Engine Virtual Machines.

To enable access to the GCE API in your project, click on the name of the project in Google Developer Console, then click on ‘APIS & AUTH’>’APIs’>Google Compute Engine “OFF” to turn the service availability to “ON”.  The button should turn green to indicate the service is available.

You will also have to enable billing under the ‘Billing & Settings’ section of the project.  Because you are reading this blog post, you can apply for $ 500 USD in Google Cloud Usage Credit at this URL – use code “gde-in” when you apply for the credit.

Google Cloud Usage Credit

To be complete there are many other types of cloud services available, such as Google App Engine, Google Big Query and many more.  Those services are not directly related to the topic of this article, so I’ll just link more information from the Google Cloud Developer documentation here.

Step Two – Spin up and configure a GCE instance

Note: All of the steps I describe below could be performed in the Terminal via GCloud command line tools (‘gcloud compute’ in this case), for simplicity, I will detail the steps using the web console.  Alternatively, here is a link to creating a GCE instance using those tools.

From within your project in Google Developers Console, click on your Project Name. From the project console page, click on ‘COMPUTE’ menu on the left side to expand it.  Next click on ‘COMPUTE ENGINE’>VM Instances.

Then click on the red button on the top of the page ‘New Instance’ to open the web page with the instance information as shown below.  Also here’s a quick summary of the values I selected: ZONE: US-Central1-b; MACHINE TYPE: n1-standard-1 (1 vCPU, 3.8 GB memory); IMAGE: Debian-7-wheezy-v20140606.

Other notes: You could use a g1-small instance type if you’d prefer, minimum machine requirements for the community edition of Aerospike are at least 1 GB RAM and 1 vCPU. You could use Red Hat and CentOS for the image, however my directions are specific to Debian 7 Linux.

GCE Configuration for Aerospike test

Click the blue ‘Create’ button to start your instance.  After the instance is available (takes less than a minute in my experience!), then you will see it listed in the project console window (COMPUTE ENGINE>VM Instance).  You can now test connectivity to your instance by clicking on the ‘SSH’ button to the right of the instance.

To test connectivity using SSH, open  Terminal, then use the ‘gcloud auth login’ command as described previously, then paste the gcutil command into the terminal, an example is shown below.

Testing Connectivity to GCE

The last configuration step for GCE is set up a firewall rule.  You’ll want to do this so that you can use the Aerospike (web-based) management console.  To create this rule do the following in the Google Developers Console for your project:  Click on COMPUTE>COMPUTE ENGINE>Networks>’default’>Firewall Rules>Create New.  Then add a new firewall rules with these settings: Name: AMC; Source Ranges: 0.0.0.0/0; Allowed Protocols or Ports: tcp: 8081

Step Three – Install the Aerospike Community Edition

To start, I set up a test Aerospike server with a single node.  To do this there are three required steps.  I have added a couple of optional steps as well since I found them to make my test of Aerospike more interesting

a) Connect to GCE via SSH
b) Download the Aerospike Community Edition
c) Extract and install the download (which is the server software plus command line tools)
d) start the service and test inserting data
e) install the node.js client (optional)
f) install the web-based management console

Notes: Be sure to run the scripts below as sudo.  Also my install instructions are based on downloading the version of Aerospike Database Server 3.2.9 than is designed to run on DEBIAN 7.

Here is a Bash script to automate this process:

#!/bin/bash
sudo apt-get -y install wget
wget -qO server.tgz “http://www.aerospike.com/community_downloads/3.2.9/aerospike-community-server-3.2.9-debian7.tgz&#8221;
tar xzf server.tgz
sudo dpkg -i aerospike*/aerospike*

sudo /etc/init.d/aerospike start #start aerospike now

To verify correct functioning of the server:

sudo /etc/init.d/aerospike status

I next used the included command line tool to further verify that the server was working properly by inserting, retrieving and then deleting some values from the server.  The command line tools is called ‘cli’ and is found in /usr/bin/cli.  Here are some sample test commands that I used:

cli -o set -n test -s “test_set” -k first_key -b bin_x -v “first value”
cli -o set -n test -s “test_set” -k first_key -b bin_y -v “second value”
cli -o set -n test -s “test_set” -k first_key -b bin_z -v “third value”

Next I retrieved these values with the following command:

cli -o get -n test -s “test_set” -k first_key
> {‘bin_z’: ‘third value’, ‘bin_y’: ‘second value’, ‘bin_x’: ‘first value’}

Last I deleted the key:

cli -o delete -n test -s “test_set” -k first_key
cli -o get -n test -s “test_set” -k first_key
> no data retrieved

Tip: (Current Aerospike version: 3.2.9) After an instance reboot, the Aerospike service may fail to start.  To fix try creating an Aerospike directory in /var/run:

sudo mkdir /var/run/aerospike

Optional Step — Client Setup (Node.js)

Next I setup a simple Node.js Client on the same instance as the server. The process is as follows:  install Node.js and the Node Package Manager (NPM) and then install the Node.js Aerospike client package.

Note: There are a number of Aerospike clients available for different languages.  These are outside the scope of this document.  For more information go here: http://www.aerospike.com/aerospike-3-client-sdk/

This script automates this process:

#!/bin/bash
CWD=$(pwd)
sudo apt-get install python-software-properties
sudo add-apt-repository ppa:chris-lea/node.js # requires human interaction: ‘PRESS ENTER’
sudo apt-get install -y software-properties-common
sudo apt-get update
sudo apt-get -y upgrade
sudo apt-get install -y python g++ make nodejs
curl https://www.npmjs.org/install.js | sudo sh
sudo npm install aerospike -g # -g installs to /usr/lib, current dir otherwise

cd ${CWD}

 You will need to acknowledge the addition of the Node.js repository to your software repositories list. Once this completes, navigate to the examples directory:

 cd /usr/lib/node_modules/aerospike/examples

 Install the prerequisite packages:

 sudo npm install ../
 sudo npm update

These examples insert dummy data to a specified location in a similar fashion to the cli tool.

node put.js -n test -s “test_set” first_key
OK –  { ns: ‘test’, set: ‘test_set’, key: ‘first_key’ }
put: 3ms

node get.js -n test -s “test_set” first_key
OK –  { ns: ‘test’, set: ‘test_set’, key: ‘first_key’ } { ttl: 9827, gen: 2 } { bin_z: ‘third value’, i: 123, s: ‘abc’, arr: [ 1, 2, 3 ],map: { num: 3, str: ‘g3’, buff: <SlowBuffer 0a 0b 0c> }, b: <SlowBuffer 0a 0b 0c>,b2: <SlowBuffer 0a 0b 0c> }
get: 3ms 

node remove.js -n test -s “test_set” first_key 
OK –  { ns: ‘test’, set: ‘test_set’, key: ‘first_key’ }
remove: 7ms

Additionally there is a benchmarking tool I used to get a rough idea of the transactions per second available from my instance:

cd /usr/lib/node_modules/aerospike/benchmarks
npm install
node inspect.js

Management Console Setup

The Aerospike Management Console is a web based monitoring tool that will report all kinds of status information about your Aerospike deployment. Whether its a single instance or a large multi-datacenter cluster. To install the AMC I used the following script as superuser (e.g. sudo script.sh).

#!/bin/bash
apt-get -y install python-pip python-dev ansible
pip install markupsafe paramiko ecdsa pycrypto
wget -qO amc.deb ‘http://aerospike.com/amc/3.3.1/aerospike-management-console-3.3.1.all.x86_64.deb&#8217;
dpkg -i amc.deb

sudo /etc/init.d/amc start #start amc now

Once deployed I pointed my browser to port 8081 of the instance.  There will be a dialog asking for the hostname and port of an Aerospike instance.  Since I installed the server on the same instance as the amc I just used localhost and port 3000.

Aerospike Console

Step Four – Run tests and (optionally) add other nodes

As mentioned, you can test Aerospike on up to 2 nodes.  The next step I took in testing was to add another server node.  Here are the steps I took to do this.

First I added a firewall rule for TCP ports 3000-3004.  I did this using the same process (i.e. in the Google Developers Console) described previously.  Get to the ‘Create new firewall rule’ panel: Compute> Compute Engine> Networks> ‘default’> Firewall Rules> Create New.  Configure the rule by changing these values:Name: aerospike; Source Ranges: 0.0.0.0/0; Allowed Protocols or Ports: tcp:3000-3004

Next I opened the Aerospike configuration file located at /etc/aerospike/aerospike.conf. Inside the ‘networks’ section is a section called ‘heartbeat’ that looks like the following:

heartbeat {

mode multicast
address 239.1.99.222
port 9918

 # To use unicast-mesh heartbeats, comment out the 3 lines above and
# use the following 4 lines instead.
#mode mesh

#port 3002
#mesh-address 10.0.0.48
#mesh-port 3002

interval 150
timeout 10
}

I commented out the first three lines inside this section and then uncommented the four lines starting with mode mesh. I then replaced the ip address after mesh-address with the ip of my other node.  Next I save my changes then restarted the aerospike service:

sudo /etc/init.d/aerospike restart.

Next I repeated these changes on my second server instance, setting the mesh-address for this server to the ip address of the first server instance.  Each server instance will only need to know about one other server instance to connect to my Aerospike cluster. Everything else is handled automatically. To verify that the cluster is working correctly I checked the log file for ‘CLUSTER SIZE = 2’ like this:

sudo cat /var/log/aerospike/aerospike.log | grep CLUSTER
May 14 2014 23:42:48 GMT: INFO (partition): (fabric/partition.c::2876) CLUSTER SIZE = 2

Tip: If you are testing this out yourself, ensure that your instances can communicate with each other over the default ports 3000-3004.  To test connectivity use telnet for example: ‘telnet <remote ip> <port>’

Conclusions

In conclusion, I find Aerospike to be a superior performing database in its category.  I am curious about your experience with databases of this type (i.e. In-memory NoSQL).  Which vendors are you working with now?  What has been your experience?  Which type of setup works best for you – on premise (bare metal or virtualized) or in the cloud?  If in the cloud, which vendor’s cloud.

Also on the horizon, I am exploring up-and-coming light-weight application virtualization technologies, such as Docker.  Are you working with anything like this?  I will be posting more on using Docker with NoSQL and NewSQL databases over the next couple of months.

Bleeding Edge Databases – Aerospike, Algebraix and Google Big Query

I gave a talk called ‘Bleeding Edge Databases’ at this weekend’s BigDataCampLA.  Several attendees asked me to record the talk, so I did (and will link that post below).

Keynoting at BigDataCampLA

 

 

 

 

 

 

 

 

Here’s my favorite tweet from the event.

I am regularly invited to preview many, new database technologies, so you may be wondering why I chose these three solutions to highlight.  The first aspect of all three solutions is independently benchmarked noticeably better performance in their particular areas.  I also take usability into account.

Aerospike – 1 Million TPS (read-only workload) PER SERVER and 40K TPS (read-write) on that same server. Very good integration with client tools and libraries.

1M TPS Aerospike

AlgebraixData – 1 Billion Triples on ONE NODE and substantially better query response times than any competitor for the core benchmark queries for RDF databases.  Their core engine, which is optimized using patented mathematical algorithms interests me because of the solid performance benchmarks they’ve been able to achieve.

The Math of Algebraix Data

Google Big Query – 750 Millions Rows in 10 seconds, solidly ‘productized’ with usable integration points both in/out, reduced pricing and increased streaming – also nearly ANSI SQL-like querability.

TPC-H Benchmark for Big Query

 

 

 

 

 

 

 

 

My screencast includes demos of the first two products using Google Compute Engine (VMs on the Google Cloud).  I chose a Linux GCE instance for Aerospike and a Windows GCE instance for AlgebraixData.  I prefer to use the Google Cloud for this type of testing for a couple of reasons:

1) Fast, easy VM spin-up
2) Generous free tier, clear pricing information (i.e. no surprises)
3) As a GDE (Google Developer Expert), I have usage credits for the Google Cloud beyond typical, so I rarely encounter any fees for quick POC testing

If you are interesting in trying out these databases, the instructions on how to get access are in my screencast – enjoy!