Google Cloud Data Pipeline Patterns

Here’s the deck and screencast demos from my talks for YOWNights in Australia on Google Cloud Platform services and data pipeline patterns.

The screencast demos are linked in the slide deck (starting after slide 4) and show the following GCP services:

  • GCS – Google Cloud Storage
  • GCE  – Google Compute Engine or VMs (Linux, Windows and SQL Server)
  • BigQuery – Managed data warehouse
  • Cloud Spanner – Managed scalable RDBMS – Beta release at this time of this recording
  • Cloud Vision API – Machine Learning (Vision API)

The architecture patterns for GCP services for common data pipeline workload scenarios, include the following: Data Warehouse, Time Series, IoT and Bioinformatics.  They are taken from Google’s reference architectures – found here.

Scaling Galaxy on GCP

Below are the slides from my talk for the GAME 2017 conference  on ‘Scaling Galaxy for GCP’ to be delivered in Feb 2017 at the University of Melbourne, Australia.  Galaxy is a bioinformatics tool used for genomic research.  A sample screen from Galaxy is shown below.

Screen Shot 2017-01-26 at 6.27.56 PM.pngIn this talk, I will show demos and patterns for scaling the Galaxy tool (and also for creating bioinformatics research data pipelines in general) via the Google Cloud Platform.

Patterns include the following:

  • Scaling UP using GCE virtual machines
  • Scaling OUT using GKE container clusters
  • Advanced scaling using combinations of GCP services, such as Google’s new Genomics API, along with using Big Query to analyze variants and more.  Core GCP Services used are shown below.Screen Shot 2017-01-26 at 7.10.26 PM.png

My particular area of interest is in the application of the results of using genomic sequencing for personalized medicine for  cancer genomics This is the application of the results of the totality of DNA sequence and gene expression differences between (cancer) tumor cells and normal host cells.

Building any type of genomics pipelines is true big data work, with EACH whole genome sequencing result producing 2.9 Billion base pairs (T,A,G,C).

Screen Shot 2017-01-27 at 10.56.30 PM.png

Google has an interesting genomic browser (shown above) that you can use on the reference genomic data that they host on GCP.

SQL Server on Google Cloud Platform

screen-shot-2016-11-15-at-9-57-49-am

I recently tried out running SQL Server 2016 on a Google Cloud Platform Windows-based Virtual Machine (GCE – Google Compute Engine Service).  This is a quick way to try out new features of the latest version of SQL Server.  In this case, I wanted to test out the R (language)-in database services.

Although you can certainly ‘click’ in the GCP console to start an instance of SQL Server on GCE,  you may want to script activity (for use with the Google gcloud tool).  To that end, I created a simple script to do this.  Also I added a script to enable and test the R-in database feature.  Here’s a link to my GitHub Repo.

What do you think?  Interested to try this out?  Let me know how it goes for you.

#happyExploring

Speed Essay / Abstract Writing

A great teacher taught me a technique to use when I felt ‘blocked’ trying to write an essay.  I have used and refined this technique for years when I am doing professional writing, particularly as I write/submit abstracts to conference as a potential speaker. People often tell me ‘you write so fast!’.  Since I am now teaching my daughter this technique as she is writing a number of essays for college applications, I thought I’d share it here.

A side benefit of doing this is that when you learn to ‘submit faster’ then you become more inclined to apply more frequently.  This applies, of course, not only to college applications for high school seniors, but for many types of written submissions throughout your life.

Because I’ve learned to make the cost of writing lower (by using this time-saving process), I tend to write more in general and also to submit to more technical conferences as a potential speaker. More submissions result in more acceptances (more rejections too) but that’s a topic for another blog post.

I hope you find this process useful!

#happyWriting


Speed Essay / Abstract Writing

Take a one hour block of time and…

  • Set a timer for 10 minutes
  • Write out the first question
  • Write the first bullet point
  • Write a sentence using the bullet point
  • Read the sentence out loud
  • List the next bullet point
  • Write a sentence using the next bullet point
  • Continue until time runs out or done

NOTE: Take a break after every 10 minutes for 5 minutes – get up and move around

When done writing out all bullets into sentences

  • Put the sentences in a logical order
  • Write words (or new sentences) to connect the existing sentences
  • Read each new paragraph out loud, update as needed
  • Write a concluding sentence for your essay
  • Read the entire essay out loud
  • Check the word count (for limits)

Sleep on it

  • Read the entire essay out loud
  • Update as needed
  • Input the essay into the application
  • Submit it, verify that your submission was accepted

Getting Started with SQL Server & Redis

When is SQL Server ‘not enough’ and/or ‘too expensive’ for your application performance scalability needs?  Caching is a popular mechanism used to accelerate response times and help applications scale while reducing the load on RDBMS systems, and save on resources.

Redis is a widely used in-memory NoSQL database, which can be used as a database, cache and message broker. This blog post explains how to get started with Microsoft SQL Server and enterprise grade Redis from Redis Labs.

WHAT IS AN APPLICATION CACHE?
Application caching is a technique used to speed up application response times, to help applications scale and to keep critical data highly available in the event of system failure. It works by placing frequently needed data close to the application, most often placing this data in memory using a NoSQL database such as Redis for the fastest application access.

WHAT IS REDIS?
Redis is a fast, lightweight and powerful open source NoSQL database.  Redis is written in C and it compiles into extremely efficient machine code which requires very little overhead. It runs entirely in-memory, and is optimized to deliver more than 1.5M ops/sec with less than 1 ms latency with a single standard server. This allows for really fast application response times, increases throughput, reduces database traffic, with the lowest spend! This approach also reduces the number of RDBMS licenses you need.

Redis is used as a database, a message broker, or very commonly as a caching layer because it enables true statelessness for an application’s processes, while reducing duplication of data or requests to external data sources. Although entire data sets are most commonly served from RAM, Redis also can be enabled to support data persistence and backups. With Redis Labs Enterprise Cluster, you get a natively distributed architecture, that scales as the application workload scales and reduces your database workload.

WHAT IS REDIS LABS?
Redis Labs is both the open source home and commercial provider of Redis. Redis Labs products provide an additional technology layer that encapsulates open-source Redis. This layer provides an enhanced deployment architecture for enterprises. They offer two enterprise Redis products:

  • Redis Labs Enterprise Cluster  – downloadable software, for on premises deployments or in the environment of your choice
  • Redis Cloud – fully managed service for cloud-based deployments in any of these public cloud providers – Azure, AWS, Google Cloud Platform, Softlayer, Heroku, Open Shift, Cloud Foundry and more…

Redis Labs offers 24/7 access to top-notch Redis experts with enterprise-class support. There’s zero downtime and zero performance impact while scaling Redis up or down. Redis Labs technology includes constant monitoring and rebalancing of shards to meet throughput goals, optimizations to ensure consistent high performance and automation for low operational overhead while ensuring high availability and scalability.

HOW TO USE REDIS CLOUD WITH SQL SERVER?
As an example, I’ll show how to use Redis Cloud as a application cache for a web application that uses SQL Server as it’s primary database.  Although the caching paradigm is a well-known application design pattern, I find that many of my customers are not familiar with using external NoSQL databases, such as Redis, to address this need. Shown below is an example architecture and also a link to a screencast with a demo of the application and even more information about using Redis Cloud.  The business scenario is to cache user login information.

Application Architecture shown below.

 

Arch.png

NOTE: This design pattern could be used by applications which are hosted on premises, by using Redis Labs Enterprise Cluster for the caching layer. In this case, the Redis Labs Enterprise Cluster would be installed locally, and would be used with an application that runs in house which uses an on premises SQL Server.

HOW DO I START?
The simplest way to start is to quickly setup a demo of Redis Labs using the Redis Cloud evaluation cluster using the Redis Labs website.  From there you’ll want to add Redis Cloud caching to your applications.  Shown below is a the Redis Labs portal which lets you get up and running with a RedisCloud instance with just a few clicks.  There is a free tier for development and evaluation.

RedisLabsConsole.png

Next, as an example, I’ve set up a node.js web application which uses Azure SQL as a it’s primary database.  This website includes a user login functionality.  The user information is cached using Redis Cloud.  Complete application source code is on Github at this link –  https://github.com/lynnlangit/RedisLabsDemo

In this demo, a simple node.js website uses Azure SQL for the user login information. The application login screen is shown below.

website.png

Also shown below is the structure (column names and datatypes) of the Azure SQL User Table.

sql-table.png

The application uses Windows Azure website hosting along with Azure SQL for the database tables.  It is a standard node.js application.  In addition to the standard setup, I’ve added a call to the Redis Labs’ Redis Cloud instance which I created earlier (in this case the Redis Cloud instance is running on Windows Azure).

Why don’t we just use Azure cache? The answer is simple. Azure cache is not persistent and does not come with built-in high availability at the same level as Redis cloud. Redis Cloud runs in Azure, so if your application is hosted in Azure, then you don’t have to worry about latencies between Azure and other environments. Operationally, scaling Redis Cloud is also a lot simpler since it happens behind the scenes, without you needing to select larger instances or having to change your application in any way.

In this demo, the use case is to retrieve the user login credentials from cache (Redis Cloud) rather than from Azure SQL.  The business reason for this is to accelerate the website login performance.  The data types used are simple strings presented as a username and password.  This is obviously a POC application, just meant to demonstrate core functionality and is no way shape or form anywhere a production quality application,

To set up the demo, configure the ‘env’ file to use your values and rename it to ‘.env’.  Run the ‘deploy-site.sh’ in the azure-scripts folder to execute the setup scripts for the Azure artifacts (website and SQL database).  See detailed setup instructions at the end of this article.  If you’d prefer to create the Azure website and Azure SQL instance by clicking on the Azure portal, you can do that as an alternative to running the setup scripts that I have supplied for your setup.

To test the demo, create a new user on the webpage, and then login with the new user account.  Log out and log back in with that new user account.  You’ll then be able to see the traffic on the Redis Labs website for your Redis Cloud cluster.  An example is shown below.

RedisLabs-Console.png

The code that makes this call to a Redis Labs session store possible is straightforward. It is found in the /services/ folder of the GitHub project in the ‘redis-cloud.js’ file.  I’ve detailed the key lines and provided a copy of this code below.

— Lines 1-4 setup the objects you’ll need.
— Line 6 creates the ‘createSessionStore’ method.
— Lines 8-11 setup and configure a Redis Labs, Redis Cloud client.
— Lines 13-15 setup the call to connect to the Redis Cloud instance.
— Lines 17-19 authenticate the connection with your server password.
— Lines 24-29, we set up the Redis Cloud instance to be used for session store for our node.js application.

code1.png

The users.routes.js’ & users.service.js’ files in the /users folder of the sample makes the actual calls to the session store for the username and password information (stored as key/value pairs), which has been set to use the Redis Cloud instance in our demo. Since we have already set Redis as our session store, any calls to session in our node.js code will access Redis.  Shown below on lines 28-33 from ‘users.routes.js’ is the call to get the session key.

code2.png

This demo application demonstrates how quick and easy it is to add Redis Cloud to your SQL Server-backed web application for the user / password caching use case using the standard session pattern.

MORE ABOUT REDIS

Redis provides pre-built data structures including strings, lists, sets, sorted sets with range queries, hashes, hyperloglogs, bitmaps and geospatial indexes with radius queries. Its ability to allow data objects to be stored in their original format, data can be processed on the database level rather than the application level. Using this architecture, an application can retrieve only discrete elements from the object, as opposed to other key / value data stores that require the application to retrieve an entire object value (blob), then de-serialize it and parse it in order to get the desired part. This enables blazingly fast access and analysis on stored data.

Redis’ true value is providing advanced data structures and operations to application developers. “Intelligent caching” is more than leveraging these data structures via the GET / SET operations; it’s the act of exploiting their unique properties to efficiently and optimally manipulate data. Two examples of intelligent caching are its commands that modify the data in the server and its ability to execute embedded Lua scripts.

Every operation in Redis is atomic, ensuring the integrity of the cached data and provides a consistent view to the processes sharing it. Additionally, Redis cache contributes to an application’s availability. External data sources can experience failures resulting in degraded or terminated service. During these outages, the cache can still serve data to the application, retaining availability. Redis cache has data persistence, in-memory cross-region / data center / cloud replication, instant automatic failover, backups, and disaster recovery.

WHEN TO USE REDIS CACHE?
A cache’s main purpose is to reduce the time needed to access data stored outside of the application’s main memory space, reducing bottlenecked performance and freeing up application resources for other uses.

Redis is a distributed shared cache providing very fast performance, a seamlessly scalable architecture and extremely low-latency data delivery. When you have very complex programming problems and/or very demanding data processing tasks, Redis’ data structures provide simple commands executed within the data store – not on the application-level – that results in cleaner, more elegant code with fewer lines, faster execution time, better application performance, and better CPU, I/O and network utilization.

Redis cache is a great alternative caching solution vs. caching available in SQL Server because the SQL Server’s main memory – RAM – is finite, sometimes volatile, normally non-sharable and comparatively expensive. Redis cache’s ability to store its data on disk allows for greater scalability, durability, concurrency, and cost-effectiveness. Redis cache is a great replacement for session storage such as for applications running on web farms or on Microsoft Azure where maintaining state is difficult.

When you have multiple data structures in various formats and sizes and need a flexible schema-less design, Redis data structures allows you to easily scale your database as if it was a simple key / value data store. Several of Redis’ commands operate on multiple keys. Multi-key operations provide better overall performance because there’s substantially less communication and administration. It is binary safe, and any single element in Redis can range in size from 0 bytes to 0.5 GB.  Also if complex data-structure is used, such as hash, list, set, (this limitation refers to a per element), then the entire object can be much larger.

Redis use cases include real-time analytics, high speed transactions, gaming, online advertising, IoT, message queues, high speed data ingest, session storage, in-app social functionality, in-database analytics, application job management, time-series processing, geo-searching and high-speed caching.

CONCLUSION
Redis Labs products, (Redis Cloud for cloud-based applications or Redis Enterprise for on premise applications) are simple, powerful and a natural choice for application caching with SQL Server.

REFERENCES
–Source code for caching demo – https://github.com/lynnlangit/RedisLabsDemo
–Comparison of Commercial Redis offerings — https://redislabs.com/redis-comparison
–SQL Server caching mechanisms – https://msdn.microsoft.com/en-us/library/cc293623.aspx

DEMO SETUP INSTRUCTIONS
Redis Labs RedisCloud as a user caching store for a node.js app with SQL Azure.  NOTE: Instructions are for OSX

  1. Prerequisites
    • install node.js
    • install azure cli
    • use a text editor (Visual Studio Code, Sublime, etc…)
  2. Redis Cloud
    • go to RedisLabs.com and sign up
    • login to RedisLabs.com and click ‘New Redis Subscription’
    • next to ‘Cloud’ click the drop down, select ‘Azure/west-us’ and 30MB/Free
    • go to ‘My Resources’ >’Manage Resources’, wait for the green checkmark
    • note your Redis Cloud endpoint address and Redis password
    • fill in these values on the Redis variables in your ‘env’ file
  3. Azure Setup
    • fill in your desired values in your ‘azure-scripts/redis-lab-demo-sql-server/parameters.json’ file
    • open ‘azure-scripts/redis-lab-demo-sql-server/deploy.sh’
    • run ‘azure-scripts/redis-lab-demo-sql-server/deploy.sh’ from a bash shell
    • connect to your SQL Azure instance with a client (i.e. Navicat), run scripts
      • run ‘user-login.sql’ and ‘user-status.sql’ to create tables
    • fill in your SQL values in your ‘env’ file
  4. Test the results
    • update ‘redis-cloud.js’ line 25 to use your value for the ‘secret’
    • rename the ‘env’ file to ‘.env’
    • run ‘npm install’ to install node dependencies on your local machine (if you wish to test locally before deplying to Azure)
    • test with localhost
    • get your ip address and set a firewall rule in azure to test remotely
    • run the commands in ‘azure-scripts/deploy-site.sh’
    • push your code to azure using the following commands
      • git add .
      • git commit – ‘{your commit message}’
      • git push azure master
    • test via your azure website endpoint (using a browser)