Slides from my keynote at YOWData in Sydney, Australia.
To try out the VariantSpark HipsterIndex notebook go here.
Slides from my keynote at YOWData in Sydney, Australia.
To try out the VariantSpark HipsterIndex notebook go here.
Here’s the .pdf of my full-day workshop from QCon London ‘Beyond Relational: Cloud Big Data Design Patterns’
What is the og-aws? It’s a new kind of book (really booklet) crowd-sourced and published on GitHub. ‘OG’ stands for open guide and the idea is that people who use AWS, but are NOT employees of AWS, have created a curated crib sheet with links to the stuff you really need to know, organized by category (such as ‘high availability’ or ‘billing’…) or by service (i.e. EC2, S3, etc…) and well-indexed so that you can quickly scan and get the USEFUL answer that you need.
Also, attention has been paid to common ‘mistakes’ or ‘gotchas’ when using one or more AWS services and information about mistakes has been provided as well.
There is an associated Slack for the og-aws, click the link at the top of the README.md page on the GitHub Repo to join in. In the Slack there are active discussions about how best to use AWS services. Also, the editors of the og-aws (including me) welcome additional community contributions (via GitHub pull requests.) The editors have written a short guide to contributing — here.
All-in-all, this guide is useful, timely and FREE, so head over to GitHub to check out the og-aws — here.
The screencast demos are linked in the slide deck (starting after slide 4) and show the following GCP services:
The architecture patterns for GCP services for common data pipeline workload scenarios, include the following: Data Warehouse, Time Series, IoT and Bioinformatics. They are taken from Google’s reference architectures – found here.
Below are the slides from my talk for the GAME 2017 conference on ‘Scaling Galaxy for GCP’ to be delivered in Feb 2017 at the University of Melbourne, Australia. Galaxy is a bioinformatics tool used for genomic research. A sample screen from Galaxy is shown below.
In this talk, I will show demos and patterns for scaling the Galaxy tool (and also for creating bioinformatics research data pipelines in general) via the Google Cloud Platform.
Patterns include the following:
My particular area of interest is in the application of the results of using genomic sequencing for personalized medicine for cancer genomics. This is the application of the results of the totality of DNA sequence and gene expression differences between (cancer) tumor cells and normal host cells.
Google has an interesting genomic browser (shown above) that you can use on the reference genomic data that they host on GCP.
Here’s a link to my slides from the workshop I delivered for QCon Sao Paulo, Brazil, “Real-world Cloud Big Data Patterns”
In this whitepaper, I take a look at the various options for Hadoop Streaming. These include Apache Storm, Apache Spark Streaming and Apache Samza. Also I examine commercial alternatives, such as Data Torrent. I cover implementation details of streaming, including type of streaming and capacities of libraries and products included.
You can read this whitepaper online or download it via the included Slideshare link.
Here’s a whitepaper I wrote on the ‘state of Machine Learning’. It includes information about implementation via various cloud-based ML services (AWS, Azure, IBM) as well as category information (for architects). Your are welcome to read this whitepaper online or to download it if you prefer (linked to Slideshare source).
In this post I’ll summarize what I learned from running benchmark tests on virtual machines on the AWS Cloud with the Aerospike team and also as I validated their test results independently. I’ll also discuss benchmarking techniques & results for this particular set of test databases. In the process of validating benchmarks, I learned many broadly applicable AWS-specific EC2 benchmarking practices that I will include.
I tested two NoSQL databases – Aerospike and Redis. Both databases are known for speed and are often used for caching or as fast key value stores via in-memory implementation. Aerospike is built to be extremely fast by leveraging SSDs for persistence and to be very easy to scale. By contrast, Redis is built primarily as a fast in memory store.
Aerospike is multithreaded and Redis is single threaded. For the benchmark tests, I compared both as simple key-value stores. To fairly compare, I needed to scale out Redis so that it uses multiple cores on each AWS EC2 instance. The way to do this is to launch several Redis servers and shard the data among these servers.
Benchmark Results — TL; DR – at scale Aerospike wins
As I compared both databases at scale, I found a key differentiator to be manageability of sharding or scaling for each type of database solution.
About Redis Scaling:
About Aerospike Scaling:
Benchmark Testing on AWS — TL; DR – the devil is in the details
Although AWS is convenient and inexpensive to use for testing, cloud platforms like AWS typically, demonstrate greater variability of results. The network throughput, disk speeds, etc are more variable and this may result in different throughput results for the tests when conducted in a different availability zone, at a different time of day or even within the same run of the test. Using AWS boundary containers, such as an AWS VPC and an AWS Placement Group reduces this variability by a significant amount.
That being said, I found that reproducing vendor benchmarks on any public cloud requires quite a bit of attention to detail. The environment is obviously different that on premises. Also beyond basic set up, performance-tuning techniques vary from those I’ve used for on premise and also from cloud-to-cloud solutions. In addition to covering the steps to do this, I’ve also included a long list of technical links at the end of this blog post.
Part 1: Getting Setup to Test on AWS – the Basics
Step 1 – Create an IAM AWS User account. I performed all of my tests as an authorized AWS IAM (non root) user. It is of course a best practice for all use of any cloud to run as least privileged user, rather than root. On AWS via IAM there are permission templates, which make the creation of users and assignment of permission quick and easy, and there is really no excuse to perform benchmark testing as a root user.
Step 2 – Select your EC2 AMI. For the first, most basic type of test, you’ll need to select, start and configure 3 AWS EC2 instances. There are a number of considerations here. In this post, the term “node” means a single EC2 instance and “shard” will mean a single Redis process acting as a part of a larger database service.
To get started, I used three of the same Amazon Linux AMIs. Each instance should be capable of having HVM enabled for maximum network throughput. HVM provides enhanced networking, it uses single root I/O virtualization (SR-IOV) and results in higher network performance (packets per second), lower latency and lower jitter.
I used Amazon Linux AMI version 2014.09, as shown below:
Step 3 – Select your AWS EC2 Machine Types. I chose the AWS R3 series of instances, since these were designed to be optimized for memory intensive applications. Specifically I used R3.8xlarge which has 32 CPUs and 244 GB RAM for the servers. On this instance type, HVM should be enabled by default so long as you spin up your instances in an AWS VPC.
|EC2 instance||R3.8xlarge||32||244||2 x 320||4||10 Gigabit||Redis server|
|EC2 instance||R3.8xlarge||32||244||2 x 320||4||10 Gigabit||Aerospike server|
|EC2 instance||R3.2xlarge||8||61||1 x 160||2||“High”||Database client|
Step 4 – Create an AWS Placement Group. As you prepare to spin up each EC2 instance be sure to use AWS containers to simulate the ‘in the same rack’ proximity that you’d have if you were performing tests on premise. In order to exactly simulate this and to minimize network latency on the AWS Cloud, I was careful to place the first set of EC2 instances in the same VPC, availability zone and placement group. About AWS Placement groups from AWS documentation “A placement group is a logical grouping of instances within a single Availability Zone. Using placement groups enables applications to participate in a low-latency, 10 Gbps network.”
Step 5 – Startup your 3 EC2 instances. Be sure to place them in the same VPC, availability zone and placement group. Take note of both their external and internal IP addresses.
Step 6 – Connect to each of your instances. When you connect you may also want to verify HVM for each one, to do so run this command to verify that the ixgbevf driver has been properly installed as shown below:
Step 7 – Add more AWS ENIs: Even with Enhanced Networking, the network throughput is not enough to drive Aerospike and Redis to their capacity. To increase the network throughput I added more network interfaces or ENIs to each server. By using 4 ENIs on each r3.8xlarge EC2 instance I reached high network throughputs where the database engines load the CPU cores to a significant amount (around 40%-60%).
Although you will add these ENIs (and also associate them with the EC2 instances for the servers and client, you will also need to perform additional configuration steps to get maximum throughput. These steps are described in the ‘AWS Performance Networking Tuning’ section of this post.
Also when connecting from the client to the server for testing, I used the internal IP address to utilize the containment that I had so carefully set up. Shown below is a simple diagram of this process.
Part 2: Installing the Databases and Testing the Benchmark Tools
This first ‘test’ is purposefully simple and isn’t really designed to test either database at capacity, rather it’s a kind of “Hello World” or “smoke test” designed to test your testing environment. Benchmark 1 tests with a single node for each database server and keep all data in memory only, i.e. no data is persisted to disk. To proceed you perform the following steps:
Shown below is sample output from the Aerospike benchmark tool:
Shown below is sample output from the native Redis benchmark tool:
Part 3a: Run Tests -> Benchmark Test 1 – Single node, no persistence
For this first benchmark, I tested the performance of both Aerospike and Redis as a completely RAM-based store. To get a more realistic result that just running the benchmark in a ‘plain vanilla’ configuration, you will want to compensate for architectural differences in the products. Aerospike is multithreaded and will use all available cores (which in our case is 32 per server instance), while Redis is single-threaded. To fairly compare, I launched multiple instances of Redis and sharded the data manually. Shown below is a visualization of this process.
The first diagram shows this process for Aerospike:
All clients must run with all the shards configured. Otherwise, the partitioning of keys will break down. Because of this, all the benchmark clients should send traffic to all the redis servers.
The diagram below shows this process for Redis:
Here is the process to add Redis shards:
The next set of considerations is around mitigating the network bottleneck that you will encounter when testing these high performance databases with the default number of ENIs (network interfaces). Here is where you will want to further ‘tune’ those additional ENIs that we created when we set up the instances by configuring IRQ and Process affinity manually. The next section details this process.
AWS Networking Performance Tuning
To configure your environment you perform the following steps:
Benchmark Tool Parameters
The multiple hosts in the “-h” option of the benchmark tool must be used to test against sharded Redis servers. The ports are assumed to be serially increasing from the number specified in the “-p” option. Benchmark options used in the current tests were:
Aerospike is as fast as Redis with close to 1 MTPS for 100% read workloads on a single node on AWS R3.8xlarge with no persistence.
The default bottleneck in both cases is the network throughput of the instances. Adding ENIs helps to increase the TPS for both Aerospike and Redis. With proper network IRQ affinity and process affinity set, both reach close to 1 MTPS in the 100% read workload. The chart below shows the benchmark test 1 results.
Part 3b: Run Tests ->Benchmark Test 2 – Single Node, with Persistence
In this scenario, persistent storage was introduced. All of the data was still in memory but was also persisted on EBS SSD (gp2) storage.
For Aerospike a new namespace was configured for this case. The “data-in-memory” config parameter was used. To avoid the bottleneck caused by writing to single file, Aerospike was configured to write to 12 different data file locations (to create the same environment as the 12 files written by the 12 Redis shards.) This configuration specifies that the storage files will only be read when restarting the instance.
The append-only file persistence option (AOF) was used to test with Redis. When a certain size of the AOF file is reached, Redis compacts the file by reading the data from memory (background rewriting AOF). When this was taking place, there are periods when Redis throughput dropped. To avoid these outlier numbers, I kept the auto-aof-rewrite-min-size parameter to a large size so that the rewrites were not triggered while the benchmark was being run. These changes favorably overstate Redis performance.
As shown in the chart above, Aerospike is slightly faster than Redis for 100/0 and 80/20 read/write workloads against a single node backed by EBS SSD (gp2) storage for persistence.
I ran the test against 12 Redis shards on a single machine with 4 ENIs.. In this scenario, it was the disk writes which were the bottleneck. The number of client threads was reduced for both Aerospike and Redis, to keep write errors to zero.
It is important to note that Aerospike handles rewrites of the data using a block interface, rather than appending to a file. It uses a background job to rewrite the data. The throughput numbers presented above are a good representation of the overall performance. However, when using a persistence file, Redis must occasionally rewrite the data from RAM to disk in an AOF rewrite. During these times peak throughput is reduced. The throughput results above do not take AOF rewrites into account.
The effects of AOF Rewrites should not be underestimated. In the above charts, I configured Redis to not do this, since it is difficult to measure the steady state performance of the database during this time. However, it is important to understand its effects since this may impact your production system. The chart below shows how Redis performs during one example of an AOF rewrite. Notice that both the read and write performance varies during the rewrite.
I’ve been doing some work with the super fast in-memory database, Aerospike lately. See previous blog posts here about the speed of this product. Since I’ve started work w/Aerospike, the team there has announced that their core product is now open source.
In this blog post, I’ll be covering how to get started developing with Aerospike. There are a couple of considerations as you begin. The first consideration is where you want to host your Aerospike cluster (which can be a single node for initial testing). Aerospike itself runs only on Linux at this time, i.e not Windows, Mac, etc… So you have two options for hosting the server – either on the cloud or using virtualization software on your local development machine. I did extensive testing on both methods and prefer to use a cloud-hosted instance at this time. I will cover the process to do both types in this blog post.
The second consideration is which client language/library you prefer to use. Because there is a good amount of information already available on using Java with Aerospike, I will cover using both Python or .NET (C#) here. As of this writing, Aerospike has client libraries for C/C++, Java, C#, Node.js, C libevent, PHP, Erlang, Python and Perl.
So let’s get started….
I’ve tested two configurations – Google Cloud using Google Compute Engine (I documented the install steps for Aerospike when using this method in a previous blog post – here), the Google Developers Console with a GCE instance hosting Aerospike is shown below. You’ll note that I used a ‘n1-standard-2’ image for my testing – 2 CPUs and 7.5 GB RAM. If you are new to the Google Cloud, you can use the code ‘gde-in‘ at this link to get $ 500 usage credit as well.
I also tested using an AWS Amazon Machine Image (EC2 service) linked here. This AMI has Aerospike pre-installed. Both methods are simple and quick for initial testing. To use this method, simply spin-up the image linked (and shown below in the screenshot) via AWS.
Tip: Be sure to include a firewall rule to open port 3000 for testing your client connectivity for either cloud configuration.
If you prefer to install Aerospike locally, you can use the instructions found on Aerospike’s web site to do so. They have instructions and links to Vagrant files (wrappers) for Oracle’s Virtual Box so that you can quickly download and start an Aerospike image. Because you are using virtualization technology to host Aerospike itself, you can install this on your local machine with any OS.
If you choose this route, be sure to follow the instructions exactly as listed on the Aerospike site, as there are a number of configuration steps and each must be done in the order listed. I tested both the Mac and Windows instructions. There are 9 install steps for each type, the Mac install steps are shown below.
Part One of install instructions for Mac
Part Two of install instructions for Mac
If you are attempting to install on a Windows machine, be sure to verify that your installation of Virtual Box is able to use Hardware Virtualization Vt-x and AMD-V (look in the Settings tab) for Virtual Box, as some Windows machines may need BIOS settings updated in order for this to be possible.
After you’ve set up and tested your Aerospike server, the next step is to select your client library. First I’ll cover using the Python client library.
The instructions provided by Aerospike worked just fine for me — with one small exception as I tested on my Mac. The exception is for the last command (‘sudo pip install aerospike‘) – if you do not have ‘pip’ installed on your Mac, then just run ‘sudo easy_install pip‘ from Terminal to install it.
On the first page of the Aerospike Python client manual there is a complete sample Python file that you can use to quickly test your client connectivity by inserting a record via the put() method. You can see this file in Sublime in my environment in the screenshot below.
This sample is designed to connect to a local installation of Aerospike. If you are using a cloud-hosted installation, then just change the IP address (shown in line 6) to the external IP address for your hosted instance. Also in the config (line 6) is the default port of 3000.
You may also notice that Aerospike records are addresses via the pattern of namespace, set and key (shown on line 13). You call the put() method (line 16) to write a record and, optionally, the get(key) method (line 22) to read a record.
I found the easiest way to verify record insertion while testing was to use the web console. A sample of the output (with several test records inserted) is shown below on the Definitions tab of the console. The URL format for the web console for a remote installation (including cloud) is as follows:
The example above shows the first IP address being the external IP address for the remotely hosted instance and the second IP address being the internal IP address for that instance
If you are using a local instance, then the default URL for the console is shown below:
There are also more complete sample files for working with Aerospike using Python to be found on Github – here. Support for Python in Aerospike is relatively new and the team is also asking for your feedback if you use this library.
The C# client library (here) is quite rich and conveniently includes a test harness (Windows Form application), that allows you to easily connect and test the Aerospike API. Additionally the sample code, includes a benchmark test harness that I found useful.
I tested the library on a Windows 8 machine with Visual Studio 2012 and it built with no issues. I then connected to an instance (local shown in the screenshot, but in reality, I connected to a cloud-based instance) and was happy to explore the API via this well-written test harness shown below. You’ll notice that the sample includes code for all types of activity, i.e. put, get, append, prepend, batch, etc.. and also that the sample includes code for asynchronous processes.
In addition the sample includes a benchmarking tool, which makes it simple for me to test and benchmark on various vendor clouds (in this case AWS vs Google Cloud). An example of the benchmark application that is part of the C# sample client is also shown below.
Just for completeness, I’ll include a screenshot of C# sample code in Visual Studio. You can see there are two projects, AerospikeClient and AerospikeDemo (the latter is set to be the start-up project). The AerospikeDemo project contains the code for the test harness (Windows Form) shown in the previous screenshot. Shown below is the source file ‘Operation.cs‘ from the /Main directory. Here you can get a sense of core database operations, put, get, etc… which take Bin objects (each Bin is a column name/value pair).
I’ll close by reminding you why I am so interested in trying out Aerospike. You’ll remember, the core product is now free and open source. Also, it has the potential for relevant BigData scenarios to cost much less than other storage methods that can scale to this speed and size. I have found a graphic from Aerospike’s site to be compelling and accurate.
What’s your experience like? If you try out Aerospike, let me know (in the comments section below) how it goes for you.
Fun addition for those of you who have read this far…