Spring and Summer Speaking Schedule for Lynn Langit

Spring and Summer 2015 speaking schedule

Gigaom Structure Data

In March

  • Lynda.com – recording a new course – here’s a link to my current courses (‘Hadoop Fundamentals’ and ‘Learning Visual Programming with Kodu’)
  • GigaOm Structure Data conference in NYC (March 17-19) – link here, speaking on the SaaS BI Solution Panel
  • TKP in NYC (March 19-20) – for private teacher training event, teaching teachers to teach kids Java

In April 

  • SQL Saturday 389 (April 11) in Huntington Beach, CA – link here
  • SQL PASS Business Analytics (April 20-22) in Santa Clara, CA – link here – speaking on ‘Premium Data for Analysts’ (Dun & Bradstreet is currently surveying potential customers – take survey here ) and also presenting a preCon ‘3 Tools per Hour – 24 FREE Tools for Analysts’

In May

  • DevIntersection Conference (May 19-21) in Scottsdale, AZ – link here- speaking on ‘The Azure Data Story’ and ‘AWS vs Azure – for the Architect’ (w/Michele Bustamante)

In June

  • Norwegian Developer’s Conference (June 15-19) in Oslo, Norway – link here-  speaking on ‘AWS vs. Azure for the Architect’ (w/ Michele Bustamante) and on other cloud topic TBD.

In July

  • Launching new TKP courseware (all month) in San Diego, CA – training teachers and teachers on core #TKPJava and #IoTDataScience
Posted in Uncategorized | Tagged , , , , | Leave a comment

Lessons Learned – Benchmarking NoSQL on the AWS Cloud (AerospikeDB and Redis)

AWS EC2

AWS EC2

In this post I’ll summarize what I learned from running benchmark tests on virtual machines on the AWS Cloud with the Aerospike team and also as I validated their test results independently. I’ll also discuss benchmarking techniques & results for this particular set of test databases. In the process of validating benchmarks, I learned many broadly applicable AWS-specific EC2 benchmarking practices that I will include.

I tested two NoSQL databases – Aerospike and Redis. Both databases are known for speed and are often used for caching or as fast key value stores via in-memory implementation. Aerospike is built to be extremely fast by leveraging SSDs for persistence and to be very easy to scale. By contrast, Redis is built primarily as a fast in memory store.

Aerospike is multithreaded and Redis is single threaded. For the benchmark tests, I compared both as simple key-value stores. To fairly compare, I needed to scale out Redis so that it uses multiple cores on each AWS EC2 instance. The way to do this is to launch several Redis servers and shard the data among these servers.

Benchmark Results — TL; DR – at scale Aerospike wins

As I compared both databases at scale, I found a key differentiator to be manageability of sharding or scaling for each type of database solution.

Redis

Redis

About Redis Scaling:

  • You must manage sharding yourself, by coming up with a sharding algorithm that evenly balances data between the shards.
  • Some of the Java clients (such as Jedis) do this for you, but you must check to make sure that the data is properly balanced.
  • If you wish to increase the number of Redis shards in order to increase throughput or data volume, you will have to refactor the sharding algorithm and rebalance the data. This usually results in a downtime.
  • If you need replication and failover, you will need to synchronize the data yourself, at the application layer.
Aerospike

Aerospike

About Aerospike Scaling:

  • Aerospike handles the equivalent of sharding automatically.
  • There is no downtime. You can add new capacity (data volume and throughput) dynamically. Simply add additional nodes and Aerospike will automatically rebalance the data and traffic.
  • You set the replication factor to 2 or more and configure Aerospike to operate with synchronous replication for immediate consistency.
  • If a server goes down, Aerospike handles fail-over transparently; the Aerospike client makes sure that your application can read or write replicated data automatically.
  • You can run purely in RAM, or take advantage of Flash/SSDs for storage. Using Flash/SSDs results in a slight penalty on latency (~0.1ms longer on average), but allows you to scale with significantly smaller clusters and at a fraction of the cost.

Benchmark Testing on AWS — TL; DR  – the devil is in the details

Although AWS is convenient and inexpensive to use for testing, cloud platforms like AWS typically, demonstrate greater variability of results. The network throughput, disk speeds, etc are more variable and this may result in different throughput results for the tests when conducted in a different availability zone, at a different time of day or even within the same run of the test. Using AWS boundary containers, such as an AWS VPC and an AWS Placement Group reduces this variability by a significant amount.

That being said, I found that reproducing vendor benchmarks on any public cloud requires quite a bit of attention to detail. The environment is obviously different that on premises. Also beyond basic set up, performance-tuning techniques vary from those I’ve used for on premise and also from cloud-to-cloud solutions. In addition to covering the steps to do this, I’ve also included a long list of technical links at the end of this blog post.

Part 1: Getting Setup to Test on AWS – the Basics

Step 1 –  Create an IAM AWS User account. I performed all of my tests as an authorized AWS IAM (non root) user. It is of course a best practice for all use of any cloud to run as least privileged user, rather than root. On AWS via IAM there are permission templates, which make the creation of users and assignment of permission quick and easy, and there is really no excuse to perform benchmark testing as a root user.

Step 2 – Select your EC2 AMI. For the first, most basic type of test, you’ll need to select, start and configure 3 AWS EC2 instances. There are a number of considerations here. In this post, the term “node” means a single EC2 instance and “shard” will mean a single Redis process acting as a part of a larger database service.

To get started, I used three of the same Amazon Linux AMIs. Each instance should be capable of having HVM enabled for maximum network throughput. HVM provides enhanced networking, it uses single root I/O virtualization (SR-IOV) and results in higher network performance (packets per second), lower latency and lower jitter.

I used Amazon Linux AMI version 2014.09, as shown below:

AWS EC2 Linux instance

AWS EC2 Linux instance

Step 3 – Select your AWS EC2 Machine Types. I chose the AWS R3 series of instances, since these were designed to be optimized for memory intensive applications. Specifically I used R3.8xlarge which has 32 CPUs and 244 GB RAM for the servers. On this instance type, HVM should be enabled by default so long as you spin up your instances in an AWS VPC.

AWS Component Type CPUs RAM SSD ENIs Network Use
EC2 instance R3.8xlarge 32 244 2 x 320 4 10 Gigabit Redis server
EC2 instance R3.8xlarge 32 244 2 x 320 4 10 Gigabit Aerospike server
EC2 instance R3.2xlarge 8   61 1 x 160 2 “High” Database client

Step 4 – Create an AWS Placement Group. As you prepare to spin up each EC2 instance be sure to use AWS containers to simulate the ‘in the same rack’ proximity that you’d have if you were performing tests on premise. In order to exactly simulate this and to minimize network latency on the AWS Cloud, I was careful to place the first set of EC2 instances in the same VPC, availability zone and placement group. About AWS Placement groups from AWS documentation “A placement group is a logical grouping of instances within a single Availability Zone. Using placement groups enables applications to participate in a low-latency, 10 Gbps network.”

Step 5 –  Startup your 3 EC2 instances. Be sure to place them in the same VPC, availability zone and placement group. Take note of both their external and internal IP addresses.

Step 6 –  Connect to each of your instances. When you connect you may also want to verify HVM for each one, to do so run this command to verify that the ixgbevf driver has been properly installed as shown below:

Ethtool

Ethtool

Step 7 – Add more AWS ENIs: Even with Enhanced Networking, the network throughput is not enough to drive Aerospike and Redis to their capacity. To increase the network throughput I added more network interfaces or ENIs to each server. By using 4 ENIs on each r3.8xlarge EC2 instance I reached high network throughputs where the database engines load the CPU cores to a significant amount (around 40%-60%).

Although you will add these ENIs (and also associate them with the EC2 instances for the servers and client, you will also need to perform additional configuration steps to get maximum throughput. These steps are described in the ‘AWS Performance Networking Tuning’ section of this post.

Also when connecting from the client to the server for testing, I used the internal IP address to utilize the containment that I had so carefully set up. Shown below is a simple diagram of this process.

AWS Architecture

AWS Architecture 

Part 2: Installing the Databases and Testing the Benchmark Tools

This first ‘test’ is purposefully simple and isn’t really designed to test either database at capacity, rather it’s a kind of “Hello World” or “smoke test” designed to test your testing environment. Benchmark 1 tests with a single node for each database server and keep all data in memory only, i.e. no data is persisted to disk. To proceed you perform the following steps:

  • Step 1 – Install Redis (2.8.17) on one EC2 R3.8xlarge instance
  • Step 2 – Install Aerospike (3.3.21 Community) on another EC2 R3.8xlarge instance
  • Step 3 – Install Client Tools – Aerospike Java client (3.0.30) and Redis Jedis client (2.6.0) on a EC2 R3.2xlarge instance
  • Step 4 – Install the Benchmark Tools on the client – I used 3 tools – the Aerospike and Redis-version (of Aerospike) benchmarking tool and also the native Redis benchmarking tool. Benchmark test results for Redis were roughly the same using either the Aerospike benchmark tool for Redis or the native Redis tool.
  • Step 5 – Test run the benchmarks – at this point you are only trying to verify that you’ve installed the server(s), client and benchmarking tools correctly. You will not get highest level benchmark results until AFTER you perform the additional AWS performance tweaks, listed in the next section. You can see the type of activity (read or write) in the first column, then going across, and the latency for operations in ms by percentage of operations. I’ve highlighted the total transactions per second in the red circle in the first sample output below.

Shown below is sample output from the Aerospike benchmark tool:

Aerospike Performance Tool

Aerospike Performance Tool 

Shown below is sample output from the native Redis benchmark tool:

Redis Performance Tool

Redis Performance Tool

Part 3a: Run Tests -> Benchmark Test 1 – Single node, no persistence

For this first benchmark, I tested the performance of both Aerospike and Redis as a completely RAM-based store. To get a more realistic result that just running the benchmark in a ‘plain vanilla’ configuration, you will want to compensate for architectural differences in the products. Aerospike is multithreaded and will use all available cores (which in our case is 32 per server instance), while Redis is single-threaded. To fairly compare, I launched multiple instances of Redis and sharded the data manually. Shown below is a visualization of this process.

The first diagram shows this process for Aerospike:

Aerospike Architecture

Aerospike Architecture

All clients must run with all the shards configured. Otherwise, the partitioning of keys will break down. Because of this, all the benchmark clients should send traffic to all the redis servers.

The diagram below shows this process for Redis:

Redis Architecture

Redis Architecture

Here is the process to add Redis shards:

  • Run redis-*/utils/install-server.sh and use sequentially increasing port numbers (up from 6379)
  • Change conf file for every server (/etc/redis/6380.conf) and comment out the “save” lines to disable persistence
    for i in {6387..6394}; do sed -i ‘s/^save/#save/’ $i.conf; done
  • Restart the redis instances
  • Make sure you “FLUSHALL” in redis-cli for all redis instances
    for i in {6379..6386}; do redis-cli -p $i FLUSHALL; done
  • After keys have been inserted, confirm that they are equally distributed among the shards using “INFO” command in redis-cli. The last line in the output shows the number of records.
    for i in {6379..6386}; do redis-cli -p $i INFO | grep “keys=”; done

The next set of considerations is around mitigating the network bottleneck that you will encounter when testing these high performance databases with the default number of ENIs (network interfaces). Here is where you will want to further ‘tune’ those additional ENIs that we created when we set up the instances by configuring IRQ and Process affinity manually. The next section details this process.

AWS Networking Performance Tuning

  • IRQ affinity: Both Aerospike and Redis are extremely fast in their in-process operations of getting and setting data items. To this end, it is beneficial to dedicate CPU cores to handle just the network IRQs. Each network interface has 2 IRQs, which can be found by:

    for i in {0..3}; do echo eth$i; grep eth$i-TxRx /proc/interrupts | awk ‘{printf ” %s\n”, $1}'; done

    Now, we can assign one CPU core for processing interrupts on each IRQ by changing the smp_affinity values of the IRQs.

    echo 1 > /proc/irq/259/smp_affinity
    echo 2 > /proc/irq/260/smp_affinity

  • Process affinity: The kernel on any of the CPU cores may schedule the Aerospike and Redis processes and indeed they may switch around different cores in the course of a single run. It has been empirically found that pinning the processes to a set of CPU cores usually results in better performance. Specifically, keeping the CPU cores handling the network IRQ isolated from the database processes is beneficial. This is accomplished by using the taskset command. To make Aerospike use CPU cores 8 to 31:

    # taskset -ap FFFFFF00 <PID of asd>

    In the case of the many sharded Redis processes, this takes many taskset invocations – one for each Redis server. As Redis is single threaded, assigning one Redis server for each CPU core is a good idea.

To configure your environment you perform the following steps:

  • Step 1 – Install multiple instances of Redis on your Redis server instance
  • Step 2 – Verify that you have the correct number of ENIs associated to your server(s)
  • Step 3 – Spin up additional client instances – for this test, I used 4 instances
  • Step 4 – Install Client Tools and Benchmark Tool on each client instance
  • Step 5 – Assign CPUs to IRQs as per the directions in the previous section (smp_affinity)
  • Step 6 – Pin processes to CPUs as per the directions in the previous section (taskset)
  • Step 7 – Set the default RAM for the “test” Aerospike namespace to 10 GB
  • Step 8 – Set each Redis instance to disable snapshotting by commenting out the “save” parameters in the config file.
  • Step 9 – Run the benchmark tool (using parameters) and compare the tuned benchmarks

Benchmark Tool Parameters

The multiple hosts in the “-h” option of the benchmark tool must be used to test against sharded Redis servers. The ports are assumed to be serially increasing from the number specified in the “-p” option. Benchmark options used in the current tests were:

  • 10 million keys (-k 10000000)
  • 100 byte string values for each key (-o S: 100)
  • Three different read-write loads:
    • 50%-50% read/write (-w RU, 50)
    • 80%-20% read/write (-w RU, 80)
    • 100% read (-w RU, 100)
  • Before every test run, data was populated using the insert workload option (-w I)
  • For each client instance, 90 threads give the maximum throughput in the in-memory database case (-z 90). This was reduced to a lower number in case of persistence tests (Benchmark Test 2) to avoid write errors due to disk choking.

Benchmark Test 1 – Results

Aerospike is as fast as Redis with close to 1 MTPS for 100% read workloads on a single node on AWS R3.8xlarge with no persistence.

The default bottleneck in both cases is the network throughput of the instances. Adding ENIs helps to increase the TPS for both Aerospike and Redis. With proper network IRQ affinity and process affinity set, both reach close to 1 MTPS in the 100% read workload. The chart below shows the benchmark test 1 results.

Benchmark Test 1 Results

Benchmark Test 1 Results

Part 3b: Run Tests ->Benchmark Test 2 – Single Node, with Persistence

In this scenario, persistent storage was introduced. All of the data was still in memory but was also persisted on EBS SSD (gp2) storage.

For Aerospike a new namespace was configured for this case. The “data-in-memory” config parameter was used. To avoid the bottleneck caused by writing to single file, Aerospike was configured to write to 12 different data file locations (to create the same environment as the 12 files written by the 12 Redis shards.) This configuration specifies that the storage files will only be read when restarting the instance.

The append-only file persistence option (AOF) was used to test with Redis. When a certain size of the AOF file is reached, Redis compacts the file by reading the data from memory (background rewriting AOF). When this was taking place, there are periods when Redis throughput dropped. To avoid these outlier numbers, I kept the auto-aof-rewrite-min-size parameter to a large size so that the rewrites were not triggered while the benchmark was being run. These changes favorably overstate Redis performance.

Benchmark Test 2 Results

Benchmark Test 2 Results

Benchmark Test 2 – Results

As shown in the chart above, Aerospike is slightly faster than Redis for 100/0 and 80/20 read/write workloads against a single node backed by EBS SSD (gp2) storage for persistence.

I ran the test against 12 Redis shards on a single machine with 4 ENIs.. In this scenario, it was the disk writes which were the bottleneck. The number of client threads was reduced for both Aerospike and Redis, to keep write errors to zero.

It is important to note that Aerospike handles rewrites of the data using a block interface, rather than appending to a file. It uses a background job to rewrite the data. The throughput numbers presented above are a good representation of the overall performance. However, when using a persistence file, Redis must occasionally rewrite the data from RAM to disk in an AOF rewrite. During these times peak throughput is reduced. The throughput results above do not take AOF rewrites into account.

The effects of AOF Rewrites should not be underestimated. In the above charts, I configured Redis to not do this, since it is difficult to measure the steady state performance of the database during this time. However, it is important to understand its effects since this may impact your production system. The chart below shows how Redis performs during one example of an AOF rewrite. Notice that both the read and write performance varies during the rewrite.

Redis AOF

Redis AOF

References

Posted in Big Data, Cloud, noSQL | Tagged , , | Leave a comment

Learning GitHub – A Multi-part Screencast series

Commit Yourself - Learning GitHub

I struggled through various aspects of GitHub when I first started using it about 2 years ago.  Now it’s become part of my daily routine.  Lately, I’ve gotten more and more requests to answer “What am I sure is a stupid question” from friends.  To that end, I decided to create a short screencast series in which I will show the practicalities of using GitHub.

Commit Yourself and Enjoy!

Part 1 – What is GitHub?

Part 2 – Getting Started with GitHub – Users and Repositories

Part 3 – Working with Repositories – working with push, pull and clone; also performing basic commits to both local and remote repos

Part 4 – Handling Conflict – working with repo branches and forks; performing merges

Part 5 – Bonus – understanding and using data from your Repositories

Posted in Uncategorized | Leave a comment

Real-world Predictive Analytics with Power BI and Predixion Software

Here’s the deck from my talk on this topic at VSLive in Redmond this week.

While there I demo’d Predixion Software’s next release (which is in private beta at this time).  This release includes many new features, the most interesting of which to me are the new visualization dashboard and enhancements to the Insight Workbench around Data Visualization.  I snipped a couple of screens from these feature sets and am adding them below.

First  – Data Exploration in Insight Workbench

Predixion Data Exploration

Next  – New Visualization Dashboard for the browser

Predixion Browser Visualizer

happy programming!

 

Posted in Microsoft | Tagged , | Leave a comment

Updated: AWS for the SQL Server Professional

Updated core deck (includes video at end with demos) on AWS services of interest for the SQL Server professional – enjoy!

Posted in AWS, Cloud, Data Science | Tagged | Leave a comment

How to: Developing for Aerospike with Python or C#

I’ve been doing some work with the super fast in-memory database,  Aerospike lately.  See previous blog posts here about the speed of this product.  Since I’ve started work w/Aerospike, the team there has announced that their core product is now open source.

In this blog post, I’ll be covering how to get started developing with Aerospike.  There are a couple of considerations as you begin.  The first consideration is where you want to host your Aerospike cluster (which can be a single node for initial testing).  Aerospike itself runs only on Linux at this time, i.e not Windows, Mac, etc…  So you have two options for hosting the server – either on the cloud or using virtualization software on your local development machine.  I did extensive testing on both methods and prefer to use a cloud-hosted instance at this time.  I will cover the process to do both types in this blog post.

Developing with AerospikeThe second consideration is which client language/library you prefer to use.  Because there is a good amount of information already available on using Java with Aerospike, I will cover using both Python or .NET (C#) here.  As of this writing, Aerospike has client libraries for C/C++, Java, C#, Node.js, C libevent, PHP, Erlang, Python and Perl.

So let’s get started….

Installing Aerospike on the Cloud

I’ve tested two configurations – Google Cloud using Google Compute Engine (I documented the install steps for Aerospike when using this method in a previous blog post – here), the Google Developers Console with a GCE instance hosting Aerospike is shown below.  You’ll note that I used a ‘n1-standard-2′ image for my testing – 2 CPUs and 7.5 GB RAM.  If you are new to the Google Cloud, you can use the code ‘gde-in‘ at this link to get $ 500 usage credit as well.

Aerospike on GCE

I also tested using an AWS Amazon Machine Image (EC2 service) linked here.  This AMI has Aerospike pre-installed.  Both methods are simple and quick for initial testing.  To use this method, simply spin-up the image linked (and shown below in the screenshot) via AWS.

Aerospike on AWS

Tip: Be sure to include a firewall rule to open port 3000 for testing your client connectivity for either cloud configuration.

Installing Aerospike locally using Virtual Box

If you prefer to install Aerospike locally, you can use the instructions found on Aerospike’s web site to do so.  They have instructions and links to Vagrant files (wrappers) for Oracle’s Virtual Box so that you can quickly download and start an Aerospike image.  Because you are using virtualization technology to host Aerospike itself, you can install this on your local machine with any OS.

If you choose this route, be sure to follow the instructions exactly as listed on the Aerospike site, as there are a number of configuration steps and each must be done in the order listed.  I tested both the Mac and Windows instructions.  There are 9 install steps for each type, the Mac install steps are shown below.

Part One of install instructions for Mac

Mac install P1

Part Two of install instructions for Mac

Mac install P2

 

If you are attempting to install on a Windows machine, be sure to verify that your installation of Virtual Box is able to use Hardware Virtualization Vt-x and AMD-V (look in the Settings tab) for Virtual Box, as some Windows machines may need BIOS settings updated in order for this to be possible.

Using Client Libraries with Aerospike

After you’ve set up and tested your Aerospike server, the next step is to select your client library.  First I’ll cover using the Python client library.

Using Python

The instructions provided by Aerospike worked just fine for me — with one small exception as I tested on my Mac.  The exception is for the last command (‘sudo pip install aerospike‘) – if you do not have ‘pip’ installed on your Mac, then just run ‘sudo easy_install pip‘ from Terminal to install it.
Python Library for Aerospike

On the first page of the Aerospike Python client manual there  is a complete sample Python file that you can use to quickly test your client connectivity by inserting a record via the put() method.  You can see this file in Sublime in my environment in the screenshot below.

This sample is designed to connect to a local installation of Aerospike.  If you are using a cloud-hosted installation, then just change the IP address (shown in line 6) to the external IP address for your hosted instance.  Also in the config (line 6) is the default port of 3000.

You may also notice that Aerospike records are addresses via the pattern of namespace, set and key (shown on line 13).  You call the put() method (line 16) to write a record and, optionally, the get(key) method (line 22) to read a record.

Simple Python client for Aerospike

 

 

 

 

 

 

 

 

 

 

 

 

I found the easiest way to verify record insertion while testing was to use the web console.  A sample of the output (with several test records inserted) is shown below on the Definitions tab of the console.  The URL format for the web console for a remote installation (including cloud) is as follows:

http://100.118.015.101:8081/#dashboard/100.118.015.101:3000/30/10.215.167.37:3000

The example above shows the first IP address being the external IP address for the remotely hosted instance and the second IP address being the internal IP address for that instance

If you are using a local instance, then the default URL for the console is shown below:
http://localhost:8081/#dashboard/127.0.0.1:3000/30/10.0.2.15:3000

Verify sample record insertion

There are also more complete sample files for working with Aerospike using Python to be found on Github – here.  Support for Python in Aerospike is relatively new and the team is also asking for your feedback if you use this library.

More Python samples

Using C#

The C# client library (here) is quite rich and conveniently includes a test harness (Windows Form application), that allows you to easily connect and test the Aerospike API.  Additionally the sample code, includes a benchmark test harness that I found useful.

CSharp for Aerospike

I tested the library on a Windows 8 machine with Visual Studio 2012 and it built with no issues.  I then connected to an instance (local shown in the screenshot, but in reality, I connected to a cloud-based instance) and was happy to explore the API via this well-written test harness shown below.  You’ll notice that the sample includes code for all types of activity, i.e. put, get, append, prepend, batch, etc.. and also that the sample includes code for asynchronous processes.

CSharp Test Harness Form

In addition the sample includes a benchmarking tool, which makes it simple for me to test and benchmark on various vendor clouds (in this case AWS vs Google Cloud).  An example of the benchmark application that is part of the C# sample client is also shown below.

Benchmark Aerospike AWS

Just for completeness, I’ll include a screenshot of C# sample code in Visual Studio.  You can see there are two projects, AerospikeClient and AerospikeDemo (the latter is set to be the start-up project).  The AerospikeDemo project contains the code for the test harness (Windows Form) shown in the previous screenshot.  Shown below is the source file ‘Operation.cs‘ from the /Main directory.  Here you can get a sense of core database operations, put, get, etc… which take Bin objects (each Bin is a column name/value pair).

Aerospike in Visual Studio

What’s Next

I’ll close by reminding you why I am so interested in trying out Aerospike.  You’ll remember, the core product is now free and open source. Also, it has the potential for relevant BigData scenarios to cost much less than other storage methods that can scale to this speed and size.  I have found a graphic from Aerospike’s site to be compelling and accurate.

Why Aerospike

What’s your experience like?  If you try out Aerospike, let me know (in the comments section below) how it goes for you.

Fun addition for those of you who have read this far…

Soon there will be...

Happy Programming!

 

 

Posted in Big Data, Cloud | Leave a comment

How to: Installing AerospikeDB on Google Compute Engine

Showing Aerospike at BigDataCampLA

Recently, I’ve been doing some work with AerospikeDB.  It is a super-fast in-memory NoSQL Database.  I gave a presentation at the recent BigDataCampLA on ‘Bleeding Edge Databases’ and included it because of impressive benchmarks, such as 1 Million TPS (read-only workload) PER SERVER and 40K TPS (read-write) on that same server.  Here’s the live presentation, also I did a screencast of this presentation.

In this blog post, I’ll detail how you can get started with the FREE community edition of AerospikeDB.  Again I’ll use Google Compute Engine as my platform of choice, due to the speed, ease of use and inexpensive cost for testing.  You’ll note from the screenshot below, that you can install the community edition on your own server, or on other clouds (such as AWS) as well.  I am writing this blog post because Aerospike didn’t have directions to get set up on GCE available prior to this blog post.

Aerospike Community Edition

Here’s a top level list of what you’ll need to do (below, I’ll detail each step) – I did the whole process start-to-finish in < 30 minutes.

    • Set up a Google Cloud project with Google Compute Engine (VM) API access
    • Spin up and configure a GCE instance
    • Install the Aerospike Community Edition, which runs on up to 2-nodes and can use up to 200 GB for your testing purposes
    • Run your tests and (optionally) add other nodes

Next I’ll drill into each of the steps listed above.  I’ll go into more detail and will provide sample commands for the Google Cloud test that I did.

Step One – Setup a Google Cloud project with Google Compute Engine access

If you are new to the Google Cloud, you’ll need to get the Google Cloud SDK for the command line utilities you’ll need to install and to connect to your cloud-hosted virtual machine.  There is a version of the SDK for Linux/Mac and also for Windows.

For this tutorial, I will be using Mac. There are only two steps to using the SDK:
a) From Terminal run

curl https://sdk.cloud.google.com | bash

b) Then restart Terminal and then run the command below from Terminal.  After it runs then a browser window will open, then click on your gmail account and then click on the ‘accept’ button and then login will complete in the terminal window

gcloud auth login

GCloud Authorization

If you already have a Google Cloud Project, then you can proceed to Step Two.  If you do not yet have a Google Cloud Project, then you will need to go to the Google Developer’s Console and create a new Project by clicking on the red ‘Create Project’ button at the top of the console.

Create Project

 

 

Note: Projects are containers for billing on the Google Cloud.  They can contain 1:M instances of each authorized service – in our case that would be 1:M instance of Google Compute Engine Virtual Machines.

To enable access to the GCE API in your project, click on the name of the project in Google Developer Console, then click on ‘APIS & AUTH’>’APIs’>Google Compute Engine “OFF” to turn the service availability to “ON”.  The button should turn green to indicate the service is available.

You will also have to enable billing under the ‘Billing & Settings’ section of the project.  Because you are reading this blog post, you can apply for $ 500 USD in Google Cloud Usage Credit at this URL – use code “gde-in” when you apply for the credit.

Google Cloud Usage Credit

To be complete there are many other types of cloud services available, such as Google App Engine, Google Big Query and many more.  Those services are not directly related to the topic of this article, so I’ll just link more information from the Google Cloud Developer documentation here.

Step Two – Spin up and configure a GCE instance

Note: All of the steps I describe below could be performed in the Terminal via GCloud command line tools (‘gcloud compute’ in this case), for simplicity, I will detail the steps using the web console.  Alternatively, here is a link to creating a GCE instance using those tools.

From within your project in Google Developers Console, click on your Project Name. From the project console page, click on ‘COMPUTE’ menu on the left side to expand it.  Next click on ‘COMPUTE ENGINE’>VM Instances.

Then click on the red button on the top of the page ‘New Instance’ to open the web page with the instance information as shown below.  Also here’s a quick summary of the values I selected: ZONE: US-Central1-b; MACHINE TYPE: n1-standard-1 (1 vCPU, 3.8 GB memory); IMAGE: Debian-7-wheezy-v20140606.

Other notes: You could use a g1-small instance type if you’d prefer, minimum machine requirements for the community edition of Aerospike are at least 1 GB RAM and 1 vCPU. You could use Red Hat and CentOS for the image, however my directions are specific to Debian 7 Linux.

GCE Configuration for Aerospike test

Click the blue ‘Create’ button to start your instance.  After the instance is available (takes less than a minute in my experience!), then you will see it listed in the project console window (COMPUTE ENGINE>VM Instance).  You can now test connectivity to your instance by clicking on the ‘SSH’ button to the right of the instance.

To test connectivity using SSH, open  Terminal, then use the ‘gcloud auth login’ command as described previously, then paste the gcutil command into the terminal, an example is shown below.

Testing Connectivity to GCE

The last configuration step for GCE is set up a firewall rule.  You’ll want to do this so that you can use the Aerospike (web-based) management console.  To create this rule do the following in the Google Developers Console for your project:  Click on COMPUTE>COMPUTE ENGINE>Networks>’default’>Firewall Rules>Create New.  Then add a new firewall rules with these settings: Name: AMC; Source Ranges: 0.0.0.0/0; Allowed Protocols or Ports: tcp: 8081

Step Three – Install the Aerospike Community Edition

To start, I set up a test Aerospike server with a single node.  To do this there are three required steps.  I have added a couple of optional steps as well since I found them to make my test of Aerospike more interesting

a) Connect to GCE via SSH
b) Download the Aerospike Community Edition
c) Extract and install the download (which is the server software plus command line tools)
d) start the service and test inserting data
e) install the node.js client (optional)
f) install the web-based management console

Notes: Be sure to run the scripts below as sudo.  Also my install instructions are based on downloading the version of Aerospike Database Server 3.2.9 than is designed to run on DEBIAN 7.

Here is a Bash script to automate this process:

#!/bin/bash
sudo apt-get -y install wget
wget -qO server.tgz “http://www.aerospike.com/community_downloads/3.2.9/aerospike-community-server-3.2.9-debian7.tgz&#8221;
tar xzf server.tgz
sudo dpkg -i aerospike*/aerospike*

sudo /etc/init.d/aerospike start #start aerospike now

To verify correct functioning of the server:

sudo /etc/init.d/aerospike status

I next used the included command line tool to further verify that the server was working properly by inserting, retrieving and then deleting some values from the server.  The command line tools is called ‘cli’ and is found in /usr/bin/cli.  Here are some sample test commands that I used:

cli -o set -n test -s “test_set” -k first_key -b bin_x -v “first value”
cli -o set -n test -s “test_set” -k first_key -b bin_y -v “second value”
cli -o set -n test -s “test_set” -k first_key -b bin_z -v “third value”

Next I retrieved these values with the following command:

cli -o get -n test -s “test_set” -k first_key
> {‘bin_z': ‘third value’, ‘bin_y': ‘second value’, ‘bin_x': ‘first value’}

Last I deleted the key:

cli -o delete -n test -s “test_set” -k first_key
cli -o get -n test -s “test_set” -k first_key
> no data retrieved

Tip: (Current Aerospike version: 3.2.9) After an instance reboot, the Aerospike service may fail to start.  To fix try creating an Aerospike directory in /var/run:

sudo mkdir /var/run/aerospike

Optional Step — Client Setup (Node.js)

Next I setup a simple Node.js Client on the same instance as the server. The process is as follows:  install Node.js and the Node Package Manager (NPM) and then install the Node.js Aerospike client package.

Note: There are a number of Aerospike clients available for different languages.  These are outside the scope of this document.  For more information go here: http://www.aerospike.com/aerospike-3-client-sdk/

This script automates this process:

#!/bin/bash
CWD=$(pwd)
sudo apt-get install python-software-properties
sudo add-apt-repository ppa:chris-lea/node.js # requires human interaction: ‘PRESS ENTER’
sudo apt-get install -y software-properties-common
sudo apt-get update
sudo apt-get -y upgrade
sudo apt-get install -y python g++ make nodejs
curl https://www.npmjs.org/install.js | sudo sh
sudo npm install aerospike -g # -g installs to /usr/lib, current dir otherwise

cd ${CWD}

 You will need to acknowledge the addition of the Node.js repository to your software repositories list. Once this completes, navigate to the examples directory:

 cd /usr/lib/node_modules/aerospike/examples

 Install the prerequisite packages:

 sudo npm install ../
 sudo npm update

These examples insert dummy data to a specified location in a similar fashion to the cli tool.

node put.js -n test -s “test_set” first_key
OK –  { ns: ‘test’, set: ‘test_set’, key: ‘first_key’ }
put: 3ms

node get.js -n test -s “test_set” first_key
OK –  { ns: ‘test’, set: ‘test_set’, key: ‘first_key’ } { ttl: 9827, gen: 2 } { bin_z: ‘third value’, i: 123, s: ‘abc’, arr: [ 1, 2, 3 ],map: { num: 3, str: ‘g3′, buff: <SlowBuffer 0a 0b 0c> }, b: <SlowBuffer 0a 0b 0c>,b2: <SlowBuffer 0a 0b 0c> }
get: 3ms 

node remove.js -n test -s “test_set” first_key 
OK –  { ns: ‘test’, set: ‘test_set’, key: ‘first_key’ }
remove: 7ms

Additionally there is a benchmarking tool I used to get a rough idea of the transactions per second available from my instance:

cd /usr/lib/node_modules/aerospike/benchmarks
npm install
node inspect.js

Management Console Setup

The Aerospike Management Console is a web based monitoring tool that will report all kinds of status information about your Aerospike deployment. Whether its a single instance or a large multi-datacenter cluster. To install the AMC I used the following script as superuser (e.g. sudo script.sh).

#!/bin/bash
apt-get -y install python-pip python-dev ansible
pip install markupsafe paramiko ecdsa pycrypto
wget -qO amc.deb ‘http://aerospike.com/amc/3.3.1/aerospike-management-console-3.3.1.all.x86_64.deb&#8217;
dpkg -i amc.deb

sudo /etc/init.d/amc start #start amc now

Once deployed I pointed my browser to port 8081 of the instance.  There will be a dialog asking for the hostname and port of an Aerospike instance.  Since I installed the server on the same instance as the amc I just used localhost and port 3000.

Aerospike Console

Step Four – Run tests and (optionally) add other nodes

As mentioned, you can test Aerospike on up to 2 nodes.  The next step I took in testing was to add another server node.  Here are the steps I took to do this.

First I added a firewall rule for TCP ports 3000-3004.  I did this using the same process (i.e. in the Google Developers Console) described previously.  Get to the ‘Create new firewall rule’ panel: Compute> Compute Engine> Networks> ‘default’> Firewall Rules> Create New.  Configure the rule by changing these values:Name: aerospike; Source Ranges: 0.0.0.0/0; Allowed Protocols or Ports: tcp:3000-3004

Next I opened the Aerospike configuration file located at /etc/aerospike/aerospike.conf. Inside the ‘networks’ section is a section called ‘heartbeat’ that looks like the following:

heartbeat {

mode multicast
address 239.1.99.222
port 9918

 # To use unicast-mesh heartbeats, comment out the 3 lines above and
# use the following 4 lines instead.
#mode mesh

#port 3002
#mesh-address 10.0.0.48
#mesh-port 3002

interval 150
timeout 10
}

I commented out the first three lines inside this section and then uncommented the four lines starting with mode mesh. I then replaced the ip address after mesh-address with the ip of my other node.  Next I save my changes then restarted the aerospike service:

sudo /etc/init.d/aerospike restart.

Next I repeated these changes on my second server instance, setting the mesh-address for this server to the ip address of the first server instance.  Each server instance will only need to know about one other server instance to connect to my Aerospike cluster. Everything else is handled automatically. To verify that the cluster is working correctly I checked the log file for ‘CLUSTER SIZE = 2’ like this:

sudo cat /var/log/aerospike/aerospike.log | grep CLUSTER
May 14 2014 23:42:48 GMT: INFO (partition): (fabric/partition.c::2876) CLUSTER SIZE = 2

Tip: If you are testing this out yourself, ensure that your instances can communicate with each other over the default ports 3000-3004.  To test connectivity use telnet for example: ‘telnet <remote ip> <port>’

Conclusions

In conclusion, I find Aerospike to be a superior performing database in its category.  I am curious about your experience with databases of this type (i.e. In-memory NoSQL).  Which vendors are you working with now?  What has been your experience?  Which type of setup works best for you – on premise (bare metal or virtualized) or in the cloud?  If in the cloud, which vendor’s cloud.

Also on the horizon, I am exploring up-and-coming light-weight application virtualization technologies, such as Docker.  Are you working with anything like this?  I will be posting more on using Docker with NoSQL and NewSQL databases over the next couple of months.

Posted in Agile, Big Data, Cloud, google, noSQL | Leave a comment