Learning GitHub – A Multi-part Screencast series

Commit Yourself - Learning GitHub

I struggled through various aspects of GitHub when I first started using it about 2 years ago.  Now it’s become part of my daily routine.  Lately, I’ve gotten more and more requests to answer “What am I sure is a stupid question” from friends.  To that end, I decided to create a short screencast series in which I will show the practicalities of using GitHub.

Commit Yourself and Enjoy!

Part 1 – What is GitHub?

Part 2 – Getting Started with GitHub – Users and Repositories

Part 3 – Working with Repositories

Part 4 – Handling Conflict

Part 5 – Bonus – data about your Repo

Posted in Uncategorized | Leave a comment

Real-world Predictive Analytics with Power BI and Predixion Software

Here’s the deck from my talk on this topic at VSLive in Redmond this week.

While there I demo’d Predixion Software’s next release (which is in private beta at this time).  This release includes many new features, the most interesting of which to me are the new visualization dashboard and enhancements to the Insight Workbench around Data Visualization.  I snipped a couple of screens from these feature sets and am adding them below.

First  – Data Exploration in Insight Workbench

Predixion Data Exploration

Next  – New Visualization Dashboard for the browser

Predixion Browser Visualizer

happy programming!

 

Posted in Microsoft | Tagged , | Leave a comment

Updated: AWS for the SQL Server Professional

Updated core deck (includes video at end with demos) on AWS services of interest for the SQL Server professional – enjoy!

Posted in AWS, Cloud, Data Science | Tagged | Leave a comment

How to: Developing for Aerospike with Python or C#

I’ve been doing some work with the super fast in-memory database,  Aerospike lately.  See previous blog posts here about the speed of this product.  Since I’ve started work w/Aerospike, the team there has announced that their core product is now open source.

In this blog post, I’ll be covering how to get started developing with Aerospike.  There are a couple of considerations as you begin.  The first consideration is where you want to host your Aerospike cluster (which can be a single node for initial testing).  Aerospike itself runs only on Linux at this time, i.e not Windows, Mac, etc…  So you have two options for hosting the server – either on the cloud or using virtualization software on your local development machine.  I did extensive testing on both methods and prefer to use a cloud-hosted instance at this time.  I will cover the process to do both types in this blog post.

Developing with AerospikeThe second consideration is which client language/library you prefer to use.  Because there is a good amount of information already available on using Java with Aerospike, I will cover using both Python or .NET (C#) here.  As of this writing, Aerospike has client libraries for C/C++, Java, C#, Node.js, C libevent, PHP, Erlang, Python and Perl.

So let’s get started….

Installing Aerospike on the Cloud

I’ve tested two configurations – Google Cloud using Google Compute Engine (I documented the install steps for Aerospike when using this method in a previous blog post – here), the Google Developers Console with a GCE instance hosting Aerospike is shown below.  You’ll note that I used a ‘n1-standard-2′ image for my testing – 2 CPUs and 7.5 GB RAM.  If you are new to the Google Cloud, you can use the code ‘gde-in‘ at this link to get $ 500 usage credit as well.

Aerospike on GCE

I also tested using an AWS Amazon Machine Image (EC2 service) linked here.  This AMI has Aerospike pre-installed.  Both methods are simple and quick for initial testing.  To use this method, simply spin-up the image linked (and shown below in the screenshot) via AWS.

Aerospike on AWS

Tip: Be sure to include a firewall rule to open port 3000 for testing your client connectivity for either cloud configuration.

Installing Aerospike locally using Virtual Box

If you prefer to install Aerospike locally, you can use the instructions found on Aerospike’s web site to do so.  They have instructions and links to Vagrant files (wrappers) for Oracle’s Virtual Box so that you can quickly download and start an Aerospike image.  Because you are using virtualization technology to host Aerospike itself, you can install this on your local machine with any OS.

If you choose this route, be sure to follow the instructions exactly as listed on the Aerospike site, as there are a number of configuration steps and each must be done in the order listed.  I tested both the Mac and Windows instructions.  There are 9 install steps for each type, the Mac install steps are shown below.

Part One of install instructions for Mac

Mac install P1

Part Two of install instructions for Mac

Mac install P2

 

If you are attempting to install on a Windows machine, be sure to verify that your installation of Virtual Box is able to use Hardware Virtualization Vt-x and AMD-V (look in the Settings tab) for Virtual Box, as some Windows machines may need BIOS settings updated in order for this to be possible.

Using Client Libraries with Aerospike

After you’ve set up and tested your Aerospike server, the next step is to select your client library.  First I’ll cover using the Python client library.

Using Python

The instructions provided by Aerospike worked just fine for me — with one small exception as I tested on my Mac.  The exception is for the last command (‘sudo pip install aerospike‘) – if you do not have ‘pip’ installed on your Mac, then just run ‘sudo easy_install pip‘ from Terminal to install it.
Python Library for Aerospike

On the first page of the Aerospike Python client manual there  is a complete sample Python file that you can use to quickly test your client connectivity by inserting a record via the put() method.  You can see this file in Sublime in my environment in the screenshot below.

This sample is designed to connect to a local installation of Aerospike.  If you are using a cloud-hosted installation, then just change the IP address (shown in line 6) to the external IP address for your hosted instance.  Also in the config (line 6) is the default port of 3000.

You may also notice that Aerospike records are addresses via the pattern of namespace, set and key (shown on line 13).  You call the put() method (line 16) to write a record and, optionally, the get(key) method (line 22) to read a record.

Simple Python client for Aerospike

 

 

 

 

 

 

 

 

 

 

 

 

I found the easiest way to verify record insertion while testing was to use the web console.  A sample of the output (with several test records inserted) is shown below on the Definitions tab of the console.  The URL format for the web console for a remote installation (including cloud) is as follows:

http://100.118.015.101:8081/#dashboard/100.118.015.101:3000/30/10.215.167.37:3000

The example above shows the first IP address being the external IP address for the remotely hosted instance and the second IP address being the internal IP address for that instance

If you are using a local instance, then the default URL for the console is shown below:

http://localhost:8081/#dashboard/127.0.0.1:3000/30/10.0.2.15:3000

Verify sample record insertion

There are also more complete sample files for working with Aerospike using Python to be found on Github – here.  Support for Python in Aerospike is relatively new and the team is also asking for your feedback if you use this library.

More Python samples

Using C#

The C# client library (here) is quite rich and conveniently includes a test harness (Windows Form application), that allows you to easily connect and test the Aerospike API.  Additionally the sample code, includes a benchmark test harness that I found useful.

CSharp for Aerospike

I tested the library on a Windows 8 machine with Visual Studio 2012 and it built with no issues.  I then connected to an instance (local shown in the screenshot, but in reality, I connected to a cloud-based instance) and was happy to explore the API via this well-written test harness shown below.  You’ll notice that the sample includes code for all types of activity, i.e. put, get, append, prepend, batch, etc.. and also that the sample includes code for asynchronous processes.

CSharp Test Harness Form

In addition the sample includes a benchmarking tool, which makes it simple for me to test and benchmark on various vendor clouds (in this case AWS vs Google Cloud).  An example of the benchmark application that is part of the C# sample client is also shown below.

Benchmark Aerospike AWS

Just for completeness, I’ll include a screenshot of C# sample code in Visual Studio.  You can see there are two projects, AerospikeClient and AerospikeDemo (the latter is set to be the start-up project).  The AerospikeDemo project contains the code for the test harness (Windows Form) shown in the previous screenshot.  Shown below is the source file ‘Operation.cs‘ from the /Main directory.  Here you can get a sense of core database operations, put, get, etc… which take Bin objects (each Bin is a column name/value pair).

Aerospike in Visual Studio

What’s Next

I’ll close by reminding you why I am so interested in trying out Aerospike.  You’ll remember, the core product is now free and open source. Also, it has the potential for relevant BigData scenarios to cost much less than other storage methods that can scale to this speed and size.  I have found a graphic from Aerospike’s site to be compelling and accurate.

Why Aerospike

What’s your experience like?  If you try out Aerospike, let me know (in the comments section below) how it goes for you.

Fun addition for those of you who have read this far…

Soon there will be...

Happy Programming!

 

 

Posted in Big Data, Cloud | Leave a comment

How to: Installing AerospikeDB on Google Compute Engine

Showing Aerospike at BigDataCampLA

Recently, I’ve been doing some work with AerospikeDB.  It is a super-fast in-memory NoSQL Database.  I gave a presentation at the recent BigDataCampLA on ‘Bleeding Edge Databases’ and included it because of impressive benchmarks, such as 1 Million TPS (read-only workload) PER SERVER and 40K TPS (read-write) on that same server.  Here’s the live presentation, also I did a screencast of this presentation.

In this blog post, I’ll detail how you can get started with the FREE community edition of AerospikeDB.  Again I’ll use Google Compute Engine as my platform of choice, due to the speed, ease of use and inexpensive cost for testing.  You’ll note from the screenshot below, that you can install the community edition on your own server, or on other clouds (such as AWS) as well.  I am writing this blog post because Aerospike didn’t have directions to get set up on GCE available prior to this blog post.

Aerospike Community Edition

Here’s a top level list of what you’ll need to do (below, I’ll detail each step) – I did the whole process start-to-finish in < 30 minutes.

    • Set up a Google Cloud project with Google Compute Engine (VM) API access
    • Spin up and configure a GCE instance
    • Install the Aerospike Community Edition, which runs on up to 2-nodes and can use up to 200 GB for your testing purposes
    • Run your tests and (optionally) add other nodes

Next I’ll drill into each of the steps listed above.  I’ll go into more detail and will provide sample commands for the Google Cloud test that I did.

Step One – Setup a Google Cloud project with Google Compute Engine access

If you are new to the Google Cloud, you’ll need to get the Google Cloud SDK for the command line utilities you’ll need to install and to connect to your cloud-hosted virtual machine.  There is a version of the SDK for Linux/Mac and also for Windows.

For this tutorial, I will be using Mac. There are only two steps to using the SDK:
a) From Terminal run

curl https://sdk.cloud.google.com | bash

b) Then restart Terminal and then run the command below from Terminal.  After it runs then a browser window will open, then click on your gmail account and then click on the ‘accept’ button and then login will complete in the terminal window

gcloud auth login

GCloud Authorization

If you already have a Google Cloud Project, then you can proceed to Step Two.  If you do not yet have a Google Cloud Project, then you will need to go to the Google Developer’s Console and create a new Project by clicking on the red ‘Create Project’ button at the top of the console.

Create Project

 

 

Note: Projects are containers for billing on the Google Cloud.  They can contain 1:M instances of each authorized service – in our case that would be 1:M instance of Google Compute Engine Virtual Machines.

To enable access to the GCE API in your project, click on the name of the project in Google Developer Console, then click on ‘APIS & AUTH’>’APIs’>Google Compute Engine “OFF” to turn the service availability to “ON”.  The button should turn green to indicate the service is available.

You will also have to enable billing under the ‘Billing & Settings’ section of the project.  Because you are reading this blog post, you can apply for $ 500 USD in Google Cloud Usage Credit at this URL – use code “gde-in” when you apply for the credit.

Google Cloud Usage Credit

To be complete there are many other types of cloud services available, such as Google App Engine, Google Big Query and many more.  Those services are not directly related to the topic of this article, so I’ll just link more information from the Google Cloud Developer documentation here.

Step Two – Spin up and configure a GCE instance

Note: All of the steps I describe below could be performed in the Terminal via GCloud command line tools (‘gcloud compute’ in this case), for simplicity, I will detail the steps using the web console.  Alternatively, here is a link to creating a GCE instance using those tools.

From within your project in Google Developers Console, click on your Project Name. From the project console page, click on ‘COMPUTE’ menu on the left side to expand it.  Next click on ‘COMPUTE ENGINE’>VM Instances.

Then click on the red button on the top of the page ‘New Instance’ to open the web page with the instance information as shown below.  Also here’s a quick summary of the values I selected: ZONE: US-Central1-b; MACHINE TYPE: n1-standard-1 (1 vCPU, 3.8 GB memory); IMAGE: Debian-7-wheezy-v20140606.

Other notes: You could use a g1-small instance type if you’d prefer, minimum machine requirements for the community edition of Aerospike are at least 1 GB RAM and 1 vCPU. You could use Red Hat and CentOS for the image, however my directions are specific to Debian 7 Linux.

GCE Configuration for Aerospike test

Click the blue ‘Create’ button to start your instance.  After the instance is available (takes less than a minute in my experience!), then you will see it listed in the project console window (COMPUTE ENGINE>VM Instance).  You can now test connectivity to your instance by clicking on the ‘SSH’ button to the right of the instance.

To test connectivity using SSH, open  Terminal, then use the ‘gcloud auth login’ command as described previously, then paste the gcutil command into the terminal, an example is shown below.

Testing Connectivity to GCE

The last configuration step for GCE is set up a firewall rule.  You’ll want to do this so that you can use the Aerospike (web-based) management console.  To create this rule do the following in the Google Developers Console for your project:  Click on COMPUTE>COMPUTE ENGINE>Networks>’default’>Firewall Rules>Create New.  Then add a new firewall rules with these settings: Name: AMC; Source Ranges: 0.0.0.0/0; Allowed Protocols or Ports: tcp: 8081

Step Three – Install the Aerospike Community Edition

To start, I set up a test Aerospike server with a single node.  To do this there are three required steps.  I have added a couple of optional steps as well since I found them to make my test of Aerospike more interesting

a) Connect to GCE via SSH
b) Download the Aerospike Community Edition
c) Extract and install the download (which is the server software plus command line tools)
d) start the service and test inserting data
e) install the node.js client (optional)
f) install the web-based management console

Notes: Be sure to run the scripts below as sudo.  Also my install instructions are based on downloading the version of Aerospike Database Server 3.2.9 than is designed to run on DEBIAN 7.

Here is a Bash script to automate this process:

#!/bin/bash
sudo apt-get -y install wget
wget -qO server.tgz “http://www.aerospike.com/community_downloads/3.2.9/aerospike-community-server-3.2.9-debian7.tgz&#8221;
tar xzf server.tgz
sudo dpkg -i aerospike*/aerospike*

sudo /etc/init.d/aerospike start #start aerospike now

To verify correct functioning of the server:

sudo /etc/init.d/aerospike status

I next used the included command line tool to further verify that the server was working properly by inserting, retrieving and then deleting some values from the server.  The command line tools is called ‘cli’ and is found in /usr/bin/cli.  Here are some sample test commands that I used:

cli -o set -n test -s “test_set” -k first_key -b bin_x -v “first value”
cli -o set -n test -s “test_set” -k first_key -b bin_y -v “second value”
cli -o set -n test -s “test_set” -k first_key -b bin_z -v “third value”

Next I retrieved these values with the following command:

cli -o get -n test -s “test_set” -k first_key
> {‘bin_z': ‘third value’, ‘bin_y': ‘second value’, ‘bin_x': ‘first value’}

Last I deleted the key:

cli -o delete -n test -s “test_set” -k first_key
cli -o get -n test -s “test_set” -k first_key
> no data retrieved

Tip: (Current Aerospike version: 3.2.9) After an instance reboot, the Aerospike service may fail to start.  To fix try creating an Aerospike directory in /var/run:

sudo mkdir /var/run/aerospike

Optional Step — Client Setup (Node.js)

Next I setup a simple Node.js Client on the same instance as the server. The process is as follows:  install Node.js and the Node Package Manager (NPM) and then install the Node.js Aerospike client package.

Note: There are a number of Aerospike clients available for different languages.  These are outside the scope of this document.  For more information go here: http://www.aerospike.com/aerospike-3-client-sdk/

This script automates this process:

#!/bin/bash
CWD=$(pwd)
sudo apt-get install python-software-properties
sudo add-apt-repository ppa:chris-lea/node.js # requires human interaction: ‘PRESS ENTER’
sudo apt-get install -y software-properties-common
sudo apt-get update
sudo apt-get -y upgrade
sudo apt-get install -y python g++ make nodejs
curl https://www.npmjs.org/install.js | sudo sh
sudo npm install aerospike -g # -g installs to /usr/lib, current dir otherwise

cd ${CWD}

 You will need to acknowledge the addition of the Node.js repository to your software repositories list. Once this completes, navigate to the examples directory:

 cd /usr/lib/node_modules/aerospike/examples

 Install the prerequisite packages:

 sudo npm install ../
 sudo npm update

These examples insert dummy data to a specified location in a similar fashion to the cli tool.

node put.js -n test -s “test_set” first_key
OK –  { ns: ‘test’, set: ‘test_set’, key: ‘first_key’ }
put: 3ms

node get.js -n test -s “test_set” first_key
OK –  { ns: ‘test’, set: ‘test_set’, key: ‘first_key’ } { ttl: 9827, gen: 2 } { bin_z: ‘third value’, i: 123, s: ‘abc’, arr: [ 1, 2, 3 ],map: { num: 3, str: ‘g3′, buff: <SlowBuffer 0a 0b 0c> }, b: <SlowBuffer 0a 0b 0c>,b2: <SlowBuffer 0a 0b 0c> }
get: 3ms 

node remove.js -n test -s “test_set” first_key 
OK –  { ns: ‘test’, set: ‘test_set’, key: ‘first_key’ }
remove: 7ms

Additionally there is a benchmarking tool I used to get a rough idea of the transactions per second available from my instance:

cd /usr/lib/node_modules/aerospike/benchmarks
npm install
node inspect.js

Management Console Setup

The Aerospike Management Console is a web based monitoring tool that will report all kinds of status information about your Aerospike deployment. Whether its a single instance or a large multi-datacenter cluster. To install the AMC I used the following script as superuser (e.g. sudo script.sh).

#!/bin/bash
apt-get -y install python-pip python-dev ansible
pip install markupsafe paramiko ecdsa pycrypto
wget -qO amc.deb ‘http://aerospike.com/amc/3.3.1/aerospike-management-console-3.3.1.all.x86_64.deb&#8217;
dpkg -i amc.deb

sudo /etc/init.d/amc start #start amc now

Once deployed I pointed my browser to port 8081 of the instance.  There will be a dialog asking for the hostname and port of an Aerospike instance.  Since I installed the server on the same instance as the amc I just used localhost and port 3000.

Aerospike Console

Step Four – Run tests and (optionally) add other nodes

As mentioned, you can test Aerospike on up to 2 nodes.  The next step I took in testing was to add another server node.  Here are the steps I took to do this.

First I added a firewall rule for TCP ports 3000-3004.  I did this using the same process (i.e. in the Google Developers Console) described previously.  Get to the ‘Create new firewall rule’ panel: Compute> Compute Engine> Networks> ‘default’> Firewall Rules> Create New.  Configure the rule by changing these values:Name: aerospike; Source Ranges: 0.0.0.0/0; Allowed Protocols or Ports: tcp:3000-3004

Next I opened the Aerospike configuration file located at /etc/aerospike/aerospike.conf. Inside the ‘networks’ section is a section called ‘heartbeat’ that looks like the following:

heartbeat {

mode multicast
address 239.1.99.222
port 9918

 # To use unicast-mesh heartbeats, comment out the 3 lines above and
# use the following 4 lines instead.
#mode mesh

#port 3002
#mesh-address 10.0.0.48
#mesh-port 3002

interval 150
timeout 10
}

I commented out the first three lines inside this section and then uncommented the four lines starting with mode mesh. I then replaced the ip address after mesh-address with the ip of my other node.  Next I save my changes then restarted the aerospike service:

sudo /etc/init.d/aerospike restart.

Next I repeated these changes on my second server instance, setting the mesh-address for this server to the ip address of the first server instance.  Each server instance will only need to know about one other server instance to connect to my Aerospike cluster. Everything else is handled automatically. To verify that the cluster is working correctly I checked the log file for ‘CLUSTER SIZE = 2’ like this:

sudo cat /var/log/aerospike/aerospike.log | grep CLUSTER
May 14 2014 23:42:48 GMT: INFO (partition): (fabric/partition.c::2876) CLUSTER SIZE = 2

Tip: If you are testing this out yourself, ensure that your instances can communicate with each other over the default ports 3000-3004.  To test connectivity use telnet for example: ‘telnet <remote ip> <port>’

Conclusions

In conclusion, I find Aerospike to be a superior performing database in its category.  I am curious about your experience with databases of this type (i.e. In-memory NoSQL).  Which vendors are you working with now?  What has been your experience?  Which type of setup works best for you – on premise (bare metal or virtualized) or in the cloud?  If in the cloud, which vendor’s cloud.

Also on the horizon, I am exploring up-and-coming light-weight application virtualization technologies, such as Docker.  Are you working with anything like this?  I will be posting more on using Docker with NoSQL and NewSQL databases over the next couple of months.

Posted in Agile, Big Data, Cloud, google, noSQL | Leave a comment

Bleeding Edge Databases – Aerospike, Algebraix and Google Big Query

I gave a talk called ‘Bleeding Edge Databases’ at this weekend’s BigDataCampLA.  Several attendees asked me to record the talk, so I did (and will link that post below).

Keynoting at BigDataCampLA

 

 

 

 

 

 

 

 

Here’s my favorite tweet from the event.

I am regularly invited to preview many, new database technologies, so you may be wondering why I chose these three solutions to highlight.  The first aspect of all three solutions is independently benchmarked noticeably better performance in their particular areas.  I also take usability into account.

Aerospike – 1 Million TPS (read-only workload) PER SERVER and 40K TPS (read-write) on that same server. Very good integration with client tools and libraries.

1M TPS Aerospike

AlgebraixData – 1 Billion Triples on ONE NODE and substantially better query response times than any competitor for the core benchmark queries for RDF databases.  Their core engine, which is optimized using patented mathematical algorithms interests me because of the solid performance benchmarks they’ve been able to achieve.

The Math of Algebraix Data

Google Big Query – 750 Millions Rows in 10 seconds, solidly ‘productized’ with usable integration points both in/out, reduced pricing and increased streaming – also nearly ANSI SQL-like querability.

TPC-H Benchmark for Big Query

 

 

 

 

 

 

 

 

My screencast includes demos of the first two products using Google Compute Engine (VMs on the Google Cloud).  I chose a Linux GCE instance for Aerospike and a Windows GCE instance for AlgebraixData.  I prefer to use the Google Cloud for this type of testing for a couple of reasons:

1) Fast, easy VM spin-up
2) Generous free tier, clear pricing information (i.e. no surprises)
3) As a GDE (Google Developer Expert), I have usage credits for the Google Cloud beyond typical, so I rarely encounter any fees for quick POC testing

If you are interesting in trying out these databases, the instructions on how to get access are in my screencast – enjoy!

Posted in Big Data, Cloud, google | Leave a comment

Intro to the Google Cloud for Developers

Here’s the deck from my talk at #Techorama  – enjoy.

Posted in Cloud, google | Leave a comment