Agile, AWS, Azure, Big Data, Cloud, Data Science, google, Hadoop, Microsoft, noSQL, Technical Conference

My SoCalCodeCamp decks – Hadoop, ApprovalTests, BigData and more

I am going to be one busy lady on Saturday, June 23 at SoCalCodeCamp at UCSD in San Diego.  Here’s the schedule.  I am presenting at 5 different talks there all on Saturday.  Here are the decks and sessions:

1) Harnassing the good intentions of others, increasing contributions to open source projects.  Deck TBD – here’s a video talking about the session (which we also present in July at OsCon 2012)

2) Intro to SketchUp  – presented by my 13 year old daughter Samantha (I am just there to smile and say ‘that’s my girl!’

3) Better Unit Testing with ApprovalTests – presenting with Woody Zuill.  Will also be presented at Agile 2012 (national Agile conference in August 2012 – on the Testing Tracks)

4) Intro to Hadoop on Azure – article coming in next month’s MSDN Magazine (publishes July 25, 2012) as well

5) BigData Panel – state of the data industry hosted by Stacey Broadwell.

AWS, Azure, Big Data, Cloud, Data Science, google, Hadoop, Microsoft, noSQL

New 1-day Class ‘NoSQL for the SQL Server Pro’

Because I’ve had several requests to build a short class after delivering my talk ‘NoSQL for the SQL Server DBA’ at #SQLSaturday120 last month in Huntington Beach, CA, I’ve decided to write and teach a one-day class around this topic.

I am running the first one in Anaheim on May 22 (from 9am to 5pm).  Join me to get a deeper perspective into the NoSQL data landscape.  Also students will get hand-on experience with NoSQL using MongoDB.

Here’s the link to register (Eventbrite) for the class.  I am looking forward to seeing you there.

AWS, Azure, Big Data, Cloud, Hadoop

Hadoop on Azure – JavaScript MapReduce using AWS S3 data

What? Use a Microsoft Azure product (Hadoop on Azure) to run a MapReduce job (using JavaScript) on data stored on AWS S3? Seems like a great blog topic for April 1, doesn’t it? Enjoy the video.

Here’s Microsoft’s Denny Lee’s original blog post, which inspired me to try this out.
Also, in case you are wondering, here is the source code (from the Samples section of the Hadoop on Azure beta site, in JavaScript, to run the ‘WordCount’ MapReduce job.

var map = function (key, value, context)
{ var words = value.split(/[^a-zA-Z]/);
for (var i = 0; i < words.length; i++)
{ if (words[i] !== “”)
{ context.write(words[i].toLowerCase(), 1); }   } };

var reduce = function (key, values, context)
{  var sum = 0;
while (values.hasNext())
{ sum += parseInt(;  }
context.write(key, sum);};

AWS, Azure, Big Data, Cloud, Data Science, google, Hadoop, Microsoft, noSQL, SQL Server 2012

NoSQL for the SQL Server DBA

Here are the slides and my demos from my this presentation, which I will be delivering at SQLSaturday 120 .  Here I cover data storage options for NoSQL and RDBMS databases – also for both on premise and on the cloud.  I explain when to use which database solution and include some examples of using ‘not-only-SQL’ (i.e. combinations of both relational and non-relational implementations based on business requirements.)


Big Data, Cloud, Hadoop, Microsoft

Trying out Hadoop on Azure Screencast Series

I made some short screencasts today as I tried out some functionality of the Hadoop on Azure CTP.  I intend to continue to post more screencasts as I work through the samples and start trying out my own data samples too.  If you’d like to be notified when I add more content, then you can either RSS this blog (Tags – BigData or Hadoop) or you can just subscribe to my YouTube BigData playlist – here.

First up – geting started with Hadoop on Azure, setting up a cluster.

Next – performing basic administrative tasks via the Hadoop on Azure web-based cluster interface.

Next – Trying out a sample – Using the Mahout Clustering sample

Next – Getting started with the Interactive Javascript console

Next – Trying out Pig Latin, running a Pig job using the Javascript console.

In the next set of videos, I plan to try out writing and executing a Map Reduce job in Javascript.  Then I’ll start working with the Hive (HQL) functionality in the Hadoop on Azure console.  Finally, I plan to finish up by showing the Hadoop add-in for Excel.  This last tool also allow you to write and to execute Hive queries against your Hadoop on Azure cluster, but uses Excel as the client.


AWS, Azure, Big Data, Cloud, Data Science, Hadoop, Microsoft, noSQL

NoSQL for the DBA – 5 Minute Preview

I am working on a new one-hour talk ‘NoSQL for the DBA’ for SQLSaturday 120 in March in Huntington Beach, CA.  As I work on this, I am thinking about what to focus on – here are my ideas:

1) Understanding CAP
2) Understanding the different ‘flavors’ of NoSQL – i.e. Document databases, Key/Value databases, Graph databases, etc…
3) Translating relational DBA task to NoSQL
4) Talking about hosted (cloud) NoSQL

Here’s a 5-minute video preview.

Azure, Big Data, Cloud, Hadoop, noSQL

Trying out Hadoop on Azure


Yesterday the team announced the availability of a limited beta on this new functionality.  They sent me a trial code, so I decided to give it a whirl.  Here’s what I found…

The team also announced some of their future plans here.

To start at the portal, you sign in with a WLID and enter your beta code, then you enter a username and password.  The portal is simple to use (Metro-style buttons).  You then request that a cluster be allocated.  I’ll show their screen shot (for security reasons), rather than mine.


After you click ‘Request cluster’, then a status screen, similar to the one below, shows up.  It took about 5 minutes for the cluster allocation to complete for me.


Then you are ready to get started trying it out.  My cluster is shown below in a couple of screen shots.  The first two options are for you to create (and run) a MapReduce job and also to see the status of the most recently run jobs.


If you click ‘Create Job’ then you’ll see the screen shown below.  There you can run a named job by selecting a *.jar file to run the MapReduce job.


Back on the first page, the complete page is shown below.  You can see that the green buttons allow you work with your cluster in the following ways:

1) You can run queries, either in JavaScript or in Hive directly in an interactive console.  (I’ll show a screenshot later in this post).

2) Next you can download a *.rdp file so that you can access your cluster via remote desktop.  I tested that out and it worked just fine.

3) The third button allows you to open ports.  Nothing is open by default.  So that I could connect via my local Excel instance, I opened port 10000 for ODBC by clicking on this button and dragging the slider from closed to open.

4) The forth button takes you to some options for importing data, again I’ll show that below.

On the next (orange) row, the buttons work as follows:

1) The first button informs you that the trial is FREE

2) The second button shows you the history of the jobs that you have run.

3) The third button points you to a sample to start with.  I’d suggest that you start there.

4) The forth button just links you to documentation.


Trying out the Sample

As mentioned, click the third orange button on the main console, then you’ll see the samples shown below.


I already tried out the Pi Estimator, which worked fine, yesterday.  So, I’ll try out the ‘10GB Gray Sort’ today.  I clicked that button and saw the screen below.


Next I clicked ‘Deploy to your cluster’ and was taken to the screen shown below.


Note the parameters and the ‘Final Command’ on the screen above.  Next I clicked on ‘Execute Job’ and then saw the screen below.  Then the screen converted to a status screen that refreshed itself about once per second.  This screen is shown below.  This job took 12 minutes to complete.


Here’s the job history for this and some other jobs that I ran.


Now let’s look at the interactive console.  It defaults to using JavaScript, this is shown below.  I am going to switch to the ‘Hive’ console by clicking that button.


Here’s a couple of simple Hive queries to get us started (shown below):


Next, if go back to main console and then click on ‘Manage Data’ you’ll see the screen below:


You’ll see that in this beta, you can work with data from the Windows Azure Data market, Windows Azure Blob storage and Amazon S3.  I tried out importing data from the Windows Azure Data market and it worked just fine.  Shown below is a the query I generated (from some ‘free’ data) from the Data market.  Notice that I clicked on ‘Develop’ on the right side of the page to generate the query (highlighted) that I pasted into the dialog box on the Hadoop page.


The blank import page in the Hadoop portal is shown below, you paste the query in the highlighted box, enter your credentials and give the new table a name, then click the ‘Import Data’ button.


At this time, although you can enter the credentials for your Windows Azure storage account, I did not see a way in the Hadoop portal to actually do the import.  Also the documentation (linked) section is empty.  The same appears to be the case for AWS S3.  Also the ‘upload’ and ‘download’ buttons on the ‘Manage Data’ page do not yet appear to be available in this beta to try out.

Connect to Excel

The last thing I tried out was the connectivity to Excel.  I followed the instructions in the documentation linked on this portal and tried to connect to my newly uploaded table (from Windows Azure Data Market) of UN_samples and it worked great!  Note the ‘Hive’ button and task pane on the right.



I still have more to try out here.  I am reading and learning about MapReduce and Hive and want to put this beta through a few more paces in the weeks to come.  I am interested in hearing from you as well.  Are you working with Hadoop yet? How’s it going?