Big Data, Cloud, Data Science, Microsoft, SQL Server 2012

Working with DnB Company Cleanse Match Data

DnB Cleanse Match

I’ve been working on some data cleansing projects lately and to that end I’ve tried out working with the DnB Company Cleanse Match Dataset in the Windows Azure Marketplace.  This dataset allows you get more complete information about companies and to combine duplicate records.  Shown below is a screenshot which illustrates what you can do with this service.

DnB Company Cleanse Match

To try it out, you can email DnB for a promo code (send mail to ‘‘).  You can use this service in a couple of different ways, these include using it with Excel (PowerQuery or any other service that supports consuming OData feeds), SQL Server 2012 Data Quality Services or programmatically by downloading the proxy class for C# from the Azure Data market (available after you subscribe to the service) and coding against the API.

I’ve made two screencasts to show how this works.  First, here’s the screencast on Power Query / API.

Second, here’s the screencast using the dataset with SQL Server 2012 DQS.

Also here’s the stub code for the API:

string USER_ID = "<windows live id user id>";
string ACCT_KEY = "<your key>";
var ROOT_URI = "";
var serviceClient = new DnB.DnBContainer(new Uri(ROOT_URI));
 serviceClient.Credentials = new NetworkCredential(USER_ID,ACCT_KEY);
var l =
(from d in serviceClient.SuggestCompanyDetails
 ("Dell", null, null, null, "TX", null, "US", null, 3, 0)
 select d);

foreach (var a in l)
    Console.WriteLine("Result " + a.DunsNumber);


AWS, Cloud, Data Science

Understanding AWS Pricing

AWS Console
AWS Console

Because I get asked so regularly, I made a deck and screencast with the goal of helping you to understand and to get the best value for AWS pricing.  Here are list of useful tools, when trying to understand AWS pricing:

1) AWS Free Tier information – here
2) AWS Pricing Calculator – here
3) About AWS billing – here
4) RightScale Plan for Cloud cross-cloud pricing calculator – here

Here’s the deck

Here’s the screencast

Feel free to share tips and information that you have for understanding AWS pricing in the comments section of this blog post as well.


Big Data, Data Science, facebook

Visualizing Facebook Birthday greetings using Power Query

To help to visualize the many, wonderful birthday greetings I got from Facebook yesterday, I tried out using Power Query for Excel so that I could visualize the locations from which I got greetings.  Other tools I used were LINQPad and the Facebook Developer’s Graph API Explorer tool.

PowerQuery with Facebook Data

Below is a screenshot of the results and a screencast.

Are you using Power Query?  Share your feedback here.

AWS, Azure, Big Data, Cloud, Data Science, Hadoop, Microsoft

New YouTube Series – Hadoop MapReduce Fundamentals

Hadoop MapReduce
Hadoop MapReduce

I’ve been working with Hadoop MapReduce in several formats over the past couple of years.  I decided to pull together my experience and record this as a free, multi-part screencast series on YouTube.

The course consists of 5 screencasts – from 30 – 50 minutes per part.  Each part tackles some aspect of Hadoop MapReduce, from basic, conceptual understanding to most common tuning processes.  Throughout the series, I’ve included screencast demos using a variety of vendor distributions of Hadoop.  These demos include Cloudera CHD4, Windows Azure HDInsight, AWS MapReduce and more.

Below is the first module of the course.

Here is a link to the entire Power Point deck.

Here is a link to the course demo files.

Big Data, Data Science, noSQL

Quick Look – Trying out Neo4j

I’ve had two consulting clients looking to solve business problems that seem to be a fit for a particular kind of NoSQL database – that is a graph database.  What is a graph database, you ask?  Wikipedia has a good definition (and picture!), quoting:

graph database uses graph structures with nodes, edges, and properties to represent and store data. By definition, a graph database is any storage system that provides index-free adjacency. This means that every element contains a direct pointer to its adjacent element and no index lookups are necessary

I met some of the Neo4j team at Silicon Valley Code Camp last fall and they talked me about learning more.  Since then, I’ve taken some time to work with the Neo4j model and query language (Cypher) and have used their excellent website as a starting point.  I made a quick video to get you started too – enjoy.

Agile, AWS, Azure, Big Data, Cloud, Data Science, google, Hadoop, Microsoft, noSQL, Technical Conference

My SoCalCodeCamp decks – Hadoop, ApprovalTests, BigData and more

I am going to be one busy lady on Saturday, June 23 at SoCalCodeCamp at UCSD in San Diego.  Here’s the schedule.  I am presenting at 5 different talks there all on Saturday.  Here are the decks and sessions:

1) Harnassing the good intentions of others, increasing contributions to open source projects.  Deck TBD – here’s a video talking about the session (which we also present in July at OsCon 2012)

2) Intro to SketchUp  – presented by my 13 year old daughter Samantha (I am just there to smile and say ‘that’s my girl!’

3) Better Unit Testing with ApprovalTests – presenting with Woody Zuill.  Will also be presented at Agile 2012 (national Agile conference in August 2012 – on the Testing Tracks)

4) Intro to Hadoop on Azure – article coming in next month’s MSDN Magazine (publishes July 25, 2012) as well

5) BigData Panel – state of the data industry hosted by Stacey Broadwell.