Big Data, Data Science, noSQL

Quick Look – Trying out Neo4j

I’ve had two consulting clients looking to solve business problems that seem to be a fit for a particular kind of NoSQL database – that is a graph database.  What is a graph database, you ask?  Wikipedia has a good definition (and picture!), quoting:

graph database uses graph structures with nodes, edges, and properties to represent and store data. By definition, a graph database is any storage system that provides index-free adjacency. This means that every element contains a direct pointer to its adjacent element and no index lookups are necessary

I met some of the Neo4j team at Silicon Valley Code Camp last fall and they talked me about learning more.  Since then, I’ve taken some time to work with the Neo4j model and query language (Cypher) and have used their excellent website as a starting point.  I made a quick video to get you started too – enjoy.

google, noSQL, Technical Conference

A former Softie on Google IO -> Mind = blown

As I wait in for my “ride home” at SFO, I am trying to process all that is Google IO.  Have recently emerged from many years in Microsoft ecosystem, and I am am feeling like I in the right place at the right time.

Top Line – Google hits all competition – hard.

Even through the lens (the bubble?) of the converted (and Google IO is full of them), it’s a stark reality that all of former Softies have to face, the world is changing and we have to as well if we want to stay relavent (and employed).  What does this mean exactly?

1) Tablets and phones – it’s a war and Google pulled ahead with the release of Jelly Bean for Android here at IO.  Of course they gave every attendee a Nexus tablet and phone on the first day, so that we would use them at the conference.  Contrast this to the old-school PR show that MSFT ran last week to preview Surface.  No one was allowed to touch their device.  Amazon will fire back with a new Kindle Fire shortly and there are rumours that Apple will release a 7″ iPad as well. Shown below is the ‘swag haul’.

2) Android is developer-friendly – more than 50% of this conference content was aimed toward Android developers.  There were additional giveaways for those attendees who were lucky enough to actually get into the Android sessions.  Not being an Android developer myself (yet), I just watched the huge lines for all of the Android sessions with surprise.  I sat in one session for about 15 minutes, but the screens filled with code bored me silly.

3) Look out Amazon – Google has some serious offerings in the cloud.  The big announcement here was the beta of Google Compute Engine (or IaaS).  Also Big Query quietly came out of beta shortly before IO.  It’s essentially a massively distributed query process (Hadoop-like) that allows you to use SQL queries, rather than MapReduce.  Cheap, simple and powerful – are you paying attention yet?

4) Google gets geeks.  From the conference sell-out (28 minutes), to the sky-diving  Google glasses stunt (repeated and EXPLAINED the next day by Sergey himself), to the massive amount of swag, google speaks in a language that attracts the best and brightest technical talent world-wide.  Lest you think that the open source world has dibs on the geek nirvana that is IO, I heard lots of hallway conversations like this one “This is my first IO, I felt that it was time to look outside of Microsoft, and, I’m stunned.”

5) The keynote stunt – if you didn’t already see it

The Detail – First the Good

SWAG – Holy crap, I have never in over 100 technical conferences, got this much useful swag.  I mean a phone, a tablet, a chromebook and a home streaming media player.  That’s just the main stream stuff — I did get a few more things too…

Preview and incredible demo of Google Glasses – You’ve probably seen it (YouTube above), but I was THERE.  And then, they did it again (and explained how it was done), just because, they can.  Damn!  In case, you are wondering, yes, of course I signed up to give Google my $ 1500 for the first technical preview of Google glasses.

The people – really I just spent most of my hallway time listening.  I am utterly convinced based on what I heard that the smartest, most talented developers on earth were all in one place for 3 days.  It was really astounding.  Between what I heard and the NDAs I have with all major companies, I really can’t share anything more specifically here yet unfortunately.

Quiet announcements – amidst all the Android and Cloud fanfare a couple of smaller releases caught my eye – Google Drive SDK 2.0 has some interesting application integration features.  Google Now on the phone, in my initial test, when combined with the improved voice search, work great.

More Detail – The Bad

The venue, although cool, was too small.  There were many long, long lines and huge crowds.  Given the attendance (I think around 5,000), there are venues that can hold a crowd this size more comfortably.  The crowd, as time, posed an actual safety risk.

Getting a ticket was too difficult.  I was actually invited, but I heard many horror stories of people opening multiple browsers routed through multiple continents and automating requests to try to get a ticket before the conference sold out (about 20 minutes after it opened).  I mean, really, with all the bragging about ‘Google scale’ infrastructure, I can’t help but wonder, was this by design?

Why did Google take down all of the exhibits at the end of day 2 in a 3 day conference? #fail

G+ – I know they have to try.  They released Events and Party Mode, but still, it’s just not working for me…

Even more Detail – The Ugly

What, no Java? If you want to work with all of the rich Google APIs, then you better be prepared to code in Python.  All of the session code demos that I attended were delivered only in Python, and, if asked, the presenters seemed pretty clueless about Java capabilities.  I am trying to figure out if I should just give up and start coding in Python, or if I should continue to give feedback to the Google Developer Advocate groups that Java is the preferred (at least over Python) language of the enterprise, and that they’ll get broader adoption there if they make a bigger effort to improve both their APIs and documentation in Java.

What, no women?  The ratio of men to women was, well, appalling.  There were times when I was actually physically uncomfortable because I was the only women for as far as the eye could see.  I spoke to one women attendee, she remarked ‘It’s actually better this year…but, I came with my boss, as last year, when I came alone, I found myself in uncomfortable situations several times.’

Bottom line – Go if you can get in

Great conference, great products, great crowd – well worth it.

Agile, AWS, Azure, Big Data, Cloud, Data Science, google, Hadoop, Microsoft, noSQL, Technical Conference

My SoCalCodeCamp decks – Hadoop, ApprovalTests, BigData and more

I am going to be one busy lady on Saturday, June 23 at SoCalCodeCamp at UCSD in San Diego.  Here’s the schedule.  I am presenting at 5 different talks there all on Saturday.  Here are the decks and sessions:

1) Harnassing the good intentions of others, increasing contributions to open source projects.  Deck TBD – here’s a video talking about the session (which we also present in July at OsCon 2012)

2) Intro to SketchUp  – presented by my 13 year old daughter Samantha (I am just there to smile and say ‘that’s my girl!’

3) Better Unit Testing with ApprovalTests – presenting with Woody Zuill.  Will also be presented at Agile 2012 (national Agile conference in August 2012 – on the Testing Tracks)

4) Intro to Hadoop on Azure – article coming in next month’s MSDN Magazine (publishes July 25, 2012) as well

5) BigData Panel – state of the data industry hosted by Stacey Broadwell.

Big Data, noSQL, SQL Server 2012

MongoDB MapReduce vs. SQL Server group by – Which is faster?

I decided to try out a well-written sample on the MongoVue site to compare just how MapReduce with MongoDB vs. good old T-SQL group by really work.  I had to make a couple of tweaks to what they had written because:

1) There was one error in the MapReduce code example.  I point this out in the screencast.

2) I wanted to use SQL Server rather than MySQL for the RDBMS test.

I made two short screencasts showing exactly how I proceeded.  For completeness on the SQL Server side I did two comparisons.  The first one was over a heap table.  Next I took the recommendations of the query optimizer, which included to create not only a clustered index, but also a non-clustered index with included columns for the lat/long values.  Below is the 5 minute video showing the detail.

Then I ran the MapReduce job as written in the example on their blog using the MongoVue interface.  It took me a bit of fiddling around to get the sample into the MongoVue interface.  I have a tip if you plan to try to get this to work — work with the output console (named ‘Learn Shell’) at the bottom of the MongoVue interface to verify that you’ve entered the MapReduce code correctly and into the correct section of the GUI interface.  Finally the ‘In & Out’ section of the MapReduce interface in MongoVue wasn’t very well explained on the MongoVue site.  Take a look at my screencast to see how I chose to work with that section.

It was interesting to note a couple of things through this process:

1) The T-SQL execution proved to be the fastest solution for this problem for a data set of this size, even BEFORE I did any optimization.  The T-SQL query (group by) query on the view against the heap table ran in 5 seconds on the 37,000+ records.  After optimization (adding the indexes to the base table and recreating the view), that same query ran in 3 seconds.

2) The MapReduce took 11 seconds to run on my instance of MongoDB. My instance has no sharding and no replicas.

If you’d like to try this out as well, I’ve zipped the source data and sample code and made it available for download here.