This week I’m in Chicago studying Cloudera’s version of Hadoop. I decided to take both the Administrator’s and Developer’s courses to be sure we haven’t missed anything along the way and also to guide the team education process.
So far the materials have been good — aimed at the right audience, clear, consise etc. For Admins, you’ll need a reasonable degree of Linux skills. I started my career in Unix and later adopted Linux so the OS portion has been easy. The really interesting areas have come discovering the deeper choices in Hadoop setups and best practices.
Then comes Cloudera Manager. So far the word is ‘slick’. That’s how I’d describe fairly sophisticated installation and management process works. Of course I knew some of that by my own demos. Cloudera Manager makes setting up large clusters a simple task compared to hand installing all the bits and editing a dozen or more config files for each node.
I’ll be adding more posts on Cloudera and Hadoop in the coming days.
Our son gave me a banjo last January. I play guitar and do my own maintenance. While I idolize Béla Fleck and the Flecktones, Earl Scruggs, Steve Martin and a few others I never saw myself wanting to take up this particular instrument. After some research I found some replacement parts and had it working in no time. Now I can’t seem to put it down! So now I’m an accidental banjoist of sorts.
Along these lines a number of people have self-described as ‘accidental DBAs’ — people who had the duties thrust upon them due to no fault of their own. That wasn’t my own path, but one I understand since so many areas in my career came by way of a need.
Microsoft has a fairly recent article based around on this idea (http://blogs.technet.com/b/accidental_dba/archive/2011/05/23/introducing-the-accidental-dba.aspx) but so many predate this such as the ones over at Simple Talk (www.simple-talk.com) in Jonathan Kehayias, Ted Kreuger and a few more of their folks.
Similar to my banjo fun, the only way to progress beyond the accidental phase is to study, read books, talk with others, try out ideas and purposely work at the craft.
I’ve been a Data Architect, DBA, Data Analyst, ETL guru, Report Hack, code monkey and script jockey, bit twiddler, and whatever else. You don’t always know which thing will be next in your career. Right now I’m working feverishly in Hadoop/HDFS/Hive/Pig. I have a project. It has a scope, a deadline, a promise. That tends to sharpen my focus.
My musical instrument list has included trumpet, ukulele, guitar and now — believe it or not — banjo. And so it is in the data disciplines. Sometimes it seeks you out rather than the other way around.
I’m always amazed when someone in this crazy field hates long hours. I think to myself “If you were a cop would you hate to go to crime scenes? If you were fireman would you hate to go into burning buildings?”. Long hours are part of the gig. But that doesn’t mean you should work around the clock either. That can lead to serious burnout. I’ve seen people constantly checking their smart phone, available 24×7, responding to emails at 3:30 a.m.
Along these lines I’m amused by the recent Yahoo! announcement on no longer allowing their employees the ability to ‘Work From Home’. The whole working from home phenomenon came to my house in 1990. I was doing work remotely as a requirement for a job. I thought ‘I’m not disciplined enough for this! I need the structure of an office, see my coworkers…’. I came to find out the danger wasn’t from not working hard enough, rather it was knowing when to stop. I began working longer and longer hours to the point I was routinely working 12+ hours a day. It wasn’t a year later I found myself looking for a new job. It’s no wonder that today we see productivity skyrocketing; everyone’s so busy working!
Knowing when to quit became an important skill — one that’s overlooked by millions today. Put down that phone, iPad, mouse. Stop! Close the door to the home office and be part of your family.
Making time for your own life and your family is far more important than learning the latest tech trick, language, what have you. These will come and go. There will always be work but the kids will only be little for just a while.
Update: A friend of mine (Dan English, Microsoft MVP, @denglishbi ) notifed me that Lara Rubbelke (@SQLGal also of Microsoft fame) did a similar piece a while back. I’ve seen many of her presentations (not this one) and she’s awesome! Check our her blog posting at:
I’m building a resource list for Hadoop. At first these will be easily found but I hope to grow the list to include more obscure references.
I want to start with Hive because it’s probably one of the most useful pieces of the Hadoop world for experienced data folks. What good is data if you can’t query and analyze it?
Hive DDL Language Manual
I think one of the more useful areas is a quick reference to the language constructs of Hive — which is similar but not exactly like T-SQL or PL/SQL or any other SQL I recall for that matter.
Table Partitioning in Hive
One thing I picked up early on was the way in which we can easily add and delete large amounts of data in Hive. Having done Table Partitioning in a number of RDBMS platforms including SQL Server it’s fairly easy to spot how this works. Basically the trick is to declare the table using the External keyword identifying that the data isn’t directly under Hive’s control but rather external. Then at this point you’re simply describing the shape of the data. Once that’s done adding folders under the location adds data etc. I’m going to work on a special posting next on just this topic.
The Best Book on Hive (so far): Programming Hive
As a third resource I highly recommend the best book on Hive from O’Reilly by Capriolo, Wampler and Rutherglen. It’s a fairly small read with good examples. Given that there aren’t many resources outside of Apache or a vendor site it’s a reasonable attempt to explain Hive.
I’ve been working through the deeper context of Hadoop these last few months and I have to say I’m impressed! I was a skeptic in every since of the word until I spent a lot of time with Hive, Pig and Impala.
I’m including Cloudera’s CDH-4 into my Big Data strategy because so far it’s been reasonably performant and very stable. Right now it should barely run: we’re running virtuals and shared storage. Could it be worse than that for this platform? I doubt it!
Yet it continues to pound out some pretty big queries with relative ease. I’m going to be spending a lot of the space here discussing the various setups I went through and what I think is the final best approach for my project.
I’ve been a huge fan of SQL Server for a very long time now. I got my start in Hutchinson Kansas at a well known salt factory who owned SQL Server 4.3 and needed help. I knew a lot of SQL Code back in 1996 but hadn’t worked with the Microsoft variety. I did a few basic queries for them and really enjoyed how it seemed to speed along. By 2000 it was nearly all I did except for an occasional Oracle or DB/2 gig.
SQL Server 2012 is another huge leap for the product. I won’t bother enumerating all the benefits here — you can Google that yourself. I will say it’s the best release yet and I really love what they did with the place!
Alas SQL Server is still SQL Server and getting tables over the 1 Billion row area gets interesting. I’ve always noticed an odd slowdown at 200 million rows. By 800 million it’s pretty noticeable. By 2 billion it’s predictably worse. One thing I’ve noticed that helps is chopping up big tables using Partitioned Views. Keeping tables at under 500 rows using this helps with many areas such as doing Filegroup Backups, dropping whole tables in the scheme for deletes, and indexing member tables specifically to the age of the data. All of these contribute to overcome the size monster.
At PASS 2011 Microsoft announced several big things:
- A Connector for Sqoop
- Relationship with Hortonworks
- Parallel Data Warehouse
All these things go together predictably since they are part of the a technology stream.
So far I’ve been able to use the Sqoop connector with SQL Server 2008R2 and SQL 2012 without issue. I can query data from SQL Server and upload data to it from Hadoop. It’s not blazelingly fast but good enough.
So over the next couple of weeks I intend to share some insights into this world. What Hadoop topics would you like covered? Let me know and I’ll do my best.
I’ve been working with data of every sort for over 25 years. Before that, I had a fun starter career working on the NASA Space Shuttle program. That came to an end about 9 months after the Challenger Accident once all the questions had been asked and answered. There were about 14,500 of us looking for jobs at the same time so I rode my motorcycle to Kansas, met my wife and transitioned my hobby into my dream job — data!
Having worked in a dozen database products spread over probably that many operating systems (depending on how you count ‘em) there are a lot of ideas I notice getting recycled.
First of all, Big Data is not new! While I wasn’t around for the 1890 Census I saw evidence of it all over my earlier days. The evidence is the invention of a card-driven counting machine by Herman Hollarith that reduced the task from 8 years to 1. That’s what I’m talking about!
The 1890 Census was also the first time women were allowed to be ‘tabulators’. Women and data go back to the beginning: I’m in very good company!
To read more, please see http://en.wikipedia.org/wiki/1890_United_States_Census
Big data is in the eye of the beholder. If you have a mountain of data and not enough machines or process then you need to come up with a better idea. Hollarith did and in the process thus began IBM.
These days I’m working on Hadoop in conjunction with SQL Server 2012. So far it’s very promissing. To my SQL Server friends I say come join me. To my other data friends it’s nice returning home to a *Nix platform.
With the rests of my posts I intend to discover the good and the bad of mashing these two worlds and sharing my findings here.