I recently gave a talk on retrieving big data at Data Wranglers DC on Retrieving Big Data. The goal of this talk was to introduce Hadoop to business analysts in the barest, no-nonsense form and cover one Pig hurdle I faced. Read: I gave really shitty explanations of Hadoop core components like "All you need to know about MapReduce is that it is a way of messing with data in Hadoop"and I pointed out one area where I think the Pig tutorials go wrong and how smarter ppl than I have fixed it.
In my mind, that is perfectly reasonable. If I'm writing Pig, which is an easier-to-use scripting-ish language for Hadoop, I really don't need to know shit about how to write MapReduce jobs in Java. And while everyone generally agrees that more information is better, I don't believe that applies here.
If I want to teach an Excel user, someone who's never before used even the command line, how to query data in a distributed system like Hadoop, the most important factor in passing some valuable knowledge to them is how much shit I can leave out. Mention Zookeeper, Hive, Flume, Kafka? Fuck no! That will only confuse and bewilder. I want ppl to know that there is this system called Hadoop that is the predominant software behind the Big Data buzzword. Hadoop is made up of two core components: HDFS and MapReduce. HDFS is a file system where data is stored, and MapReduce is a way of messing with that data. Bam. Core Hadoop explained.
When developers build applications on Hadoop, they often use a database like HBase (Hadoop Database) to store data. HBase sits on top of HDFS so you never have to touch HDFS (hence my crappy explanation of it). If analyst wants to retrieve data from HBase they can write MapReduce jobs in Java (yuck) or they can write short scripts in Pig (bangerang Peter Pan!).
So if you just want to get your hands on some data in Hadoop, I think you only need to care about HBase and Pig. But that's just me. Read the slides here