Getting to Second Base with HBase
What non-coding business analysts need to know to manipulate Big Data
NoSQL databases are a class of databases used in Big Data applications. They provide tremendous advantage to Big Data developers: fault-tolerance, flexible schema, and fast query performance. HBase is one of the most popular examples of a NoSQL database.
However, there is a common misconception that what makes HBase and other NoSQL databases good for developers also makes them good for business analysts. In fact, this is not the case at all. For an analyst used to working with relational databases, writing queries against HBase can be more complicated and much slower than if the same amount of data were stored in a relational database.
This flies in the face of all the hype surrounding NoSQL databases and Hadoop/Big Data in general. What’s the point in using these new technologies if they don’t make things easier? There are a number of deservedly irate business analysts who feel left out of our brave new Big Data world. I will argue, however, that if you are willing to learn a little bit about how HBase is designed, you can achieve the query performance that the marketers promised. If you stick with me I will show you how to get to second base with HBase: far enough to have a good time but not too far to get you in trouble.
You can kick butt with Big Data
First, we need to provide a little background on Big Data as a concept. Big Data is more than a buzzword. It is a complicated set of software technologies that in concert allow developers to store and manipulate large amounts of data in a cost effective way. One need not be Steve Jobs to see the business potential here, and as such there is an explosion of demand for people with the skills to manipulate large amounts of data. But what if you are not a software developer? Can you still take part in this data renaissance? Of course you can.
Whereas software developers need to concern themselves with difficult tasks like writing MapReduce jobs (which we will briefly touch on later), business people should begin with learning to write queries against the databases that these developers use to store Big Data. For many business tasks, I may want to query lots of data but I only need to return a small portion of it, e.g., the highest selling item all of the sales in the Midwest region. In general terms, this means that I don’t need to concern myself with much of what makes processing Big Data so cumbersome. Hooray!
So what are these databases I should focus on querying? These databases are collectively known as NoSQL databases and writing queries against them can be downright persnickety. But have no fear as together we will discuss what you need to know to retrieve data effectively. In particular, we will focus on the most popular NoSQL database, HBase. HBase stands for the Hadoop Database, which is the only software project in the entire Hadoop ecosystem with a name that at least tries to make sense (seriously, Hadoop - wtf?). As Hadoop is in HBase’s name, this necessitates a very brief introduction to this thing called Hadoop.
Hadoop – what makes Big Data possible for mortals
From now on, whenever you hear someone say Big Data I want you to think of Hadoop. Hadoop is the software ecosystem that made this whole Big Data craze possible. In the ancient times of pre-2003, it was either very difficult or very expensive to store and process more data than one could fit on a single computer. Your options were either to build your own distributed processing framework, i.e., write custom code to use multiple, inexpensive computers to work in parallel (very difficult), or find your local Oracle, IBM, or TeraData salesperson and spend several million dollars on a custom server (and do not forget to save a few million for the next model when your data outgrows the one you just bought, which it most certainly will).
Obviously neither of these options was very attractive, and as such there wasn’t a lot of big data processing. Then again, we were also not forced to read banal articles about the sexiness of Big Data in Harvard Business Review, so there was a silver lining. Everything changed with the release of Hadoop, an open-source (read: free to use and modify) software project that provided a framework for stringing together cheap commodity servers to store and process large amounts of data.
Hadoop is able to store and process large amounts of information thanks to two key software innovations: HDFS and MapReduce. You don’t need to know much about either in order to use HBase, but a general understanding of HDFS and MapReduce will go a long way.
HDFS, the Hadoop Distributed File System, allows one to store data across multiple machines. It is a file system similar to those on Linux, Mac, and Windows PCs (like what a Windows user would see when they look in their Documents directory). The reason why HDFS is special is that it is smart enough to know where the all of the data is stored across the cluster of computers. If one server dies, which happens more often than you would think in large clusters, your data will be replicated on other computers so that it isn’t lost. This means that HDFS makes distributed storage (storing data on lots of computers) possible.
Whereas HDFS allows for distributed storage, MapReduce is the software paradigm that allows for distributed processing, i.e., using all of those computers to actually doing something with your data. MapReduce is code that requires developers to write their queries in a very specific format. For Java developers, this format means every job has a Map and Reduce class. The rest of us should think of MapReduce as the software version of Shakespeare’s iambic pentameter: one is still writing in English but the format is very distinctive. If that sounds like pain you are correct. One dirty little secret in the Big Data world is that processing large amounts of data isn’t that easy.
Thankfully for us we do not have to concern ourselves with MapReduce to get started querying Big Data. As business people it is more likely we will want to retrieve data from an existing Big Data application rather than create one ourselves. And where is the data stored in a Big Data application? In a NoSQL database like HBase.
Why use a database at all?
This is fair question. After all, I told you that HDFS allows for distributed storage. So why can’t we just store our information there? The answer is all about speed. Data is stored in HDFS in large blocks of files. Let’s say I build a wildly successful Tinder clone for people who like Hadoop and Shakespeare. Millions of people sign up for my app. When I login, I want my profile returned to me with the numerous updates of women vying for my attention.
If that information were stored in HDFS, my application would have to read the entire HDFS file in which my information was stored. This file would contain the information of several other people as well, meaning that my app would then have to parse the file to get just what it wanted. It would be more efficient if my app could go directly to the record that contained my profile information without having to read in the data for anyone else. The easiest way to do this is with a database, where we store data as records for efficient retrieval.
But why NoSQL databases?
There are three principal reasons why relational databases do not work well in distributed environments. The first has to do with what we discussed earlier, that the magic of HDFS was that it was fault-tolerant, i.e., that if one of the servers in your cluster dies, your data will not be lost. Unfortunately relational databases, like Oracle, SQL Server, or even Access, are designed to work on single servers, not clusters of them. This doesn’t mean you can’t run a relational database on a cluster (this process is called sharding, but it does require a lot of extra code and developer time. And for really large amounts of data this becomes really hard.
The second reason has to do with schema. Relational databases require fixed schemas, i.e., you specify the column headers and the formats of the values. This is fine if you know exactly what kind of data you will be storing. But in the Big Data world, developers wanted a more flexible approach. If they needed to change the type of data they were storing they wanted the flexibility to do so (sounds reasonable enough, right?). NoSQL databases fit the bill well. They offer a more flexible schema, or one could say, do not require a schema at all. What this means is that even if the developer doesn’t know the structure of the data she needs to store, NoSQL databases can accommodate that.
The third reason is performance. NoSQL databases are designed to retrieve individual records from among billions of records extremely quickly. I list this point last because it will be the basis of our future discussion about HBase. NoSQL databases can do these individual lookups wildly fast, but only for very specific types of queries. That is to say you have to follow a predetermined access pattern to the data you want. If NoSQL database queries were a highway with no speed limit, the access pattern would be a toll booth operator at the highway entrance who models himself off of Seinfeld’s Soup Nazi. You will have to make your request in a very specific format if you want the good stuff.
Which format you will need to use is entirely dependent on how the developers built the NoSQL database you would like to query. And just like with the Soup Nazi, you need to know this format before you make your first order, i.e., write your first query. Otherwise your query performance will be dreadful. This nuance is often missed in business writing about NoSQL databases. Writers love to extol the speed but neglect to mention that the speed is a byproduct of a specific, limited query format.
But before we go into how you should format your queries in HBase, let us pause to define what a NoSQL database is. NoSQL, which stands for not-SQL or not-only-SQL, defines itself by what it is not without any explanation as to what it actually is. Imagine if I introduced a new fall sport and called it “not-football”. You might have a tough time understanding what I mean.
What I want you to understand is that by NoSQL we mean the benefits we discussed above: databases that scale on clusters of computers, have flexible schemas, and allow for blazing fast query performance of billions of records provided you write your query correctly. While NoSQL databases provide tremendous benefits to developers, those same benefits can cause a great deal of pain to analysts. Now we will discuss how you can avoid some of that pain with the most common NoSQL database, HBase.
Getting to second base
So now let us talk about HBase. HBase is the Hadoop Database, and as you might infer from the name, is a very popular NoSQL database to use with Hadoop. HBase is the open source implementation of Google’s Big Table, which is something nerds care about so mention that to them if you want to sound cool.
Here is what you need to know about HBase: it is all about the row key. In relational databases we use row keys (also known as row ids) to join tables together, but we don’t care at all how they are composed. In fact, if we are writing a query on a single table, we won’t mention them at all. Let’s look at an example.
We might write queries like:
SELECT * FROM SALES WHERE SALES.PRICE > 100
Even if you have never seen SQL before you can probably infer what this query means: select every record from the sales table where the value for sales price is greater than 100.
This type of query, filtering data from a single table, is a basic staple in both relational and NoSQL databases. The big difference is that depending on which value you use to filter in your HBase table, the HBase query will be horribly slow. In fact, there is only one value in any HBase table in which you can filter quickly. And that is the row key. You certainly can filter on fields other than the row key, but those queries will be really slow. And that is not what we signed up for.
So we can only filter on the row key? Pretty much. We want to filter only data in the row key, which is wildly different than filtering in SQL. In SQL we join on row keys and filter on values (column headers) in a table. In the SQL statement above, we didn’t even mention the row key anywhere because we were filtering a single table. It didn’t matter at all. Conversely, in HBase the row key counts for pretty much everything.
This is because with HBase we put the data we want to retrieve in the row key. Feel free to scream at this point. Data in the row key? Yes. I want this to sound shocking because it is. Here’s why developers put the information they want to retrieve in the row key itself. In HBase, the row key is the only indexed field, which means that it is the only field that can be returned quickly.
This necessitates a brief sidebar about indexes.
If you’re unfamiliar with database indexes, just think of the index in a book. Let’s say you are writing a study of Freemasonry in Russian literature. You pick up a copy of Tolstoy’s magnum opus, War and Peace, and because you haven’t read it (you chump, you), you want to know if the novel contains any references to the Freemasons. So what do you do? You open the end of the book and look for the book’s index, keeping your fingers crossed that the term “Freemason” is listed there.
With relational databases you can have lots of indexes. Just like with books, these indexes take up space so you can’t make indexes for every field. With NoSQL databases, they are so large by definition (Big Data remember) that by design we only build one index. That index is the row key.
So we know that if we can find an index entry for Freemason we can quickly go to the page with the data we need. But what happens if there isn’t an entry for Freemason in the index? How would you go about checking?
The scary is answer is that you would have to scan the entire book. As you can imagine (or maybe you don’t have to imagine if you’ve ever crammed an entire term paper into one evening) this would take a very long time. Scanning all of War & Peace to look for references to Freemasons is exactly like writing a query against HBase in which you do not filter on the row key. The query will be forced to scan the entire table and that will be horribly slow.
#scan an entire table
To continue our example, let’s say that you discover a computer savvy literature buff that has created a massive index of all references to secret societies in world literature. And rather than print a gigantic book, he creates an HBase table to store each secret society and the books they are referenced, along with the text from the book in which the reference is made. And he designs the row key to give us the fastest access to the data we want. This may sound a little confusing so let’s use an example below.
Format of our data:
Secret Society – Country of Origin – Book Title – Page Number – Paragraph Number: Lots of free text
Freemasons – Russian – War & Peace – Page 385: text: Pierre meets a Mason while traveling
Freemasons – Russian – War & Peace – Page 456: text: Pierre is asked to join the Masons
Look at the information to the left of the first colon. As we read from left to right we narrow down on the information that we want. Remember that the database is storing references in literature for all secret societies. Thus the outer most layer is the secret society name, which in this case is “Freemason”. The next rung is the geographic area from which the book originated, which is then followed by the book title, page number. These items collectively make up our table’s row key.
After the first colon we see the single word, text. This field is known in HBase parlance as a column family. Please just think of it as the column header in a typical relational table (or spreadsheet). Finally, the value to the right of the second colon is the actual data we would like to retrieve. This is our column value.
We can represent that data as the table below.
#Table name: secret_stuff
Row key Column Family Value
Freemasons – Russian – War & Peace – Page 385 Text Pierre meets a Mason while traveling
Freemasons – Russian – War & Peace – Page 456 Text Pierre is asked to join the Masons
Freemasons – American – Angels & Demons – Page 97 Text Robert Langdon travels to Rome
Illuminati – American – The DaVinci Code – Page 100 Text Robert Langdon reaches New York
Skull & Bones – American – George Bush: My Life – Page 110 Text George goes to college
So how would we query for Freemason references in War & Peace? We will step through two examples. The first will perform very slowly, and the second will take advantage our table’s row key to run much faster.
Let’s take a look at a query that would not perform efficiently in this table. Let’s assume we are lazy and want to filter only the text that references “Freemasons”. If we were to write this query against this table (which we have cleverly named, secret_stuff), it might look like this:
#HBase pseudo-code sample
#scan table for references to Freemasons
WHERE 'Freemason' in text
Because we do not provide any row key filters, HBase will begin scanning at the first record in our table and not stop scanning until it reaches the end (just as you would have done with the physical book). As you can imagine, this is extremely slow when you have billions of records.
If we are able to narrow our query in some fashion, we can make use of the row key as an index (just like the index in the back of the book).
Here we are instructing HBase to begin scanning the table only where the row keys begin with “Freemasons-Russian-War&Peace”. This allows our query to run extremely quickly, because we are telling HBase where to begin and where to end scanning.
HBase pseudo-code example
#scan table for War & Peace Freemason text
Where row_key LIKE 'Freemasons-Russian-War&Peace'
Even if we broaden our filter to just “Freemasons”, thus including all possible geographic regions and book titles, we will still return our results much faster than just filtering the “Text” value for “Freemasons”. This is because our row key is designed such that all the Freemason entries are grouped together. We know that if any record that does not begin with “Freemasons” will not have the information we want. As such, we can tell HBase to scan only the records we want.
HBase, like all of the NoSQL databases, offer tremendous advantages for developers who want to build applications that store and process Big Data. Unfortunately what makes HBase good for building Big Data applications – fault-tolerance, flexible schema, and excellent query performance – is not necessarily helpful for analysts who need to query HBase to do their job. Namely, if analysts do not follow the access pattern constructed by the developers, query performance will be horrendous. However, if you can remember to analyze the row key in your HBase table before you start writing queries, you will be able to query billions of records extremely quickly.