This a quick example of using Apache Spark, the big data processing engine, and Watson Analytics, a new analysis and visualization tool, to do a basic word count and analyze the results. For fun, we are using Leo Tolstoy's War and Peace. As you're likely aware, Tolstoy's magnum opus is one hefty book, so figuring out which words Tolstoy uses most often is actually interesting. I for one want to know what words are used most often in the novel many considered to be the finest of all time.
Tutorials I stole from
Many thanks to Dave Thomas for the edits
Final version of tutorial (don't cheat, just for your reference)
If Hadoop made Big Data storage and processing accessible to your average software developer, Spark has opened to door for English majors like me. Yes, you still have to code, but writing Spark is faster and easier than writing code for Hadoop/MapReduce. And this ease of use doesn't require a performance hit. In fact, Spark is way faster than Hadoop/MapReduce. If you are on the fence about this whole coding thing, it's time to get involved.
Why Watson Analytics?
Watson Analytics is a new product from IBM that harnesses the power of Watson and applies it to small scale data processing tasks. It has a pretty nifty user interface
so if you find spreadsheets ugly and difficult to use, Watson Analytics represents an alternative.
Truthfully, I haven't dug too deeply into Watson Analytics. But it so easy to get started - just upload data and Watson will suggest questions you might want to ask and then graph the results - there is no reason not to play with it. All we are going to do today is use Watson Analytics to make a treemap of our data, as Excel doesn't offer the treemap as a chart type. Simple enough, right?
- Install Python
- Install Spark
- Sign-up for Watson Analytics (free)
- Download War and Peace
- Run Spark Code to Generate Top 100 Words
- Move Output into Watson Analytics and Visualize as TreeMap
I am using the Ananconda distribution of Python. If you have never installed Python before, this distribution is super helpful because it provides a nice Windows installer (sorry, I'm on Windows here, but there is an installer for Mac as well).
I have installed Spark directly on Windows, which is unusual. Most people will probably run Spark through a VM (virtual machine - a separate computer that runs as software within in your computer) or a docker container (same idea, but higher level of abstraction). Unlike writing MapReduce or Pig (Pig is the scripting language for Hadoop non-Java developers use), you do not need to have Hadoop installed to use Spark. Because Spark is just a processing engine, it can read data from basically anywhere.
Download Spark from here. I chose version 1.5.2 (latest as of this writing) and "pre-built for Hadoop 2.6 or later". Please note:
- You do need to install Java. Google "install java" and you'll find loads of great tutorials
If you have successfully installed Java and set the corresponding system variables (your tutorial will walk you through this), you can open up your windows command shell (cmd.exe) and type
and if your java version is printed to the console, then you are good to go.
- Unzip Spark and note the file path to Spark's bin directory, e.g. C:\spark-1.5.2-bin-hadoop2.6\bin
*Add the above path to your PATH variable.
I won't show you how to do this permanently, you'll need to look that up yourself.To set the path for just this session, i.e., until you close your command shell, type the following into your command shell. Please use the file path for where you installed Spark, ending with the bin directory
#On machine, that looks like:
The version of Spark I'm using requires an environment variable called HADOOP_HOME, which is (I assume) because I am using a version of Spark pre-built for Hadoop. I do not see an option on the Spark downloads page to download a version not pre-built for Hadoop. You can download the source code and build Spark yourself, but I do not know how to do that and it sounds hard.
So we need a little work around. Go here, a Jira for Spark. This is going to sound sketchy, but scroll half-way down and find the link to download winutils.exe (This will download an exe file to your machine when you click on it): http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe
Note: I had to do this to get my installation to work, but I have no idea what this executable is. And I know that sounds bad
Now we set the environment variable for HADOOP_HOME like so:
Again, use the path to winutils on your machine. I moved winutils.exe to the root of my C:\ directory, so if it's still in your downloads folder my command won't work for you.
Voila. Spark is installed. It's go time.
Signup for Watson Analtyics
Easy peasy. Free
Download War and Peace
Make sure you get just the plain text version (ends .txt) here
Let's get the boilerplate stuff out of the way. Open a text file and call it wordcount.py. If you don't have a preference on text editors, you might try Notepad++.
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from pyspark import SparkConf, SparkContext
APP_NAME = "My Spark Application"
#Spark code that actually does stuff goes here#
if __name__ == "__main__":
# Configure Spark
conf = SparkConf().setAppName(APP_NAME)
conf = conf.setMaster("local[*]")
sc = SparkContext(conf=conf)
# Execute Main functionality
So let's dive in to our spark code. The first thing we need to do is create a spark context and read the book into a Spark RDD. And we will do this within the main method aka driver program. You don't have to be concerned about what that means just yet, but I want to arm you with some terms.
text = sc.textFile("WarAndPeace.txt")
So we have this RDD thing with our text file. Now we need to split up all of those pesky sentences into individual words. The first example I'll show below is very basic. We won't exclude boring but popular words like is, a, you, and, me (the linguists call these stop words). We'll need some additional code for that and I want you to see the basic version first.
#split our text RDD into words
words = text.flatMap(lambda x: x.split())
The first thing you should notice is the function flatMap. FlatMap will pass something (whatever we choose, like a function) to all of the lines in our text. And it will flatten our lines of text into a single column of data.
I know this sounds a little weird, but the function we pass via flatMap will help clear things up. We are using a Python feature called lambda functions or anonymous functions. The reason they are called anonymous is because we do not have to give them a name. Whereas a typical function begins with a name, like below
def myfunction(arguments go here):
We can skip that and go straight to the arguments. By writing lambda x we are creating an anonymous function (no name) that has one argument, a value called x. Then after the colon we describe what we want to do. In this case, that means calling the split method on x.
#These are the same
lambda some_text: some_text.split()
All split will do is split up our sentences into words.So we'll go from the "We the people" to ["We", "the", "people"]. Capisce?
So now that we have split all the lines into individual words, we want to make every word lower case. This way we won't count "We" and "we" as distinct words. To do this we will use a map and another lambda function
lower_words = words.map(lambda x: x.lower())
We have already flattened our lines into one one column of words, so we don't need the flat part of flatMap anymore. But we still need to pass a function to every word in the book.That means we will just use map.
Just like with split, we can call a method on our anonymous value x called lower(), which makes every letter in the word lowercase. Sweet. Nothing crazy there.
Now we need to start counting the words. And to do that we are going to use two functions, one map and one reduce (or reduceByKey).
#Assign a number of one to each word
assign_numbers = lower_words.map(lambda x: (x, 1))
#Count all of those numbers
count_numbers = assign_numbers.reduceByKey(lambda x, y: x+y)
So here's what we are doing. We pass the number one to each of those words. ["We", "the", "people"] becomes [("We", 1), ("the", 1), ("people",1)]. It is important to note that we went from individual strings, e.g., "we", to to tuples, e.g., ("we", 1). A tuple is just an ordered list of items. We have two items in our tuple, which Spark thinks of as keys and values, (key, value). Spark views the first item as a key and the second item as a value.
This makes it possible for us to combine all of the words easily.And the function that does this called reduceByKey. It does some nifty magic and combines all of the keys, which are just the the things in the first position in the tuple. The end result is a tuple with the correct count of the number of occurrences: ("people", 17).
The last two lines are just formatting. To make it easier for us to retrieve the top 100 most used words, we need to sort the tuples in descending order. Spark has a sortByKey method, which sorts by the value in the first position of the tuple. However, Spark does not have a sortByValue.
That's okay, all we have to do is flip the order within the tuple - we need to go from ("people", 17) to (17, "people). And the easiest way to do that is with a lambda function
#create a new tuple with the second item first and the first item last
ordered = count_numbers.map(lambda x: (x, x))
#Now we can use our sortByKey method
sorted = ordered.sortByKey(False)
The reason I put False as the argument for sortByKey is just to let it know to sort in descending order. According to the function's documentation, True would indicate ascending order. This is a good time to note that google is your best friend when it comes to coding. Just type your question, e.g., how do I sort in descending order in Spark?, into google. More often that not you will find someone who has asked your question and received the correct answer.
Okay! We have successfully read in our file, split up the words, counted the words, and sorted them from greatest to smallest. All that is left for us to do is to grab the top one hundred and write them out to a file. Here is how we do that.
#Use the take method to snag the highest 100 words
top100 = wordcounts.take(100)
#Open up a new file, indicating via "wb" that we plan to write to it
with open("top100.csv", "wb") as f:
#Initialize writer object - boilerplate, you have to do this every time
writer = csv.writer(f)
#Write out our header row
#Loop through our 100 words and write out each word and number to a row
for line in top100:
Now it is time to actually run the code. Here is how we do that:
Voila! We have done it. So let's review our results. You can open up the csv file in Excel.
Oh no - we have all of these useless words in our list. Unlike Bill Clinton, I don't care about the word "is". Let's get rid of those words. And to do that we need to use a library called NLTK.
Now here's the sweet code from the good people at Mortar Data. You do not have to worry how this works. Just type it and be thankful. That's what I do.
# Module-level global variables for the `tokenize` function below
PUNCTUATION = set(string.punctuation)
STOPWORDS = set(stopwords.words('english'))
STEMMER = PorterStemmer()
# Function to break text into "tokens", lowercase them, remove punctuation and stopwords, and stem them
tokens = word_tokenize(text)
lowercased = [t.lower() for t in tokens]
no_punctuation = 
for word in lowercased:
punct_removed = ''.join([letter for letter in word if not letter in PUNCTUATION])
no_stopwords = [w for w in no_punctuation if not w in STOPWORDS]
stemmed = [STEMMER.stem(w) for w in no_stopwords]
return [w for w in stemmed if w]
One pretty nifty thing we see here is the use of word stems. This just means that we will combine words to their root. So "running", "ran", and "run" would all be grouped as "run". That's pretty handy for making sure we don't miss out on a popular word just because the author uses it in multiple tenses or in different parts of speech.
Now we can update our main method to reflect the addition of the tokenize function we wrote above.
tokens = text.flatMap(lambda x: tokenize(x)) \
.map(lambda x: (x, 1)) \
.reduceByKey(lambda x, y: x + y ) \
.map(lambda x: (x, x)) \
Note here we are using some different syntax in our PySpark code. Instead of creating new variables for each operation, we are chaining the operations together with periods. This just means that each function will work on the data that precedes it. I have split these operations out on individual lines so you can see how the steps work.
Now let's run the code. We do that with the spark-submit command:
Now let's look at our results in Excel. Hah! "Said" is the most popular word in Tolstoy's work. Take that pretentious writers who refuse to said and instead pepper their writing with exclaim, gesticulate, and the like. You think you are better than Tolstoy!? I think not.
Anyway, viewing the results in a spreadsheet is fine, but we can certainly do better. It might be nice to create some psuedo pie chart so we can see not just the total count by word but the proportion of that word's use out of the total. Now we just grabbed the top one hundred words, not all of them, but it would still be a disgusting pie chart.
If only there were another way.
Enter Treemap and Watson Analytics
Treemaps provide a nice way of viewing hierarchical data. Unfortunately, Excel does not offer the treemap as a standard chart type. One can write a few lines of Python or R to create treemaps, but it would be nice if we could create them without code.
Luckily for us, Watson Analytics fits the bill. If you haven't created your account by now, head over and do so. After that the steps are really simple. From the Watson Analytics homepage:
- Click the Add button in the middle of the page
- Upload our csv file
- Your file will be displayed as rectangle on your dashboard - click on it
- Watson automatically suggests a question you might ask: what is the breakdown of count by word? Click on that
- Bam - Treemap created. Sweet.
What stands out right away is the popularity of the stem pierr, which is just short for Pierre. Notice that Andrew and Natasha, other main characters, are in the top ten of words as well. Why this may not sound surprising at first - after all, why wouldn't a novelist use the names of his characters all the time in his book - it is actually interesting given that these are Russian names.
One of the most confusing aspects of Russian literature (for English readers, of course) is the fact that each character has several names. Given the popularity of French at the time, each character has both a Russian and French name. Pierre, is also Pyotr. And he is also Count Bezhukov. The name Tolstoy uses for Pierre depends on the those with whom he is speaking. Thus it isn't like a typical English language novel where a character might have one or maybe two names. And yet despite this, Pierre, Andrew, and Natasha rise to the top of Tolstoy's most popular words.
We should compare how Tolstoy's frequency of character name references in War and Peace compares with his other novels, namely Anna Karenina. Then we should compare against other Russian novelists and authors from around the globe
We ran through a very basic example of processing some text with Apache Spark and visualizing our results with Watson Analytics. With just a little bit of code we are able to process lots of data (and our single file is only scratching the surface) in a way that traditional analytic tools like Excel are poorly suited. But more importantly, we showed that one can do some data processing with code and then switch over to a shiny analysis and visualization tool like Watson Analytics. Just because certain things can be done with code doesn't mean they necessarily should be. And Watson Analytics offers a nice option for performing some basic exploratory analysis.