I have installed Spark directly on Windows, which is unusual. Most people will probably run Spark through a VM (virtual machine - a separate computer that runs as software within in your computer) or a docker container (same idea, but higher level of abstraction). Unlike writing MapReduce or Pig (Pig is the scripting language for Hadoop non-Java developers use), you do not need to have Hadoop installed to use Spark. Because Spark is just a processing engine, it can read data from basically anywhere.
Download Spark from here. I chose "pre-built for Hadoop 2.6 or later". Please note:
- You do need to install Java. Google "install java" and you'll find loads of great tutorials
If you have successfully installed Java and set the corresponding system variables (your tutorial will walk you through this), you can open up your windows command shell (cmd.exe) and type
and if your java version is printed to the console, then you are good to go.
- Unzip Spark and note the file path to Spark's bin directory, e.g. C:\spark-1.5.2-bin-hadoop2.6\bin
*Add the above path to your PATH variable.
I won't show you how to do this permanently, you'll need to look that up yourself.To set the path for just this session, i.e., until you close your command shell, type the following into your command shell. Please use the file path for where you installed Spark, ending with the bin directory
#On machine, that looks like:
The version of Spark I'm using requires an environment variable called HADOOP_HOME, which is (I assume) because I am using a version of Spark pre-built for Hadoop. I do not see an option on the Spark downloads page to download a version not pre-built for Hadoop. You can download the source code and build Spark yourself, but I do not know how to do that and it sounds hard.
So we need a little work around. Go here, a Jira for Spark. This is going to sound sketchy, but scroll half-way down and find the link to download winutils.exe: http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe
Note: I am not including this as a hyperlink because once you click on the link you will download the exe file. I had to do this to get my installation to work, but I have no idea what this executable is. And I know that sounds bad
Now we set the environment variable for HADOOP_HOME like so:
Again, use the path to winutils on your machine. I moved winutils.exe to the root of my C:\ directory, so if it's still in your downloads folder my command won't work for you.
Voila. Spark is installed. It's go time.