Installing spark in linux

Step by step process to install Spark 2.0.1 on python 3.5 in Linux running on virtual machine:

This blog is for people who want to run Spark on top of python with Linux (Ubuntu Desktop 16.0.4) installed over virtual machine in windows machine.

Please follow the steps one by one:

Step 1: Installing VMware in Windows:

In your windows machine > open internet browser > paste the following link > https://my.vmware.com/web/vmware/info?slug=desktop_end_user_computing/vmware_workstation_pro/12_0 > enter > here you can see download VMware Workstation Pro 12.5.2 for Windows > click on the link > once download is finished run the file > click next until you find the finish button > now VMware is installed in your machine.

Here is the image how it must look

Step 2: Downloading Linux (Ubuntu Desktop) and installing it in VMware player:

In windows > open internet browser > paste the following link > https://www.ubuntu.com/download/desktop > download Ubuntu Desktop by clicking the download button > an .iso image file of 1.4GB will be downloaded > once it is downloaded open the VMware player by double clicking the VMware icon > click on Create a new virtual machine > select Installer disc image iso option > click browse button and show the path were Ubuntu.iso file is located > select the Ubuntu.iso file > click open > then click next > give memory of size 20GB for this machine to run smoothly > click next > which will install Ubuntu in VMware, it will take more than 1 hour to install.

Once Ubuntu is installed it will look like this inside VMware.

In search type terminal and click on terminal icon > type python after $ symbol in command line , by default linux comes with python 2.7.1.

Here it looks like this.

Step3: Downloading Anaconda 4.2.0 inside Ubuntu:

Inside Ubuntu open internet browser > paste the following link > https://www.continuum.io/downloads > download Anaconda 4.0.2 for windows > once download is finished > open the terminal > type following command after $ symbol > sudo bash “home/Narayana/download/Anaconda3-4.2.0-Windows-x86_64.exe > press enter >

Here at this process it looks like this

now the installation takes place > once installation is done you get a message Anaconda 4.0.2 is successfully installed in command line > close the terminal and again open it > now to check whether Anaconda4.0.2 got installed properly > open terminal > type python in command line prompt and press enter > you should get message like this below snapshot.

Step4: Installing Java inside Ubuntu:

By default Ubuntu VMware image does not come with Java runtime environment which is essential to run Spark.

Let’s see the steps to install Java runtime environment inside Ubuntu.

Connect to internet > open terminal inside Ubuntu > type $ sudo apt-get update > press enter > this command will update the package index > once it is done successfully > next type $ sudo apt-get install default-jre > Press enter > this command install java in your Ubuntu machine > it will take some time to install > them you will get message that java is successfully installed.

Step5: Downloading and installing Spark 2.0.1 inside Ubuntu:

Connect to internet > Inside Ubuntu open internet browser > paste the following link > http://spark.apache.org/downloads.html > Download spark-2.0.2-bin-hadoop2.7.tgz > once it is downloaded > unzip the .tgz file > copy the output folder in your home directory and rename it as spark201 > now open the terminal > type $ gedit .profile > a editor window will open > at end of the editor, after fi paste below 5 lines of code >

xport SPARK_HOME=/home/narayana/spark201

export PATH=$SPARK_HOME/bin:$PATH

export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH

export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH

export SPARK_LOCAL_IP=LOCALHOST

> Caution: In first line of code give your home directory details( i.e., /home/narayana/spark201) > click save button and close the editor

Here the above process looks like this.

> now paste the same 5 codes in terminal and press enter > now it installs spark > after successful installation > to start spark type the following command in terminal > $ pyspark > press enter > and to close spark type exit().

Now you should see following image.

If you could see this screen that means you have successfully installed Spark 2.0.1 on Python3.5 in your Ubuntu system.

Let’s do some sample program in Spark to check everything working well.

Sample Programs in Spark , ipython notebook and in python :

[1] Word Count Program in Spark:

Open terminal in Ubuntu and type $ pyspark > and type the following commands in line by line and press enter.

Program:

text = sc.textFile("hobbit.txt")
print text
from operator import add
def tokenize(text):
return text.split()
words = text.flatMap(tokenize)
print words
wc = words.map(lambda x: (x,1))
print wc.toDebugString()
counts = wc.reduceByKey(add)
counts.saveAsTextFile("output-dir")

The screen shot of above program in spark look like this.

The output screen shot looks like this:

[2] Word Count Program in ipython notebook:

Inside Ubuntu open terminal > type following command which redirect you to ipython notebook in browser.

PYSPARK_PYTHON=python3 PYSPARK_DRIVER_PYTHON=ipython3 PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark

It looks like below image once the above code is executed in terminal.

In ipython notebook > click new > python (conda root) > and here paste the above codes one by one and run the cell.

The screen shot of above process looks like this.

The output will look like this:

[3] Word count Program in Python:

Open terminal in Ubuntu > type following command in command prompt after $ symbol > $ spark-submit wordcount3.py hobbit.txt > press enter > you can see the program executes.

You will see the following screenshot.

Below is the screenshot of the Output of this program.

If you could able to see these output like above in these environment's , that means your Spark , ipython notebook and python are working well.

If you have any doubts please feel free to contact me.

Thanks & Regards
I.S.L.Narayana
isivanarayana9@gmail.com

Installing spark in linux

Tuesday, 15 November 2016

Step by step process to install Spark 2.0.1 on python 3.5 in Linux running on virtual machine:

This blog is for people who want to run Spark on top of python with Linux (Ubuntu Desktop 16.0.4) installed over virtual machine in windows machine.

No comments:

Post a Comment

Blog Archive