Step by step process to install Spark 2.0.1 on python 3.5 in Linux running on virtual machine:
This blog is for people who want to run Spark on top of python with Linux (Ubuntu Desktop 16.0.4) installed over virtual machine in windows machine.
Please follow the steps one by one:
Step 1: Installing VMware
in Windows:
In
your windows machine > open internet browser > paste the following link
> https://my.vmware.com/web/vmware/info?slug=desktop_end_user_computing/vmware_workstation_pro/12_0 > enter > here you
can see download VMware
Workstation Pro 12.5.2 for Windows > click on the link > once download is
finished run the file > click next until you find the finish button > now
VMware is installed in your machine.
Here is
the image how it must look
Step 2:
Downloading Linux (Ubuntu Desktop) and installing it in VMware player:
In
windows > open internet browser > paste the following link > https://www.ubuntu.com/download/desktop > download Ubuntu
Desktop by clicking the download button > an .iso image file of 1.4GB will
be downloaded > once it is downloaded open the VMware player by double clicking
the VMware icon > click on Create a new virtual machine > select Installer disc image iso option >
click browse button and show the path were Ubuntu.iso file is located >
select the Ubuntu.iso file > click open > then click next > give
memory of size 20GB for this machine to run smoothly > click next > which
will install Ubuntu in VMware, it will take more than 1 hour to install.
Once
Ubuntu is installed it will look like this inside VMware.
In
search type terminal and click on terminal icon > type python after
$ symbol in command line , by default linux comes with python 2.7.1.
Here it
looks like this.
Step3:
Downloading Anaconda 4.2.0 inside Ubuntu:
Inside
Ubuntu open internet browser > paste the following link > https://www.continuum.io/downloads > download Anaconda
4.0.2 for windows > once download is finished > open the terminal >
type following command after $ symbol > sudo bash “home/Narayana/download/Anaconda3-4.2.0-Windows-x86_64.exe
> press enter >
Here at
this process it looks like this
now the installation takes place > once
installation is done you get a message Anaconda 4.0.2 is successfully installed
in command line > close the terminal
and again open it > now to check whether Anaconda4.0.2 got installed
properly > open terminal > type python in command line prompt and
press enter > you should get message
like this below snapshot.
Step4: Installing Java inside Ubuntu:
By default Ubuntu VMware
image does not come with Java runtime environment which is essential to run
Spark.
Let’s see the steps to
install Java runtime environment inside Ubuntu.
Connect to internet >
open terminal inside Ubuntu > type $ sudo apt-get update > press enter > this command
will update the package index > once it is done successfully > next type $ sudo apt-get install default-jre >
Press enter > this
command install java in your Ubuntu machine > it will take some time to
install > them you will get message that java is successfully installed.
Step5: Downloading and installing Spark 2.0.1
inside Ubuntu:
Connect to internet > Inside
Ubuntu open internet browser > paste
the following link > http://spark.apache.org/downloads.html
> Download spark-2.0.2-bin-hadoop2.7.tgz > once it is
downloaded > unzip the .tgz file >
copy the output folder in your home directory and rename it as spark201 > now
open the terminal > type $ gedit .profile > a editor window
will open > at end of the editor, after
fi paste below 5 lines of code >
xport
SPARK_HOME=/home/narayana/spark201
export
PATH=$SPARK_HOME/bin:$PATH
export
PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export
PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH
export
SPARK_LOCAL_IP=LOCALHOST
> Caution: In first line of code give your
home directory details( i.e., /home/narayana/spark201) > click
save button and close the editor
Here the above process looks
like this.
> now paste the same 5
codes in terminal and press enter > now it installs spark > after successful
installation > to start spark type the following command in terminal > $ pyspark > press
enter > and to close spark type exit().
Now you should see
following image.
If you
could see this screen that means you have successfully installed Spark 2.0.1 on
Python3.5 in your Ubuntu system.
Let’s
do some sample program in Spark to check everything working well.
Sample
Programs in Spark , ipython notebook and in python :
[1] Word
Count Program in Spark:
Open
terminal in Ubuntu and type $ pyspark > and type the following commands
in line by line and press enter.
Program:
text = sc.textFile("hobbit.txt")
print text
from operator import add
def tokenize(text):
return text.split()
words = text.flatMap(tokenize)
print words
wc = words.map(lambda x: (x,1))
print wc.toDebugString()
counts = wc.reduceByKey(add)
counts.saveAsTextFile("output-dir")
print text
from operator import add
def tokenize(text):
return text.split()
words = text.flatMap(tokenize)
print words
wc = words.map(lambda x: (x,1))
print wc.toDebugString()
counts = wc.reduceByKey(add)
counts.saveAsTextFile("output-dir")
The screen shot of above program in spark
look like this.
The
output screen shot looks like this:
[2] Word Count
Program in ipython notebook:
Inside
Ubuntu open terminal > type following
command which redirect you to ipython notebook in browser.
PYSPARK_PYTHON=python3
PYSPARK_DRIVER_PYTHON=ipython3 PYSPARK_DRIVER_PYTHON_OPTS="notebook"
pyspark
It looks
like below image once the above code is executed in terminal.
In
ipython notebook > click new > python (conda root) > and here paste
the above codes one by one and run the cell.
The screen
shot of above process looks like this.
The
output will look like this:
[3] Word count Program in Python:
Open terminal in Ubuntu > type following command in command prompt after $ symbol > $ spark-submit wordcount3.py hobbit.txt > press enter > you can see the program executes.
You will see the following screenshot.
Below is the screenshot of the Output of this program.
If you could able to see these output like above in these environment's , that means your Spark , ipython notebook and python are working well.
If you have any doubts please feel free to contact me.
Thanks & Regards
I.S.L.Narayana
isivanarayana9@gmail.com
Open terminal in Ubuntu > type following command in command prompt after $ symbol > $ spark-submit wordcount3.py hobbit.txt > press enter > you can see the program executes.
You will see the following screenshot.
Below is the screenshot of the Output of this program.
If you could able to see these output like above in these environment's , that means your Spark , ipython notebook and python are working well.
If you have any doubts please feel free to contact me.
Thanks & Regards
I.S.L.Narayana
isivanarayana9@gmail.com


No comments:
Post a Comment