Instead, it's best practice to have a separate Python 2.7 installation. And to be completely isolated, best practice is to create a virtualenv, which you will use to install all packages you are going to use with pyspark.
Also, if you plan to run pyspark within Zeppelin, you have to be sure that the virtualenv is accessible to user Zeppelin. This is why I install the whole thing in /etc. Also, make sure to run this on all cluster nodes, otherwise Spark executors cannot launch the local Python processes.
#!/bin/bash
# run as root
# info on python2.7 req's here: http://toomuchdata.com/2014/02/16/how-to-install-python-on-centos/
# info on installing python for spark: http://blog.cloudera.com/blog/2015/09/how-to-prepare-your-apache-hadoop-cluster-for-pyspark-jobs/
# info on python on local environment http://stackoverflow.com/questions/5506110/is-it-possible-to-install-another-version-of-python-to-virtualenv
#install needed system libraries
yum groupinstall "Development tools"
yum install zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-devel
#setup local python2.7 installation
mkdir /etc/spark-python
mkdir /etc/spark-python/python
cd /etc/spark-python/python
wget http://www.python.org/ftp/python/2.7.9/Python-2.7.9.tgz
tar -zxvf Python-2.7.9.tgz
cd Python-2.7.9
make clean
./configure --prefix=/etc/spark-python/.localpython
make
make install
#setup local pip installation
cd /etc/spark-python/python
wget https://pypi.python.org/packages/8b/2c/c0d3e47709d0458816167002e1aa3d64d03bdeb2a9d57c5bd18448fd24cd/virtualenv-15.0.3.tar.gz#md5=a5a061ad8a37d973d27eb197d05d99bf
tar -zxvf virtualenv-15.0.3.tar.gz
cd virtualenv-15.0.3/
/etc/spark-python/.localpython/bin/python setup.py install
cd /etc/spark-python
/etc/spark-python/.localpython/bin/virtualenv spark-venv-py2.7 --python=/etc/spark-python/.localpython/bin/python2.7
#activate venv
cd /etc/spark-python/spark-venv-py2.7/bin
source ./activate
#pip install packages of your choice
/etc/spark-python/spark-venv-py2.7/bin/pip install  --upgrade pip
/etc/spark-python/spark-venv-py2.7/bin/pip install py4j
/etc/spark-python/spark-venv-py2.7/bin/pip install py4j
/etc/spark-python/spark-venv-py2.7/bin/pip install numpy
/etc/spark-python/spark-venv-py2.7/bin/pip install scipy
/etc/spark-python/spark-venv-py2.7/bin/pip install scikit-learn
/etc/spark-python/spark-venv-py2.7/bin/pip install pandas
After you did this, make sure to set variable PYSPARK_PYTHON in /etc/spark-env.sh to the path of the new binary, in this case /etc/spark-python/spark-venv-py2.7/bin/python
Also, if you use Zeppelin make sure to set the correct python path in interpreter settings. Simply alter/add property zeppelin.pyspark.python and set it's value to the python binary as above.
Tags: Apache Spark, Python, pyspark, Apache Zeppelin, Ambari, Hortonworks HDP
 
No comments:
Post a Comment