More Hands on Experience¶
How to upload on Hadoop cluster¶
In order to upload data from you personal computer to the hadoop cluster you need to follow two steps. First you need to upload data to the head node and from the head not to the Hadoop cluster.
Step 1: uploading data to the head node:
scp <file path on the personal machine> <username>@wegdam.ewi.utwente.nl:~
Step 2: Logging in to the head node and putting the data on Hadoop cluster
ssh <username>@wegdam.ewi.utwente.nl ls hdfs dfs -put <sourse address on headnode> hdfs dfs -ls
Attention! While you are logged in to the head node you can use normal unix commands (eg. ls, cd, rm, ...). When you want to access files on the hadoop file system you have to use the prefix (hdfs dfs) before all commands.
How to copy data from Hadoop cluster to your local machine¶
You have to follow the reverse procedure of what mentioned above. First you need to get data from Hadoop cluster to the Head node and then from head node to your local machine.
Step 1: getting data to the head node:
assuming that you are logged to the head node
hdfs dfs -get <path of the file on the Hadoop cluster> ls
Step 2: Copying the file from head node to your local machine
scp <file name on the head node > <username@yourmachine IP:destination path>
Attention! Please later remove everything that you copied to the head node and Hadoop cluster in order to keep it clean!
Example: Assuming that you are logged in to the head node
Removing files from head node:
Removing files from Hadoop clusterL
hdfs dfs -rm filename
Running Apache Spark¶
/usr/lib/spark/bin/spark-shell --master yarn-client --num-executors 5 --driver-memory 4g --executor-memory 2g --executor-cores 1 --conf spark.ui.port=<a number higher that 4050>--queue <your username>