Cloud Computing for Data Analysis Group Activity 08 - Spark MLib program ------------------------------ Part1 The following instructions are given to work with Spark in Eclipse itself. So you can program and debug in Eclipse itself. Please follow instructions given in Exercise04 - Spak SQL - to create a Maven project. Find Spark programming tutorials here: http://spark.apache.org/docs/latest/programming-guide.html 1. Copy the following Maven dependency into pom.xml in a Maven project org.apache.spark spark-core_2.11 2.1.0 org.apache.spark spark-mllib_2.11 2.1.0 2. Copy the scala word count program from http://spark.apache.org/examples.html to your project. (NOTE: The copied code contains only a wordcount module. You have to initialize Hadoop home directory, Spark Configuration and Spark Context) Review Exercise 04 to know how to initialize them 3. Clean the project to remove any old jar files (Maven Clean) and Package the Maven project into a jar file. 4. Run the jar file using the command spark-submit --class "ClassName" --master yarn --deploy-mode client for: 4.1 data from http://webpages.uncc.edu/aatzache/ITCS6190/Exercises/ActionRulesList.txt 4.2 data.txt from http://webpages.uncc.edu/aatzache/ITCS6162/Project/Data/MammographicMassData/MammData.zip 5. Save results in output text file ( created in ) , and upload both output text files to Canvas Part2: 6. Install / import plug-ins packages to use the Spark MLib (Machine Learning Library) 7. Write a program to Create a Decision Tree using the CarData.txt 9.1 read Chapter11 from Book 3. Learning Spark on using the MLib . DecisionTree is on page 230 . 9.2 use the mllib.tree.DecisionTree class | use the trainClassifier() and trainRegressor() methods 9.3 the training methods take the following parameters : data RDD of LabeledPoint numClasses (classification only) Number of classes to use impurity Node impurity measure; can be mini or entropy for classification. Use entropy . maxDepth Maximum depth of tree - use default: 5 maxBins Number of bins to slip data into when building each node - use value: 32 categoricalFeaturesInfo A map specifying which features are categorical, and how many categories they each have. For example, if feature ( attribute ) 1 is a binary attribute , with labels ( values ) 0 and 1 , and feature ( attribute ) 2 has three values - 0,1, and 2, you would pass {1: 2, 2: 3} . Use an empty map if no features are categorical . Part 3 ---------- Run the following program , which builds a Decision Tree per the instructions in Part 2 above . 10 . Background : Read Chapter1 and Chapter2 from Book 3. Learning Spark , and refer to the following links : http://spark.apache.org/docs/latest/mllib-guide.html https://www.mapr.com/blog/apache-spark-machine-learning-tutorial 11. Create a new Maven Project 12. Create a package named org.ML 13. Copy a spark decision tree program from http://webpages.uncc.edu/aatzache/ITCS6190/Exercises/NewDecisionTreeProgram.zip 14. Run the project in Eclipse and Upload outout file to Canvas - a file named like part-00000 15. Copy the attached .jar file from the downloaded .zip file. Copy it to hadoop-dsba.uncc.edu via WinSCP or CyberDuck 16. Create an EMR cluster with Spark 17. Download CarData from http://webpages.uncc.edu/aatzache/ITCS6162/Project/Data/CarEvaluationData/CarData.zip and upload to hadoop-dsba.uncc.edu. Also, upload DecisionTree.jar to S3 18. Login to the master node of the cluster using putty or cyberduck 19. Copy the DecisionTree.jar to the master node using the command: aws s3 cp s3://DATA_BUCKET_NAME/DecisionTree.jar . 20. Run the .jar file using the command: spark-submit --class org.ML.DecisionTreeDriver --master yarn --deploy-mode client DecisionTree.jar s3://DATA_BUCKET_NAME/data.txt s3://DATA_BUCKET_NAME/DecisionTreeOutput 21. Copy All text from the Command Window - including the output showing "TestE Error" and Decision Tree Model Classifier , and upload to Canvas along with the 'metadata' directory in the output