Cloud Computing for Data Analysis Group Activity 08 - Spark MLib program ------------------------------ Part1 The following instructions are given to work with Spark in Eclipse itself. So you can program and debug in Eclipse itself. Please follow instructions given in Exercise04 - Spak SQL - to create a Maven project. Find Spark programming tutorials here: http://spark.apache.org/docs/latest/programming-guide.html 1. Copy the following Maven dependency into pom.xml in a Maven project org.apache.spark spark-core_2.11 2.1.0 org.apache.spark spark-mllib_2.11 2.1.0 2. Copy the scala word count program from http://spark.apache.org/examples.html to your project. (NOTE: The copied code contains only a wordcount module. You have to initialize Hadoop home directory, Spark Configuration and Spark Context) Review Exercise 04 to know how to initialize them 3. Clean the project to remove any old jar files (Maven Clean) and Package the Maven project into a jar file. 4. Run the jar file on DSBA hadoop cluster spark2-submit --class "ClassName" --master yarn --deploy-mode client for: 4.1 data from http://webpages.uncc.edu/aatzache/ITCS6190/Exercises/ActionRulesList.txt 4.2 data.txt from http://webpages.uncc.edu/aatzache/ITCS6162/Project/Data/MammographicMassData/MammData.zip 5. Save results in output text file ( created in ) , and upload both output text files to Canvas Part2: 6. Install / import plug-ins packages to use the Spark MLib (Machine Learning Library) 7. Write a program to Create a Decision Tree using the CarData.txt 9.1 read Chapter11 from Book 3. Learning Spark on using the MLib . DecisionTree is on page 230 . 9.2 use the mllib.tree.DecisionTree class | use the trainClassifier() and trainRegressor() methods 9.3 the training methods take the following parameters : data RDD of LabeledPoint numClasses (classification only) Number of classes to use impurity Node impurity measure; can be mini or entropy for classification. Use entropy . maxDepth Maximum depth of tree - use default: 5 maxBins Number of bins to slip data into when building each node - use value: 32 categoricalFeaturesInfo A map specifying which features are categorical, and how many categories they each have. For example, if feature ( attribute ) 1 is a binary attribute , with labels ( values ) 0 and 1 , and feature ( attribute ) 2 has three values - 0,1, and 2, you would pass {1: 2, 2: 3} . Use an empty map if no features are categorical . Part 3 ---------- Run the following program , which builds a Decision Tree per the instructions in Part 2 above . 10 . Background : Read Chapter1 and Chapter2 from Book 3. Learning Spark , and refer to the following links : http://spark.apache.org/docs/latest/mllib-guide.html https://www.mapr.com/blog/apache-spark-machine-learning-tutorial 11. Create a new Maven Project 12. Create a package named org.ML 13. Copy a spark decision tree program from http://webpages.uncc.edu/aatzache/ITCS6190/Exercises/NewDecisionTreeProgram.zip 14. Run the project in Eclipse and Upload outout file to Canvas - a file named like part-00000 15. Copy the attached .jar file from the downloaded .zip file. Copy it to hadoop-dsba.uncc.edu via WinSCP or CyberDuck 16. Download CarData from http://webpages.uncc.edu/aatzache/ITCS6162/Project/Data/CarEvaluationData/CarData.zip to hadoop-dsba.uncc.edu via WinSCP or CyberDuck 17. Move the data.txt file to the dsba cluster using 'hadoop -put' command in SSH or putty 18. Run the .jar file using the command: spark2-submit --class org.ML.DecisionTreeDriver --master yarn --deploy-mode client 19. Copy All text from the Command Window - including the output showing "TestE Error" and Decision Tree Model Classifier , and upload to Canvas