Cloud Computing for Data Analysis --------------------------------- GroupActivity 05 - K-Means Clustering on Spark MLib ---------------------------------------------------- SPARK program - Use the the Spark MLib ( Machine Learning Library ) with the CarEvaluation data - Download CarEvaluation Data from : http://webpages.uncc.edu/aatzache/ITCS6162/Project/Data/CarEvaluationData/CarData.zip - Create the following : Clustering : - use K-MEANS clustering , create 3 clusters ; - Save the output for the Clustering in a Text file . Upload your output file with command line window to Canvas. Sample Code for the Car Data: import org.apache.spark.mllib.clustering.{KMeans, KMeansModel} import org.apache.spark.mllib.linalg.Vectors //Parsing the input data def parseData(str:String) : Car = { var line = str.split(",") if(line.length == 7) Car(line(0), line(1), line(2), line(3), line(4), line(5), line(6)) else Car("None", "None", "None", "None", "None", "None", "None") } //Setting up Spark configurations val conf = new SparkConf().setAppName("SparkAction").setMaster("local") val sc = new SparkContext(conf) //Reading an input data file val inputDataRDD = sc.textFile(args(0)) val parsedInputRDD = inputDataRDD.map(parseData).cache() val validParsedInputRDD = parsedInputRDD.filter(line => !line.carClass.equals("None")) //Converting Strings to Double var buyingMap : Map[String,Double] = Map() var index1 = 0.0 validParsedInputRDD.map(car => car.buying).distinct.collect().foreach(x => { buyingMap += (x -> index1); index1 += 1.0 }) var maintMap : Map[String,Double] = Map() var index2 = 0.0 validParsedInputRDD.map(car => car.maint).distinct.collect().foreach(x => { maintMap += (x -> index2); index2 += 1.0 }) var doorsMap : Map[String,Double] = Map() var index3 = 0.0 validParsedInputRDD.map(car => car.doors).distinct.collect().foreach(x => { doorsMap += (x -> index3); index3 += 1.0 }) var personsMap : Map[String,Double] = Map() var index4 = 0.0 validParsedInputRDD.map(car => car.persons).distinct.collect().foreach(x => { personsMap += (x -> index4); index4 += 1.0 }) var lugMap : Map[String,Double] = Map() var index5 = 0.0 validParsedInputRDD.map(car => car.lug_boot).distinct.collect().foreach(x => { lugMap += (x -> index5); index5 += 1.0 }) var safetyMap : Map[String,Double] = Map() var index6 = 0.0 validParsedInputRDD.map(car => car.safety).distinct.collect().foreach(x => { safetyMap += (x -> index6); index6 += 1.0 }) var classMap : Map[String,Double] = Map() var index7 = 0.0 validParsedInputRDD.map(car => car.carClass).distinct.collect().foreach(x => { classMap += (x -> index7); index7 += 1.0 }) //Getting final data for Decision tree val dataPrep = validParsedInputRDD.map(car => { val carClass = classMap(car.carClass) val buying = buyingMap(car.buying) val maint = maintMap(car.maint) val doors = doorsMap(car.doors) val persons = personsMap(car.persons) val lugBoot = lugMap(car.lug_boot) val safety = safetyMap(car.safety) Array(carClass.toDouble,buying.toDouble,maint.toDouble,doors.toDouble,persons.toDouble,lugBoot.toDouble,safety.toDouble) }) //Creating vector from the data val dataLabels = dataPrep.map(dataLine => { Vectors.dense(dataLine.apply(1), dataLine.apply(2), dataLine.apply(3), dataLine.apply(4), dataLine.apply(5), dataLine.apply(6), dataLine.apply(0)) }) // Split the data into training and test sets (30% held out for testing) val splits = dataLabels.randomSplit(Array(0.7, 0.3)) val (trainingData, testData) = (splits(0), splits(1)) // Cluster the data into two classes using KMeans val numClusters = 3 val numIterations = 20 val clusters = KMeans.train(trainingData, numClusters, numIterations) // Makng predictions and printing which text data belongs to which cluster var id = 0 val predictions = testData.map { x => (x,clusters.predict(x)) } .map(x => x._1.toArray.mkString(",") + " - Cluster:" + x._2) predictions.saveAsTextFile(args(1)) //END OF SAMPLE CODE Tutorial for K-Means clustering in Spark-MLlib: https://spark.apache.org/docs/latest/mllib-clustering.html Get a .jar file Copy the .jar file and the input Car Data file to the master node using WinSCP(Windows) or CyberDuck (MAC or Linux) Copy the CarData to HDFS using the following command: hadoop fs put /users/UNCC-USERNAME/CarData.txt /user/UNCC-USERNAME/ To execute the program, use the following command: spark2-submit --class package_name.ClassName --master yarn --deploy-mode client /users/UNCC-USERNAME/KMeans.jar /user/UNCC-USERNAME/CarData/data.txt KMeansOutput SUBMIT THE OUTPUT FILE WITH YOUR COMMANDS IN THE COMMAND LINE WINDOW IN CANVAS