Cloud Computing for Data Analysis Exercise04 Spark SQL program ------------------------------ Part 1 Scala Environment SetUp Download the following: 1. Scala: Go to website http://www.scala-lang.org/download/ and download Scala binaries from 'Other ways to install Scala' Install Scala in the laptop Set environment variable (under system variables), set a path to scala/bin folder (NOTE: To set the environment variables, in Windows : Right click on My Computer -> Select Properties -> 'Select Advanced system Settings' -> Select 'Environment Variables' button -> Under 'System Variables', select PATH and click on 'Edit button' -> Click on 'New button' and add the desired path and give OK) in Macintosh : See instructions here: http://osxdaily.com/2015/07/28/set-enviornment-variables-mac-os-x/ ) 2. Scala-IDE Go to website http://scala-ide.org/ and download Scala-IDE Unzip the .zip file Copy the Eclipse to any desired locations 3. Hadoop packages: Go to website https://github.com/srccodes/hadoop-common-2.2.0-bin and download the .zip file Unzp the .zip file Copy the hadoop-common-2.2.0-bin-master, which contains a bin folder, to any desired location. (Say it has been copied to C://Hadoop folder. So, the bin folder is in the location: C://Hadoop/hadoop-common-2.2.0-bin-master/) (NOTE: while programming, we need this location) 4. Make sure the environment variable for Java ( java/bin folder ) is also set like Scala see Step 1. If not, make the path variable to Java's bin folder Part2 Create a New Maven Project The following instructions are given to work with Spark in Eclipse itself. So you can program and debug in Eclipse itself. 1. Open Eclipse which came with Scala-IDE 2. Click File-> New-> Other -> Maven Project and select Next 3. Select 'Create simple project checkbox' and click Next 4. Give any 'Group Id' similar 'org.SparkSQL' 5. Give any 'Artifact Id' similar to 'SparkSQLScala' and click Finish 6. Right click on newly created project and go to Configure and select 'Add Scala nature' 7. Open the new project and open the pom.xml and go to pom.xml tab 8. Copy the following Maven dependency into pom.xml org.apache.spark spark-core_2.11 2.2.1 org.apache.spark spark-sql_2.11 2.2.1 and paste it under .... line 9. Inside the project, right click on src/main/java, select Refactor and select Rename. Rename 'java' as 'scala' 10. Right click on src/main/scala and select New -> Package. Name the package as 'org.SparkSQL'(Same name as you have given for 'Group Id' in step 4) 11. Right click on the package(org.SparkSQL) and select New -> Scala Object. Name it as 'org.SparkSQL.Driver' Find Spark programming tutorials here: http://spark.apache.org/docs/latest/programming-guide.html Part3 Use SparkSQL Library to load Data into a Table , and run SQL queries on it 1. Delete all contents inside Driver.scala and copy the following code into the Maven Project createin Part2 - Driver.scala package org.SparkSQL import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.sql.SQLContext case class Car(buying:String, maint:String, doors:String, persons:String, lug_boot:String, safety:String, car_class:String) object Driver { def main(args: Array[String]): Unit = { //Set Hadoop home directory //System.setProperty("hadoop.home.dir", "C:\\Hadoop\\hadoop-common-2.2.0-bin-master\\") //Setting up Spark configurations val conf = new SparkConf().setAppName("SparkAction").setMaster("local") val sc = new SparkContext(conf) val sqlc = new SQLContext(sc) //Loading a data file val loadFile = sc.textFile(args(0)) val fileRDD = loadFile.map { line => line.split(",") } //Mapping the data into an RDD val carRDD = fileRDD.map { car => Car(car(0),car(1),car(2),car(3),car(4),car(5),car(6)) } import sqlc.implicits._ //Converting the RDD into a dataframe val carsDF = carRDD.toDF() carsDF.registerTempTable("Cars") // Creating a table from the dataframe //Querying the table val queryDF = sqlc.sql("select * from Cars where buying='vhigh' and doors like '%more' and car_class='unacc'") val queryRDD = queryDF.rdd queryRDD.saveAsTextFile(args(1)) //Storing the result as a textfile } } In the program, line number 12, System.setProperty("hadoop.home.dir", "location of bin folder of Hadoop you saved in step 3 of Part 1") Follow the same pattern for the folder location Uncomment line 12 3. Save the project. 4. Right click on the project. Select 'RunAs' and click 'Maven Clean'. This step removes old .jar files in the 'target' folder 5. Right click on the project. Select 'RunAs' and click 'Maven Install'. Check the 'target' folder for a .jar file. Rename the .jar file as 'SparkSQLScala.jar' 6. Download Car Evaluation Data from http://webpages.uncc.edu/aatzache/ITCS6162/Project/Data/CarEvaluationData/CarData.zip 7. Copy data.txt and SparkSQLScala.jar files to your UNCC account using WinSCP or CyberDuck 8. Copy data.txt to HDFS using the command: hadoop fs -put /users/UNCCUserName/data.txt /user/UNCCUserName/ 9. Run the program using the command: spark2-submit --class org.SparkSQL.Driver --master yarn --deploy-mode client /users/abagavat/SparkSQLScala.jar /user/UNCCUserName/data.txt /user/UNCCUserName/SparkSQLOutput 10. Copy the output to your server using the command: hadoop fs -get /user/UNCCUserName/SparkSQLOutput/part-00000 /users/abagavat/SparkSQLOutput.txt 11. Get SparkSQLOutput.txt from CyberDuck or WinSCP and upload in Canvas