Cloud Computing for Data Analysis Exercise04 Spark SQL program ------------------------------ Part 1 Scala Environment SetUp Download the following: 1. Scala: Go to website http://www.scala-lang.org/download/ and download Scala binaries from 'Other ways to install Scala' Install Scala in the laptop Set environment variable (under system variables), set a path to scala/bin folder (NOTE: To set the environment variables, in Windows : Right click on My Computer -> Select Properties -> 'Select Advanced system Settings' -> Select 'Environment Variables' button -> Under 'System Variables', select PATH and click on 'Edit button' -> Click on 'New button' and add the desired path and give OK) in Macintosh : See instructions here: http://osxdaily.com/2015/07/28/set-enviornment-variables-mac-os-x/ ) 2. Scala-IDE Go to website http://scala-ide.org/ and download Scala-IDE Unzip the .zip file Copy the Eclipse to any desired locations 3. Hadoop packages: Go to website https://github.com/srccodes/hadoop-common-2.2.0-bin and download the .zip file Unzp the .zip file Copy the hadoop-common-2.2.0-bin-master, which contains a bin folder, to any desired location. (Say it has been copied to C://Hadoop folder. So, the bin folder is in the location: C://Hadoop/hadoop-common-2.2.0-bin-master/) (NOTE: while programming, we need this location) 4. Make sure the environment variable for Java ( java/bin folder ) is also set like Scala see Step 1. If not, make the path variable to Java's bin folder 5. Launch AWS Spark Cluster: 5.1. Log-in to the AWS account 5.2. Create a new cluster and add Spark to the Software configuration 5.3. Give required Hardware configuration (Cheapest one is m4.large) and number of instances (2 is sufficient for this Exercise) 5.4. Start the cluster 5.5. Once the cluster is started, login to the master node using Putty(Windows) or SSH(Mac or Linux) 5.5. Create a data bucket in AWS S3 for storing all data (Video to setup AWS cluster: https://www.youtube.com/watch?v=_A_xEtd2OIM) Part2 Create a New Maven Project The following instructions are given to work with Spark in Eclipse itself. So you can program and debug in Eclipse itself. 1. Open Eclipse which came with Scala-IDE 2. Click File-> New-> Other -> Maven Project and select Next 3. Select 'Create simple project checkbox' and click Next 4. Give any 'Group Id' similar 'org.SparkSQL' 5. Give any 'Artifact Id' similar to 'SparkSQLScala' and click Finish 6. Right click on newly created project and go to Configure and select 'Add Scala nature' 7. Open the new project and open the pom.xml and go to pom.xml tab 8. Copy the following Maven dependency into pom.xml org.apache.spark spark-core_2.11 2.2.1 org.apache.spark spark-sql_2.11 2.2.1 and paste it under .... line 9. Inside the project, right click on src/main/java, select Refactor and select Rename. Rename 'java' as 'scala' 10. Right click on src/main/scala and select New -> Package. Name the package as 'org.SparkSQL'(Same name as you have given for 'Group Id' in step 4) 11. Right click on the package(org.SparkSQL) and select New -> Scala Object. Name it as 'org.SparkSQL.Driver' Find Spark programming tutorials here: http://spark.apache.org/docs/latest/programming-guide.html Part3 Use SparkSQL Library to load Data into a Table , and run SQL queries on it 1. Delete all contents inside Driver.scala and copy the following code into the Maven Project createin Part2 - Driver.scala package org.SparkSQL import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.sql.SQLContext case class Car(buying:String, maint:String, doors:String, persons:String, lug_boot:String, safety:String, car_class:String) object Driver { def main(args: Array[String]): Unit = { //Set Hadoop home directory //System.setProperty("hadoop.home.dir", "C:\\Hadoop\\hadoop-common-2.2.0-bin-master\\") //Setting up Spark configurations val conf = new SparkConf().setAppName("SparkAction").setMaster("local") val sc = new SparkContext(conf) val sqlc = new SQLContext(sc) //Loading a data file val loadFile = sc.textFile(args(0)) val fileRDD = loadFile.map { line => line.split(",") } //Mapping the data into an RDD val carRDD = fileRDD.map { car => Car(car(0),car(1),car(2),car(3),car(4),car(5),car(6)) } import sqlc.implicits._ //Converting the RDD into a dataframe val carsDF = carRDD.toDF() carsDF.registerTempTable("Cars") // Creating a table from the dataframe //Querying the table val queryDF = sqlc.sql("select * from Cars where buying='vhigh' and doors like '%more' and car_class='unacc'") val queryRDD = queryDF.rdd queryRDD.saveAsTextFile(args(1)) //Storing the result as a textfile } } In the program, line number 12, System.setProperty("hadoop.home.dir", "location of bin folder of Hadoop you saved in step 3 of Part 1") Follow the same pattern for the folder location Uncomment line 12 if running the code in Eclipse 3. Save the project. 4. Right click on the project. Select 'RunAs' and click 'Maven Clean'. This step removes old .jar files in the 'target' folder 5. Right click on the project. Select 'RunAs' and click 'Maven Install'. Check the 'target' folder for a .jar file. Rename the .jar file as 'SparkSQLScala.jar' 6. Download Car Evaluation Data from http://webpages.uncc.edu/aatzache/ITCS6162/Project/Data/CarEvaluationData/CarData.zip 7. Upload data.txt and SparkSQLScala.jar files to your AWS S3 data bucket 8. From the master node download SparkSQLScala.jar using the command: aws s3 cp s3://BUCKET_NAME/SparkSQLScala.jar . 9. Run the program using the command: spark-submit --class org.SparkSQL.Driver ./SparkSQLScala.jar s3://BUCKET_NAME/data.txt s3://BUCKET_NAME/SparkSQLOutput 10. Download the output folder(SparkSQLOutput) from S3 to your local machine 11. Upload the downloaded folder to Canvas 12. Save all commands in your Terminal Window in a Text File , including your login and username until the last command, and upload to Canvas . 13. Delete/Terminate the AWS cluster and delete all files from S3 when finished, otherwise Amazon will charge your Credit Card