Cloud Computing for Data Analysis

Exercise04 Spark SQL program
------------------------------

Part 1
Scala Environment SetUp

Download the following:
1. Scala:
	Go to website http://www.scala-lang.org/download/ and download Scala binaries from 'Other ways to install Scala'
	Install Scala in the laptop
	Set environment variable (under system variables), set a path to scala/bin folder 
	(NOTE: To set the environment variables, in Windows :
		Right click on My Computer -> Select Properties -> 'Select Advanced system Settings' -> Select 'Environment Variables' button -> 
		Under 'System Variables', select PATH and click on 'Edit button' -> Click on 'New button' and add the desired path and give OK)
		in Macintosh :
		See instructions here: http://osxdaily.com/2015/07/28/set-enviornment-variables-mac-os-x/
	)


2. Scala-IDE
	Go to website http://scala-ide.org/ and download Scala-IDE
	Unzip the .zip file
	Copy the Eclipse to any desired locations

3. Hadoop packages:
	Go to website https://github.com/srccodes/hadoop-common-2.2.0-bin and download the .zip file
	Unzp the .zip file
	Copy the hadoop-common-2.2.0-bin-master, which contains a bin folder, to any desired location. (Say it has been copied to C://Hadoop folder. So, the bin folder is in the location: C://Hadoop/hadoop-common-2.2.0-bin-master/)
	(NOTE: while programming, we need this location) 

4. Make sure the environment variable for Java ( java/bin folder ) is also set like Scala see Step 1. If not, make the path variable to Java's bin folder


5. Launch AWS Spark Cluster:
	5.1. Log-in to the AWS account
	5.2. Create a new cluster and add Spark to the Software configuration
	5.3. Give required Hardware configuration (Cheapest one is m4.large) and number of instances (2 is sufficient for this Exercise)
	5.4. Start the cluster
	5.5. Once the cluster is started, login to the master node using Putty(Windows) or SSH(Mac or Linux)
	5.5. Create a data bucket in AWS S3 for storing all data

	(Video to setup AWS cluster: https://www.youtube.com/watch?v=_A_xEtd2OIM)


Part2
Create a New Maven Project

The following instructions are given to work with Spark in Eclipse itself. So you can program and debug in Eclipse itself.

1. Open Eclipse which came with Scala-IDE
2. Click File-> New-> Other -> Maven Project and select Next
3. Select 'Create simple project checkbox' and click Next
4. Give any 'Group Id' similar 'org.SparkSQL'
5. Give any 'Artifact Id' similar to 'SparkSQLScala' and click Finish
6. Right click on newly created project and go to Configure and select 'Add Scala nature'
7. Open the new project and open the pom.xml and go to pom.xml tab
8. Copy the following Maven dependency into pom.xml

  <dependencies>
  	<dependency>
  		<groupId>org.apache.spark</groupId>
  		<artifactId>spark-core_2.11</artifactId>
  		<version>2.2.1</version>
  	</dependency>
  	<dependency>
  		<groupId>org.apache.spark</groupId>
  		<artifactId>spark-sql_2.11</artifactId>
  		<version>2.2.1</version>
  	</dependency>
  </dependencies>

  and paste it under <version>....</version> line


9. Inside the project, right click on src/main/java, select Refactor and select Rename. Rename 'java' as 'scala'
10. Right click on src/main/scala and select New -> Package. Name the package as 'org.SparkSQL'(Same name as you have given for 'Group Id' in step 4)
11. Right click on the package(org.SparkSQL) and select New -> Scala Object. Name it as 'org.SparkSQL.Driver'

Find Spark programming tutorials here: http://spark.apache.org/docs/latest/programming-guide.html


Part3
Use SparkSQL Library to load Data into a Table , and run SQL queries on it

1. Delete all contents inside Driver.scala and copy the following code into the Maven Project createin Part2 - Driver.scala

package org.SparkSQL

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext

case class Car(buying:String, maint:String, doors:String, persons:String, lug_boot:String, safety:String, car_class:String)

object Driver {
  def main(args: Array[String]): Unit = {
    //Set Hadoop home directory
    //System.setProperty("hadoop.home.dir", "C:\\Hadoop\\hadoop-common-2.2.0-bin-master\\")
    //Setting up Spark configurations
    val conf = new SparkConf().setAppName("SparkAction").setMaster("local")
    val sc = new SparkContext(conf)
    val sqlc = new SQLContext(sc)
    
    //Loading a data file
    val loadFile = sc.textFile(args(0))
    val fileRDD = loadFile.map { line => line.split(",") }
    //Mapping the data into an RDD
    val carRDD = fileRDD.map { car => Car(car(0),car(1),car(2),car(3),car(4),car(5),car(6)) }
    
    import sqlc.implicits._
    //Converting the RDD into a dataframe
    val carsDF = carRDD.toDF()
    carsDF.registerTempTable("Cars") // Creating a table from the dataframe
    //Querying the table
    val queryDF = sqlc.sql("select * from Cars where buying='vhigh' and doors like '%more' and car_class='unacc'")
    val queryRDD = queryDF.rdd
    queryRDD.saveAsTextFile(args(1)) //Storing the result as a textfile
  }
}

In the program, line number 12,
	System.setProperty("hadoop.home.dir", "location of bin folder of Hadoop you saved in step 3 of Part 1")
	Follow the same pattern for the folder location
	Uncomment line 12 if running the code in Eclipse


3. Save the project.
4. Right click on the project. Select 'RunAs' and click 'Maven Clean'. This step removes old .jar files in the 'target' folder
5. Right click on the project. Select 'RunAs' and click 'Maven Install'. Check the 'target' folder for a .jar file. Rename the .jar file as 'SparkSQLScala.jar'
6. Download Car Evaluation Data from http://webpages.uncc.edu/aatzache/ITCS6162/Project/Data/CarEvaluationData/CarData.zip
7. Upload data.txt and SparkSQLScala.jar files to your AWS S3 data bucket
8. From the master node download SparkSQLScala.jar using the command:
	aws s3 cp s3://BUCKET_NAME/SparkSQLScala.jar .
9. Run the program using the command: 
	spark-submit --class org.SparkSQL.Driver ./SparkSQLScala.jar s3://BUCKET_NAME/data.txt s3://BUCKET_NAME/SparkSQLOutput
10. Download the output folder(SparkSQLOutput) from S3 to your local machine

11. Upload the downloaded folder to Canvas

12. Save all commands in your Terminal Window in a Text File , including your login and username until the last command, and upload to Canvas .

13. Delete/Terminate the AWS cluster and delete all files from S3 when finished, otherwise Amazon will charge your Credit Card