Cloud Computing for Data Analysis

Exercise04 Spark SQL program
------------------------------

Part 1
Scala Environment SetUp

Download the following:
1. Scala:
	Go to website http://www.scala-lang.org/download/ and download Scala binaries from 'Other ways to install Scala'
	Install Scala in the laptop
	Set environment variable (under system variables), set a path to scala/bin folder 
	(NOTE: To set the environment variables, in Windows :
		Right click on My Computer -> Select Properties -> 'Select Advanced system Settings' -> Select 'Environment Variables' button -> 
		Under 'System Variables', select PATH and click on 'Edit button' -> Click on 'New button' and add the desired path and give OK)
		in Macintosh :
		See instructions here: http://osxdaily.com/2015/07/28/set-enviornment-variables-mac-os-x/
	)


2. Scala-IDE
	Go to website http://scala-ide.org/ and download Scala-IDE
	Unzip the .zip file
	Copy the Eclipse to any desired locations

3. Hadoop packages:
	Go to website https://github.com/srccodes/hadoop-common-2.2.0-bin and download the .zip file
	Unzp the .zip file
	Copy the hadoop-common-2.2.0-bin-master, which contains a bin folder, to any desired location. (Say it has been copied to C://Hadoop folder. So, the bin folder is in the location: C://Hadoop/hadoop-common-2.2.0-bin-master/)
	(NOTE: while programming, we need this location) 

4. Make sure the environment variable for Java ( java/bin folder ) is also set like Scala see Step 1. If not, make the path variable to Java's bin folder


Part2
Create a New Maven Project

The following instructions are given to work with Spark in Eclipse itself. So you can program and debug in Eclipse itself.

1. Open Eclipse which came with Scala-IDE
2. Click File-> New-> Other -> Maven Project and select Next
3. Select 'Create simple project checkbox' and click Next
4. Give any 'Group Id' similar 'org.SparkSQL'
5. Give any 'Artifact Id' similar to 'SparkSQLScala' and click Finish
6. Right click on newly created project and go to Configure and select 'Add Scala nature'
7. Open the new project and open the pom.xml and go to pom.xml tab
8. Copy the following Maven dependency into pom.xml

  <dependencies>
  	<dependency>
  		<groupId>org.apache.spark</groupId>
  		<artifactId>spark-core_2.11</artifactId>
  		<version>2.2.1</version>
  	</dependency>
  	<dependency>
  		<groupId>org.apache.spark</groupId>
  		<artifactId>spark-sql_2.11</artifactId>
  		<version>2.2.1</version>
  	</dependency>
  </dependencies>

  and paste it under <version>....</version> line


9. Inside the project, right click on src/main/java, select Refactor and select Rename. Rename 'java' as 'scala'
10. Right click on src/main/scala and select New -> Package. Name the package as 'org.SparkSQL'(Same name as you have given for 'Group Id' in step 4)
11. Right click on the package(org.SparkSQL) and select New -> Scala Object. Name it as 'org.SparkSQL.Driver'

Find Spark programming tutorials here: http://spark.apache.org/docs/latest/programming-guide.html


Part3
Use SparkSQL Library to load Data into a Table , and run SQL queries on it

1. Delete all contents inside Driver.scala and copy the following code into the Maven Project createin Part2 - Driver.scala

package org.SparkSQL

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext

case class Car(buying:String, maint:String, doors:String, persons:String, lug_boot:String, safety:String, car_class:String)

object Driver {
  def main(args: Array[String]): Unit = {
    //Set Hadoop home directory
    //System.setProperty("hadoop.home.dir", "C:\\Hadoop\\hadoop-common-2.2.0-bin-master\\")
    //Setting up Spark configurations
    val conf = new SparkConf().setAppName("SparkAction").setMaster("local")
    val sc = new SparkContext(conf)
    val sqlc = new SQLContext(sc)
    
    //Loading a data file
    val loadFile = sc.textFile(args(0))
    val fileRDD = loadFile.map { line => line.split(",") }
    //Mapping the data into an RDD
    val carRDD = fileRDD.map { car => Car(car(0),car(1),car(2),car(3),car(4),car(5),car(6)) }
    
    import sqlc.implicits._
    //Converting the RDD into a dataframe
    val carsDF = carRDD.toDF()
    carsDF.registerTempTable("Cars") // Creating a table from the dataframe
    //Querying the table
    val queryDF = sqlc.sql("select * from Cars where buying='vhigh' and doors like '%more' and car_class='unacc'")
    val queryRDD = queryDF.rdd
    queryRDD.saveAsTextFile(args(1)) //Storing the result as a textfile
  }
}

In the program, line number 12,
	System.setProperty("hadoop.home.dir", "location of bin folder of Hadoop you saved in step 3 of Part 1")
	Follow the same pattern for the folder location
	Uncomment line 12


3. Save the project.
4. Right click on the project. Select 'RunAs' and click 'Maven Clean'. This step removes old .jar files in the 'target' folder
5. Right click on the project. Select 'RunAs' and click 'Maven Install'. Check the 'target' folder for a .jar file. Rename the .jar file as 'SparkSQLScala.jar'
6. Download Car Evaluation Data from http://webpages.uncc.edu/aatzache/ITCS6162/Project/Data/CarEvaluationData/CarData.zip
7. Copy data.txt and SparkSQLScala.jar files to your UNCC account using WinSCP or CyberDuck
8. Copy data.txt to HDFS using the command:
	hadoop fs -put /users/UNCCUserName/data.txt /user/UNCCUserName/
9. Run the program using the command: 
	spark2-submit --class org.SparkSQL.Driver --master yarn --deploy-mode client /users/abagavat/SparkSQLScala.jar /user/UNCCUserName/data.txt /user/UNCCUserName/SparkSQLOutput
10. Copy the output to your server using the command:
	hadoop fs -get /user/UNCCUserName/SparkSQLOutput/part-00000 /users/abagavat/SparkSQLOutput.txt

11. Get SparkSQLOutput.txt from CyberDuck or WinSCP and upload in Canvas