Cloud Computing for Data Analysis

Group Activity 04 - PART4: Download Spark Action Rules software and run on AWS-EMR Spark cluster

Get Car Evaluation Data from: http://webpages.uncc.edu/aatzache/ITCS6162/Project/Data/CarEvaluationData/CarData.zip
Get Mammographic Mass Data from: http://webpages.uncc.edu/aatzache/ITCS6162/Project/Data/MammographicMassData/MammData.zip

Replicate both data 1024 times. Templates of all data files are given in the 'Data' folder in the .zip file that we download from the below link. Execute the code for both datasets.

Download source code here: http://webpages.uncc.edu/aatzache/ITCS6190/Exercises/SparkAction.zip

To run the Code:

1. Create a cluster with Hadoop and Spark in AWS and start the cluster. Once the cluster is running, log-in to the master node using Putty(Windows) or SSH(MAC or Linux)

2. Create a data bucket in AWS S3. Upload data.txt, attributes.txt, parameters.txt from the Data directory and SparkAction-0.0.1-SNAPSHOT.jar files to S3

3. From the master node download SparkAction-0.0.1-SNAPSHOT.jar using the command:
	aws s3 cp s3://BUCKET_NAME/SparkAction-0.0.1-SNAPSHOT.jar .

4. Run the .jar file using your terminal or Putty using following command:

	spark-submit --class org.ActionRules.Main --master yarn --deploy-mode client LOCATION_OF_JAR_FILE s3://BUCKET_NAME/attributes.txt s3://BUCKET_NAME/parameters.txt s3://BUCKET_NAME/data.txt s3://BUCKET_NAME/SparkActionRulesOutput

5. Download the output folder(SparkActionRulesOutput) from S3 to your local machine

6. Upload SparkActionRulesOutput along with text from the terminal window to Canvas

7. Delete/Terminate the AWS cluster and delete all files from S3 when finished, otherwise Amazon will charge your Credit Card