Cloud Computing for Data Analysis

Exercise - 03: Hive

1. Download Car data from : http://webpages.uncc.edu/aatzache/ITCS6162/Project/Data/CarEvaluationData/CarData.zip 

2. Create a cluster with Hive and Hadoop in AWS and start the cluster. Once the cluster is running, log-in to the master node using Putty(Windows) or SSH(MAC or Linux)

3. Create a data bucket in AWS S3. Unzip CarData.zip and upload data.txt and attributes.txt file to S3

4. In the master node, to enter Hive, type hive and click Enter

$ hive

hive>


5. Create a table called CarData in Hive by using the attribute names from  CarDataAttributes.txt.

CREATE TABLE UNCCUserNameCarData(Buying STRING,Maint STRING,Doors STRING,Persons STRING,LugBoot STRING,Safety STRING,Class String)
COMMENT 'This is the Cars table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

6. Load data into the created table using:
load data inpath 's3://BUCKET_NAME/data.txt' overwrite into table UNCCUserNameCarData;

To create Hive table see the tutorial below : https://cwiki.apache.org/confluence/display/Hive/Tutorial

(NOTE: The Hive tables that we create are centralized. If we create a table named CarData, it will be created for the entire class. So create unique Hive tables like UNCCUserNameCarData)

(NOTE: The directory(CarData) that we created in HDFS may get deleted after creating table. So if we run into any problem while running this command or if we want 
to create table again, we have to check if the file 'CarData.txt' is still available in HDFS. If it got deleted, we have to do the steps 3 and 4 again)

7. Run SQL queries on table CarData :

INSERT OVERWRITE LOCAL DIRECTORY './CarDataHiveOutput' 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
SELECT * FROM UNCCUserNameCarData 
WHERE UNCCUserNameCarData.Buying = 'vhigh' 
AND UNCCUserNameCarData.Doors LIKE '%more' 
AND UNCCUserNameCarData.Class = 'unacc';

This command will store the query output in the directory 'CarDataOutput'. We can access files inside this directory from CyberDuck or WinSCP.
The output file would be named something like 000000_0. We can rename it to any name like 'CarDataOutput.txt'

8. Copy the output to S3 using:
	aws s3 cp ./CarDataHiveOutput/000000_0 s3://BUCKET_NAME/CarDataHiveOutput.txt

9. Download CarDataHiveOutput.txt from S3

10. Save all commands in your Terminal Window in a Text File , including your login and username until the last command, and upload the   TeminalWindow.txt  and  CarDataHiveOutput.txt   to Canvas .

11. Delete/Terminate the AWS cluster and delete all files from S3 when finished, otherwise Amazon will charge your Credit Card