Cloud Computing for Data Analysis Project Description Part 1 ----------- MapReduce program - worth 20 points ( out of 100 ) - Obtain PatentData XML files on the https://www.dropbox.com/sh/celzjax2ebrokby/AAAD_R6Sy2XKUOvobgSXI3l5a?dl=0 all the XML files are in ::: PatentData/patGrants2005_2016Nov to UnZip the file unzip ipg100727.zip - Parse the PatentData XML files to create data in a table ( relational ) format . MapReduce program should take as an input several XML files at once ( not just one ) . - For example from the following XML tags : new_chip electronics 2015 create a comma separated text file , containing one patenet per line , and each tag is a column ( AttributeName ) : patenet_name patent_class year ----- -------- ----- new_chip electronics 2015 monitor electronics 2014 electric_car vehicles 2012 - The XML file contains a text filed with description of the patent. Use your WordCount program to extract the 3 most frequent words ( words with 3 highest counts in the text ) , and create 3 additional attributes containint these words : patenet_name patent_class year word1 word2 word3 ----- -------- ----- ----- ----- ----- new_chip electronics 2015 chip intel design monitor electronics 2014 display apple 27" electric_car vehicles 2012 drive google battery - Save the data file as PatentData.txt each patenet on a new line , each value separated by comma ' , ' and NO attribute names . Save the attribute names in a separate file called PatentDataAttributes.txt each attribute on a new line . - Upload your PatentData.txt and PatentDataAttributes.txt to Canvas . Part 2 ------- HIVE program - worth 20 points ( out of 100 ) - Load PatentData.txt file ( you created in Step. 1 ) into a HIVE table - Run SQL queries on the data to obtain: 2.1. all Patents from year 2015 2.2. all Patents owned by the company ' Apple ' 2.3. all patents in class electronics, that contain the frequent word ' Samsung ' , AND ' Apple ' - Save the output of each query in a separate text file . Upload your output files to Canvas . Part 3 ------- SPARK program - worth 20 points ( out of 100 ) - Use the PatentData.txt file ( from Step. 1 ) with the Spark MLib ( Machine Learning Library ) - Create the following : 3.1. Decision Tree , use the attribute PatentClass as a decision attribute 3.2. Classification , use the attribute word1 ( most frequent word ) as a decision attribute 3.3. Clustering : - use K-MEANS clustering , create 12 clusters ; - use hierarchical clustering ( number of clusters automatic ) - Save the output for DecisionTree, Classification, and Clustering . Upload your output files to Canvas . Part 4 ------- PRESENTATION - worth 20 points ( out of 100 ) - bring candy / sweets ( ex. chocolates ( sneakers , mars , M&M's , etc.) , cookies , cupcakes , doughnuts ) - present your project to the class on the assigned presentation date - create 7 PowerPoint slides total for the group - each student presents 1 slide , and speaks maximum 1 to 2 minutes - each student should ask 1 question to the class about what is shown on his/her slide - if a student answers the question correctly - go and give them the candy / sweets . - one student runs the 3 programs above - MapReduce , Hive , and Spark MLib in front of the class , and explains the code , and what the results of the programs are - find a YouTube video related to your presentation subject , video should be 2 - 3 minutes maximum ( otherwise stop the video at the 3rd minute ) . one student should show the video to the class . - all students must attend the class on the presentation date . if a student is not present on the presentation date , the student will receive a grade of 0 for the project - all source code ( Part 3 ) , PowerPoint slides file , and YouTube video link are due on Canvas 3 days before the presentation date . Part 5 ------- Research Paper Report - worth 20 points ( out of 100 ) - Write a research paper showing the results of your project . - The paper should contain the following sections : 5.1. Abstract 5.2. Introduction 5.3. Related Work 5.4. Method 5.5. Experiment and Results 5.6. Conclusion 5.7. References - Use the following papers as an example . However, you cannot copy any parts of the example papers . Your writing should be original . MR – Random Forest Algorithm for Distributed Action Rules Discovery http://webpages.uncc.edu/aatzache/Papers/IEEE_Formatted_MRRandomForestDistributedActionRules.pdf MR - Apriori Count Distribution Algorithm for Parallel Action Rules Discovery http://webpages.uncc.edu/aatzache/Papers/IEEE_Formatted_MRCountDistributionAprioriActionRules.pdf - Format the paper in IEEE format ( 2 columns , times new roman font , etc. ). Use the following template : IEEE Formatting Template http://webpages.uncc.edu/aatzache/ITCS6190/Project/IEEE_Format_Instructions8.5x11x2.doc - References should be formatted the same as in the example papers above . References are shown in the order in which they are cited . In other words, the first paper which is cited in the text should be numbered [1], the second paper cited should be numbered [2], and so on . Use the following guidelines for formatting your references . Do not use Wikipedia , or web links for references . All references should be citing a published research paper - a conference paper , a journal paper , a textbook , or another published book . https://owl.english.purdue.edu/owl/resource/560/06/ - Upload your Paper as a Word document in IEEE format to Canvas .