Cloud Computing for Data Analysis ---------------------------------- Exercise 02 Example MapReduce Program Without Cloudera: Programming Language : Java 1. Install required Software : - Putty ( for Windows ) , use Terminal ( for Apple Macintosh ) - WinSCP ( for Windows ) , CyberDuck ( for Apple Macintosh ) More instructions for logging into DSBA cluster: http://webpages.uncc.edu/aatzache/ITCS6190/Exercises/GroupActivity01_LoggingInto_UNCC_cluster.docx 2. Read from textbook : "Hadoop the Definitive Guide" http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1491901632 - Chapter 1 Meet Hadoop , - Chapter 2 MapReduce , - Chapter 3 The Hadoop Distributed File System , - Chapter 6 Developing a MapReduce Application , - Chapter 7 How MapReduce Works , - Chapter 8 MapReduce Types and Formats , - Chapter 9 MapReduce Features 3. Read the MapReduce Tutorial from https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html Part 1 -------- 4. Copy the ‘ WordCount v1.0 ‘ program from https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Example:_WordCount_v1.0 and run it on the AWS -OR- dsba-hadoop Cluster . (NOTE: For the input file of the WordCount program use Mammals book from: http://webpages.uncc.edu/aatzache/ITCS6190/Exercises/03_MammalsBook_Text_34848.txt.utf8.txt) 4.1. Open Eclipse 4.2. Click on File -> New -> Other -> Maven -> Maven Project and select Next 4.3. Select "Create a simple project" checkbox and select Next 4.4. Give a GroupID(package name) as "org.wc" 4.5. Give ArtifactID(project name) as "MRWordCount" and select Finish 4.6. Open the pom.xml file inside the MRWordCount project 4.7. After the tag, copy and paste following dependencies: org.apache.hadoop hadoop-common 2.8.1 org.apache.hadoop hadoop-mapreduce-client-core 2.8.1 org.apache.hadoop hadoop-yarn-common 2.8.1 org.apache.hadoop hadoop-mapreduce-client-common 2.8.1 All the above dependencies can be downloaded from: https://mvnrepository.com/artifact/org.apache.hadoop 4.8. Save the pom.xml 4.9. Right click on src/main/java and select New -> Package. Give the package name as org.wc(the name that we gave for GroupID) and select Finish 4.10. Right click on the org.wc package and select New -> Class. Give the class name as "WordCount" and select Finish 4.11. Go to https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Example:_WordCount_v1.0 4.12. Copy the entire WordCount v1.0 program and paste it in the project. We have to keep the first line(package org.wc) of the program. 4.13. Convert the project into a jar file. To do it, right click on the project, select Export -> Java -> JAR file and click Next button. Give the export destination (My suggestion is to keep the .jar file in cloudera folder - Since this is the home directory, we can run directly from it without any directory changes). The .jar file will be 'untitled', we can change it to some other name. Click 'Finish' button 4.14. Check if the .jar file is available in a desired folder 4.15. Move the .jar file to your server using WinSCP or CyberDuck 4.16. Get Mammals book from the above mentioned link and save it as 'mammals.txt' and move it to the server using WinSCP or CyberDuck. Move the 'mammals.txt' file from your server into HDFS with the command: hadoop fs -put /users/UNCC_USERNAME/mammals.txt /user/UNCC_USERNAME/ 4.17. Execute the program with the command: hadoop jar JAR_FILE_NAME.jar org.wc.WordCount /user/UNCC_USERNAME/mammals.txt /user/UNCC_USERNAME/WordCountV1Output/ (NOTE: 1. In the above command, WordCount is a main class name 2. For running the same command second time, we have to delete the old output folder or give another name for the new output folder. Hadoop cannot replace the old output folder 3. To remove a folder from HDFS, use the command: hadoop fs -rm -r /user/UNCC_USERNAME/FOLDER_NAME ) 4.18. Once the execution is complete, check if the output files are created using: hadoop fs -ls /user/UNCC_USERNAME/WordCountV1Output 4.19. To open a file in the output folder, use : hadoop fs -cat /user/UNCC_USERNAME/WordCountV1Output/part-r-00000 4.20. To copy this file from HDFS to local, use: hadoop fs -get /user/UNCC_USERNAME/WordCountV1Output/part-r-00000 /users/UNCC_USERNAME/WordCountV1.txt 4.21. Get "WordCountV1.txt" using WinSCP or CyberDuck and upload in Canvas Part 2 ------- 5. Copy the ‘ WordCount v2.0 ‘ and run it on the AWS -OR- dsba-hadoop Cluster . (NOTE: For an input file of the WordCount program, use Mammals book from: http://webpages.uncc.edu/aatzache/ITCS6190/Exercises/03_MammalsBook_Text_34848.txt.utf8.txt) 5.1. Follow the steps 4.1 to 4.21. Use https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Example:_WordCount_v2.0 5.2. Open PuTTy or Terminal. 5.3. Get the mammals book from the above mentioned link and rename it as 'mammals.txt'. Copy the 'mammals.txt' file into HDFS. For example, if 'mammals.txt' is in home folder of the FTP client, use the command: hadoop fs -put /users/UNCC_username/mammals.txt /user/UNCC_username/ 5.3. Run the " WordCount v2.0 " in a clustered environment . ( this program will *not* run on the Laptop , as it needs to use multiple machines ). 5.4. Execute and save the output into a output text file using the steps that we followed from 4.11-4.14 (NOTE: While running in the cluster, use /users/UNCC_username/ to access FTP client files and /user/UNCC_username/ to access HDFS files) 5.5. Get the output file from HDFS and upload the .jar file and output text file to Canvas Each student should upload this part individually --------------------------