! Hadoop + Solr AWS Deployment
Transcription
! Hadoop + Solr AWS Deployment
Ariya Bala Sadaiappan Hadoop+Solr+AWS Integration 2 Apr 2014 ! ! ! Hadoop + Solr AWS Deployment ! Installation - Cloudera Manager / Hadoop Cluster ! 1. Launch EC2 Instances: (Use Spot Instances) ! Reference: http://blog.cloudera.com/blog/2013/03/how-to-create-a-cdh-cluster-on-amazon-viacloudera-manager/ • Login to AS AWS Dev • https://console.aws.amazon.com/iam/home?#users • Launch ’n’ EC2 Spot Instances for installing Cloudera Manager and the hadoop cluster • Navigate EC2 > Spot Requests > Request Spot Instances > My AMIs • Select AMI Id: ami-a3d7ccca. • Opt for Preferred Instance • Configure Instance Details • Number of Instances - N {User specific} • Max.Price - 0.245 • Availability Zone - Select the one with minimum Current pricing in the table • Add Storage • Ensure you have 50 GB of Root Folder size • Tag Spot Request • Tag with custom Key-Value : POC - {Date} ex: Apr14 • Ensure the Tag key is “POC” • Configure Security Group • Select Existing Security Group • Security GroupId - sg-f1617e9a • Name - jclouds#poc-ariya • Review and Launch • Choose Existing key pair - ads • Request Spot Instances and View Instance Requests • Copy the list of comma separated spot request ids and place it in a file name “spotrequests.txt”, replace “,” Wait until the Status changes from “pending-evaluation” to “active” ! ! 2. Prepare Export Variables: ! 1 This section explains the scripts used to prepare the list of export variables. ! Note: The scripts and files referred in the document is present in the GDrive under Hadoop_Solr POC/archive.zip ! Ensure all the spot requests are active. These export variables which holds the Public DNS names are used to place the jars and files as a part of automation Master script: exec_prepIP.sh {SpotRequest_File Name} ./exec_prepIp.sh spotrequests.txt ip.txt publicip.txt privateip.txt exportpublicip.txt Input Arguments: spotrequests.txt - This should hold the list of Spot Request Id one per line ip.txt - This file will be created to hold the list of privateip:publicip publicip.txt - This file will be created to hold the list of public ips privateip.txt - This file will be created to hold the list of private ips exportpublicip.txt - This file will be created to hold the list of export statements of public ips Final Output: ! exportpublicip.txt Will be filled with the export statements for future use The script in turn calls getSpotInstances.sh $spotRequestId - Prepares ip.txt with internalIP:externalIP prepPublicPrivateIP.sh - Reads ip.txt and prepares publicip.txt and privateip.txt exportPublicIP.sh - Reads publicIP.txt and prepares exportpublicip.txt with export statements ! ! 3. Install Latest Version of Cloudera Manager: ! • Open Terminal • Copy the content from the file “exportpublicip.txt” generated by the Section 2 • Content would be similar to the one below export export export export export m1=54.226.120.23 m2=54.197.132.98 m3=54.234.125.43 m4=54.242.203.115 m5=54.237.210.213 2 export export export export ! m6=54.80.65.111 m7=54.205.6.5 m8=54.197.221.12 m{N}=54.198.240.198 • Login to Machine $m1 to install Cloudera Manager using below commands ! ssh -i ~/.ssh/asdev.pem -oStrictHostKeyChecking=no ubuntu@$m1 wget http://archive.cloudera.com/cm4/installer/latest/cloudera-managerinstaller.bin chmod +x cloudera-manager-installer.bin ! ! sudo ./cloudera-manager-installer.bin • Accept the license and continue the installation • After the installation, login to the Cloudera Manager in the browser • http://54.226.120.23:7180 [ i.e $m1:7180] ! • with admin/admin as credentials • Select Cloudera Standard Edition • Click Launch Classic Wizard • Enter the list of Private DNS of all the instances (m0 - m{n}) • Copy the contents from “privateip.txt” generated from “Section 2” ! ip-10-46-253-27.ec2.internal,ip-10-44-78-201.ec2.internal,ip-10-44-137-72.ec2.internal,ip10-127-81-209.ec2.internal,ip-10-45-171-56.ec2.internal,ip-10-94-42-39.ec2.internal,ip-10 -120-113-81.ec2.internal,ip-10-111-154-10.ec2.internal,ip-10-44-133-206.ec2.internal,ip-1 0-87-150-185.ec2.internal,ip-10-126-142-60.ec2.internal • Click Search • Ensure all the hosts are validated with the Green Tick • If the hosts are not able to talk to each other, Correct the security group • i.e Add a new rule to the Security group of the instances • New rule with All TCP and type its own Security group name. sg-xxxx • By this way, we can allow the instances to talk to each other. • Click Continue • Select the needed parcels to be installed and Continue • Provide appropriate ssh details • Login as ubuntu, not root • Private Key : asdev as given during instance creation • Installation will start • Continue and Customise the services needed to install as part of Hadoop stack • Use Embedded Database and Test Connection • Complete the installation ! In the DashBoard, the health status of the machines should not be bad. 3 If Bad, free up some space using the script mentioned in the Remove Cache/Archive Section. ! 4. Add/Remove Services: [Optional] ! ! Click Home Click on the drop down on the right of the Cluster name Rename the Cluster if needed Click “Add a Service” from the drop down Select Solr Select the set of Dependencies Select one or more host for the new service Accept and Continue Click on the drop down next to the available services to delete Click Delete If there are dependant services, Delete that and continue the deletion ! 5. Preparing Environment: ! Place Jars/Files: ! • Open the Terminal (@laptop) • Export all the machine variables as referred in the previous page • Execute sh exec_placeJars.sh publicip.txt placeJars_final This script prepares and executes sequentially the placeJars_final_{n}.sh which places the below utility jars User can manually execute the prepared scripts in parallel to save time • The script moves the jars from Source: /home/ubuntu/uploads/jars/dist-lib/*.jar /home/ubuntu/uploads/jars/ext/*.jar /home/ubuntu/uploads/jars/extraction-lib/*.jar /home/ubuntu/uploads/jars/morphlines-cell-lib/*.jar /home/ubuntu/uploads/jars/morphlines-core-lib/*.jar /home/ubuntu/uploads/jars/solrj-lib/*.jar /home/ubuntu/uploads/jars/web-inf-lib/*.jar Target: /opt/cloudera/parcels/CDH-4.6.0-1.cdh4.6.0.p0.26/lib/hadoop-0.20mapreduce/lib/ ! • The jars will be already present in all the machines as a part of the {HASOLR-POC - ami-0dc3df64} • Restart the cluster from the Cloudera Manager UI ! Prepare HDFS: ! 4 Deploy MRIndexer Tool: ! Use below command to build the MR Indexer Tool locally and ship the jar to $m1 sh deployMRjar.sh $m1 This script builds the code locally and pushes it to the /opt/cloudera/parcels/ SOLR-1.2.0-1.cdh4.5.0.p0.4/lib/solr/contrib/mr/ ! ! Note: Modify the script to reflect your local build path Preparing Environment for docS3: Open the Terminal ! sh ~/scripts/aws-dev/prepdocS3.sh $m1 This script Downloads the docs3 jar from the AS dev environment to EC2 instance Installs mysql server in the EC2 instance (enter “root” as pwd when prompted) Executes prepareFileList.sh and prepares the list of S3 object keys (inputFileList.txt ) to be processed Note: Edit mysql.sh to match the needs to query the Accession numbers Prepare Hadoop job folders: ! Login to $m1 Prepare the HDFS using the below commands ! sudo -u hdfs hadoop fs -mkdir -p /user/$USER sudo -u hdfs hadoop fs -chown $USER:$USER /user/$USER hadoop fs -mkdir -p /user/$USER/indir hadoop fs -copyFromLocal ~/uploads/files/samplefiles/* /user/$USER/indir/ hadoop fs -ls /user/$USER/indir hadoop fs -rm -r -skipTrash /user/$USER/outdir hadoop fs -mkdir /user/$USER/outdir hadoop fs -ls /user/$USER/outdir sudo -u hdfs hadoop fs -mkdir /outdir sudo -u hdfs hadoop fs -chown $USER:$USER /outdir hadoop fs -put /home/ubuntu/uploads/files/txshards.conf /tmp/ hadoop fs -put /home/ubuntu/uploads/files/fdshards.conf /tmp/ nano /etc/hadoop/conf/hdfs-site.xml ! ! ! With inputFileList: 5 hadoop --config /etc/hadoop/conf.cloudera.mapreduce1 jar /home/ubuntu/uploads/jars/mrlib/solr-map-reduce-4.7-SNAPSHOT.jar -D 'mapred.child.java.opts=-Xmx8G' --log4j /opt/ cloudera/parcels/SOLR-1.2.0-1.cdh4.5.0.p0.4/share/doc/search/examples/solr-nrt/ log4j.properties --shards 1 --shardsConf hdfs://ip-10-111-154-10.ec2.internal:8020/tmp/ txshards.conf --fulldocshardsConf hdfs://ip-10-111-154-10.ec2.internal:8020/tmp/ fdshards.conf --morphline-file /home/ubuntu/uploads/files/readASXML.conf --solr-home-dir / home/ubuntu/uploads/files/TX-collection --TX-solr-home-dir /home/ubuntu/uploads/files/ TX-collection --FF-solr-home-dir /home/ubuntu/uploads/files/FF-collection --FD-solr-home-dir /home/ubuntu/uploads/files/FD-collection --output-dir hdfs://ip-10-111-154-10.ec2.internal: 8020/user/$USER/outdir_5L --collection collection1 --verbose --input-list /home/ubuntu/ uploads/scripts/docS3/inputFileList_5L.txt ! rm -r /home/ubuntu/downloads/outdir hadoop fs -get /user/ubuntu/outdir /home/ubuntu/downloads/ ! scp -i ~/.ssh/asdev.pem -r ubuntu@$m1:/home/ubuntu/downloads/outdir ~/Downloads/stats/ ! 6. Preparing Environment for creating AMI: [ Optional] ! ! ! Create a spot instance using Ubuntu 12.4 64 bit (ami-59a4a230) with RAM of 30 GB Install Java: If the ami doesn't come with Java 7, follow the steps to install the same Download tarball from http://www.oracle.com/technetwork/java/javase/downloads/java-archive-downloadsjavase7-521261.html#jre-7u25-oth-JPR sudo scp -i ~/.ssh/asdev.pem ~/Downloads/jre-7u25-linux-x64.gz ubuntu@$m1:/home/ ubuntu/ sudo scp -i ~/.ssh/asdev.pem ~/Downloads/jdk-7u25-linux-x64.gz ubuntu@$m1:/home/ ubuntu/ ssh -i ~/.ssh/asdev.pem ubuntu@$m1 "sudo mkdir /usr/local/java;sudo cp jre-7u25-linuxx64.gz /usr/local/java/;sudo cp jdk-7u25-linux-x64.gz /usr/local/java/" ssh -i ~/.ssh/asdev.pem ubuntu@$m1 cd /usr/local/java sudo tar xvzf jdk-7u25-linux-x64.gz sudo tar xvzf jre-7u25-linux-x64.gz sudo vi /etc/profile JAVA_HOME=/usr/local/java/jdk1.7.0_25 PATH=$PATH:$HOME/bin:$JAVA_HOME/bin JRE_HOME=/usr/local/java/jre1.7.0_25 PATH=$PATH:$HOME/bin:$JRE_HOME/bin export JAVA_HOME export JRE_HOME export PATH ! ! 6 Ensure java 7 is installed in one of the below folders. If not, duplicate the “java 7” folder to /usr/lib/ jvm/j2sdk1.7-oracle ! ! /usr/lib/j2sdk1.6-sun /usr/lib/jvm/java-6-sun /usr/lib/jvm/java-1.6.0-sun-1.6.0.* /usr/lib/jvm/java-1.6.0-sun-1.6.0.*/jre/ /usr/lib/jvm/j2sdk1.6-oracle /usr/lib/jvm/j2sdk1.6-oracle/jre /usr/java/jdk1.6* /usr/java/jre1.6* /usr/java/jdk1.7* /usr/java/jre1.7* /usr/lib/jvm/j2sdk1.7-oracle /usr/lib/jvm/j2sdk1.7-oracle/jre /Library/Java/Home /usr/java/default /usr/lib/jvm/default-java /usr/lib/jvm/java-openjdk /usr/lib/jvm/jre-openjdk /usr/lib/jvm/java-1.7.0-openjdk* /usr/lib/jvm/jn Also place the below folders in the instance to create the AMI FF-collection FD-collection txshards.conf fdshards.conf readASXML.conf ! ! ! ! Create image out of this instance Execution Statistics: Machine Type Machine Count Input File Count File Read Mappers Used Reducers Used Start Time End Time Job Id Custom m2.4x large 300 897,174 3052 1148 Tue Apr 22 21:49:15 UTC 2014 Wed Apr 23 04:28:03 UTC 2014 job_20140 4222147_0 001[58] numLines* 6 m2.4x large 300 500,000 2977 1148 Wed Apr 23 06:18:24 UTC 2014 Wed Apr 23 10:18:03 UTC 2014 job_20140 4222147_0 059 numLines* 6 ! ! ! List of Errors Faced: 7 http://stackoverflow.com/questions/20687517/cannot-allocate-memory-errno-12-errors-duringruntime-of-java-application ! Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x0000000752680000, 167247872, 0) failed; error='Cannot allocate memory' (errno=12) ! ! ! Error 2: attempt_201404220829_0001_m_000252_1: Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x0000000609c00000, 255328256, 0) failed; error='Cannot allocate memory' (errno=12) attempt_201404220829_0001_m_000252_1: # attempt_201404220829_0001_m_000252_1: # There is insufficient memory for the Java Runtime Environment to continue. attempt_201404220829_0001_m_000252_1: # Native memory allocation (malloc) failed to allocate 255328256 bytes for committing reserved memory. attempt_201404220829_0001_m_000252_1: # An error report file with more information is saved as: attempt_201404220829_0001_m_000252_1: # /mapred/local/taskTracker/ ubuntu/jobcache/job_201404220829_0001/ attempt_201404220829_0001_m_000252_1/work/hs_err_pid1099.log java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:250) Caused by: java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:237) 8