Platfora Deployment Planning Guide
Transcription
Platfora Deployment Planning Guide
Platfora Deployment Planning Guide Version 4.5 Copyright Platfora 2015 Last Updated: 10:14 p.m. June 28, 2015 Contents Document Conventions............................................................................................. 3 Contact Platfora Support...........................................................................................4 Copyright Notices...................................................................................................... 4 Chapter 1: About Platfora Deployments.................................................................... 6 Platfora Deployment Architectures............................................................................6 On-Premise Hadoop Deployments...................................................................... 6 Amazon AWS Cloud Deployments......................................................................8 Platfora Server Architecture...................................................................................... 8 FAQs - Platfora Deployments................................................................................. 10 Chapter 2: Supported Hadoop and Hive Versions..................................................14 Chapter 3: System Requirements (On-Premise)..................................................... 15 Platfora Server Requirements.................................................................................15 Hadoop Resource Requirements............................................................................16 Chapter 4: System Requirements (AWS Cloud)......................................................18 Platfora EC2 Instance Requirements......................................................................18 Amazon EMR Instance Requirements....................................................................19 AWS Security Settings for Platfora.........................................................................20 Amazon AWS Virtual Private Cloud (VPC)....................................................... 20 IAM User and IAM Roles for Platfora................................................................21 EC2 Security Group Settings............................................................................ 26 Chapter 5: Port Configuration Requirements..........................................................28 Ports to Open on Platfora Nodes........................................................................... 28 Ports to Open on Hadoop Nodes........................................................................... 29 Chapter 6: Browser Requirements........................................................................... 31 Appendix A: Hardware Specifications for Platfora Nodes..................................... 32 Appendix B: EC2 Considerations for Platfora Instances....................................... 33 Preface This guide provides information about what you need to consider when deploying a new Platfora® cluster. This guide is intended for system and Hadoop administrators who are responsible for procuring and managing server resources. Knowledge of Linux system administration, network administration and Hadoop administration is recommended. Document Conventions This documentation uses certain text conventions for language syntax and code examples. Convention Usage Example $ Command-line prompt proceeds a command to be entered in a command-line terminal session. $ ls $ sudo Command-line prompt $ sudo yum install open-jdk-1.7 for a command that requires root permissions (commands will be prefixed with sudo). UPPERCASE Function names and keywords are shown in all uppercase for readability, but keywords are caseinsensitive (can be written in upper or lower case). SUM(page_views) italics Italics indicate a usersupplied argument or variable. SUM(field_name) [ ] (square Square brackets denote optional syntax items. CONCAT(string_expression[,...]) ... (elipsis) An elipsis denotes a syntax item that can be repeated any number of times. CONCAT(string_expression[,...]) brackets) Page 3 Platfora Deployment Planning Guide - Introduction Contact Platfora Support For technical support, you can send an email to: [email protected] Or visit the Platfora support site for the most up-to-date product news, knowledge base articles, and product tips. http://support.platfora.com To access the support portal, you must have a valid support agreement with Platfora. Please contact your Platfora sales representative for details about obtaining a valid support agreement or with questions about your account. Copyright Notices Copyright © 2012-15 Platfora Corporation. All rights reserved. Platfora believes the information in this publication is accurate as of its publication date. The information is subject to change without notice. THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” PLATFORA CORPORATION MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Use, copying, and distribution of any Platfora software described in this publication requires an applicable software license. Platfora®, You Should Know™, Interest Driven Pipeline™, Fractal Cache™, and Adaptive Job Synthesis™ are trademarks of the Platfora Corporation. Apache Hadoop™ and Apache Hive™ are trademarks of the Apache Software Foundation. All other trademarks used herein are the property of their respective owners. Embedded Software Copyrights and License Agreements Platfora contains the following open source and third-party proprietary software subject to their respective copyrights and license agreements: • Apache Hive PDK • dom4j • freemarker • GeoNames • Google Maps API • javassist Page 4 Platfora Deployment Planning Guide - Introduction • javax.servlet • Mortbay Jetty 6.1.26 • OWASP CSRFGuard 3 • PostgreSQL JDBC 9.1-901 • Scala • sjsxp : 1.0.1 • Unboundid Page 5 Chapter 1 About Platfora Deployments Platfora runs on dedicated servers in the same network as your Hadoop deployment, which can be in an onpremise data center or in the cloud. Platfora uses the data processing services of Hadoop to process and prepare data for analysis. Platfora uses the data storage services of Hadoop to access the raw data and to store the output of the optimized data it prepares. This section explains how Platfora is deployed and the basics of the Platfora/ Hadoop server architecture. Topics: • Platfora Deployment Architectures • Platfora Server Architecture • FAQs - Platfora Deployments Platfora Deployment Architectures The Platfora software runs on a scale-out cluster of servers. These servers can be physical servers in an on-premise data center or virtual server instances in the cloud. Platfora uses native Hadoop protocols to connect to the distributed file system and data processing services of Hadoop. Platfora should be deployed on dedicated machines with low-latency connections to these Hadoop cluster services. This section explains how Platfora is deployed in your network environment, using either an on-premise or AWS cloud deployment of Hadoop. On-Premise Hadoop Deployments An on-premise Hadoop deployment means that you already have an existing Hadoop installation in your data center (either a physical data center or a virtual private cloud). Page 6 Platfora Deployment Planning Guide - About Platfora Deployments Platfora connects to the Hadoop cluster managed by your organization, and the majority of your organization's data is stored in the distributed file system of this primary Hadoop cluster. For on-premise Hadoop deployments, the Platfora servers should be on their own dedicated hardware co-located in the same data center as your Hadoop cluster. A data center can be a physical location with actual hardware resources, or a virtual private cloud environment with virtual server instances (such as Rackspace or Amazon EC2). Platfora recommends putting the Platfora servers on a network with at least 1 Gbps connectivity to the Hadoop nodes. Platfora users access the Platfora master node using an HTML5-compliant web browser. The Platfora master node accesses the HDFS NameNode and the MapReduce JobTracker or YARN Resource Manager using native Hadoop protocols. The Platfora worker nodes access the HDFS DataNodes directly. If using a firewall, Platfora recommends placing the Platfora servers on the same side of the firewall as your Hadoop cluster. Platfora software can run on a wide variety of server configurations – on as little as one server or scale across multiple servers. Since Platfora runs best with all of the active lenses readily available in RAM, Platfora recommends obtaining servers optimized for higher RAM capacity and a minimum of 8 CPUs. Page 7 Platfora Deployment Planning Guide - About Platfora Deployments Amazon AWS Cloud Deployments An Amazon Web Services (AWS) cloud deployment means that you do not have a persistent Hadoop cluster. Instead, your organization uses Amazon S3 for raw data storage and Amazon EMR for ondemand Hadoop data processing. In an Amazon AWS cloud deployment, the Platfora server instances are deployed on dedicated, highmemory EC2 instances. Your organization’s raw data is managed in Amazon's Simple Storage Service (S3). Platfora uses Amazon Elastic MapReduce (EMR) to run its data processing jobs (lens builds). The results of the lens build jobs are then written back to S3. Platfora Server Architecture Platfora connects to an existing Hadoop implementation, and makes the raw data residing in Hadoop accessible to users. The Platfora server has a number of services that work together with Hadoop's Page 8 Platfora Deployment Planning Guide - About Platfora Deployments services to access the raw data, prepare it for analysis, and present the results to users. This topic helps you understand the main components of the Platfora server architecture. The Platfora Master Node You can have a fully-functioning Platfora installation with just one node—the master node. The master node manages the following Platfora services: • Metadata Catalog - Platfora's metadata catalog holds all of the information about the data managed by Platfora (the datasets, lenses, vizboards and so on). The metadata catalog is a relational database that runs on the Platfora master node, but is accessed by all nodes in the Platfora cluster. • Lens Builder - The lens builder interfaces with the data processing services of Hadoop. It translates data requests from the Platfora application into a series of custom MapReduce jobs, which it then submits to the Hadoop Job Tracker or Resource Manager for execution. After the requested data has been extracted and transformed in Hadoop, the job results are written back to the Hadoop file system in Platfora's proprietary file format called a lens. • On-Disk Storage - Finished lenses are immediately copied from the Hadoop file system to on-disk storage of the Platfora nodes. The data of a lens is distributed across all of the available worker nodes in a Platfora cluster. • In-Memory Query Engine - When users explore and analyze data in Platfora, they are actually generating queries that run against a lens. The result of a lens query is rendered as a visualization in Platfora. When users construct visualizations, they choose a lens to work with. Choosing a lens loads its data into Platfora's in-memory query engine. The in-memory query engine has two kinds of processes that work on a query: 1. Query Coordinator - The query coordinator process runs on the master node only, and translates actions made in the Platfora application into queries. The coordinator sends the query to the workers for processing, then consolidates the partial results from each worker into a final result. Page 9 Platfora Deployment Planning Guide - About Platfora Deployments 2. Query Worker - The query worker process typically runs on the worker nodes, but the master may also serve as a worker in some cases. A query worker process works on its portion of lens data for a given query. • Web Application Server - Platfora's user interface runs as a web application in your network. Users connect to Platfora using any HTML5-compliant browser. Through the browser, users interact with data in Hadoop as easily as browsing a web site. The Platfora Worker Nodes The Platfora worker nodes are used to distribute lens storage capacity and query processing workload. As users work with more and bigger lenses in Platfora, more memory and processing power is needed to render visualizations quickly. Administrators can add additional worker nodes to scale up lens storage capacity and performance. By using the resources of multiple machines to store and process lens data, Platfora can handle true 'big data' query workloads. FAQs - Platfora Deployments Got questions about what you need to get Platfora up and running? Want to know how Platfora is deployed in your data center environment and how it works with Hadoop? This topic answers the most frequently asked questions (FAQs) about Platfora installation and deployment. What do I need before I can install Platfora? Before you can install Platfora, you will need: Page 10 Platfora Deployment Planning Guide - About Platfora Deployments • Hadoop - Platfora needs access to an installed and running Hadoop cluster, or to an Amazon Web Services (AWS) account with Amazon S3 (Simple Storage Service) and EMR (Elastic MapReduce) enabled. • Linux Server(s) - You will need one or more dedicated servers running a supported Linux operating system on which to install Platfora. The Platfora server(s) should be in the same data center (or region) as your Hadoop distribution, but not on the same machines. • Platfora Binaries - A Platfora customer support representative can give you the download link to the Platfora installation package for your chosen Hadoop distribution. Platfora provides both rpm and tar installer packages. • Platfora License - A Platfora customer support representative must issue you a license file. Trial period licenses are available upon request for pilot installations. • Platfora Installation Guide - You will need the Platfora installation guide for your specific Hadoop distribution. The setup steps vary slightly depending on the version of Hadoop you are using. What are the high-level steps involved in installing Platfora? Every Platfora installation involves these basic steps, although the details will vary slightly depending on the Hadoop distribution you are using: • Configure Hadoop for Platfora Access - Make sure that the Platfora server(s) can access your Hadoop services over the network and that Platfora has write access to a designated directory in the Hadoop file system. Obtain the required connection details for your Hadoop services (Platfora connects to Hadoop during setup). • Install Prerequisites on all Platfora Nodes - Make sure the Platfora servers have the required dependencies before installing Platfora. If using the rpm installer, Platfora provides a base package that includes the dependencies. If using the tar installer, you will need to manually install the dependent software yourself. • Install the Platfora Software on the Master - Install the Platfora binaries on the master node. • Setup the Platfora Master - Run the setup utility to configure the Platfora master server and connect it to your Hadoop services. • Start Platfora - After setup completes, start the Platfora server. You should now have a fullyfunctioning single-node Platfora installation. • Run Tests and Load the Tutorial Data - After setup completes, you may want to run some tests to make sure that Platfora is properly configured and can access your Hadoop cluster. One way to test everything is to load the tutorial data that comes with your Platfora installation. This will put some data in Hadoop and build a small lens to make sure everything is working. • Add Platfora Worker Nodes - Once you have the Platfora master node up and running, you can use it to add Platfora worker nodes to the cluster. The master node is always used to install and manage the worker nodes. Is there a trial version of Platfora? Platfora does not currently have a trial version available for download. You can contact Platfora Customer Support to arrange for a pilot or trial installation. Page 11 Platfora Deployment Planning Guide - About Platfora Deployments Why would I need multiple Platfora nodes? When users work with lens data in Platfora, that data is loaded into memory so that queries (vizzes) are fast and responsive. If there is more lens data than can fit into memory, then some queries may be slow or not be able to run at all. Adding more nodes to your Platfora cluster makes more disk, memory and CPU available to store and process lens data. How many Platfora nodes would I need? Platfora is intended for big data query workloads, and performs best when using the resources of multiple machines. Although you can have a fully-functioning Platfora installation with just one node, a multi-node installation is necessary for optimal performance and bigger lens sizes. The ideal number of Platfora nodes really depends on a lot of factors: lens size, lens quantity, data variety, and number of concurrent users (to name a few). Your Platfora account representative will help you determine the number of nodes that best fits your unique data requirements. You can also scale up your Platfora cluster as your data and usage grows. How does Platfora interact with Hadoop? Platfora uses the powerful distributed storage and processing features of Hadoop, but masks the complexity of working with HDFS and MapReduce by providing an easy-to-use web interface. Platfora uses Hadoop to access the raw data stored in its distributed file system (DFS) and makes the data visible to Platfora users. It uses the data processing services of Hadoop (MapReduce) to pull requested data and prepare it for analysis. The result of these processing jobs is the Platfora lens. Platfora lenses are stored in the Hadoop distributed file system, as well as copied over to the Platfora servers. Can Platfora connect to more than one source system? When you install Platfora, you connect it to one Hadoop distribution. This is the primary source system that Platfora uses to access the source data and process its lens builds. You can create data sources that point to external sources (such as a cloud storage service or a relational database). However, this external data must be pulled over to the primary Hadoop source system during lens build processing. To avoid moving large amounts of data over the network, Platfora recommends using external data sources for smaller, supplemental datasets only. What does Platfora do to the data in Hadoop? Platfora reads the raw data, but does not edit, update, or delete it in place. It makes a copy of the requested portion of the data when it builds a lens, and does its lens processing on the copied data. Your original data remains intact and unaltered. How does Platfora keep my data secure? Platfora's role-based security allows you to control who can authenticate to the Platfora application and what actions they can perform. You can maintain user credentials within the Platfora application, or configure Platfora to use an external LDAP directory service to authenticate users. Page 12 Platfora Deployment Planning Guide - About Platfora Deployments To authorize access to the raw data, you can either manage data access permissions within the Platfora application itself, or you can configure Platfora to use Kerberos authorization check the HDFS file system permissions. How does Platfora handle redundancy and high availability? Platfora relies on Hadoop for redundancy and high-availability of the raw data itself. The Platfora worker nodes are fully redundant and highly available. The worker nodes process the lens queries submitted to the Platfora application. Lens data is distributed and replicated across all of the worker nodes in the Platfora cluster. Depending on the number of worker nodes you have, you can lose a node and still continue processing queries without interruption of service. A redundant Platfora master node involves taking routine backups of the metadata catalog database so you can restore the master node if needed. Page 13 Chapter 2 Supported Hadoop and Hive Versions This section lists the Hadoop distributions and versions that are compatible with the Platfora installation packages. If using Hive as a data source for Platfora, the version of Hive must be compatible with the version of Hadoop you are using. Hadoop Distro Version Hive Version M/R Version Platfora Package CDH5.0 0.12 YARN cdh5 CDH5.1 0.12 YARN cdh5 CDH5.2 0.13 YARN cdh52 CDH5.3 0.13.1 YARN cdh52 CDH5.4 1.1 YARN cdh54 HDP 2.1.x 0.13.0 YARN hadoop_2_4_0_hive_0_13_0 HDP 2.2.x 0.14.0 YARN hadoop_2_6_0_hive_0_14_0 MapR 4.0.1 0.12.0 YARN mapr4 MapR 4.0.2 0.13.0 YARN mapr402 MapR 4.1.0 0.13.0 YARN mapr402 Pivotal Labs PivotalHD 3.0 0.14.0 YARN hadoop_2_6_0_hive_0_14_0 Amazon EMR (AMI 3.7.x) Hadoop 2.4.0 0.13.1 YARN hadoop_2_4_0_hive_0_13_0 Cloudera 5 Hortonworks MapR Page 14 Chapter 3 System Requirements (On-Premise) The Platfora software runs on a scale-out cluster of servers. You can install Platfora on a single node to start, and then scale up storage and processing capacity by adding additional nodes. Platfora requires access to an existing, compatible Hadoop implementation in order to start. Users then access the Platfora application using a compatible web browser client. This section describes the system requirements for on-premise deployments of the Platfora servers, Hadoop source systems, network connectivity, and web browser clients. Topics: • Platfora Server Requirements • Hadoop Resource Requirements Platfora Server Requirements Platfora recommends the following minimum system requirements for Platfora servers. For multi-node installations, the master server and all worker servers must be the same operating system (OS) and system configuration (same amount of memory, CPU, etc.). 1 2 64-bit Operating System or Amazon Machine Image (AMIs) CentOS 6.2-6.5 (7.0 is not supported) Software Java 1.7 Python 2.6.8, 2.7.1, 2.7.3 through 2.7.6 (3.0 not supported) PostgreSQL 9.2.1-1, 9.2.5, 9.2.7 or 9.3 (master only) 2 OpenSSL 1.0.1 or higher Unix Utilities rsync, ssh, scp, cp, tar, tail, sysctl, ntp, wget RHEL 6.2-6.5 (7.0 is not supported) Scientific Linux 6.2 Amazon Linux AMI 2014.03+ Oracle Enterprise Linux 6.x Ubuntu 12.04.1 LTS or higher 1 Security-Enhanced Linux 6.2 If you wish to install Security-Enhanced Linux, refer to Platfora's Support site for installation instructions. Only required if you want to enable SSL for secure communications between Platfora servers Page 15 Platfora Deployment Planning Guide - System Requirements (On-Premise) Memory 64 GB minimum, 256 recommended The server needs enough memory to accommodate actively used lens data. Additionally, it needs 1-2 GB reserved for normal operations and the lens query engine workspace. CPU 8 cores minimum, 16 recommended Disk All Platfora nodes (master or worker) require 300MB for the Platfora installation. Every node requires high-speed local storage and a local disk cache configured as a single logical volume. Hardware RAID is recommended for the best performance. All nodes combined require appropriate free space for aggregated data structures (Platfora lenses). At a minimum, you will need twice the amount of disk space as the amount of system memory. The Platfora master node requires an additional, approximately 700 MB for metadata catalog (dataset definitions, vizboard and visualization definitions, lens definitions, etc.) Network 1 Gbps reliable network connectivity between Platfora master server and query processing servers 1 Gbps reliable network connectivity between Platfora master server and Hadoop NameNode and JobTracker/ResourceManager node Network bandwidth should be comparable to the amount of memory on the Platfora master server Hadoop Resource Requirements Platfora must be able to connect to an existing Hadoop installation. Platfora also requires permissions and resources in the Hadoop source system. This section describes the Hadoop resource requirements for Platfora. Platfora uses the remote Distributed File System (DFS) of the Hadoop cluster for persistent storage and as the primary data source. Optionally, you can also configure Platfora to use a Hive metastore server as a data source. Page 16 Platfora Deployment Planning Guide - System Requirements (On-Premise) Platfora uses the Hadoop MapReduce services to process data and build lenses. For larger lens builds to succeed, Platfora requires minimum resources on the Hadoop cluster for MapReduce tasks. DFS Disk Space Platfora requires a designated persistent storage directory in the remote distributed file system (DFS) with appropriate free space for Platfora system files and data structures (lenses). The location is configurable. DFS Permissions The platfora system user needs read permissions to source data directories and files. The platfora system user needs write permissions to Platfora's persistent storage directory on DFS. MapReduce Permissions The platfora system user needs to be added to the submit-jobs and administer-jobs access control list (or added to a group that has these permissions). DFS Resources Minimum Open File Limit = 5000 MapReduce Resources Minimum Memory for Task Processes = 1 GB Page 17 Chapter 4 System Requirements (AWS Cloud) This section describes the system requirements for customers who plan to use Amazon Web Services (AWS) as their installation environment for Platfora, and Simple Storage Service (S3) and Elastic MapReduce (EMR) and as their Hadoop distributed data storage and processing services. Topics: • Platfora EC2 Instance Requirements • Amazon EMR Instance Requirements • AWS Security Settings for Platfora Platfora EC2 Instance Requirements Platfora recommends the following system requirements for Amazon EC2 instances that will serve as Platfora server nodes. For multi-node installations, the master server instance and all worker server instances must be the same configuration (same EC2 instance type, storage configuration, network configuration, etc.). Amazon Machine Images (AMIs) Amazon Linux AMI 2014.03.x or higher Red Hat Enterprise Linux 6.2 - 6.5 Ubuntu Server 12.04.1 LTS or higher EC2 Instance Type Small to Medium Lens Sizes: c3.8xlarge Medium to Large Lens Sizes, 10+ Platfora nodes: r3.8xlarge Medium to Large Lens Sizes, 1-9 Platfora nodes: i2.8xlarge Root Device Volume (EBS) Recommended Size = 1 TB Type = General Purpose (SSD) Additional EBS Volumes Optional. Additional EBS volumes can be attached to an EC2 instance after launch time, and can be used to increase lens cache storage capacity if needed. EBS volumes are less expensive than Instance Store volumes, and the data is persistent between shutdowns. Page 18 Platfora Deployment Planning Guide - System Requirements (AWS Cloud) Instance Store Volume (Ephemeral) Optional. You may choose to add instance store volumes for the Platfora lens cache instead of using EBS volumes. This costs more, but offers slightly faster performance. Instance store volumes can only be attached to an EC2 instance at launch time, and the data is not saved when the instance shuts down. The size of an instance store volume depends on the instance type: c3.8xlarge: 2 x 320 GB SSD (640 GB) r3.8xlarge: 2 x 320 GB SSD (640 GB) i2.8xlarge: 8 x 800 GB SSD (6400 GB) Enhanced Networking yes (requires use of VPC instead of EC2-Classic) EBS Optimized Instance yes (the 8xlarge instance types are EBS optimized instances by Availability Zone yes (use same zone for all nodes in the Platfora cluster) Placement Group yes (use same placement group for all nodes in the Platfora cluster) IAM User yes (create a dedicated Platfora IAM User in your AWS account) Other Required Software Java 1.7 Python 2.7.8 through 2.7.9 (3.0 not supported) (master node only) PostgreSQL 9.2.1-1.28 (AMZN), 9.2.5, 9.2.7 or 9.3 3 OpenSSL 1.0.1 or higher Required Unix Utilities rsync, ssh, scp, cp, tar, tail, sysctl, ntp, wget default) Amazon EMR Instance Requirements Platfora launches an Elastic MapReduce (EMR) cluster when it builds la lens. This section describes the recommended requirements for the EMR instances that are launched by Platfora. Amazon EMR is Hadoop as a web service. Platfora uses the EMR Hadoop cluster to process its lens builds. Since the EMR Hadoop cluster is only instantiated as needed, the source data does not reside in the Hadoop Distributed File System (HDFS) of the EMR Hadoop cluster. The source data is instead stored on Amazon S3. Data is copied from S3 to EMR for data processing, then the results are written back to S3 when the job completes. 3 Only required if you want to enable SSL for secure communications between Platfora servers Page 19 Platfora Deployment Planning Guide - System Requirements (AWS Cloud) At the start of a lens build job, the raw source data is copied from S3 to the local HDFS file system on the EMR nodes. The EMR instances must have enough local instance storage to support the input source dataset and the temporary workspace for intermediate lens build job results. Also consider that the local HDFS of the EMR cluster replicates the data to ensure redundancy and high availability during lens build processing. Platfora recommends the i2.4xlarge instance type for EMR data nodes and the m3.xlarge for the EMR name node. The i2.4xlarge offers a great balance between total local disk space, CPU power, and pernode memory size. Hadoop Version 2.4.0 AMI Version 3.7.0 EMR NameNode Instance Type m3.xlarge EMR DataNode Instance Type i2.4xlarge Number of EMR DataNodes The number of nodes you will need to complete a lens build depends on the following factors: • The size of the raw dataset in S3 that is considered as input to the lens build. • The replication factor of HDFS. EMR clusters of 1-4 nodes have a replication factor of 1, 5-9 nodes have a replication factor of 2, and over 10 nodes have a replication factor of 3. • Temporary work space for intermediate lens build results about 20-30% of total disk space. AWS Security Settings for Platfora Amazon Web Services (AWS) has a number of security features that you can use to protect your AWS account and cloud server instances. This section contains security setting recommendations if you plan to use Amazon Elastic MapReduce (EMR) as the Hadoop implementation for your Platfora cluster. Amazon AWS Virtual Private Cloud (VPC) To use Amazon EMR for Hadoop data processing, Platfora must be able to launch an EMR cluster in a public subnet. Administrators do this by provisioning an Amazon VPC with a public subnet, and then specifying the subnet identifier in Platfora. Platfora must create the EMR cluster on an Internet-facing subnet to allow the AWS EMR Provisioning Service to reach the EMR cluster. Additionally, you must ensure the Platfora server can communicate with the Amazon EMR cluster. If the Platfora server is on the same subnet as the Amazon EMR cluster, this happens automatically. If Page 20 Platfora Deployment Planning Guide - System Requirements (AWS Cloud) the Platfora server and the EMR cluster are on different VPC subnets, then a route between the subnets needs to be added to the Route table(s) so that communication can occur between the two subnets. Also, if the VPC uses Access Control Lists (ACLs), then those ACLs must be modified to allow traffic from Platfora to Hadoop. The subnet identifier cannot exceed 255 characters in length. After the Amazon VPC has been provisioned, specify its subnet identifier in the platfora.emr.subnet.id Platfora configuration property. For more information on setting up and using an Amazon VPC with Amazon EMR, see http:// docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-vpc-subnet.html. IAM User and IAM Roles for Platfora AWS Identity and Access Management (IAM) allows you to create users, groups, and roles to control access to AWS services and resources. Platfora recommends creating an IAM User account and two IAM Roles specifically for use by Platfora. Platfora uses a combination of an IAM User and IAM Roles to communicate with Amazon AWS and to create an EMR cluster. An Amazon AWS administrator needs to create a platfora IAM User and two IAM Roles specifically for use by Platfora. Then a Platfora system administrator needs to enter some information about that user and those roles in Platfora. The Platfora server uses security credentials of the platfora IAM User to request Amazon AWS to create an Amazon EMR cluster. Once that request is approved, the platfora IAM User then passes an IAM Role to actually launch an EMR cluster, and then uses another IAM Role to start EC2 instances in the EMR cluster. You must specify these roles in Platfora. For more details on creating the user and roles, see Create IAM User for Platfora and Create IAM Roles for Platfora. Create IAM User for Platfora The Amazon AWS administrator can create a new platfora user in the IAM Management Console of your AWS account. After creating the user, download the AWS credentials for this user. The Platfora Page 21 Platfora Deployment Planning Guide - System Requirements (AWS Cloud) system administrator will need the Access Key Id and Secret Access Key when you initialize Platfora for use with Amazon EMR. Page 22 Platfora Deployment Planning Guide - System Requirements (AWS Cloud) The security policy for the platfora IAM User must have (at a minimum) the permissions listed in the following sample policy: { "Version": "2012-10-17", "Statement": [ { "Action": [ "iam:ListRoles", "iam:PassRole", "elasticmapreduce:*", "s3:GetBucketLocation", "s3:ListAllMyBuckets" ], "Effect": "Allow", "Resource": "*" }, { "Effect": "Allow", "Action": [ "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::Bucket_defined_in_core-site.xml", "arn:aws:s3:::Datasource_Bucket_1", "arn:aws:s3:::Datasource_Bucket_n" }, { }, { } ] } ] "Effect": "Allow", "Action": [ "s3:PutObject", "s3:Get*", "s3:DeleteObject", ], "Resource": [ "arn:aws:s3:::Bucket_defined_in_core-site.xml/*" ] "Effect": "Allow", "Action": [ "s3:Get*" ], "Resource": [ "arn:aws:s3:::Datasource_Bucket_1/path/to/files/*", "arn:aws:s3:::Datasource_Bucket_n/*" ] Page 23 Platfora Deployment Planning Guide - System Requirements (AWS Cloud) Under Permissions for this user, attach a security policy that contains the permissions listed above. These permissions allow the platfora IAM User to pass an IAM Role to launch the EMR cluster, start an EMR cluster, and access S3 for source data during data ingest. Create IAM Roles for Platfora Amazon requires all AWS users to use IAM Roles to launch EMR clusters. One IAM Role is used to start the Amazon EMR service, and the other role is used by the EC2 instances in the EMR cluster. Amazon AWS offers some default IAM Roles for these services. However, Platfora recommends creating custom IAM Roles specifically for use by Platfora instead. The Amazon AWS administrator can create the IAM Roles in the IAM Management Console of your AWS account. Create a role for each of the following EMR cluster services, and specify them in Platfora using the specified configuration properties: • Amazon EMR service (service role). In Amazon AWS, create an IAM Role and attach a security policy that contains at a minimum the permissions specified below. Enter this IAM Role name in the platfora.emr.service.role Platfora configuration property. The custom role you define corresponds to the default IAM Role Amazon offers called EMR_DefaultRole. • EC2 instances (instance profile) in the Amazon EMR cluster. In Amazon AWS, create an IAM Role and attach a security policy that contains at a minimum the permissions specified below. Enter this IAM Role name in the platfora.emr.jobflow.role Platfora configuration property. The custom role you define corresponds to the default IAM Role Amazon offers called EMR_EC2_DefaultRole. The security policy for the Amazon EMR service (service role) IAM Role must have (at a minimum) the permissions listed in the following sample policy: { "Version": "2012-10-17", "Statement": [ { "Action": [ "ec2:AuthorizeSecurityGroupIngress", "ec2:CancelSpotInstanceRequests", "ec2:CreateSecurityGroup", "ec2:CreateTags", "ec2:DeleteTags", "ec2:Describe*", "ec2:ModifyImageAttribute", "ec2:ModifyInstanceAttribute", "ec2:RequestSpotInstances", "ec2:RunInstances", "ec2:TerminateInstances" ], "Effect": "Allow", "Resource": "*" }, { "Action": [ Page 24 Platfora Deployment Planning Guide - System Requirements (AWS Cloud) "iam:PassRole", "iam:ListRolePolicies", "iam:GetRole", "iam:GetRolePolicy", "iam:ListInstanceProfiles" }, { } ] } ], "Effect": "Allow", "Resource": "*" "Effect": "Allow", "Action": [ "s3:Get*" ], "Resource": "arn:aws:s3:::Bucket_defined_in_core-site.xml/*" The security policy for the EC2 instances (instance profile) IAM Role must have (at a minimum) the permissions listed in the following sample policy: { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Resource": "*", "Action": [ "ec2:Describe*", "elasticmapreduce:Describe*", "elasticmapreduce:ListBootstrapActions", "elasticmapreduce:ListClusters", "elasticmapreduce:ListInstanceGroups", "elasticmapreduce:ListInstances", "elasticmapreduce:ListSteps", "s3:ListAllMyBuckets" ] }, { }, { "Effect": "Allow", "Action": [ "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::Bucket_defined_in_core-site.xml", "arn:aws:s3:::Datasource_Bucket_1", "arn:aws:s3:::Datasource_Bucket_n" ] "Effect": "Allow", Page 25 Platfora Deployment Planning Guide - System Requirements (AWS Cloud) }, { ], } ] } "Action": [ "s3:PutObject", "s3:Get*", "s3:DeleteObject" ], "Resource": [ "arn:aws:s3:::Bucket_defined_in_core-site.xml/*", ] "Effect": "Allow", "Action": [ "s3:Get*", "s3:List*" "Resource": [ "arn:aws:s3:::Datasource_Bucket_1/path/to/files/*", "arn:aws:s3:::Datasource_Bucket_n/*", "arn:aws:s3:::*elasticmapreduce/*" ] Verify that the permissions for and access to Amazon resources (especially S3) for the EC2 instances role are the same or greater than the permissions and access assigned to the platfora IAM User. For example, if the platfora IAM User can access an Amazon S3 bucket, but the EC2 instances role cannot, then lens builds that rely on that S3 bucket will fail. For more information on using IAM Roles for EMR, see http://docs.aws.amazon.com/ ElasticMapReduce/latest/DeveloperGuide/emr-iam-roles.html. EC2 Security Group Settings EC2 security groups allow you to specify firewalling rules for your Amazon elastic cloud computing (EC2) server instances. EC2 security group rules are independent of, and in addition to, the software firewalling provided by the instance's operating system. Security groups must be defined before you create an EC2 instance. The security group configured for the Platfora server instance must permit connections from your user network to the Platfora web application server port (8001 by default). You also may want to open the EMR Hadoop ResourceManager and JobHistory web ports so that you can monitor and troubleshoot YARN jobs executed by Platfora. An example security group configuration for a Platfora server instance would look something like the following: Page 26 Platfora Deployment Planning Guide - System Requirements (AWS Cloud) Page 27 Chapter 5 Port Configuration Requirements You must open ports in the firewall of your Platfora nodes to allow client access and intra-cluster communications. You also must open ports within your Hadoop cluster to allow access from Platfora. This section lists the default ports required. Topics: • Ports to Open on Platfora Nodes • Ports to Open on Hadoop Nodes Ports to Open on Platfora Nodes Your Platfora master node must allow HTTP connections from your user network. All nodes must allow connections from the other Platfora nodes in a multi-node cluster. On Amazon EC2 instances, you must configure the port firewall rules on the Platfora server instances in addition to the EC2 Security Group Settings. Platfora Service Default Port Allow connections from… Master Web Services Port (HTTP) 8001 External user network Platfora worker servers localhost Secure Master Web Services Port (HTTPS) 8443 External user network Platfora worker servers localhost Master Server Management Port 8002 Platfora worker servers localhost Worker Server Management Port 8002 Platfora master server other Platfora worker servers localhost Page 28 Platfora Deployment Planning Guide - Port Configuration Requirements Platfora Service Default Port Allow connections from… Master Data Port 8003 Platfora worker servers localhost Worker Data Port 8003 Platfora master server other Platfora worker servers localhost Master PostgreSQL Database Port 5432 Platfora worker servers localhost Ports to Open on Hadoop Nodes Platfora must be able to access certain services of your Hadoop cluster. This section lists the Hadoop services Platfora needs to access and the default ports for those services. Note that this only applies to on-premise Hadoop deployments or to self-managed Hadoop deployments in a virtual private cloud, not to Amazon Elastic MapReduce (EMR). Hadoop Service Default Ports by Hadoop Allow connections from… Distro CDH, HDP, Pivotal Apache MapR Hadoop HDFS NameNode 8020 9000 N/A Platfora master and worker servers HDFS DataNodes 50010 50010 N/A Platfora master and worker servers MapRFS CLDB N/A N/A 7222 Platfora master and worker servers MapRFS DataNodes N/A N/A 5660 Platfora master and worker servers MRv1 JobTracker 8021 9001 9001 Platfora master server MRv1 JobTracker Web UI 50030 50030 50030 External user network (optional) YARN ResourceManager 8032 8032 8032 Platfora master server Page 29 Platfora Deployment Planning Guide - Port Configuration Requirements Hadoop Service 4 Default Ports by Hadoop Allow connections from… Distro CDH, HDP, Pivotal Apache MapR Hadoop YARN ResourceManager Web UI 8088 8088 8088 External user network (optional) YARN Job History Server 10020 10020 10020 Platfora master server YARN Job History Server Web UI 19888 19888 19888 External user network (optional) HiveServer Thrift Port 10000 10000 10000 Platfora master server Hive Metastore DB 4 Port 9083 9933 (HDP2) N/A 9083 Platfora master server If connecting to Hive directly using JDBC Page 30 Chapter 6 Browser Requirements Users can connect to the Platfora web application using the latest HTML5-compliant web browsers. Platfora supports the latest releases of the following web browsers: • Chrome (preferred browser) • Firefox • Safari • Internet Explorer with the Compatibility View feature disabled (versions prior to IE 10 are not supported) Platfora supports these web browsers on desktop machines only. Page 31 Appendix A Hardware Specifications for Platfora Nodes This section shows some example hardware configurations that have worked well in other Platfora deployments. To achieve the best performance and lowest operating cost, Platfora recommends that all servers in the Platfora cluster have the same configuration. At a minimum, all servers in the Platfora cluster should have an identical RAM capacity and the same number of CPU cores. Platfora software can be deployed on either rack or blade servers. Typical Platfora server configurations have specifications similar to: Rack Server Specs Blade Server Specs CPU: 2x E5-2440 2.40GHz 6-cores CPU: 2x E5-2470 2.30GHz 8-cores RAM: 12x 16GB RAM (192GB total) RAM: 12x16GB RAM (192GB total) Disk: 8x 300GB 10K SAS 2.5” HDDs Disk: 2x 900GB 10K SATA 2.5” HDDs Network: 1x Gbps NIC Page 32 Appendix B EC2 Considerations for Platfora Instances This section explains what to consider when using Amazon Elastic Compute Cloud (EC2) instances to deploy a production Platfora cluster. EC2 Storage Considerations When you launch an Amazon EC2 instance, you have several choices with regards to the storage that you can attach to the instance. There are two main types of storage available: Elastic Block Store (EBS) and Instance Store (Ephemeral). The type and capacity of storage available depends on the instance type you choose. • The Root Device Volume - All instances have a root device volume, which is backed by either EBS or Instance storage. Platfora recommends EBS-backed instance types; they launch faster and use persistent storage. Root device volumes for Platfora nodes should always be increased to the maximum size (1 TB). This ensures adequate space for the Platfora installation and logs. When using the Platfora recommended 8xlarge instance types, general purpose (SSD) EBS volumes also guarantee 3,000 IOPS. • EBS Volumes - Amazon EBS volumes are highly available and reliable storage volumes that can be attached to any running instance that is in the same Availability Zone. Amazon EBS volumes that are attached to an Amazon EC2 instance are exposed as storage volumes that persist independently from the life of the instance. Also with Amazon EBS, you only pay for what you use, making it a costeffective choice. Platfora recommends General Purpose (SSD) EBS volumes. For maximum performance, you can choose Provisioned IOPS EBS volumes instead. If you choose an instance type that is not EBS optimized by default, make sure to choose EBS Optimized Instance at launch time. This ensures that the instance has a dedicated connection to the EBS volume, which reduces overall latency and maximizes throughput. The Platfora recommended 8xlarge instance types are already EBS optimized instances. • Instance Store Volumes - Ephemeral storage is ideal for temporary storage of information that changes frequently, such as caches, or for data that is replicated across multiple instances. Instances that use EBS for the root device do not, by default, have instance store volumes available at boot time. Also, you can't attach instance store volumes after you've launched an instance. Therefore, if you want your Amazon EBS-backed instance to use instance store volumes, you must specify them when you first launch your instance. Page 33 Platfora Deployment Planning Guide - EC2 Considerations for Platfora Instances The choice to add instance store volumes to Platfora nodes depends on price, performance, and persistence of the data. Ephemeral storage allows data to be read faster from disk, but is also more expensive. Also, the data stored on these volumes is not persistent - it will be lost if the instance is shutdown or terminated. If you do decide to use ephemeral drives for the Platfora cache directories, use RAID 0 (Stripe). This ensures Platfora has access to the maximum possible disk space and will also yield the highest performance. Remember, ephemeral drives are temporary storage, so there is no need to use RAID 1. When the instance is stopped, the data is not saved. In Platfora, the PLATFORA_DATA/dfscache and PLATFORA_DATA/fsCache directories can be mapped to instance store volumes (if you decide to use them). These are the only directories of a Platfora installation that should use ephemeral storage. Lens data is backed up in S3, so the loss of any cached data is temporary. EC2 Network Considerations • Placement Groups - All Platfora server instances should be launched within the same Amazon EC2 Placement Group. A placement group is a logical grouping of instances within a single Availability Zone. Using placement groups enables applications to participate in a low-latency, 10 Gbps network connectivity. Placement groups are recommended for applications that benefit from low network latency, high network throughput, or both. See the Amazon EC2 Documentation on Placement Groups. • Enhanced Networking - To enable enhanced networking, you must launch each instance in the same Amazon EC2 virtual private cloud (VPC). You can't enable enhanced networking if the instance is in EC2-Classic. For more information, see the Amazon VPC User Guide and the Amazon EC2 Documentation on Enhanced Networking. Page 34