Module 1: Introduction to Big Data

Transcription

Module 1: Introduction to Big Data
TRAINING PROGRAM ON BIGDATA/HADOOP
Course: Training on Bigdata/Hadoop with Hands
Hands-on
Course Duration / Dates / Time: 4 Days / 24th - 27th June 2015 / 9:30 - 17:30 Hrs
Venue: Eagle Photonics Pvt Ltd
First Floor, Plot No 31, Sector 19C, Vashi, Navi Mumbai - 400705
Ph: 022 27841425
Fee Details:
For Indian participants: 20,000 INR
For Foreign participants: 350 USD
Course Description:
Big Data is a collection of large and complex data sets that cannot be processed using regular database
management tools or processing applications. On the other hand, the Apache Hadoop software library is a
framework that allows for the distributed processing of large data sets using simple programming models. This
hands-on
on course equips participants on how to manage Bigdata using Hadoop.
Who should attend?
This course is meant for software developers/programmers who are interested in Bigdata/Hadoop.
Key benefits:
On course completion, participants would be knowledgeable on Managing Bigdata and comfortable working with
Hadoop Distributed File Systems & components.
Course Outline:
Module 1: Introduction to Big Data
Session 1: Introduction to Big Data
• So What Is Big Data?
• History of Data Management
Management—Evolution of Big Data
• Structuring of Big Data
• Types of Big Data
• Elements of Big Data
• Application of Big Data in the Business Context
• Careers in Big Data
Session 2: Business application of Big Data
• Significance of Social network Data
• Uses of Social Network Data Analysis
• Financial Fraud and Big Data
• Preventing Fraud Using Big Data Analytics
1
TRAINING PROGRAM ON BIGDATA/HADOOP
• Use of Big Data in the Retail Industry
Session 3: Technologies for handling Big Data
• Distributed and Parallel Computing for Big Data
• Virtualization and its Importance to Big Data
• Introducing Hadoop
• Cloud Computing and Big Data
• Features of Cloud Computing
• Providers in Big Data Cloud Market
• Issues in Using Cloud Services
• In-Memory
Memory Technology for Big Data
Session 4: Understanding the Hadoop Ecosystem
• The Hadoop Ecosystem
• Processing Data with Hadoop MapReduce
• Managing Resources and Applications with Hadoop YARN
• Storing Big Data with HBase
• Using Hive for Querying Big Databases
• Interacting with Hadoop Ecosystem
Session 5: Map reduce fundamentals
• Origins of MapReduce
• Characteristics of MapReduce
• How MapReduce Works
• More about Map and Reduce Functions
• Optimization Techniques for MapReduce Jobs
• Hardware/Network Topology
• Applications of MapReduce
• Role of HBase in Processing Big Data
• Mining Big Data with Hive
Module 2: Managing an Enterprise Wide Big Data Ecosystem
Session 1- Big Data Technology Foundations
• Exploring the Big Data Stack
• Virtualization and Big Data
• Processor and Memory Virtualization
• Data and Storage Virtualization
• Managing
ging Virtualization with Hypervisor
• Abstraction and Virtualization
• Implementing Virtualization to Work with Big Data
2
TRAINING PROGRAM ON BIGDATA/HADOOP
Session 2: Big Data management Systems – Databases and Warehouses
• RDBMSs and Big Data Environment
• PostgreSQL Relational Datab
Database
• Nonrelational Databases
• Key-Value Pair Databases
• Document Databases
• Columnar Databases
• Graph Databases
• Spatial Databases.
• Polyglot Persistence
• Integrating Big Data with Traditional Data Warehouse
• Rethinking Extraction, Transformation, and Loading
• Big Data Analysis and Data Warehouse
• Changing Deployment Models in Big Data Era
Session 3: Analytics and Big Data
• Using Big Data to Get Results.
• What Constitutes Big Data
• Exploring Unstructured Data
• Understanding Text Analytics
• Building New Models and Approaches to Support Big Data
Session 4: Integrating Data, Real- Time Data and Implementing Big Data
• Stages in Big Data Analysis
• Fundamentals of Big Data Inte
Integration
• Streaming Data and Complex Event Processing
• Making Big Data a Part of Your Operational Process
• Ensuring Validity, Veracity, and Volatility of Big Data
• Data Validity and Veracity
• Data Volatility
Session 5: Big Data Solutions and Data in Motion
• Big Data as a Business Strategy Tool
• Analysis in Real-Time:
Time: Adding New Dimensions to the Cycle
• The Needs for Data in Motion
• Case 1: Using Streaming Data for Environmental Impact
• Case 2: Using Streaming Data for Public Policy
• Case 3: Use of Streaming Data in Health Care Industry
• Case 4: Use of Streaming Data in Energy Industry
• Case 5: Improving Customer Experience with Real
Real-Time Text Analytics
• Case 6: Using Real-time
time Data in Finance Industry
3
TRAINING PROGRAM ON BIGDATA/HADOOP
• Case 7: Using Real-Time
Time Data for Insurance Fraud Prevention
Module 3: Storing and Processing Data – HDFS and MapReduce
Session 1: Storing Data in Hadoop
• HDFS, HBase
• Combining HDFS and HBase for Effective Data Storage
• Choosing an Appropriate Hadoop Data Organization for Your Applications
Session 2: Processing your data with map Reduce
• Getting to Know MapReduce
• Your First MapReduce Application
• Designing MapReduce Implementations
Session 3: Customizing MapReduce Execution
• Controlling MapReduce Execution with Input Format
• Reading Data Your Way with Custom Record Reader
• Organizing Output Data with Custom Output Formats
• Optimizing Your MapReduce Execution with a Combiner
• Controlling Reducer Execution with Partitioners
Session 4: Testing and Debugging map Reduce Applications
• Unit Testing MapReduce Applications
• Local Application Testing with Eclipse
• Using Logging for Hadoop Testing
• Reporting Metrics with Job Counters
• Defensive Programming in MapReduce
Session 5: Implementing MapReduce Wordcount Program
Program- A case study
Module 4: Increasing Efficiency with Hadoop Tools: Hive and Pig
Session 1: Exploring Hive
• Introducing Hive
• Starting Hive
• Executing Hive Queries from Files
• Data Types
• Hive Built-In Functions
• Compressed Data Storage
• Data Manipulation in Hive
Session 2: Advanced Querying with Hive
• Queries
4
TRAINING PROGRAM ON BIGDATA/HADOOP
• Manipulating Column Values Using Functions
• JOINS in Hive
• Hive Best Practices
• Performance-Tuning
Tuning and Query Optimizations
• Various Execution Types
• Hive File and Record Formats
• HiveThrift Service
• Security in Hive
Session 3: Analyzing Data with Pig
• Introduction to Pig
• Installing Pig
• Properties of Pig
• Running Pig
• Pig Latin Application Flow
• Beginning with Pig Latin
• Relational Operators in Pig
Module 5: Additional Hadoop Tools: Sqoop, Flume, YARN and Storm
Session 1: Efficiently transferring Bulk data Using Sqoop
• Introducing Sqoop
• Using Sqoop 1
• Importing Data with Sqoop
• Controlling Parallelism
• Encoding NULL Values
• Importing Data into Hive Tables
• Importing Data into HBase
• Exporting Data
• Exporting Data into Subset of Columns
• Drivers and Connectors in Sqoop
• Sqoop Architecture Overview
• Sqoop 2
Session 2: Flume
• Introducing Flume
• The Flume Architecture
• Setting Up Flume
• Building Flume
Session 3: Beyond MapReduce – YARN
• Why YARN?
5
TRAINING PROGRAM ON BIGDATA/HADOOP
• The YARN Ecosystem
• A YARN API Example
• Mesos versus YARN
Session 4: Storm on YARN
• Storm and Hadoop
• Overview of Storm
• The Storm API
• Storm on YARN
• Installing Storm on YARN
• An Example of Storm on YARN
Module 6: Leveraging NoSQL, Hadoop Security, on Cloud and Real Time
Session 1: Hello MoSQL
• Two Simple Examples
• Storing and Accessing Data
• Storing and Accessing Data in MongoDB
• Storing and Accessing Data in HBase
• Storing and Accessing Data iin Apache Cassandra
• Language Bindings for NoSQL Data Stores
Session 2: Working with NoSQL
• Creating Records
• Accessing Data
• Updating and Deleting Data
• MongoDB Query Language Capabilities
• Accessing Data from Column
Column-Oriented Databases Like HBase
Session 3: Hadoop Security
• Hadoop Security Challenges
• Authentication
• Delegated Security Credentials
• Authorization
Session 4: Running Hadoop Applications on AWS
• Getting to Know AWS
• Options for Running Hadoop on AWS
• Understanding the EMR–Hadoop
Hadoop Relationship
• Using AWS S3
• Automating EMR Job Flow Creation and Job Execution
• Orchestrating Job Execution in EMR
6
TRAINING PROGRAM ON BIGDATA/HADOOP
Session 5: Real Time Hadoop
• Real-Time
Time Hadoop Applications
• Using Specialized Real-Time
Time Hadoop Quer
Query Systems
• Using Hadoop-Based Event-Processing
Processing Systems
Trainer Profile
Mr Biswajyoti Kar holds A.M.I.E from Institution of Engineers(India), Gokhale Road Calcutta & B.Sc in Physics from
University Of Calcutta. He is a Senior Architect with over 19 yea
years
rs of rich experience with proven record in
architecture, designing and implementing systems software
software. He has experience of BIG Data Analytics, UNIX/Linux
kernel mode development, Data structures and algorithm development in C
C. His area of interest is building
solutions around Big Data and Analytics and IP creation in Big Data space.
Training Experience
•
Big Data, Hadoop Distributed file systems in Dell.
•
Algorithm and Data Structures in C/C++, UNIX/Linux advanced programming, shell scripting in Dell
•
Algorithm and Data Structures in C Proton solutions
Project Experiences
1.
BIG Data Work
Leading a project that involved setting up of Hadoop distributed file system (HDFS) on Linux box
to test the elasticity part of cloud computing.
Bench-marking
marking the hado
hadoop
op system for crunching terabytes of data using macro-programming
macro
model called PIG Latin.
2.
Statistical analysis was done using R language.
Parallel network file system
Leading a project that involved setting up a pNFS client
client-server
server file and block layout
configuration.
3.
Figuring out pros and cons of each configuration in HPC and NAS environments.
Big Data Analytics
Providing consulting in the area of Big Data Analysis to credit rating agency
***
7