Intro to LargeData On the HPC cluster

Transcription

Intro to LargeData On the HPC cluster
Intro to LargeData
On the HPC cluster
Harry Mangalam
[email protected]
LargeData
vs
BigData
Some talking points..
• If you aren't comfortable with Linux...
• You can Google, and read further by
yourself.
• Questions, ASK THEM, but I may not
answer them immediately. – “You don’t
know what you don’t know”
Computing Philosophy







Unlike your Science...
Be lazy.
Copy others.
Don't invent anything you don't have to.
Re-USE, re-CYCLE, DON'T re-invent.
Don't be afraid to ask others.
Resort to new code only when absolutely
necessary.
Philosophy – Take Away

You're not CS, not programmers

Don't try to be them. (I'm not)

But! Try to think like them, a bit

Again, Google
is your friend
How to Ask Questions





Reverse the situation: if you were
answering the question, what information
would you need?
Not Science, but it is Logic.
Include enough info to recreate the
problem.
Exclude what's not helpful or ginormous
(use <pastie.org> or <tny.cz>)
Use text, not screenshots if possible.
HELP US HELP YOU
We Need:
-
the directory in which you’re working (pwd),
the machine you’re working on (hostname)
modules loaded (module list)
computer / OS you’re connecting from
the command you used and the error it
caused (in /text/, not screenshot)
- much of this info is shown by your prompt
see <http://goo.gl/6eZORd>
On to HPC
What is the High Performance Computing
Cluster?
and…
Why do I need HPC?
What is a Cluster?






Pod of large general purpose computers
run the Linux Operating System
linked by some form of networking
have access to networked storage
can work in concert to address large
problems
by scheduling jobs very efficiently
HPC @ UCI in Detail







~6560 64b Cores – Mostly AMD, few Intel
~40TB aggregate RAM
~1PB of storage
Connected by 1Gb ethernet (100MB/s)
& QDR IB (800MB/s)
Grid Engine Scheduler to handle Queues
> 1700 users, 100+ are online at anytime
What HPC is NOT




NOT: your personal machine
It is a shared resource.
What you do affects all the other users, so
think before you hit that 'Enter' key.
Well secured from mischief and disasters –
not an invitation
HPC FileSystem Layout
Orange – Cluster Wide
Black – Node Specific
/
├── data/
|─apps
+─users
|----- pub/
├── bio/
├── som/
├── cbcl/
├── dfs1/
|----- dfs2/
├── scratch
├── fast-scratch
|----- ssd-scratch
├── /tmp
NFS Mount
All Programs are installed here
Users home directory
– 50GB LIMIT PER USER
Public scratch space, overflow
- 2 TB limit (but only active data)
Space for BIO group → /dfs1
Space for SOM group → /dfs1
Space for CBCL group → /dfs1
BeeGFS Distributed File System
~460TB
BeeGFS Distributed File System
~190TB
Node-specific temporary storage per job (faster than all above)
~1TB – 14TB
High Speed Fraunhofer FileSystem for temporary storage
- 13TB
Very High IOPS for DB, other jobs.
Same as scratch
Disk Space / Quotes / Policies






You can only have so much space
50GB for /data/ (home directory)
2TB on /pub …. but ...
if 6months or older without use – please
remove from cluster
More for Condo owners or Groups who have
bought extra disk space.
Regardless, NO DATA IS BACKED UP
Here vs There

Your laptop is HERE (and HERE is often dynamic)

(How do you find out your IP #?)


HPC is THERE (and THERE is always static)
Files have to get from HERE to THERE (so it's always
easier to push data from HERE to THERE, but …..)



Displays are generated THERE but are seen
HERE. (both Text and Graphics).
The point above can be exploited to make life
easier. [byobu and x2go]
Make sure of where you are and in which
direction the bytes are going.
Keeping SSH Session Alive




If you need to maintain a live connection for
some reason, use 'byobu or screen'.
It allows you to multiplex and maintain
connections in a single terminal window.
Somewhat unintuitive interface but very
powerful.
You know about cheatsheets (Google!!)
Byobu / Screen
Graphics Apps on HPC




Linux uses X11 for graphics
X11 is very chatty, high bandwidth, sensitive to
network hops/latency.
If you need graphics programs on HPC, use
x2go vs native X11, which does for graphics
what byobu does for terminal screens.
x2go is described in the Tutorial & HOWTO,
also GOOGLE
How to: SSH & The Shell




Once logged in to HPC via SSH you are now
using the Shell, which is..
A program that intercepts and translates
what you type, to tell the computer what to
do.
What you will be interacting with mostly.
HPC shell is 'bash', altho there are others.
Know the shell, Embrace the
Shell




If you don't get along with the shell, life will be
hard.
Before you submit anything to the cluster via
qsub, get it going in your login shell.
You're welcome to start jobs in on the IO node,
type: qrsh
“DO NOT RUN JOBS ON THE LOGIN NODE”
Foreground & Background
Jobs





Foreground (fg) jobs are connected to the
terminal. You kill a fg job with ^C.
Background (bg) jobs have been disconnected
from the terminal and are running in the bg.
Send a job to the bg immed. by appending &
Recall a job to the fg with fg.
Send a fg job to the bg with ^z (suspend), then
'bg'.
STDIN, STDOUT, STDERR








THIS IS IMPORTANT
STDIN is usually the keyboard, but...
STDOUT is usually the screen, but...
STDERR is also usually the screen, but...
All can be redirected all over the place
to files, to pipes, to FIFOs to network sockets
can be combined, split (by 'tee'), to make
simple workflows
More on this later.
Redirection


STDIN, STDOUT, STDERR can be 'redirected'
merged, swapped, split, etc
Redirection





'<' reads STDIN from file
'>' writes STDOUT to a file
'>>' appends STDOUT to a file
'|' pipes the STDOUT of one program to the
STDIN of another program (no file)
'tee' splits the STDOUT and sends one of
the outputs to a file. The other output
continues as STDOUT (no file)
Redirection






'2>' redirects STDERR to file
'2>>' appends STDERR to file
'&>' redirects BOTH STDERR and STDOUT
to a file
'2>&1' merges STDERR with STDOUT
'2>&1 |' merges STDERR with STDOUT and
send to a pipe
'|&' same as '2>&1 |' above (recent bash)
Redirection




Can go to/from each other, via pipes, named
pipes, FIFOs, network sockets, or files.
All but the last involve in-memory redirection.
Files (usually) involve data being writ to disks..
which is
BAD.
Pipe = |


Works with STDIN/OUT/ERR to create
'pipelines'
Very similar to plumbing; can add 'tee's to
introduce splits.
$ ls | tee 1lsfile 2lsfile 3lsfile | wc


STDOUT of one program goes to the STDIN of
another command whose STDOUT goes to the
STDIN of another program ad infinitum.
Sooooo......
Pipe Example
w|cut -f1 -d' '|egrep -v "(^$|USER)"|sort|uniq -c|wc
w spits out who is on the system right now
cut -f1 -d ' ' chops out the 1st field (the user), based on the space
token
egrep -v "(^$|USER)" filters out both blank lines and lines with
'USER’
sort sorts the usernames alphabetically
uniq -c counts the unique lines
wc -l word-counts that output.
Example: Now on HPC!
Regular Expressions
YOU KNOW ABOUT
REGULAR
EXPRESSIONS,
RIGHT???
Regular Expressions





Among the most powerful concepts in pattern
matching
Simple in concept, NASTY in implementation
Among the ugliest / most confusing things to
learn well
But pretty easy to learn the simple parts.
You will NEED to learn it – it's central to
computers and especially biology
Regexes









Simplest form is called globbing: a*
Barely more complicated : a*.txt
A bit more: a*th.txt
Can be MUCH more complex:
[aeiou] = any of 'aeiou'
F{3,5} = 3-5 'F's
H+ = 1 or more 'H's
. = any character
Also classes of characters (#s, alphabetic,
words)
Archiving / Compression

tar = std archive format for Linux

zip = common archive format, from Windows

gzip = common compression format

bzip2 = a better, slower format

lz4 = a poorer, faster format

pigz/pbzip = parallel versions (for large files)
Move Data to / from HPC

Covered in detail in HPC USER HOWTO, which
references: goo.gl/XKFEp

scp, bbcp, netcat/tar on Mac, Linux.

WinSCP, Filezilla, CyberDuck,FDT on Win


Everyone should know how to use rsync. Not the
easiest to learn, but very powerful & scriptable.
There are rsync GUIs for Linux, Windows, MacOSX
The Scheduler (GE)




Just another program that juggles requests for
resources
Make sure a program is working on a small set
of test data on an interactive shell.
Need a short bash script (aka qsub script) to
tell the GE what your program needs to run.
Can improve the performance of your program
in a variety of ways (staging data, running in
parallel, using array jobs, etc)
GE Useful Commands









qstat - Queue Status
queue / q – What queues you have access to
qdel – Delete/Stop your job
qhost – Show all nodes and their status
Use man cmd to find out more information on above
http://hpc.oit.uci.edu/PHPQstat
Ref:
http://hpc.oit.uci.edu/running-jobs
http://hpc.oit.uci.edu/PHPQstat/
Sample QSUB Script

Visit:

<http://hpc.oit.uci.edu/guides/qsub-biolinux.html>

Ref:

<http://goo.gl/hrcXBg>
Inodes and ZOT Files





Inodes contain the metadata for files and dirs
Inodes are pointers to the data
Regardless of size, a file needs at least one
inode to locate it.
A file of 1 byte takes up the same minimum
inode count as a file of 1TB
DO NOT USE ZOTFILES!! – Zillions of Tiny
Files
Streaming Reads & Writes
Let me demonstrate with a card trick.
Data on Spinning Disks
●
Data usually lives on spinning disks.
Data on SSDs
●
●
●
Progressively it lives on SSDs.
MUCH faster IOPS
Getting to be as reliable, getting cheaper.
Reads from a Spinner
Your program asks for some data.
●
The OS asks the disk controller for the data.
●
The controller asks the disk to go get it.
●
A SPINNING disk has to:
●
Position the heads
●
Wait until the disk spins around to the right
place.
●
Follow the tracks to pick up all the data.
●
Reads from an SSD
●
●
●
●
●
Your program asks for some data.
The OS asks the disk controller for the data.
The controller asks the disk to go get it.
An NVRAM Solid State Disk has to:
Give the controller the data.
The next medium
A new technology from Intel (3D Xpoint™) is
promising to make NVRAM obsolete as well as
~1000X faster.
●
… and ...eventually... cheaper.
●
The Data path
The data, in a // stream of bytes, goes back up
thru the circuitry, stashing significant bits in the
on-disk cache.
●
Then back to the controller where it stuffs bits
into the controller cache.
●
Then up into main RAM, where it stores large
chunks in otherwise unused RAM.
●
And from there, the fun begins.
●
Data path from disk to RAM
RAIDS
Redundant Arrays of Inexpensive Disks
Client
??
A:7265
B:9286
C:0757
D:9822
E:9667
Metadata server
Storage Servers
A
B
C
D
E
From disk to RAM
The data, in a // stream of bytes, goes back up
thru the circuitry, stashing significant bits in the
on-disk cache.
●
As it gets closer to RAM, the data width and
speed increases.
●
Then back to the controller where it stuffs bits
into the controller cache.
●
Then up into main RAM, where it stores large
chunks in otherwise unused RAM.
●
And from there, the fun begins.
●
From RAM to CPU
Data in RAM is now summoned to the CPU,
typically as a POINTER.
●
As it travels closer to the CPU registers, the
speed and width of the data path increases.
●
To provide data as fast as possible, CPUs have
been tricked out with various levels on on-CPU
cache.
●
1°, 2°, and even 3° caches are used, getting
larger as they get further from the CPU registers
●
Inside the CPU
Registers: run at clock speed.
●
1° Cache (64KB) run at ¼ clock speed
●
2° Cache (256KB-1MB) 1/10 clock speed
●
3° Cache (40MB) 1/20 clock speed
●
Main RAM (100GBs) 1/1000 clock speed
●
As data comes into the registers, new data
replaces old data in all the caches, so it's nearby
in case it's needed again.
●
Inside RAM
'top' can show you what's going on...
The FileCache
$ time Rwbinary.pl # reads a 2GB file of random data
Read 2147483648 bytes
real 0m41.981s ←
user 0m0.596s
sys
0m14.855s
compare with below
$ time Rwbinary.pl # repeat it 2 seconds later
Read 2147483648 bytes
real 0m2.572s ←
user 0m0.288s
sys
0m2.284s
1/16th the time
Getting to BigData
Obtaining it
●
Extracting / Joining it
●
Cleaning & Filtering it
●
Formating & Coding it
●
Sectioning it
●
Staging it for max performance
●
[Data Provenance]
●
Big Data

Volume
•Scary sizes, and getting bigger

Velocity
•Special approaches to speed analysis

Variety
•Domain-specific standards (HDF5/netCDF, bam/sam,
FITS), but often aggregations of unstructured data

BigData Hints for Newbies
<http://goo.gl/aPj4az>
Big Data – How Big is Big?
Integer Byte Sizes
Editing Big Data



Don't.
Use format-specific utilities to view such files
and hash values to check if they’re identical to
what they should be.
Try not to be the person who tries to open a
200GB compressed data file with
nano/vim/joe/emacs, etc.
[De]Compression



If your applications can deal with compressed
data, KEEP IT COMPRESSED.
If they can't, try to use pipes '|' to decompress in
memory and feed the decompressed stream to
the app. Many popular apps now allow this.
Use native utilities to examine the compressed
data (zcat/unzip/gunzip, grep, archivemount,
Vitables, ncview, etc.
Moving LargeData




1st: Don't.
Otherwise, plan where your data will live for the life of
the analysis, have it land there, and don't move it
across filesystems.
Don't DUPLICATE DUPLICATE DUPLICATE BigData
See: <http://goo.gl/2iaHqD>
Move LargeData as archives, not ZOTfiles
(remember card trick?)
Moving BigData
• rsync for modified data
• bbcp for new transfers of large single files, regardless
of network
• tar/netcat for deep/large dir structures over LANs
• tar/gzip/bbcp to copy deep/large dir structures over
WANs
Checksums




They work. Choose one and use it.
md5sum / jacksum / shasum
Use MANIFEST files & copy them along with
the data files.
See Checksum example
• http://goo.gl/uvB5Fy
Processing LargeData






Files (HDF5, bam/sam) and specialized utilities
(nco/ncview, [Py/Vi]tables, R, Matlab)
Relational Dbs (SQLite, Postgres, MySQL)
NoSQLs (MongoDB, CouchDB)
Binary Dumps (Perl's Data::Dumper, Python's
pickle)
Non-Storage (pipes, named pipes/FIFOs,
sockets)
Keep it RAM-resident.
Iteration vs Index

Iteration: read thru data until you find what
you're looking for. (grep; slow)


Index: Look up what you're looking for and
then go there. (ToC, Index; very fast).
BUT: Iteration is easy; Indexing requires prep.

If file will be traversed a few times, iterate.

If it's going to be used repeatedly, index.

Google is an enormous Index.
BigData Formats







Unstructured document files
Structured (row/col) files (Excel, SQL tables)
XML
YAML/JSON/NoSQL DBs
– Key:Value pairs & variants
Personal Binary Dumps (Data:Dumper, pickle)
Structured Binary Files
– Bam
– ORCish formats from Hadoop
– HDF5/netCDF4 from NASA
Compressed formats
Unstructured ASCII files
Data Consumption
It often takes more effort to migrate data consumers to new storage in new
system than to migrate the data itself. Once we have used a hybrid model to
migrate data, applications designed to access the data using relational
methodology must be redesigned to adapt to the new data architecture.
Hopefully, future hybrid databases will provide built-in migration tools to help
leverage the potential of NoSQL. In the present, remodeling data is critical
before an existing system is migrated to NoSQL. In addition, modeling tools
must to be evaluated to verify they meet the new requirements.
Physical Data-aware Modeling
For the past 30 years, ‘structure first, collect later’ has been the data modeling
approach. With this approach, we determine the data structure before any
data
goes into the data system. In other words, the data structure definition is
Structured ASCII files
0
ID
ENSXETG00000000007
ENSXETG00000000007
ENSXETG00000000007
ENSXETG00000000007
ENSXETG00000000007
ENSXETG00000000007
ENSXETG00000000007
ENSXETG00000000007
ENSXETG00000000007
ENSXETG00000000007
1
BEGIN
SCAFFOLD
SCAFFOLD
SCAFFOLD
SCAFFOLD
SCAFFOLD
SCAFFOLD
SCAFFOLD
SCAFFOLD
SCAFFOLD
SCAFFOLD
2
END
1356
1356
1356
1356
1356
1356
1356
1356
1356
1356
3
Blue
47264
47264
47264
47264
47264
47264
47264
47264
47264
47264
4
Red
67263
67263
67263
67263
67263
67263
67263
67263
67263
67263
5
1
70
139
232
301
370
439
508
577
658
6
50
119
188
281
350
419
488
557
626
707
7
-0.67
-0.47
0.36
0.42
0.09
0.22
-0.42
0.61
0.71
1.61
8
0.70
0.64
0.74
-0.19
-0.65
-0.38
-0.09
-0.81
-0.45
-0.84
Relational Data
●
●
●
●
●
●
●
Relational Data is data that relates to other
data … formally.
That is, it has a 'is a', 'has a', 'is part of', talks
to, etc.
Object-oriented data is largely relational
I have a first name, last name, age, fashion
sense, ride a bike.
I live in Irvine, work at UCI, in the Dept of OIT.
The numbers & text are the data.
The 'relational' part is the metadata. How the
numbers relate to each other.
Relational Data
Make your data relational
Often, your data will appear in the form of
spreadsheets or flatfiles.
●
In order to QUERY the data, you import them
into a relational data structure.
●
Relational Engine vs Database?
●
Either way, it's fairly easy.
●
Import data into flat tables
●
Set up the relationships
●
Query away.
●
XML
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
YAML
--- !clarkevans.com/^invoice
invoice: 34843
date
: 2001-01-23
bill-to: &id001
given : Chris
family : Dumars
address:
lines: |
458 Walkman Dr.
Suite #292
city
: Royal Oak
state
: MI
postal : 48046
ship-to: *id001
JSON
{"widget": {
"debug": "on",
"window": {
"title": "Sample Konfabulator Widget",
"name": "main_window",
"width": 500,
"height": 500
},
"image": {
"src": "Images/Sun.png",
"name": "sun1",
"hOffset": 250,
"vOffset": 250,
"alignment": "center"
}
}
Personal Binary Dumps
163631823645618364912152ducksheepshark387ratthingpokemon
\
\
\ \
\
\
\ \
\
%3d, %4d, %5d,%3d, %9.4f,%4s, %10s,
%3d,%8s, %7s # read format
Structured Binary: HDF5
HDF5 (contd)
Parallel – Programming
●
What type of parallel programming model to use?
●
Problem partitioning
●
Load balancing
●
Communications between threads, processes, nodes
●
Data dependencies
●
Synchronization and race conditions
●
Memory issues
●
I/O issues
●
Program complexity
●
Programmer effort/costs/time
Is it worth the pain?
Check the performance profile before you
even look at parallel.
●
perf record [commandline]
●
perf report
●
Also 'oprofile'
●
Parallel Jobs
●
Easy: Embarrassingly //, SGE Array Jobs, GNU Parallel,
Clusterfork, other types of forking, pseudo-forking.
●
Harder: Untangling data dependency issues, Compiler
Vectorization. OpenMP.
●
Even Harder: Same node //, threads
●
Extraordinarily Hard: Inter-node //, MPI & other horrors.
Parallel – Easy
●
(no programming)
Embarrassingly // problems
●
ie: Array Jobs in SGE
iterate a command over a range
●
●
GNU Parallel
●
Scripts forking other scripts/processes
(OK, I lied about the no programming)
Parallel – Harder
●
Untangling data dependency issues
●
If X does not depend on Y, do them at the same time
●
Map/Reduce & Hadoop, Spark
●
Compiler Vectorization
●
●
Some compilers/interpreters will // your code automatically.
OpenMP
●
Explicit request to // loops (if done carefully)
Parallel – Even Harder
●
Threads on the same node.
Differences between threads and
processes.
●
Threads are fairly easy to do badly,
REALLY hard to do well..
●
Thread Toolkits: Intel's Thread Building
Blocks for C++
●
●
Can also do threads in Python, Perl, Java
(altho they might not be the same kind of thread)
Remember...
Good Judgment
comes
from
Bad Judgment
from Experience
Experience comes
And Now...
QUESTIONS??