Intro to LargeData On the HPC cluster
Transcription
Intro to LargeData On the HPC cluster
Intro to LargeData On the HPC cluster Harry Mangalam [email protected] LargeData vs BigData Some talking points.. • If you aren't comfortable with Linux... • You can Google, and read further by yourself. • Questions, ASK THEM, but I may not answer them immediately. – “You don’t know what you don’t know” Computing Philosophy Unlike your Science... Be lazy. Copy others. Don't invent anything you don't have to. Re-USE, re-CYCLE, DON'T re-invent. Don't be afraid to ask others. Resort to new code only when absolutely necessary. Philosophy – Take Away You're not CS, not programmers Don't try to be them. (I'm not) But! Try to think like them, a bit Again, Google is your friend How to Ask Questions Reverse the situation: if you were answering the question, what information would you need? Not Science, but it is Logic. Include enough info to recreate the problem. Exclude what's not helpful or ginormous (use <pastie.org> or <tny.cz>) Use text, not screenshots if possible. HELP US HELP YOU We Need: - the directory in which you’re working (pwd), the machine you’re working on (hostname) modules loaded (module list) computer / OS you’re connecting from the command you used and the error it caused (in /text/, not screenshot) - much of this info is shown by your prompt see <http://goo.gl/6eZORd> On to HPC What is the High Performance Computing Cluster? and… Why do I need HPC? What is a Cluster? Pod of large general purpose computers run the Linux Operating System linked by some form of networking have access to networked storage can work in concert to address large problems by scheduling jobs very efficiently HPC @ UCI in Detail ~6560 64b Cores – Mostly AMD, few Intel ~40TB aggregate RAM ~1PB of storage Connected by 1Gb ethernet (100MB/s) & QDR IB (800MB/s) Grid Engine Scheduler to handle Queues > 1700 users, 100+ are online at anytime What HPC is NOT NOT: your personal machine It is a shared resource. What you do affects all the other users, so think before you hit that 'Enter' key. Well secured from mischief and disasters – not an invitation HPC FileSystem Layout Orange – Cluster Wide Black – Node Specific / ├── data/ |─apps +─users |----- pub/ ├── bio/ ├── som/ ├── cbcl/ ├── dfs1/ |----- dfs2/ ├── scratch ├── fast-scratch |----- ssd-scratch ├── /tmp NFS Mount All Programs are installed here Users home directory – 50GB LIMIT PER USER Public scratch space, overflow - 2 TB limit (but only active data) Space for BIO group → /dfs1 Space for SOM group → /dfs1 Space for CBCL group → /dfs1 BeeGFS Distributed File System ~460TB BeeGFS Distributed File System ~190TB Node-specific temporary storage per job (faster than all above) ~1TB – 14TB High Speed Fraunhofer FileSystem for temporary storage - 13TB Very High IOPS for DB, other jobs. Same as scratch Disk Space / Quotes / Policies You can only have so much space 50GB for /data/ (home directory) 2TB on /pub …. but ... if 6months or older without use – please remove from cluster More for Condo owners or Groups who have bought extra disk space. Regardless, NO DATA IS BACKED UP Here vs There Your laptop is HERE (and HERE is often dynamic) (How do you find out your IP #?) HPC is THERE (and THERE is always static) Files have to get from HERE to THERE (so it's always easier to push data from HERE to THERE, but …..) Displays are generated THERE but are seen HERE. (both Text and Graphics). The point above can be exploited to make life easier. [byobu and x2go] Make sure of where you are and in which direction the bytes are going. Keeping SSH Session Alive If you need to maintain a live connection for some reason, use 'byobu or screen'. It allows you to multiplex and maintain connections in a single terminal window. Somewhat unintuitive interface but very powerful. You know about cheatsheets (Google!!) Byobu / Screen Graphics Apps on HPC Linux uses X11 for graphics X11 is very chatty, high bandwidth, sensitive to network hops/latency. If you need graphics programs on HPC, use x2go vs native X11, which does for graphics what byobu does for terminal screens. x2go is described in the Tutorial & HOWTO, also GOOGLE How to: SSH & The Shell Once logged in to HPC via SSH you are now using the Shell, which is.. A program that intercepts and translates what you type, to tell the computer what to do. What you will be interacting with mostly. HPC shell is 'bash', altho there are others. Know the shell, Embrace the Shell If you don't get along with the shell, life will be hard. Before you submit anything to the cluster via qsub, get it going in your login shell. You're welcome to start jobs in on the IO node, type: qrsh “DO NOT RUN JOBS ON THE LOGIN NODE” Foreground & Background Jobs Foreground (fg) jobs are connected to the terminal. You kill a fg job with ^C. Background (bg) jobs have been disconnected from the terminal and are running in the bg. Send a job to the bg immed. by appending & Recall a job to the fg with fg. Send a fg job to the bg with ^z (suspend), then 'bg'. STDIN, STDOUT, STDERR THIS IS IMPORTANT STDIN is usually the keyboard, but... STDOUT is usually the screen, but... STDERR is also usually the screen, but... All can be redirected all over the place to files, to pipes, to FIFOs to network sockets can be combined, split (by 'tee'), to make simple workflows More on this later. Redirection STDIN, STDOUT, STDERR can be 'redirected' merged, swapped, split, etc Redirection '<' reads STDIN from file '>' writes STDOUT to a file '>>' appends STDOUT to a file '|' pipes the STDOUT of one program to the STDIN of another program (no file) 'tee' splits the STDOUT and sends one of the outputs to a file. The other output continues as STDOUT (no file) Redirection '2>' redirects STDERR to file '2>>' appends STDERR to file '&>' redirects BOTH STDERR and STDOUT to a file '2>&1' merges STDERR with STDOUT '2>&1 |' merges STDERR with STDOUT and send to a pipe '|&' same as '2>&1 |' above (recent bash) Redirection Can go to/from each other, via pipes, named pipes, FIFOs, network sockets, or files. All but the last involve in-memory redirection. Files (usually) involve data being writ to disks.. which is BAD. Pipe = | Works with STDIN/OUT/ERR to create 'pipelines' Very similar to plumbing; can add 'tee's to introduce splits. $ ls | tee 1lsfile 2lsfile 3lsfile | wc STDOUT of one program goes to the STDIN of another command whose STDOUT goes to the STDIN of another program ad infinitum. Sooooo...... Pipe Example w|cut -f1 -d' '|egrep -v "(^$|USER)"|sort|uniq -c|wc w spits out who is on the system right now cut -f1 -d ' ' chops out the 1st field (the user), based on the space token egrep -v "(^$|USER)" filters out both blank lines and lines with 'USER’ sort sorts the usernames alphabetically uniq -c counts the unique lines wc -l word-counts that output. Example: Now on HPC! Regular Expressions YOU KNOW ABOUT REGULAR EXPRESSIONS, RIGHT??? Regular Expressions Among the most powerful concepts in pattern matching Simple in concept, NASTY in implementation Among the ugliest / most confusing things to learn well But pretty easy to learn the simple parts. You will NEED to learn it – it's central to computers and especially biology Regexes Simplest form is called globbing: a* Barely more complicated : a*.txt A bit more: a*th.txt Can be MUCH more complex: [aeiou] = any of 'aeiou' F{3,5} = 3-5 'F's H+ = 1 or more 'H's . = any character Also classes of characters (#s, alphabetic, words) Archiving / Compression tar = std archive format for Linux zip = common archive format, from Windows gzip = common compression format bzip2 = a better, slower format lz4 = a poorer, faster format pigz/pbzip = parallel versions (for large files) Move Data to / from HPC Covered in detail in HPC USER HOWTO, which references: goo.gl/XKFEp scp, bbcp, netcat/tar on Mac, Linux. WinSCP, Filezilla, CyberDuck,FDT on Win Everyone should know how to use rsync. Not the easiest to learn, but very powerful & scriptable. There are rsync GUIs for Linux, Windows, MacOSX The Scheduler (GE) Just another program that juggles requests for resources Make sure a program is working on a small set of test data on an interactive shell. Need a short bash script (aka qsub script) to tell the GE what your program needs to run. Can improve the performance of your program in a variety of ways (staging data, running in parallel, using array jobs, etc) GE Useful Commands qstat - Queue Status queue / q – What queues you have access to qdel – Delete/Stop your job qhost – Show all nodes and their status Use man cmd to find out more information on above http://hpc.oit.uci.edu/PHPQstat Ref: http://hpc.oit.uci.edu/running-jobs http://hpc.oit.uci.edu/PHPQstat/ Sample QSUB Script Visit: <http://hpc.oit.uci.edu/guides/qsub-biolinux.html> Ref: <http://goo.gl/hrcXBg> Inodes and ZOT Files Inodes contain the metadata for files and dirs Inodes are pointers to the data Regardless of size, a file needs at least one inode to locate it. A file of 1 byte takes up the same minimum inode count as a file of 1TB DO NOT USE ZOTFILES!! – Zillions of Tiny Files Streaming Reads & Writes Let me demonstrate with a card trick. Data on Spinning Disks ● Data usually lives on spinning disks. Data on SSDs ● ● ● Progressively it lives on SSDs. MUCH faster IOPS Getting to be as reliable, getting cheaper. Reads from a Spinner Your program asks for some data. ● The OS asks the disk controller for the data. ● The controller asks the disk to go get it. ● A SPINNING disk has to: ● Position the heads ● Wait until the disk spins around to the right place. ● Follow the tracks to pick up all the data. ● Reads from an SSD ● ● ● ● ● Your program asks for some data. The OS asks the disk controller for the data. The controller asks the disk to go get it. An NVRAM Solid State Disk has to: Give the controller the data. The next medium A new technology from Intel (3D Xpoint™) is promising to make NVRAM obsolete as well as ~1000X faster. ● … and ...eventually... cheaper. ● The Data path The data, in a // stream of bytes, goes back up thru the circuitry, stashing significant bits in the on-disk cache. ● Then back to the controller where it stuffs bits into the controller cache. ● Then up into main RAM, where it stores large chunks in otherwise unused RAM. ● And from there, the fun begins. ● Data path from disk to RAM RAIDS Redundant Arrays of Inexpensive Disks Client ?? A:7265 B:9286 C:0757 D:9822 E:9667 Metadata server Storage Servers A B C D E From disk to RAM The data, in a // stream of bytes, goes back up thru the circuitry, stashing significant bits in the on-disk cache. ● As it gets closer to RAM, the data width and speed increases. ● Then back to the controller where it stuffs bits into the controller cache. ● Then up into main RAM, where it stores large chunks in otherwise unused RAM. ● And from there, the fun begins. ● From RAM to CPU Data in RAM is now summoned to the CPU, typically as a POINTER. ● As it travels closer to the CPU registers, the speed and width of the data path increases. ● To provide data as fast as possible, CPUs have been tricked out with various levels on on-CPU cache. ● 1°, 2°, and even 3° caches are used, getting larger as they get further from the CPU registers ● Inside the CPU Registers: run at clock speed. ● 1° Cache (64KB) run at ¼ clock speed ● 2° Cache (256KB-1MB) 1/10 clock speed ● 3° Cache (40MB) 1/20 clock speed ● Main RAM (100GBs) 1/1000 clock speed ● As data comes into the registers, new data replaces old data in all the caches, so it's nearby in case it's needed again. ● Inside RAM 'top' can show you what's going on... The FileCache $ time Rwbinary.pl # reads a 2GB file of random data Read 2147483648 bytes real 0m41.981s ← user 0m0.596s sys 0m14.855s compare with below $ time Rwbinary.pl # repeat it 2 seconds later Read 2147483648 bytes real 0m2.572s ← user 0m0.288s sys 0m2.284s 1/16th the time Getting to BigData Obtaining it ● Extracting / Joining it ● Cleaning & Filtering it ● Formating & Coding it ● Sectioning it ● Staging it for max performance ● [Data Provenance] ● Big Data Volume •Scary sizes, and getting bigger Velocity •Special approaches to speed analysis Variety •Domain-specific standards (HDF5/netCDF, bam/sam, FITS), but often aggregations of unstructured data BigData Hints for Newbies <http://goo.gl/aPj4az> Big Data – How Big is Big? Integer Byte Sizes Editing Big Data Don't. Use format-specific utilities to view such files and hash values to check if they’re identical to what they should be. Try not to be the person who tries to open a 200GB compressed data file with nano/vim/joe/emacs, etc. [De]Compression If your applications can deal with compressed data, KEEP IT COMPRESSED. If they can't, try to use pipes '|' to decompress in memory and feed the decompressed stream to the app. Many popular apps now allow this. Use native utilities to examine the compressed data (zcat/unzip/gunzip, grep, archivemount, Vitables, ncview, etc. Moving LargeData 1st: Don't. Otherwise, plan where your data will live for the life of the analysis, have it land there, and don't move it across filesystems. Don't DUPLICATE DUPLICATE DUPLICATE BigData See: <http://goo.gl/2iaHqD> Move LargeData as archives, not ZOTfiles (remember card trick?) Moving BigData • rsync for modified data • bbcp for new transfers of large single files, regardless of network • tar/netcat for deep/large dir structures over LANs • tar/gzip/bbcp to copy deep/large dir structures over WANs Checksums They work. Choose one and use it. md5sum / jacksum / shasum Use MANIFEST files & copy them along with the data files. See Checksum example • http://goo.gl/uvB5Fy Processing LargeData Files (HDF5, bam/sam) and specialized utilities (nco/ncview, [Py/Vi]tables, R, Matlab) Relational Dbs (SQLite, Postgres, MySQL) NoSQLs (MongoDB, CouchDB) Binary Dumps (Perl's Data::Dumper, Python's pickle) Non-Storage (pipes, named pipes/FIFOs, sockets) Keep it RAM-resident. Iteration vs Index Iteration: read thru data until you find what you're looking for. (grep; slow) Index: Look up what you're looking for and then go there. (ToC, Index; very fast). BUT: Iteration is easy; Indexing requires prep. If file will be traversed a few times, iterate. If it's going to be used repeatedly, index. Google is an enormous Index. BigData Formats Unstructured document files Structured (row/col) files (Excel, SQL tables) XML YAML/JSON/NoSQL DBs – Key:Value pairs & variants Personal Binary Dumps (Data:Dumper, pickle) Structured Binary Files – Bam – ORCish formats from Hadoop – HDF5/netCDF4 from NASA Compressed formats Unstructured ASCII files Data Consumption It often takes more effort to migrate data consumers to new storage in new system than to migrate the data itself. Once we have used a hybrid model to migrate data, applications designed to access the data using relational methodology must be redesigned to adapt to the new data architecture. Hopefully, future hybrid databases will provide built-in migration tools to help leverage the potential of NoSQL. In the present, remodeling data is critical before an existing system is migrated to NoSQL. In addition, modeling tools must to be evaluated to verify they meet the new requirements. Physical Data-aware Modeling For the past 30 years, ‘structure first, collect later’ has been the data modeling approach. With this approach, we determine the data structure before any data goes into the data system. In other words, the data structure definition is Structured ASCII files 0 ID ENSXETG00000000007 ENSXETG00000000007 ENSXETG00000000007 ENSXETG00000000007 ENSXETG00000000007 ENSXETG00000000007 ENSXETG00000000007 ENSXETG00000000007 ENSXETG00000000007 ENSXETG00000000007 1 BEGIN SCAFFOLD SCAFFOLD SCAFFOLD SCAFFOLD SCAFFOLD SCAFFOLD SCAFFOLD SCAFFOLD SCAFFOLD SCAFFOLD 2 END 1356 1356 1356 1356 1356 1356 1356 1356 1356 1356 3 Blue 47264 47264 47264 47264 47264 47264 47264 47264 47264 47264 4 Red 67263 67263 67263 67263 67263 67263 67263 67263 67263 67263 5 1 70 139 232 301 370 439 508 577 658 6 50 119 188 281 350 419 488 557 626 707 7 -0.67 -0.47 0.36 0.42 0.09 0.22 -0.42 0.61 0.71 1.61 8 0.70 0.64 0.74 -0.19 -0.65 -0.38 -0.09 -0.81 -0.45 -0.84 Relational Data ● ● ● ● ● ● ● Relational Data is data that relates to other data … formally. That is, it has a 'is a', 'has a', 'is part of', talks to, etc. Object-oriented data is largely relational I have a first name, last name, age, fashion sense, ride a bike. I live in Irvine, work at UCI, in the Dept of OIT. The numbers & text are the data. The 'relational' part is the metadata. How the numbers relate to each other. Relational Data Make your data relational Often, your data will appear in the form of spreadsheets or flatfiles. ● In order to QUERY the data, you import them into a relational data structure. ● Relational Engine vs Database? ● Either way, it's fairly easy. ● Import data into flat tables ● Set up the relationships ● Query away. ● XML <note> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note> YAML --- !clarkevans.com/^invoice invoice: 34843 date : 2001-01-23 bill-to: &id001 given : Chris family : Dumars address: lines: | 458 Walkman Dr. Suite #292 city : Royal Oak state : MI postal : 48046 ship-to: *id001 JSON {"widget": { "debug": "on", "window": { "title": "Sample Konfabulator Widget", "name": "main_window", "width": 500, "height": 500 }, "image": { "src": "Images/Sun.png", "name": "sun1", "hOffset": 250, "vOffset": 250, "alignment": "center" } } Personal Binary Dumps 163631823645618364912152ducksheepshark387ratthingpokemon \ \ \ \ \ \ \ \ \ %3d, %4d, %5d,%3d, %9.4f,%4s, %10s, %3d,%8s, %7s # read format Structured Binary: HDF5 HDF5 (contd) Parallel – Programming ● What type of parallel programming model to use? ● Problem partitioning ● Load balancing ● Communications between threads, processes, nodes ● Data dependencies ● Synchronization and race conditions ● Memory issues ● I/O issues ● Program complexity ● Programmer effort/costs/time Is it worth the pain? Check the performance profile before you even look at parallel. ● perf record [commandline] ● perf report ● Also 'oprofile' ● Parallel Jobs ● Easy: Embarrassingly //, SGE Array Jobs, GNU Parallel, Clusterfork, other types of forking, pseudo-forking. ● Harder: Untangling data dependency issues, Compiler Vectorization. OpenMP. ● Even Harder: Same node //, threads ● Extraordinarily Hard: Inter-node //, MPI & other horrors. Parallel – Easy ● (no programming) Embarrassingly // problems ● ie: Array Jobs in SGE iterate a command over a range ● ● GNU Parallel ● Scripts forking other scripts/processes (OK, I lied about the no programming) Parallel – Harder ● Untangling data dependency issues ● If X does not depend on Y, do them at the same time ● Map/Reduce & Hadoop, Spark ● Compiler Vectorization ● ● Some compilers/interpreters will // your code automatically. OpenMP ● Explicit request to // loops (if done carefully) Parallel – Even Harder ● Threads on the same node. Differences between threads and processes. ● Threads are fairly easy to do badly, REALLY hard to do well.. ● Thread Toolkits: Intel's Thread Building Blocks for C++ ● ● Can also do threads in Python, Perl, Java (altho they might not be the same kind of thread) Remember... Good Judgment comes from Bad Judgment from Experience Experience comes And Now... QUESTIONS??
Similar documents
On to HPC - Parent Directory
•bbcp for new transfers of large single files, regardless of network •tar/netcat for deep/large dir structures over LANs •tar/gzip/bbcp to copy deep/large dir structures over WANs
More information