DROSS Andrew Hardie ECPRD WGICT 17-21 November 2010
Transcription
DROSS Andrew Hardie ECPRD WGICT 17-21 November 2010
DROSS Distributed & Resilient Open Source Software Andrew Hardie http://ashardie.com ECPRD WGICT 17-21 November 2010 Chamber of Deputies, Bucharest 17-21 November 2010 ECPRD - WGICT - Bucharest 1 Topics Distributed, not virtualized or ‘cloud’ DRBD Gluster Heartbeat Nginx Trends: • NoSQL • Map / Reduce • Cassandra, Hadoop & family Other stuff ‘out there’ Predictions… 17-21 November 2010 ECPRD - WGICT - Bucharest 2 DRBD Block-level disk replicator (effectively, net RAID-1) 17-21 November 2010 ECPRD - WGICT - Bucharest 3 DRBD – Good/bad points Good for HA clusters (e,g, LAMP servers) Ideal for block-level apps, e.g. MySQL Sync/Async operation Auto recovery from disk, net or node failure In Linux kernels from 2.6.33 (Ubuntu 10.10 is 2.6.35) Supports Infiniband, LVM, XEN, Dual primary config Hard to extend beyond two systems, three is maximum Remote offsite really needs DRBD Proxy (commercial) Requires dedicated disk/partition Moderately difficult to configure Documentation could be better 17-21 November 2010 ECPRD - WGICT - Bucharest 4 Gluster Filesystem-level replicator More like NAS than RAID Claims to scale to petabytes Nodes can be servers, clients or both On the fly reconfig of disks & nodes Scripting interface ‘Cloud compliant’ (isn’t everything?) 17-21 November 2010 ECPRD - WGICT - Bucharest 5 Gluster – Use case - Dublin Real-time mirroring of Digital Audio 17-21 November 2010 ECPRD - WGICT - Bucharest 6 Gluster – Good/bad points Moving to “turnkey system” (black box) N-way replication easy Easier than DRBD to configure Dedicated partitions or disks not required Supports Infiniband Background self-healing (pull rather than push) Aggregate and/or replicate volumes POSIX support Native support for NFS, CIFS, HTTP & FTP No specific features for slow link replication Similar documentation vs revenue earning tension 17-21 November 2010 ECPRD - WGICT - Bucharest 7 Heartbeat HA Cluster infrastructure (“cluster glue”) Needs Cluster Resource manager (CRM), e.g. Pacemaker, to be useful Part of the Linux-HA project Provides: hot-swap of synthetic IP address between nodes (Synthetic IP is in addition to node’s own IPs) Node failure/restore detection Start/stop of services to be managed, via init scripts 17-21 November 2010 ECPRD - WGICT - Bucharest 8 Heartbeat/DRBD – use case HA LAMP Server pair 17-21 November 2010 ECPRD - WGICT - Bucharest 9 Heartbeat – good/bad points Lots of resource agents available e.g. Apache, Squid, Sphinx search, VMWare, DB2, WebSphere, Oracle, JBOSS, Tomcat, Postfix, Informix, SAP, iSCSI, DRBD, … Beyond simple 2-way hot-swap, config can get very complicated Good for stateless (e.g. HTTP); not so good for file shares (e.g. Samba) Documentation out of date in some areas, e.g. Ububtu ‘upstart’ scripts (boot-time startup of services to be managed by Heartbeat has to be disabled) 17-21 November 2010 ECPRD - WGICT - Bucharest 10 NGINX Fast, simple Russian HTTP server Reverse proxy server Mail proxy server Fast static content serving Very low memory footprint Load balancing and fault tolerance Name and IP based virtual servers Embedded Perl FLV streaming Non-threaded, event-driven architecture Modular architecture Can front-end Apache (instead of mod_proxy) 17-21 November 2010 ECPRD - WGICT - Bucharest 11 Trends – NoSQL, etc… NoSQL Or, is it really NoACID (atomicity, consistency, isolation, durability)? It’s really the ACID that’s hard to scale, esp. in the very large, very active data stores (e.g. SN) • Some NoSQLs now have SQL for query only • Ways of solving ACID scalability being discussed The problems: • Huge numbers of simultaneous updates • Large JOINs across very large tables (= big SQL query) • Lots of updates & searches on small data elements in vast data sets The alternative: • Key/value stores • De-normalized data 17-21 November 2010 ECPRD - WGICT - Bucharest 12 Consequences of De-normalizing Order(s) of magnitude increase in storage requirements Difficulty of updating numerous “Key equivalents” in many places – can’t be done synchronously Breaking relationship links allows parallel processing: helps the bottleneck of storage read speed (storage capacity is growing much faster than transfer rates) No JOINs or transactions 17-21 November 2010 ECPRD - WGICT - Bucharest 13 Name/Value Models Just name/value pairs, e.g. memcachedb, Dynamo Name/value pairs plus associated data, e.g. CouchDB, MongoDB – think document stores with metadata Name/value pairs with nesting, e.g. Cassandra 17-21 November 2010 ECPRD - WGICT - Bucharest 14 Cassandra Distributed, fault-tolerant database, based on ideas in Dynamo (Amazon) & BigTable (Google) Developed by FaceBook, open-sourced in 2008 Now Apache project Key/value pairs, in column-oriented format • Standard column: name, value, timestamp • Super-column: name, map of columns, each with name, value, timestamp (think array of hashes) • Grouped by Column family, also either standard or super • Column family contains ‘rows’, roughly like a DB table • Column families then go in key-spaces 17-21 November 2010 ECPRD - WGICT - Bucharest 15 Cassandra - NoACID Cassandra, et al, e.g. Voldemort (LinkedIn), trade speed, distribution and availability for consistency and atomicity No single point of failure “Eventually consistent” model Tunable levels of consistency Atomicity only guaranteed within a column family Accessed using Thrift (also developed by Facebook) Used by: Facebook Digg Twitter Reddit 17-21 November 2010 ECPRD - WGICT - Bucharest 16 NoSQL for Parliaments? Much parliamentary material is naturally unstructured and suited to the name/value model (think XML) Remember the old discussions about how to map such parliamentary material into relational databases? Think of every MPs contribution (speech) in chamber or committee as a key/value pair, i.e. a column Think of every PQ & answer as a super-column of name/value pairs for question, answer, holding, supplementary, pursuant, referral … Hansard becomes a super-column family! 17-21 November 2010 ECPRD - WGICT - Bucharest 17 Map / Reduce Column (or record) oriented design & de-normalized data power the parallel “map reduce” model (think “sharding on speed”) 17-21 November 2010 ECPRD - WGICT - Bucharest 18 Hadoop Nothing to do with NoSQL Hadoop is an infrastructure and now family of tools for managing distributed systems and immense datasets How immense? Hundreds of GB and 10 node cluster is ‘entry-level’ in Hadoop terms Developed by Yahoo for their cloud, now Apache project Supports Map/Reduce by pre-dividing & distributing data “Moves computation to the data instead of data to the computation” HDFS file system particularly interesting – distributed, resilient (far more advanced than DRBD or Gluster), but not real time (more eventually consistent…) Hive data warehouse front end – has SQL-like queries 17-21 November 2010 ECPRD - WGICT - Bucharest 19 Who uses Hadoop? Twitter AOL IBM Last.fm LinkedIn E-Bay Yahoo 36,000 machines with > 100,000 cores running Hadoop Largest cluster is only 4000 nodes Largest known cluster is Facebook! 2000 machines with 22,400 cores 21Petabytes in a single HDFS store 17-21 November 2010 ECPRD - WGICT - Bucharest 20 Hadoop for Parliaments? Hadoop may seem overkill for parliaments now… But, when you start your legacy collection digitization and digital preservation projects its model, for managing large datasets which essentially do not change & don’t need real-time commit, is very good fit! Other interesting Hadoop projects: Zookeeper (distributed apps co-ordination) Hive (data warehouse infrastructure) Pig (high-level data flow language) Mahout (scalable machine learning library) Scribe (for aggregating streaming log data) [not strictly Hadoop project, but can be integrated with it, using interesting workaround for the non-real time & NameNode single point of failure] 17-21 November 2010 ECPRD - WGICT - Bucharest 21 Other things ‘out there’ Drizzle A database “optimized for Cloud infrastructure and Web applications” “Design for massive concurrency on modern multi-cpu architecture” But, doesn’t actually explain how to use it for these… It’s SQL and ACID Mostly seems to be a reaction against what’s happening at MySQL… Has to be compiled from source – no distros available for it yet CouchDB Distributed, fault-tolerant, schema-free document-oriented database RESTful JSON API (i.e. Web front end) Incremental replication with bi-directional conflict detection Written in Erlang (highly reliable language developed by Ericsson) Supports ‘map/reduce’ like querying and indexing Interesting model, different from most other offerings Also now an Apache project Still too immature for anything beyond experimentation 17-21 November 2010 ECPRD - WGICT - Bucharest 22 Also ‘out there’ Voldemort MonetDB Another distributed key/value storage system Used at LinkedIn Doesn’t seem to have much future Cassandra is similar, better & more widely used “database system for high-performance applications in data mining, OLAP, GIS, XML Query, text and multimedia retrieval “ SQL and XQUERY front ends Also hard to see where it’s going… MongoDB Tries to bridge the gap between RDBMS and map/reduce JSON document storage (like CouchDB) No JOINs, no transactions Supports atomic transactions only on single documents Interesting, but may ‘fall between two stools’ 17-21 November 2010 ECPRD - WGICT - Bucharest 23 Predictions Hadoop and Cassandra are the ones to watch There will likely be some sort of re-convergence between NoSQL and query languages of some kind – can’t do everything with map/reduce (esp. not ad hoc queries) SQL may be destined to become like COBOL – still around and running things but not something to use for new projects Distributed storage models (with or without map/reduce) have good future Datasets will only get bigger – compliance, audit, digital preservation, the shift to visuals, etc Information management models (“strategy”) and access speed will remain key problems 17-21 November 2010 ECPRD - WGICT - Bucharest 24 Questions “What’s it all about?” http://ashardie.com 17-21 November 2010 ECPRD - WGICT - Bucharest 25