Grids@Work V Oracle Coherence for Finance Applications Ewan Slater Senior Solution Specialist
Transcription
Grids@Work V Oracle Coherence for Finance Applications Ewan Slater Senior Solution Specialist
<Insert Picture Here> Grids@Work V Oracle Coherence for Finance Applications Ewan Slater Senior Solution Specialist EMEA Technology Fusion Middleware Topics • • • • • • • • • • • Scalability – why do we care? Scalability – what’s the problem? Traditional approaches and their drawbacks The Coherence approach What is Coherence? Where does Coherence fit? How Coherence works Using Coherence Coherence in Action Conclusion Q&A Scalability – why do we care? IT Initiatives Driving Scalability Demand • XTP • Highest volume, Low Latency, Absolute Transactional Integrity • Virtualization Resources • Increased demand on Data Sources • Application re-provisioning must occur transparently without interruption of data access • Must handle multiple load increases at the same time • SOA • Increasing common access to resources • Sharing access means continuous availability and absolute reliability • EDA • Event driving transactions causing massive increase in load • Pervasiveness driving data need across all systems affected Demand Supply Time The more people have, the more they want! Software Framework Pressures Hardware Capacity Impact Service Oriented Architecture Compute Power: SMP/Multicore Web 2.0 Memory Arrives: “In Memory Option” Event Driven Architecture Network Speed: Gbe/10G/IB Extreme Transaction Volumes Storage: Flexibility Enterprise Manageability Requirements Enterprise Infrastructure Requirements Grid Automation Availability – Continuous Service Level Management Reliability – Transactional Integrity Application Performance Mgmt Scalability – Capacity on Demand Provisioning Performance – Zero Latency Scalability – what’s the problem? In general, applications don’t scale well… …what worked fine in development, or for 50 users… …can’t cope with production demand… …that increases over time… Why don’t applications scale? • Single points of failure (SPOF) • Database failure or pause = application failure or pause • One server fails, the entire system fails • One application or JVM fails, the application fails • Single points of bottleneck (SPOB) • Shared resources • The “hub” of Hub-and-spoke architectures • Heavy database or disk I/O • Applications are not designed to scale • It works in single-user testing on a PC, but it will work in production? • Scaling is often an afterthought – “it’s the DBA’s problem” Scaling the Application Tier: Traditional Approaches Scale up (or even bigger boxes) Approach How Advantages Disadvantages Scale-Up Buy Big Boxes Expensive “It’s an infrastructure problem” Increase Resources (cpu, memory, hdd capacity, speed and network, etc) Simple (overnight) No development No impact on internal design By specialized hardware (Azul, Infiniband…) Will hit physical limits Will have to redesign at limit Non-graceful deterioration at limit Stop, Add, Restart required to scale Bigger box = Much Bigger price tag!!! • High incremental cost • Wasted capacity At some point, even the biggest box has it’s limits! Stateless application tier (or blame the DBA) Approach How Advantages Disadvantages Stateless Scale-Out Make application stateless (eg: stateless sessions) Only scales to match underlying Data Source performance “Push state scale-out into lower Data Source layer” Use lots of stateless servers Easy to develop (not overnight, but relatively simple as no state is managed) “It’s the DBA’s problem” Use load-balancing Use “big” and “scalable” Data Source to ensure application state scale-out Scale-out is easy, just add more servers When underlying limit is reached, have to redesign Network bottlenecks experienced as data is moved between layers Performance Bottleneck Between Tiers Application Database Object Relational Java SQL A HUGE performance bottleneck: Volume / Complexity / Frequency of Data Access Performance Bottleneck Between Tiers Solution: Move relevant data to middle tier Java Application Application Server Application Server Application Server Memory Cache Memory Cache Memory Cache Object Object Object Relational Database • One Solution is to keep the object data in object form in high-speed distributed memory cache • Database remains the system of record (persistence) Caching in the application Approach How Advantages Disadvantages Caching Application keeps local copies (in memory or on local disk) of recently / commonly used state Seems simple Maintaining consistency of data between Local and Data Source instances can be difficult “Keep recent copies of state” “We’ll save the DB and DBA by caching” Reduces Data Source and Network load Significant application performance improvements Require “messaging infrastructure” to ensure consistency across a cluster (and application development) Typically applicable to “read only” applications and not “write a lot” applications Easy to get wrong Local Caching Can be scaled out… Farm Caching Inconsistent Local Cache Farm Caching • • Benefits: • • Same as Local Cache May now scale out Constraints: • • • • • Same as Local Cache - but now worse - across Farm! Singularity broken between members (Incoherent) Members have own copies of Entries No cost savings in making copies to members Cache capacity doesn’t increase with Farm size Scale out the Container (or blame the App Server) Approach How Advantages Disadvantages Use an Application Container Believe the vendors & the marketing Simple Typically scales in-thesmall “Our magical clustered container will scale our application infinitely” Follow a “scalability paradigm” Use a “Clustering Container” … It scaled the “Pet Store” linearly, therefore our X application will also scale linearly (where X ≠ “Pet Store) Well documented and communicable paradigm Easily scale development team Usually relies on “scale-up” rather than “scale-out” Requires specialized skills or products (out side of the standard paradigm) to really scale Clustering is primarily about High-Availability, not Scalability! Traditional Scale-Out Approaches… #1. Avoid the challenge of maintaining consensus • Opt for the “single point of knowledge” Client + Server Model (Hub + Spoke) Active + Passive (High Availability) Master + Worker Model (Grid Agents) #2. Have crude consensus mechanisms, that typically fail and result in data integrity issues (including loss) Traditional Scale-Out Consequences… • Have unbalanced / unfair load and task management • Some servers have greater system responsibility than others • Have Single Points of Bottleneck (SPoB) • Have Single Points of Failure (SPoF) • “Micro outages” are magnified as you scale-out • Exhibit Strong Coupling to Physical Resources • Software completely dependent on individual physical servers • Require specialized deployment and operation for individual Resources • Some servers require “special attention” to operate The Coherence Approach So how does Coherence solve the problem? Consensus is the key… Imagine a team where some members… • Have a different impression of the actual members of the team • Allocate tasks and information to their members (from their perspective) but on behalf of the team • Result? • Inconsistent views of team information • Without consensus some information will be inconsistent (at best) or be unavailable or lost (at worst / common) Real Madrid before Capello Membership Consensus • Consensus between resources is fundamental to ensure integrity of information (and work) when scaling-out Real Madrid after Capello Coherence relies on Consensus • Traditional scale-out approaches limit • Scalability, Availability, Reliability and Performance • In Coherence… • • • • Servers share responsibilities (health, services, data…) No SPoB No SPoF Massively scalable by design • Logically servers form a “mesh” • No Masters / Slaves etc. • Members work together as a team The result? Oracle Coherence: In Memory Data Grid What is Coherence? (c) Copyright 2007. Oracle Corporation Oracle Coherence… • Is an enabling technology that… • Allows customers to build bullet proof applications… • And achieve high performance and predictable scalability Typical Coherence Customers • • • • • • Online gaming (e.g. trading system) Telcos (e.g. SMS backbone) Hospitality (e.g. flight reservation system) Insurance (e.g. user profile management) Financial Services (e.g. risk engine) Public sector (e.g. railway signalling) Common theme: Mission – critical, bullet – proof solutions • • • • Reliability Availability Scalability Performance Coherence doesn’t need an app server There is a .NET client library…and this is pure .NET …and… There is a C++ client library…and this is pure C++ Where does Coherence fit? Look at the shape of the data Application Layers • Web Server • App Server • DB Server Data “Shape” across tiers Web Tier Network Web Cache Web Servers Application Tier Application Coherence Servers Database Tier Times Ten RAC HTML Data Structures in Memory Java Data Structures in Memory SQL Data Structures in Memory Web Cache offloads Web Servers, Improves Network Performance via Compression Coherence caches Java Structures in Memory; Very Fast Access to Java Data in Memory across MidTier Grid Times Ten & RAC provide Scalability to Database Data improving Query & Transaction Write Performance What is Coherence not? • Plug and play - the application code will need to change. • A database – persistent data will need to be written to a database (Oracle RAC is often an ideal fit). • A Transaction Processing Monitor. • A panacea for: • Inadequate hardware • Badly written applications • Poor database design How Coherence Works (c) Copyright 2007. Oracle Corporation Coherence Works by Consensus • Consensus is key • • • • • Communication is more efficient (peer-to-peer) No outages for voting (no need – everyone is a peer) No SPoF, SPoB No need for broadcast traffic (yelling at each other) You can do many things once you have “consensus”. made possible by TCMP (the “secret sauce”) Tangosol Cluster Management Protocol (TCMP) • Coherence’s own protocol between cluster members • TCMP utilizes UDP • Massively scalable • Asynchronous • Point-to-point • UDP Multicast is used for: • New JVMs to join the cluster automatically • Maintaining cluster membership • Multicast is not required; it may be disabled with Well Known Addresses (WKA) • UDP Unicast is used for most communication • Very fast and scalable • TCMP guarantees packet order and delivery • TCP/IP connections do not need to be maintained Distributed caching for your data… …and go faster stripes for your data Hardware implications (Blades not Bludgeons) Big Iron • Buy based on predicted growth • High incremental cost Low cost clusters • Buy as you grow • Small increments at present day prices & clock speeds Using Coherence Building an Application • Developers use Coherence API to • • • • Access Data Listen for Events Query Data Process Data in the Grid Setting up a grid • • • • Coherence clusters to form a grid OOTB A grid may contain many caches A cache structure is defined by a scheme Schemes are defined in config files Distributed Data Management (access) The Distributed Scheme (one of many) In-Process Data Management (c) Copyright 2007. Oracle Corporation Distributed Data Management (update) (c) Copyright 2007. Oracle Corporation Distributed Data Management (failover) (c) Copyright 2007. Oracle Corporation Distributed Data Management • Members have logical access to all Entries • • • • At most 2 network operations for Access At most 4 network operations for Update Regardless of Cluster Size Deterministic access and update behaviour (performance can be improved with local caching) • Predictable Scalability • • • • Cache Capacity Increases with Cluster Size Coherence Load-Balances Partitions across Cluster Point-to-Point Communication (peer to peer) No multicast required (sometimes not allowed) (c) Copyright 2007. Oracle Corporation Data Distribution: Clients and Servers “Clients” with storage disabled “Servers” with storage enabled (c) Copyright 2007. Oracle Corporation Near Caching (L1 + L2) Topology (c) Copyright 2007. Oracle Corporation Observing Data Changes (c) Copyright 2007. Oracle Corporation Parallel Queries (c) Copyright 2007. Oracle Corporation Parallel Processing and Aggregation (c) Copyright 2007. Oracle Corporation Data Source Integration (read-through) (c) Copyright 2007. Oracle Corporation Data Source Integration (write-through) (c) Copyright 2007. Oracle Corporation Data Source Integration (write-behind) (c) Copyright 2007. Oracle Corporation Coherence*Extend WAN Topology Oracle Coherence in Action Example Use Cases • Mainframe Cost Reduction • • Caching repeated queries Oracle Coherence with Compute Grid • • Intra – day risk calculation Oracle Coherence Cloud • • Message – based infrastructure replacement Eliminating SPoB • Trading Exchange Redevelopment Mainframe Cost Reduction Taming the MIP Monster • Retail banking IT provider • • • • Supports 400+ banks 4 key systems – repeated queries to mainframe 100,000 queries to mainframe each day Large recurring cost to the business • Coherence deployed as distributed cache • 100,000 queries 1600 queries • Saving ~€1000000 in 1st year Oracle Coherence with Compute Grid Compute Grid on Database Traditional Compute Grid • Emphasis on orchestrating tasks out to compute nodes in grid Grid Applications •Data Set either loaded locally or pulled off of back end data source Grid Manager •Applications Highly Customized for Grid Environment Great processing scalability with inevitable data bottlenecking Orchestration can be point of bottleneck as well Compute Grid on Data Grid Traditional Compute Grid with Data Scale Out High Performance Computing (HPC) •Oracle Coherence Data Grid Overlay onto Compute Grid Grid Applications • Compute Grid Scale Out with Data Fault Tolerance Grid Manager Oracle Coherence Oracle RAC • Massive Persistent Scale Out with Oracle RAC Customer Story: Wachovia Scenario • Wachovia Investment Bank introducing “Service Oriented Infrastructure (SOI)” • Requires absolute data availability for complex Grid Computations Problem • Existing Compute Grid infrastructure suffering from data latency and throughput problems • Complex calculations so lengthy as to be outdated Solution • Data Grid overlay on Compute Grid • Enable risk calculations to fully utilized the grid hardware by having real time access to in-memory data as well as parallelization . • Reduced critical risk computation from 50 days to under 1 hour! Over 300 CPUs in Production! Oracle Coherence Cloud The challenge: Scale this... • Domain: Retail Banking Infrastructure • • • • Over 500 Banks 100,000+ Teller Staff Desktops Applications 10,000+ Cash Machines (ATMs) 10,000,000’s of Internet Banking Transactions/day • Current Infrastructure • • • • • Java SE based (no J2EE – apart from Servlets) Oracle RAC (not an issue – scaling across a WAN ) Messaging (serious challenges) Processing Business Tasks (challenges approaching) 30,000,000+ Business Tasks a day – minimum. • must do 100,000,000 effortlessly per/day before going live (c) Copyright 2007. Oracle Corporation The challenge continued: Scale this... • Execution of Business Tasks • Account Balance, Credit/Debit, Funds Transfer, Statement Processing, Batch Processing, Payment Processing • Tasks arrive from a variety of clients (thin, rich, crossplatform, mainframes...) – variety of languages • Goal: • Tasks are executed by the “cloud” • Don’t want to build own “cloud” software The Cloud • Their knowledege: • Massive experience in scale-out. Could build it themselves, but budget (time/resources/money) will be saved by buying. (c) Copyright 2007. Oracle Corporation Architectural issue: Performance Bottleneck Between Tiers Application Database Object Relational Java SQL A HUGE performance bottleneck: Volume / Complexity / Frequency of Data Access (in some companies, this is would be time to blame the DBA) Constraints... • • • • No Single Points of Failure No Simple Points of Bottleneck No Service Registries No Masters + Workers • already got one that is partitioned into over 200 separate clusters • No Manual Partitioning • Keep everything in Memory • Active + Active Sites • Across WAN • • • • Develop system on a note book Scale to over 500 servers No reconfiguration outages No byte-code manipulation / proxies (c) Copyright 2007. Oracle Corporation • No Data or Task Loss • During failure • During server upgrade • During scale out • • • • • No Transactions (XA) Support multiple versions Predictable response times Predictable scale out costs Manage via JMX, from any point in the “Cloud”. • Pure Java Standard Edition • Infrastructure add a maximum of 3ms latency to tasks. • Integrate with existing applications (Java 1.4.2+) Approach • Business Tasks are regular Java objects (pojo) • Place Business Tasks into Coherence • • • • Coherence dynamically distributes Tasks across the Cluster Tasks are resilient in the Cluster May use “affinity” to ensure related Tasks processed together Coherence triggers task processing • Scaling out Coherence = Scaling out Task Processing (c) Copyright 2008. Oracle Corporation List of the Performed tests Scalability Test Guaranteed Delivery Test Failover Test Server Joining Test Unattended Long Term Test Results • While submitting Tasks (regular system load) • Test 1: Scale from 1 server to over 400 • No reconfiguration • Test 2: Randomly kill servers • No reconfiguration • Test 3: Kill 1, 2, 4, 8, 16, 32, 64, 128, 160 servers at once • No data loss • Possible 1,200,000,000 Tasks execution capacity per/day • Client may reduce current hardware costs by 75% (c) Copyright 2008. Oracle Corporation Eliminating Single Point of Bottleneck Trading Exchange • • • • • • Similar requirements and constraints Order processing (Foreign Exchange) 1,000’s per second (initial) per currency pair No manual partitioning No transactions 10ms max latency for full accept, validate, match, respond • Achieved with Coherence using BMLs (< 3ms) • 14 weeks development (start to go live) (c) Copyright 2008. Oracle Corporation Previous Approach (failed to meet SLA’s) (c) Copyright 2008. Oracle Corporation Coherence – based Solution (c) Copyright 2008. Oracle Corporation Conclusion Oracle Coherence… • Is an in – memory object data grid, providing • • • • Scalability Availability Reliability Performance • Supports many mission – critical apps especially in Financial Services • Integrates with and supports other technologies: • Compute Grids • Database Grids • C++, .Net • Is a key component of Oracle’s XTP platform Q&A