What can I do to Protect DB2 Against Warehousing
Transcription
What can I do to Protect DB2 Against Warehousing
What can I do to Protect DB2 Against Warehousing, WebSphere, and Itself Adrian Burke DB2 SWAT Team SVL [email protected] #ibmiod Please note IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole discretion. Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision. The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. The development, release, and timing of any future features or functionality described for our products remains at our sole discretion. Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here. 1 Topics of discussion • Warehouse o Customer example • Virtual/Real Storage o What it really looks like • WebSphere (any application server) o Environmental topics o WLM thread classification o Connection limits 2 Warehouse Experiences • Overview o One query that builds the MQTs, one query to get summary data (5 hours, 3 hours elapsed respectively) • From the perspective of: o DB2 o The LPAR o WLM o DASD To blame or not blame…. DB2 3 The Secret to#ibmiod Success is Knowing Who to Blame Accounting Report • Where is elapsed time spent and what does that mean for performance investigation? • Notice SE CPU, not reported as ‘IIP CPU’ anymore • Vast majority of parallel tasks ran on zIIP • Highest element of time was not accounted for time o CPU starvation? o Massive paging? o Performance traces? • Class 3 suspense time is also larger than the actual CPU time, generally due to Other read I/O and Sync I/O (more to come).. 4 WLM and Unaccounted for Time • WLM needs a Donor and a recipient • It will gage whether or not the ‘transplant’ of resources is warranted and: o Can the bottleneck be improved by service class bump? o Can recipient’s performance be significantly improved? o Will stealing from the selected donor help the situation? o Does WLM have all the data it needs? o Skip Clock? • More on WLM later….. WLM performance index for both DB2 and the warehouse workload Parallelism Investigation • RMF Spreadsheet Reporter Response delay report o Part of WLM Activity report • DSNTIJUZ o PARAMDEG? o BUFFERPOOL VPPSEQT?? o It does not look like parallelism was hampered here • Lots of unaccounted for time o OMPE accounting o This block does not show child task class 2 time • SYS1.PARMLIB (IEAOPTxx) o IIPHONORPRIORITY = NO • 3 parallel tasks waiting for 1 zIIP o ZIIPAWMT also important here if delays seen in I/O in V10 • This becomes very important in V10 and V11 7 CPU delay at about 33%, and the zIIP suspense time at 34%. I/O Suspicions • Database I/O could be blamed on synch I/O average of 30 ms at times (8.6ms total average) • Other Read I/O o Usually prefetch time o Obviously a lot of prefetch • ROT: 0.4ms per page • We saw >1ms in this case o Maybe look for Sync Read Seq. (VPSEQT) • This means there was an asynch. I/O request, followed by a synch I/O because the page was thrown out too quickly • Affected by BP size and VPSEQT/ VPPSEQT) settings 8 What is causing the I/O delay • DASD device or I/O subsystem could be part of issue, but what else? • Other read I/O means pre-fetch is a significant factor • Vicious cycle of CPU then I/O starvation Scale of 100% • WLM would monitor IOSQ and Pend time….next slide If I/O delays are not calculated into performance index WLM cannot correct the root cause of the delay 9 If DB2 is starved for CPU it cannot schedule a prefetch operation Scale of 100% DASD subsystem • Top 10 volume response time chart o Primary axis is number of I/O systems queue events per second o Secondary axis is response time • Response time= o Connect + disconnect + pend + IOSQ time o If control unit has 2 requests to same volume 1 will queue o This was resolved for the most part with Parallel Access Volumes (Hiper is best) o Or moving physical data sets around to avoid LCU and volume contention 10 ROT: If IOSQ is more than half of (DISC+CONN) then it should be investigated Capturing documentation for IBM for access path regression • Methods for capturing documentation for all releases is documented here o https://www.ibm.com/support/docview.wss?uid=swg21206998 o OSC and DB2PLI8 do not support DB2 10 • SYSPROC.ADMIN_INFO_SQL supports V8 -> V10 (Required) o Excellent developerWorks article here: • http://www.ibm.com/developerworks/data/library/techarticle/dm-1012capturequery/index.html o It is installed in V10 base and is subject to the installation verification process • DB2HLQ.SDSNSAMP(DSNTESR) will create and bind it • calling program is DSNADMSB, and sample JCL in DSNTEJ6I o Ensure DB2 9 and DB2 10 have APAR PM39871 applied • Data Studio V3.2 incorporates this procedure into a GUI (Best Practice) • • • • • http://www.ibm.com/developerworks/downloads/im/data/#optional No charge product, replacement for OSC and Visual Explain Single option to download with V3.2 Incorporates Statistics Advisor Query Environment Capture used to collect doc. – FTP doc directly to DB2 Level 2 in tool • Can be used to duplicate stats in TEST environment Possible Solution: engage the appropriate resources • New z196 o Larger processors o More memory o Page-fix workfile bufferpool o Workfile best practices Info APAR • II14587 - http://www- 01.ibm.com/support/docview.wss ?uid=isg1II14587&myns=apar&m ynp=DOCTYPEcomponent&myn c=E • New DS8000 Shark with parallel access volumes • WLM goals adjusted to reflect warehouse as a business critical workload o And allow WLM to manage it (IIPHONORPRIORITY=YES) and >1 zIIP • **Management buy-in that ‘workload’ was a victim not the instigator 12 Virtual/ Real Storage Monitoring SKCT / SKPT CT / PT Skeleton Pool (EDM_SKELETON_POOL) Global Stmt Pool (EDMSTMTC) • What tools can you use DBD Pool (EDMDBDC) • How to make good decisions regarding ZPARMS • Customer examples of virtual and real storage consumption 13 2GB EDM Pool (EDMPOOL) Working storage V10 V8 V9 Enabled in CM #ibmiod SKCT / SKPT CT / PT Others Where do I stand with respect to storage? • MEMU2.zip is a REXX exec (no support provided), available free on IBM Developerworks under the DB2 for z/OS Exchange > REXX o https://www.ibm.com/developerworks/mydeveloperworks/files/app?lang=en#/pers on/270000K6H5 Or Google > IBM developerworks MEMU2 o DB2 V8 and DB2 9 use 1 version, another version specific to DB2 10 o MEMU2 REXX – outputs IFCID225 info invoked as batch job o MEMUSAGE REXX – outputs IFCID225 if invoked from TSO Option 6 • Returns IFCID225 immediately for one time snap-shot • InfoCenter V10 real storage estimate – you provide working storage size o http://publib.boulder.ibm.com/infocenter/dzichelp/v2r2/topic/com.ibm.d b2z10.doc.inst/src/tpc/db2z_calcrealstgreqs.htm 14 The ramp up • As more work gets done the stack storage grows (high use thread storage DB2 keeps around) • Prefetch, deferred write, castout engines all increase on demand and occupy storage, only released if DB2 bounced o 600 prefetch engines o 300 deferred write engines o 300 castout engines o 300 GBP write engines o 500 P-lock notify/exit engines • The global (above) and local dynamic statement cache (below - KEEPDYNAMIC), and EDM pool o RELEASE(DEALLOCATE) threads grow over time as well • DB2 keeps a hold of much of the storage assuming if it was that busy once it will occur again 15 System engines reduced 31-bit footprint in V10 • Reduced by 90%, much like the user thread footprint o Compare total agent system storage and sum of engines o In this example: Total of 110 engines in DB2 9 and 164 in DB2 10 • These should be static when making estimates on number of threads supported by virtual and real storage *(blue line is storage in MB, right axis is engines) 16 Actual thread footprint in SAP subsystem • V9 -> V10 31-bit footprint decreased 73% o 1.39MB -> 0.38MB • Estimated max number of active threads increases about 3x o This number will be more accurate the more threads are in the system • Done with MEMU2 in a spreadsheet through calculated fields 17 Memory Monitor • DB2 9 Automated Memory Monitor o Built-in monitor runs from startup to shutdown and checks the health of the system at one-minute intervals. o When the DBM1 storage below the bar reaches specific thresholds of 88, 92, 96, or 98 percent used of the available storage, messages (DSNV508I, DSNV510I, DSNV511I,and DSNV512I) are issued. • PM38435 respects storage reserved for cushion in %, and lists it • -DISPLAY THREAD(*) SERVICE (STORAGE) o DSNV492I LONG + VLONG is storage below 2GB bar (V9) o Using display command we see a 72% below the bar reduction • Not including stack storage 18 Storage contraction In here DB2 can come down In here DB2 threads abend (storage critical) • 3 critical numbers for calculating cushion • Storage reserved for must complete (e.g. ABORT, COMMIT) QW0225CR/PVTCRIT o = (CTHREAD+MAXDBAT+1)*64K (Fixed, real value) +25M • Storage reserved for open/close of datasets – QW0225MV/PVTMVS o = (DSMAX*1300)+40K (space reserved in low private) • Warning to contract – QW0225SO/PVTSOS o = Max (5% of Extended Region Size, QW0225CR -25M) • Storage Cushion = QW0225CR + QW0225MV + QW0225SO o Note: QW0225MV will decrease as page sets are opened 19 • In full system contraction DB2 grabs the LPVT latch so current threads cannot request storage and no new units of work can get in… In here DB2 hits full system contraction Respect the cushion • Storage critical means threads will be abended to gain storage o If must complete work fails to get storage, like for a roll-back, or unconditional request for storage comes in, then DB2 could come down o Need to build in extra 100MB in cushion and set MAXDBAT+CTHREAD appropriately (that is SWAT team’s approach when sizing) 20 MVS Storage DB2 10 - D VIRTSTOR,HVCOMMON • Almost everything is above the bar: o 6GB Common per subsystem: z/OS 1.10 has 64GB as default 6GB - D VIRTSTOR,HVSHARE • If you have many subsystems on same LPAR z/OS default may not be enough o 128GB Shared per subsystem came in v9: now in V10 this is where all the thread storage went • In V9 only DRDA comm. area and trusted context ran here • We still have 31bit stack for agents (threads) o 16K for system agents o 32k for attach 21 o xPROCS still there, and reported in IFCID 225 128GB What affects the storage? • 64-bit private o Mostly buffer pools o Fixed RID pool o EDM pool • 64-bit shared thread and system o Used to execute SQL o Variable-dynamic statement caches, CT, PT, SKCT, SKPT o *Compare to number of system and user agents/threads, parallelism • 64-bit stack o Just as 31-bit stack in V9, used by thread to execute SQL o *same • 64-bit common o Distributed agents, package accounting, rollup • All are affected by REALSTORAGE_MANAGEMENT 22 REALSTORAGE_MANAGEMENT (Discard Mode) • OFF Do not enter discard mode unless the REALSTORAGE_MAX boundary is approached OR z/OS has notified us that there is a critical aux shortage • ON Always operate in discard mode. This may be desirable for LPAR with many DB2s or dev/test systems • AUTO (the default) When significant paging is detected, discard mode will be entered • Important notes: o discard mode is not exited immediately upon relief to avoid constant toggling in and out of this mode o discard mode shows <1% CPU degradation with no detectable impact to running workloads o DSNV516I – beginning storage discard mode o DSNV517I – ending storage discard mode 23 V10 storage monitoring • Possible to map real storage fluctuations and real storage available on the LPAR via IFCID 225 and MEMU2 o DB2 directly affects the real available on the LPAR as shown below o Large sorts and other concurrent workload play a factor here o We can see when REALAVAIL drops sometimes DB2 gets paged out to AUX, even though very little paging registered on the system 24 Reality Check Storage summary • V9 - virtual o IFCID 225 shows real and virtual storage • Full system contraction • Storage creep/ non-DB2 storage in DBM1 • Storage for threads o The race begins – (virtual storage) EDM pool not sized based on max number of threads • Individual DB2 threads (allied, DBAT) may abend with 04E/RC=00E200xx when insufficient storage available o Eventually DB2 subsystem may abend with abend S878 or S80A due to non-DB2 subsystem component (e.g DPF) issuing unconditional MVS getmain • DB2 getmains are MVS conditional getmains so are converted to DB2 abends e.g. 00E20016 • V10 –real o Get idea of DB2’s real storage consumption, as well as sort’s and other workload running concurrently o Before DB2 member consolidation set baseline for amount you can absorb 25 o Don’t forget MAXSPACE, REALSTORAGE_MANAGEMENT, REALSTORAGE_MAX (MEMLIMIT is ignored) WebSphere (any appl. Server) disaster prevention • Overview of application server environment Its just ONE WebSphere connection…right? o Types of connections • WLM and identifying JAVA threads • Thread and timeout settings 26 #ibmiod WebSphere Infrastructure • Understanding WLM & WAS on z/OS o The Application server consists of 1 control region and 1:N servant regions • Each servant region spawns application environments WLM sees fit to satisfy application workload - similar to WLM stored procedure address spaces we know in DB2 • Each appl environment takes advantage of thread/connection pools Server CR SR Servant Region Controller Region WLM JVM Appl More as WLM sees the need, based on Policies JCL WLM starts servants based on workload seen Parameters provide control over: Minimum number of regions Maximum number of regions 27 Appl Servant Region JVM Appl Appl Different types of connections –app server to DB LPAR LPAR A B 1 1. Same LPAR No network … cross memory … ultra fast C 2. Different LPAR, HiperSockets 2 No network … cross memory … very fast 3 TCP Stack D TCP Stack 3. Different LPAR, not HiperSockets No wire … just adapter card … fast 4. Off System z Traditional networking here Real Resources OSA Network 4 When throughput and scalability is important, network delays can add up TCP/IP on z/OS is very aware and optimizes its path to reduce overhead and benefit your business 29 Thread footwork T2 vs. T4 Connections same LPAR • Going through DIST adds network translations, another address space, and context switch to an SRB o DB2 objective is to improve T2 driver performance to beat T4 o Even local T4 connection hits DIST and TCP stack – need DDF WLM service class o Moving T2 to T4 is not 1 for 1 MIP exchange due to overhead • But, the worse behaving the application the more there is to offload so in the end it must be tested and compared 30 WLM ROTs – samples every 250ms, acts every 10 seconds • IRLM Highest (SYSSTC, dispatch priority 255) • MSTR (SYSSTC or Imp 1 and high velocity) o In DB2 9 MSTR controls system health monitor for virtual memory constraints and reporting to WLM for sysplex routing • Importance 1, high velocity o DBM1 o DIST Number of CPs defined for the LPAR Recommended High Velocity Goal 1-5 50-70 6-15 60-80 More than 15 70-90 o WLM sto proc environments • Discretionary is not appropriate for DB2 work for the most part o Performance index ALWAYS 0.8, so it always looks good • DDF work should be lower priority than DIST o Response time is much more effective than velocity for transactions, as velocity means nothing to end user, and response time makes the phone ring o If ignored becomes SYSOTHER, discretionary • Despite z/OS Enqueue processing could still hold locks on resources o Enclaves independent, but Dependant on DIST 32 • If CMTSSTAT=ACTIVE do not use multiple periods Do Not Put DDF Work Above DIST Address Space • Example of DDF workload with importance 1 velocity goal coming into DIST address space with importance 2 and high velocity goal o DB2 itself is preempted by remote work, which is then punished when DIST cannot get the cycles it needs 33 WLM Qualifiers: Work qualifiers are used to help identify a thread or unit of work • AI - Accounting Information* • CI - Correlation Information* • CN - Collection Name • CT - Connection Type • CTG - Connection Type Group • LU - LU Name • LUG - LU Name Group • PFG - Perform Group • PK - Package Name • PKG - Package Name Group • PN - Plan Name • PNG - Plan Name Group • PR - Procedure Name • NET - Net ID • PX - Sysplex Name • NETG - Net ID Group • SI - Subsystem Instance • PC - Process Name* • SIG - Subsystem Instance Group • PF - Perform • SSC - Subsystem Collection * Remote processes 35 • UI - Userid* WebSphere Connection Annotation • Supported since WAS v6.0 • Allows you to specify certain annotations on a connection similar to datasource annotations but on the connection level o CLIENT_ACCOUNTING_INFO (AI in WLM) o CLIENT_APPLICATION_NAME (PC in WLM) o CLIENT_ID (UI in WLM) o CLIENT_LOCATION • Can be set explicitly or implicitly o Implicitly – use the trace string: WAS.clientinfo=all • Dynamically enabled and disabled. • Passes down workstation ID and client ID (if security is enabled in WAS) • Passes down WAS application name (if one exists). o Explicitly – use setClientInformation API defined in com.ibm.websphere.rsadapter.WSConnection • WLM_SET_CLIENT_INFO procedure can be called by app. 36 Identify WAS threads- Setting Client info • In the data source o All applications sharing the data source appear the same to DB2 Ease of implementing But static o Need one source per application to change information • Calling WLM_SET_CLIENT_INFO stored procedure o Requires application add a call to proc and populate the information • Having application set it o WSConnection() method to set correlation and accounting info • Create a wrapper from incoming getConnection() string that dynamically picks up program name and IDs o Can use Hibernate or Spring framework class to populate their intermediary config file o Could use a wrapper from WebSphere that uses getConnection() and WSSubject class to pull the information out of the incoming request to populate client info 37 Requires coding, but flexible Data Source Definition in WebSphere • clientWorkstation • clientApplicationInformation o Note that this is not ProgramName • clientUser • clientAccountingInformation WLM_SET_CLIENT_INFO • A stored procedure was introduced in V8 to allow remote applications call the same APIs via a DB2 stored procedure o PK74330 - https://www304.ibm.com/support/docview.wss?uid=swg1PK74330 • What these fields looks like in Omegamon Performance Expert 39 Example of Creating a DDF Service Class for a Specific Application . . . • If a service class of ‘PC=AdrianBurke’ had been created and the application code contained connectionProperties.put("clientProgramName", "agb_v9"), and connectionProperties.put("clientApplicationInformation", “AdrianBurke“ or WSConnection.CLIENT_APPLICATION_NAME, “AdrianBurke”, ) in connection string: then the snapshot of the enclave screen would show the following details: 41 Thread Monitoring: -DIS DDF DETAIL real time information on DBAT and CONDBAT metrics DSNL080I -DSSP DSNLTDDF DISPLAY DDF REPORT FOLLOWS: DSNL081I STATUS=STARTD DSNL082I LOCATION LUNAME GENERICLU Max connections, could -NONE be idle and large number How many threads are RESPORT=5020 IPNAME=-NONE DSNL084I TCPPORT=446 SECPORT=0 currently doing work DSNL085I IPADDR=::192.168.10.193 Max concurrent DSNL086I SQL DOMAIN=KSEE1.BCBSKS.BLUESNET.NEthreads, finite number DSNL083I USTOPNETDB2P TOPNET.LUDB2P DSNL090I DT=I CONDBAT= 2000 MDBAT= 300 DSNL092I ADBAT= 64 QUEDBAT= DSNL093I DSCDBAT= 0 INADBAT= 56 INACONN= 0 CONQUED= 370 DSNL099I DSNLTDDF DISPLAY DDF REPORT COMPLETE How many threads are lounging in the pool 43 0 You hit max DBAT and some are waiting Inactive connection An Example of Tuning a Connection Pool Recommendations To compute if minimum and maximum connections from WAS to DB2 are set correctly, add up all connections from every data source definition and compare to corresponding DB2 zParms related to thread and connection count (after running MEMU2) Connection Timeout Max Connections Application How long an idle or inactiveServer connection Settings can remain in the pool. Should be < IDTHTOIN 400 and > Reap time 40 Min Connections 20 Reap Time Unused Time Aged Timeout 90 115 600 DB2 zParms Max WAS Connections to DB2 IDTHTOIN Min WAS Connections to DB2 MAXDBAT Aged time out should be set to a value greater than reap time FailingConnectionOnly Reap time denotes how out. Setting theCTHREAD value to 0 often the pool maintenance Number of means the physical connection thread runs. The more 9 Servants will exist in the pool forever. often it runs, the more CONDBAT Can help accurate 9 the pool is. The Total Max servants * 40 max conns = resize the thread footprint, like CONTSTOR by running can effect 360 Connections TCPKPALV allowing connection and thread performance. Total Min 9 servants * 20 min connsto be recycled 520 120 260 395 Purge Policy Connections 44 =180 100 1000 300 WebSphere for z/OS Production Environment • The point is to determine max threads DB2 can handle and queue the excess outside DB2, with pooled threads the queued WAS requests can be serviced quickly. o WAS2 – • 1 Control Region • 3 Servant Regions o WAS3 – • 1 Control Region • 3 Servant Regions • WAS2 and WAS3 reside on the same LPAR running against the same DB2 • WAS profile o considered IOBOUND (default) where #processors * 3= # of threads o CPUBOUND is where #processors *1 = # of threads 45 WAS Production Settings – WAS2, WAS3 • We do not want WAS to be able to consume all available threads, as DB2 may have other work to process (DB2 can handle 550 total [CTHREAD + MAXDBAT]) • IDBACK and CTHREAD govern local T2 connections in this instance (200<600) DB2 zParms • 4 processors x 3 = 12 for IOBOUND WAS Application Server Settings Connection Timeout Max Connections Min Connections Reap Time Unused Time Aged Timeout Purge Policy Tot Number of Servant Regions Total Max Connections Total Min Connections 46 180 Default=100 (min 12) 1 180 1800 (<900) 0 EntirePool Max WAS Connections to DB2 IDTHTOIN 600 (72 min) Min WAS Connections to DB2 MAXDBAT 6 200 CTHREAD 350 IDBACK 200 CONDBAT 1000 TCPKPALV Enable 900 6 6 servants * 100 = 600 3*1=3 Possibility of 600 thread requests from WAS is greater than the 350 calculated max we see in the MEMU2 data References • John Campbell’s Performance Statistics presentations • Nigel Slinger’s Session 2639 on Real Storage • RMF spreadsheet reporting tool o http://www03.ibm.com/systems/z/os/zos/features/rmf/tools/rmftools.html • Akira and Akiko’s performance updates at IOD • **Trials and tribulations endured by customers via trial and error tuning 47 Extras…… • No-Charge workshop (email me) o System z Synergy workshop focused on Websphere (LUW or z) and DB2 for z/OS, settings, best practices, lessons learned Questions??? Adrian Burke DB2 SWAT Team SVL Lab [email protected] • • • • 49 VISIT the DB2 Best Practices VISIT the DB2 for z/OS Exchange JOIN the World of DB2 for z/OS JOIN the DB2 for z/OS group
Similar documents
四川大学 IBM 技术中心 ... Pre-Assessment/Sample Test for Test /Exam 701-DB2 UDB V8.1for Linux, UNIX...
More information
DB2 Connect: Best Practices
includes drivers and capabilities to define data sources. For example, for ODBC, installing a DB2 client installs the DB2 ODBC driver and registers the driver. Application developers and other us...
More information