What can I do to Protect DB2 Against Warehousing

Transcription

What can I do to Protect DB2 Against Warehousing
What can I do to Protect
DB2 Against
Warehousing, WebSphere,
and Itself
Adrian Burke
DB2 SWAT Team SVL
[email protected]
#ibmiod
Please note
IBM’s statements regarding its plans, directions, and intent are subject to change
or withdrawal without notice at IBM’s sole discretion.
Information regarding potential future products is intended to outline our general
product direction and it should not be relied on in making a purchasing decision.
The information mentioned regarding potential future products is not a
commitment, promise, or legal obligation to deliver any material, code or
functionality. Information about potential future products may not be incorporated
into any contract. The development, release, and timing of any future features or
functionality described for our products remains at our sole discretion.
Performance is based on measurements and projections using standard IBM
benchmarks in a controlled environment. The actual throughput or performance
that any user will experience will vary depending upon many factors, including
considerations such as the amount of multiprogramming in the user’s job stream,
the I/O configuration, the storage configuration, and the workload processed.
Therefore, no assurance can be given that an individual user will achieve results
similar to those stated here.
1
Topics of discussion
• Warehouse
o Customer example
• Virtual/Real Storage
o What it really looks like
• WebSphere (any application
server)
o Environmental topics
o WLM thread classification
o Connection limits
2
Warehouse Experiences
• Overview
o One query that builds the MQTs, one
query to get summary data (5 hours,
3 hours elapsed respectively)
• From the perspective of:
o DB2
o The LPAR
o WLM
o DASD
To blame or not
blame…. DB2
3
The Secret to#ibmiod
Success is Knowing Who to Blame
Accounting Report
• Where is elapsed time spent and what does
that mean for performance investigation?
• Notice SE CPU, not reported as ‘IIP CPU’
anymore
• Vast majority of parallel tasks ran on zIIP
• Highest element of time was not accounted
for time
o CPU starvation?
o Massive paging?
o Performance traces?
• Class 3 suspense time is also larger than
the actual CPU time, generally due to Other
read I/O and Sync I/O (more to come)..
4
WLM and Unaccounted for Time
• WLM needs a Donor and a recipient
• It will gage whether or not the ‘transplant’ of
resources is warranted and:
o Can the bottleneck be improved by service class
bump?
o Can recipient’s performance be significantly
improved?
o Will stealing from the selected donor help the
situation?
o Does WLM have all the data it needs?
o Skip Clock?
• More on WLM later…..
WLM performance
index for both DB2
and the warehouse
workload Parallelism Investigation
• RMF Spreadsheet Reporter Response
delay report
o
Part of WLM Activity report
• DSNTIJUZ
o
PARAMDEG?
o
BUFFERPOOL VPPSEQT??
o It does not look like parallelism
was hampered here
• Lots of unaccounted for time
o
OMPE accounting
o
This block does not show child task
class 2 time
• SYS1.PARMLIB (IEAOPTxx)
o
IIPHONORPRIORITY = NO
• 3 parallel tasks waiting for 1 zIIP
o
ZIIPAWMT also important here if
delays seen in I/O in V10
• This becomes very important in
V10 and V11
7
CPU delay at about 33%, and
the zIIP suspense time at
34%.
I/O Suspicions
• Database I/O could be blamed
on synch I/O average of 30 ms
at times (8.6ms total average)
• Other Read I/O
o Usually prefetch time
o Obviously a lot of prefetch
• ROT: 0.4ms per page
• We saw >1ms in this case
o Maybe look for Sync Read
Seq. (VPSEQT)
• This means there was an
asynch. I/O request,
followed by a synch I/O
because the page was
thrown out too quickly
• Affected by BP size and
VPSEQT/ VPPSEQT)
settings
8
What is causing the I/O delay
• DASD device or I/O subsystem could be part of
issue, but what else?
• Other read I/O means pre-fetch is a significant
factor
• Vicious cycle of CPU then I/O starvation
Scale of 100%
• WLM would monitor IOSQ and Pend time….next
slide
If I/O delays are not
calculated into
performance index
WLM cannot correct the
root cause of the delay
9
If DB2 is starved for
CPU it cannot
schedule a prefetch operation
Scale of 100%
DASD subsystem
• Top 10 volume response time
chart
o Primary axis is number of
I/O systems queue
events per second
o Secondary axis is
response time
• Response time=
o Connect + disconnect +
pend + IOSQ time
o If control unit has 2
requests to same volume
1 will queue
o This was resolved for the
most part with Parallel
Access Volumes (Hiper is
best)
o Or moving physical data
sets around to avoid LCU
and volume contention
10
ROT: If IOSQ is more than
half of (DISC+CONN) then it
should be investigated
Capturing documentation for IBM for access path regression
• Methods for capturing documentation for all releases is documented here
o https://www.ibm.com/support/docview.wss?uid=swg21206998
o OSC and DB2PLI8 do not support DB2 10
• SYSPROC.ADMIN_INFO_SQL supports V8 -> V10 (Required)
o Excellent developerWorks article here:
• http://www.ibm.com/developerworks/data/library/techarticle/dm-1012capturequery/index.html
o It is installed in V10 base and is subject to the installation verification process
• DB2HLQ.SDSNSAMP(DSNTESR) will create and bind it
• calling program is DSNADMSB, and sample JCL in DSNTEJ6I
o Ensure DB2 9 and DB2 10 have APAR PM39871 applied
• Data Studio V3.2 incorporates this procedure into a GUI (Best Practice)
•
•
•
•
•
http://www.ibm.com/developerworks/downloads/im/data/#optional
No charge product, replacement for OSC and Visual Explain
Single option to download with V3.2
Incorporates Statistics Advisor
Query Environment Capture used to collect doc.
– FTP doc directly to DB2 Level 2 in tool
• Can be used to duplicate stats in TEST environment
Possible Solution: engage the appropriate resources
• New z196
o Larger processors
o More memory
o Page-fix workfile bufferpool
o Workfile best practices Info APAR
• II14587 - http://www-
01.ibm.com/support/docview.wss
?uid=isg1II14587&myns=apar&m
ynp=DOCTYPEcomponent&myn
c=E
• New DS8000 Shark with parallel
access volumes
• WLM goals adjusted to reflect
warehouse as a business critical
workload
o And allow WLM to manage it
(IIPHONORPRIORITY=YES) and
>1 zIIP
• **Management buy-in that ‘workload’
was a victim not the instigator
12
Virtual/ Real Storage Monitoring
SKCT / SKPT
CT / PT
Skeleton Pool (EDM_SKELETON_POOL)
Global Stmt Pool (EDMSTMTC)
• What tools can you
use
DBD Pool (EDMDBDC)
• How to make good
decisions regarding
ZPARMS
• Customer examples
of virtual and real
storage consumption
13
2GB
EDM Pool (EDMPOOL)
Working storage
V10
V8
V9
Enabled in CM
#ibmiod
SKCT / SKPT
CT / PT
Others
Where do I stand with respect to storage?
• MEMU2.zip is a REXX exec (no support provided), available free on IBM
Developerworks under the DB2 for z/OS Exchange > REXX
o https://www.ibm.com/developerworks/mydeveloperworks/files/app?lang=en#/pers
on/270000K6H5
Or
Google > IBM developerworks MEMU2
o DB2 V8 and DB2 9 use 1 version, another version specific to DB2 10
o MEMU2 REXX – outputs IFCID225 info invoked as batch job
o MEMUSAGE REXX – outputs IFCID225 if invoked from TSO Option 6
• Returns IFCID225 immediately for one time snap-shot
• InfoCenter V10 real storage estimate – you provide working storage size
o http://publib.boulder.ibm.com/infocenter/dzichelp/v2r2/topic/com.ibm.d
b2z10.doc.inst/src/tpc/db2z_calcrealstgreqs.htm
14
The ramp up
• As more work gets done the stack storage grows (high use
thread storage DB2 keeps around)
• Prefetch, deferred write, castout engines all increase on
demand and occupy storage, only released if DB2 bounced
o 600 prefetch engines
o 300 deferred write engines
o 300 castout engines
o 300 GBP write engines
o 500 P-lock notify/exit engines
• The global (above) and local dynamic statement cache
(below - KEEPDYNAMIC), and EDM pool
o RELEASE(DEALLOCATE) threads grow over time as
well
• DB2 keeps a hold of much of the storage assuming if it was
that busy once it will occur again
15
System engines reduced 31-bit footprint in V10
• Reduced by 90%, much like the user thread footprint
o Compare total agent system storage and sum of engines
o In this example: Total of 110 engines in DB2 9 and 164 in DB2 10
• These should be static when making estimates on number of threads supported
by virtual and real storage *(blue line is storage in MB, right axis is engines)
16
Actual thread footprint in SAP subsystem
• V9 -> V10 31-bit footprint decreased 73%
o 1.39MB -> 0.38MB
• Estimated max number of active threads increases about 3x
o This number will be more accurate the more threads are in the system
• Done with MEMU2 in a spreadsheet through calculated fields
17
Memory Monitor
• DB2 9 Automated Memory Monitor
o Built-in monitor runs from startup to shutdown and checks the
health of the system at one-minute intervals.
o When the DBM1 storage below the bar reaches specific
thresholds of 88, 92, 96, or 98 percent used of the available
storage, messages (DSNV508I, DSNV510I, DSNV511I,and
DSNV512I) are issued.
• PM38435 respects storage reserved for cushion in %, and lists it
• -DISPLAY THREAD(*) SERVICE (STORAGE)
o DSNV492I LONG + VLONG is storage below 2GB bar (V9)
o Using display command we see a 72% below the bar reduction
• Not including stack storage
18
Storage contraction
In here DB2 can
come down
In here DB2
threads abend
(storage critical)
• 3 critical numbers for calculating cushion
• Storage reserved for must complete (e.g. ABORT, COMMIT) QW0225CR/PVTCRIT
o = (CTHREAD+MAXDBAT+1)*64K (Fixed, real value) +25M
• Storage reserved for open/close of datasets –
QW0225MV/PVTMVS
o = (DSMAX*1300)+40K (space reserved in low private)
• Warning to contract – QW0225SO/PVTSOS
o = Max (5% of Extended Region Size, QW0225CR -25M)
• Storage Cushion = QW0225CR + QW0225MV + QW0225SO
o Note: QW0225MV will decrease as page sets are opened
19
• In full system contraction DB2 grabs the LPVT latch
so current threads cannot request storage and no
new units of work can get in…
In here DB2 hits
full system
contraction
Respect the cushion
• Storage critical means threads will be abended to gain storage
o If must complete work fails to get storage, like for a roll-back, or unconditional
request for storage comes in, then DB2 could come down
o Need to build in extra 100MB in cushion and set MAXDBAT+CTHREAD
appropriately (that is SWAT team’s approach when sizing)
20
MVS Storage DB2 10
- D VIRTSTOR,HVCOMMON
• Almost everything is above the
bar:
o 6GB Common per
subsystem: z/OS 1.10 has
64GB as default
6GB
- D VIRTSTOR,HVSHARE
• If you have many
subsystems on same LPAR
z/OS default may not be
enough
o 128GB Shared per
subsystem came in v9: now
in V10 this is where all the
thread storage went
• In V9 only DRDA comm.
area and trusted context ran
here
• We still have 31bit stack for agents (threads)
o 16K for system agents
o 32k for attach
21
o xPROCS still there, and reported in IFCID 225
128GB
What affects the storage?
• 64-bit private
o Mostly buffer pools
o Fixed RID pool
o EDM pool
• 64-bit shared thread and system
o Used to execute SQL
o Variable-dynamic statement caches, CT, PT, SKCT, SKPT
o *Compare to number of system and user agents/threads, parallelism
• 64-bit stack
o Just as 31-bit stack in V9, used by thread to execute SQL
o *same
• 64-bit common
o Distributed agents, package accounting, rollup
• All are affected by REALSTORAGE_MANAGEMENT
22
REALSTORAGE_MANAGEMENT (Discard Mode)
• OFF Do not enter discard mode unless the REALSTORAGE_MAX boundary
is approached OR z/OS has notified us that there is a critical aux shortage
• ON Always operate in discard mode. This may be desirable for LPAR with
many DB2s or dev/test systems
• AUTO (the default) When significant paging is detected, discard mode will
be entered
•
Important notes:
o discard mode is not exited immediately upon relief to avoid constant
toggling in and out of this mode
o discard mode shows <1% CPU degradation with no detectable impact
to running workloads
o DSNV516I – beginning storage discard mode
o DSNV517I – ending storage discard mode
23
V10 storage monitoring
• Possible to map real storage fluctuations and real storage available on
the LPAR via IFCID 225 and MEMU2
o DB2 directly affects the real available on the LPAR as shown below
o Large sorts and other concurrent workload play a factor here
o We can see when REALAVAIL drops sometimes DB2 gets paged out
to AUX, even though very little paging registered on the system
24
Reality Check
Storage summary
• V9 - virtual
o IFCID 225 shows real and virtual storage
• Full system contraction
• Storage creep/ non-DB2 storage in DBM1
• Storage for threads
o The race begins – (virtual storage) EDM pool not sized based on max number
of threads
• Individual DB2 threads (allied, DBAT) may abend with 04E/RC=00E200xx when
insufficient storage available
o Eventually DB2 subsystem may abend with abend S878 or S80A due to non-DB2
subsystem component (e.g DPF) issuing unconditional MVS getmain
• DB2 getmains are MVS conditional getmains so are converted to DB2 abends
e.g. 00E20016
• V10 –real
o Get idea of DB2’s real storage consumption, as well as sort’s and other
workload running concurrently
o Before DB2 member consolidation set baseline for amount you can
absorb
25
o Don’t forget MAXSPACE, REALSTORAGE_MANAGEMENT,
REALSTORAGE_MAX (MEMLIMIT is ignored)
WebSphere (any appl. Server) disaster prevention
• Overview of application server
environment
Its just ONE
WebSphere
connection…right?
o Types of connections
• WLM and identifying JAVA
threads
• Thread and timeout settings
26
#ibmiod
WebSphere Infrastructure
• Understanding WLM & WAS on z/OS
o The Application server consists of 1 control region and 1:N servant
regions
• Each servant region spawns application environments WLM sees fit to
satisfy application workload - similar to WLM stored procedure address
spaces we know in DB2
• Each appl environment takes advantage of thread/connection pools
Server
CR
SR
Servant Region
Controller
Region
WLM
JVM
Appl
More as WLM sees the
need, based on Policies
JCL
WLM starts servants
based on workload seen
Parameters provide control over:
Minimum number of regions
Maximum number of regions
27
Appl
Servant Region
JVM
Appl
Appl
Different types of connections –app server to DB
LPAR
LPAR
A
B
1
1. Same LPAR
No network … cross memory … ultra fast
C
2. Different LPAR, HiperSockets
2
No network … cross memory … very fast
3
TCP Stack
D
TCP Stack
3. Different LPAR, not HiperSockets
No wire … just adapter card … fast
4. Off System z
Traditional networking here
Real Resources
OSA
Network
4
When throughput and scalability is important, network delays can add up
TCP/IP on z/OS is very aware and optimizes its path to
reduce overhead and benefit your business
29
Thread footwork T2 vs. T4 Connections same LPAR
• Going through DIST adds network translations, another address space, and
context switch to an SRB
o DB2 objective is to improve T2 driver performance to beat T4
o Even local T4 connection hits DIST and TCP stack – need DDF WLM
service class
o Moving T2 to T4 is not 1 for 1 MIP exchange due to overhead
• But, the worse behaving the application the more there is to offload so
in the end it must be tested and compared
30
WLM ROTs – samples every 250ms, acts every 10 seconds
• IRLM Highest (SYSSTC, dispatch priority 255)
• MSTR (SYSSTC or Imp 1 and high velocity)
o In DB2 9 MSTR controls system health monitor for virtual memory
constraints and reporting to WLM for sysplex routing
• Importance 1, high velocity
o DBM1
o DIST
Number of CPs defined for the LPAR
Recommended High Velocity Goal
1-5
50-70
6-15
60-80
More than 15
70-90
o WLM sto proc environments
• Discretionary is not appropriate for DB2 work for the most part
o Performance index ALWAYS 0.8, so it always looks good
• DDF work should be lower priority than DIST
o Response time is much more effective than velocity for transactions, as
velocity means nothing to end user, and response time makes the phone
ring
o If ignored becomes SYSOTHER, discretionary
• Despite z/OS Enqueue processing could still hold locks on resources
o Enclaves independent, but Dependant on DIST
32
• If CMTSSTAT=ACTIVE do not use multiple periods
Do Not Put DDF Work Above DIST Address Space
• Example of DDF workload with importance 1 velocity goal coming into
DIST address space with importance 2 and high velocity goal
o DB2 itself is preempted by remote work, which is then punished
when DIST cannot get the cycles it needs
33
WLM Qualifiers:
Work qualifiers are used to help identify a thread or unit of work
• AI - Accounting Information*
• CI - Correlation Information*
• CN - Collection Name
• CT - Connection Type
• CTG - Connection Type Group
• LU - LU Name
• LUG - LU Name Group
• PFG - Perform Group
• PK - Package Name
• PKG - Package Name Group
• PN - Plan Name
• PNG - Plan Name Group
• PR - Procedure Name
• NET - Net ID
• PX - Sysplex Name
• NETG - Net ID Group
• SI - Subsystem Instance
• PC - Process Name*
• SIG - Subsystem Instance
Group
• PF - Perform
• SSC - Subsystem Collection
* Remote processes
35
• UI - Userid*
WebSphere Connection Annotation
• Supported since WAS v6.0
• Allows you to specify certain annotations on a connection similar to
datasource annotations but on the connection level
o CLIENT_ACCOUNTING_INFO (AI in WLM)
o CLIENT_APPLICATION_NAME (PC in WLM)
o CLIENT_ID (UI in WLM)
o CLIENT_LOCATION
• Can be set explicitly or implicitly
o Implicitly – use the trace string: WAS.clientinfo=all
• Dynamically enabled and disabled.
• Passes down workstation ID and client ID (if security is enabled in WAS)
• Passes down WAS application name (if one exists).
o Explicitly – use setClientInformation API defined in
com.ibm.websphere.rsadapter.WSConnection
• WLM_SET_CLIENT_INFO procedure can be called by app.
36
Identify WAS threads- Setting Client info
• In the data source
o All applications sharing the data source appear the same to
DB2
Ease of
implementing
But static
o Need one source per application to change information
• Calling WLM_SET_CLIENT_INFO stored procedure
o Requires application add a call to proc and populate the
information
• Having application set it
o WSConnection() method to set correlation and accounting info
• Create a wrapper from incoming getConnection() string that
dynamically picks up program name and IDs
o Can use Hibernate or Spring framework class to populate their
intermediary config file
o Could use a wrapper from WebSphere that uses
getConnection() and WSSubject class to pull the information
out of the incoming request to populate client info
37
Requires
coding, but
flexible
Data Source Definition in WebSphere
• clientWorkstation
• clientApplicationInformation
o Note that this is not ProgramName
• clientUser
• clientAccountingInformation
WLM_SET_CLIENT_INFO
• A stored procedure was introduced in V8 to allow remote
applications call the same APIs via a DB2 stored procedure
o PK74330 - https://www304.ibm.com/support/docview.wss?uid=swg1PK74330
• What these fields looks like in Omegamon Performance Expert
39
Example of Creating a DDF Service Class for a Specific
Application . . .
• If a service class of ‘PC=AdrianBurke’ had been created and the
application code contained
connectionProperties.put("clientProgramName", "agb_v9"), and
connectionProperties.put("clientApplicationInformation",
“AdrianBurke“ or WSConnection.CLIENT_APPLICATION_NAME,
“AdrianBurke”, ) in connection string: then the snapshot of the
enclave screen would show the following details:
41
Thread Monitoring:
-DIS DDF DETAIL real time information on DBAT and CONDBAT metrics
DSNL080I -DSSP DSNLTDDF DISPLAY DDF REPORT FOLLOWS:
DSNL081I STATUS=STARTD
DSNL082I LOCATION
LUNAME
GENERICLU
Max connections, could
-NONE
be idle and large number
How many
threads are RESPORT=5020 IPNAME=-NONE
DSNL084I TCPPORT=446
SECPORT=0
currently doing work
DSNL085I IPADDR=::192.168.10.193
Max concurrent
DSNL086I SQL DOMAIN=KSEE1.BCBSKS.BLUESNET.NEthreads, finite number
DSNL083I USTOPNETDB2P
TOPNET.LUDB2P
DSNL090I DT=I CONDBAT= 2000 MDBAT= 300
DSNL092I ADBAT= 64 QUEDBAT=
DSNL093I DSCDBAT=
0 INADBAT=
56 INACONN=
0 CONQUED=
370
DSNL099I DSNLTDDF DISPLAY DDF REPORT COMPLETE
How many threads are
lounging in the pool
43
0
You hit max DBAT and
some are waiting
Inactive connection
An Example of Tuning a Connection Pool Recommendations
To compute if minimum and maximum connections from WAS to DB2 are set
correctly, add up all connections from every data source definition and compare
to corresponding DB2 zParms related to thread and connection count (after
running MEMU2)
Connection
Timeout
Max Connections
Application
How long an
idle or inactiveServer
connection
Settings
can remain
in the pool.
Should be < IDTHTOIN
400 and > Reap
time
40
Min Connections
20
Reap Time
Unused Time
Aged Timeout
90
115
600
DB2 zParms
Max WAS
Connections to
DB2
IDTHTOIN
Min WAS
Connections to
DB2
MAXDBAT
Aged time out should
be set to
a value greater than reap time
FailingConnectionOnly
Reap time denotes
how
out. Setting theCTHREAD
value to 0
often the pool maintenance
Number of
means the physical connection
thread runs. The more 9
Servants
will exist in the pool forever.
often it runs, the more
CONDBAT
Can help
accurate 9
the
pool is. The
Total Max
servants
* 40 max conns
= resize the thread
footprint, like CONTSTOR by
running can effect 360
Connections
TCPKPALV
allowing connection
and thread
performance.
Total Min
9 servants * 20 min connsto be recycled
520
120
260
395
Purge Policy
Connections
44
=180
100
1000
300
WebSphere for z/OS Production Environment
• The point is to determine max threads DB2 can handle
and queue the excess outside DB2, with pooled threads
the queued WAS requests can be serviced quickly.
o WAS2 –
• 1 Control Region
• 3 Servant Regions
o WAS3 –
• 1 Control Region
• 3 Servant Regions
• WAS2 and WAS3 reside on the same LPAR running
against the same DB2
• WAS profile
o considered IOBOUND (default) where #processors * 3=
# of threads
o CPUBOUND is where #processors *1 = # of threads
45
WAS Production Settings – WAS2, WAS3
• We do not want WAS to be able to consume all available threads, as DB2 may
have other work to process (DB2 can handle 550 total [CTHREAD + MAXDBAT])
• IDBACK and CTHREAD govern local T2 connections in this instance (200<600)
DB2 zParms
• 4 processors x 3 = 12 for IOBOUND WAS
Application Server
Settings
Connection
Timeout
Max Connections
Min Connections
Reap Time
Unused Time
Aged Timeout
Purge Policy
Tot Number of
Servant Regions
Total Max
Connections
Total Min
Connections
46
180
Default=100 (min 12)
1
180
1800 (<900)
0
EntirePool
Max WAS
Connections
to DB2
IDTHTOIN
600 (72 min)
Min WAS
Connections to
DB2
MAXDBAT
6
200
CTHREAD
350
IDBACK
200
CONDBAT
1000
TCPKPALV
Enable
900
6
6 servants * 100 = 600
3*1=3
Possibility of 600 thread requests from
WAS is greater than the 350 calculated
max we see in the MEMU2 data
References
• John Campbell’s Performance Statistics presentations
• Nigel Slinger’s Session 2639 on Real Storage
• RMF spreadsheet reporting tool
o http://www03.ibm.com/systems/z/os/zos/features/rmf/tools/rmftools.html
• Akira and Akiko’s performance updates at IOD
• **Trials and tribulations endured by
customers via trial and error tuning
47
Extras……
• No-Charge workshop (email me)
o System z Synergy workshop focused on Websphere (LUW or z)
and DB2 for z/OS, settings, best practices, lessons learned
Questions???
Adrian Burke
DB2 SWAT Team SVL Lab
[email protected]
•
•
•
•
49
VISIT the DB2 Best Practices
VISIT the DB2 for z/OS Exchange
JOIN the World of DB2 for z/OS
JOIN the DB2 for z/OS group

Similar documents

DB2 Connect: Best Practices

DB2 Connect: Best Practices includes drivers and capabilities to define data sources. For example,  for ODBC, installing a DB2 client installs the DB2 ODBC driver and  registers the driver. Application developers and other us...

More information