proc sort

Transcription

proc sort
Sortering fra A-Z
Henrik Dorf
Chefkonsulent SAS Institute
Hvorfor ikke sortering fra A-Å?
• Det er for svært…
Hvorfor ikke sortering fra A-Å?
Hvorfor ikke sortering fra A-Å?
Hvorfor ikke sortering fra A-Å?
Hvorfor ikke sortering fra A-Å?
Derfor ikke A-Å
• Og ikke meget om NLS
Sortere – hvorfor sortere?
• Fordi orden er godt
• Og effektivt
• Et lille korttrick:
07-10-2015
9
07-10-2015
10
07-10-2015
11
07-10-2015
12
Hurtigere adgang til data
Fordi data ligger samlet efter
• Dato
• Afdelinger
• Varegrupper
• ….
Så er det hurtigere at læse data fra disk, fordi man ikke skal så mange
forskellige steder hen. ( Ikke IKEA-lager)
Hvad sker der, når data skal sorteres?
• Når data sorteres , ordnes det i et arbejdsareal i hukommelsen på
computeren
07-10-2015
15
07-10-2015
16
07-10-2015
17
07-10-2015
18
Hvad sker der?
• Ved større datamængder bliver arbejdsarealet for lille
• Derfor oprettes et arbejdsområde på disk: Utility-filer
07-10-2015
20
07-10-2015
21
07-10-2015
22
07-10-2015
23
07-10-2015
24
07-10-2015
25
07-10-2015
26
07-10-2015
27
07-10-2015
28
07-10-2015
29
Hvad sker der?
• Utility-filer er smarte, men hvad betyder de for performance?
• Mere disk I/O: Tager mere tid, men kan jeg vise det ?
SAS-session og arbejdsområder
D:\WORK\_TD7160_KOHDOW72_
SAS jobs kan ikke måle sig selv
Local Work
Workset Monitor
• En ”måleapp” som holder øje med udvalgte ressourcer
• Startes fra jobbet
• Afslutter og leverer statistik
• SAS kan levere analysen
…naturligvis
Workset Monitor
SAS Work
SAS utility folder
Input sas table
Output SAS table
Workset Monitor
Current size (bytes)
Max size (bytes)
Stat output CSV
Workset Monitor - output
Analysen
Eksempel på måling
07-10-2015
38
07-10-2015
39
Bedre anvendelse
af hukommelse
07-10-2015
40
07-10-2015
41
07-10-2015
42
07-10-2015
43
Elaps-tider
Memory skal anvendes
• Computer memory (16-96GB)
16 GB
SAS options
proc options group=memory ; run ;
Group=MEMORY
SORTSIZE=1073741824
SUMSIZE=0
MAXMEMQUERY=0
MEMBLKSZ=16777216
MEMMAXSZ=2147483648
LOADMEMSIZE=0
MEMSIZE=2147483648
REALMEMSIZE=0
Specifies the amount of memory that is available to the SORT
procedure.
Specifies a limit on the amount of memory that is available for data
summarization procedures, when class variables are active.
Specifies the maximum amount of memory that is allocated for
procedures.
Specifies the memory block size for Windows memory-based libraries.
Specifies the maximum amount of memory to allocate for using
memory-based libraries.
Specifies a suggested amount of memory that is needed for executable
programs loaded by SAS.
Specifies the limit on the amount of virtual memory that can be used
during a SAS session.
Specifies the amount of real memory SAS can expect to allocate.
Memory
• Computer memory
Memsize=1G
16 GB
Memsize=2G
Memsize=0
Memory
• Computer memory
Sortering
memory
16 GB
Memsize=2G
Sortsize=1G
Memory
• Computer memory
Sortering memory
Memsize=6G
Sortsize=4G
16 GB
Hvad sker der?
MEMSIZE
SAS does not automatically reserve or allocate the amount of memory that you specify in
the MEMSIZE system option. SAS uses only as much memory as it needs to complete a
process. For example, a DATA step might require only 20M of memory, so even though
MEMSIZE is set to 500M, SAS uses only 20M of memory.
SORTSIZE
The SORTSIZE system option can reduce the amount of swapping SAS must do to sort the
data set. If PROC SORT needs more memory than you specify, it creates a temporary
utility file in your SAS work directory in which to store the data. The SORT procedure's
algorithm can swap unneeded data more efficiently than Windows can.
Memory - deles mellem mange jobs
• Computer memory
Sortering
memory
Memsize=6G
Sortsize=5G
16 GB
Memory options
• Memsize 4G skal sættes før SAS start
• Sortsize 3G kan sættes undervejs i jobbet
07-10-2015
53
Mulige alternativer til meget memory
• Proc SQL – order by :
• Proc sort data=a out=b TAGSORT; By …..
• Proc sort via index
SQL Order by
SQL
25 Sekunder
07-10-2015
55
Alternativer
• Tagsort – spar på pladsen i utility-området
07-10-2015
57
07-10-2015
58
07-10-2015
59
07-10-2015
60
07-10-2015
61
07-10-2015
62
Index sort
• Hvis en tabel er indekseret på den (ene) variabel, der skal sorteres,
kan indekset anvendes til at læse data i sorteret rækkefølge, uden at
udføre en egentlig sortering.
07-10-2015
64
07-10-2015
65
07-10-2015
66
07-10-2015
67
07-10-2015
68
Sort via Index - grundtabellen med Index
3739 data a(INDEX=(B));
3740
length c $100;
3741
do i=1 to 10000000;
3742
b=ranuni(0);
3743
c=repeat("A",100);
3744
output;
3745
end;
3746 run;
NOTE: The data set WORK.A has 10.000.000 observations and 3 variables.
INFO: Multiple concurrent threads will be used to create the index.
NOTE: Simple index B has been defined.
NOTE: DATA statement used (total process time):
real time
4.48 seconds
user cpu time
6.02 seconds
system cpu time 1.17 seconds
memory
158521.56k
OS Memory
186072.00k
Timestamp
25/09/2015 11.01.18 f.m.
By clause via Index
3758Options msglevel=I;
3759DATA c ;
3760 SET a ;
3761 BY b ;
3762 RUN;
INFO: Index b selected for BY clause processing.
NOTE: There were 10.000.000 observations read from the data set WORK.A.
NOTE: The data set WORK.C has 10.000.000 observations and 3 variables.
NOTE: DATA statement used (total process time):
real time
4:38.62
user cpu time
57.12 seconds
system cpu time 3:18.77
memory
641.37k
OS Memory
28908.00k
Timestamp
25/09/2015 01.03.23 e.m.
Sorteret tabel
3747proc sort data=a out=b ;
3748
by b ;
3749 run;
NOTE: There were 10.000.000 observations read from the data set WORK.A.
NOTE: SAS threaded sort was used.
NOTE: The data set WORK.B has 10.000.000 observations and 3 variables.
NOTE: PROCEDURE SORT used (total process time):
real time
7.73 seconds
user cpu time
6.49 seconds
system cpu time 1.84 seconds
memory
156390.64k
OS Memory
183676.00k
Timestamp
25/09/2015 11.01.26 f.m.
Sortering og standard by
3759Data c ;
3760
set b ;
3761
by b;
3762 run ;
NOTE: There were 10.000.000 observations read from the data set WORK.B.
NOTE: The data set WORK.C has 10.000.000 observations and 3 variables.
NOTE: DATA statement used (total process time):
real time
11.40 seconds
user cpu time
1.15 seconds
system cpu time 1.49 seconds
memory
630.53k
OS Memory
28908.00k
Timestamp
25/09/2015 01.08.27 e.m.
Resultat
INFO: Index b selected for BY clause processing
00:04:38.62
Proc sort + dataudtræk
00:00:07.73
00:00:11.40
==========
00:00:19.13
Sortering – datasæt metadata-markering
• ”Sorted by” sættes af SAS ved sortering
Proc sort data=master ; by kundenummer dato ; run;
• ”Sorted by” kan sættes af programmet
Data master(sortedby=(Kundenummer dato) );
Merge master transactions ;
By kundenummer dato ;
Run;
Usorterede data
Proc contents data=A ; run;
Data Set Name
WORK.A
Observations
10000000
Member Type
DATA
Variables
3
Engine
V9
Indexes
1
Created
25/09/2015 11:04:00
Observation Length
120
Last Modified
25/09/2015 11:04:00
Deleted Observations
0
Protection
Compressed
NO
Data Set Type
Sorted
NO
Label
Data Representation
WINDOWS_64
Encoding
wlatin1 Western (Windows)
Validated sort
proc sort data=a out=b(INDEX=(B)) ; by b ;
run;
Proc contents data=B ; run;
Data Set Name
WORK.B
Observations
10000000
Member Type
DATA
Variables
3
Engine
V9
Indexes
0
Created
25/09/2015 10:37:57
Observation Length
120
Last Modified
25/09/2015 10:37:57
Deleted Observations
0
Protection
Compressed
NO
Data Set Type
Sorted
YES
Label
Data Representation
WINDOWS_64
Encoding
wlatin1 Western (Windows)
Sort Information
Sortedby
b
Validated
YES
Character Set
ANSI
Unvalidated sort
Data D(sortedby=b);
SET B ;
BY B ;
RUN;
Proc contents data=D ; run ;
Data Set Name
WORK.D
Observations
10000000
Member Type
DATA
Variables
3
Engine
V9
Indexes
0
Created
25/09/2015 14:46:23
Observation Length
120
Last Modified
25/09/2015 14:46:23
Deleted Observations
0
Protection
Compressed
NO
Data Set Type
Sorted
YES
Label
Data Representation
WINDOWS_64
Encoding
wlatin1 Western (Windows)
Sort Information
Sortedby
b
Validated
NO
Character Set
ANSI
Kontrol af en ikke-valideret sortedby=
• Validering af sorteringen:
Data _null_ ;
Set master ;
By kundenummer dato ;
Run;
options sortvalidate;
Proc sort data=master ;
By kundenummer dato ;
Run;
Options nosortvalidate;
NOTE: There were 10.000 observations read from the data set WORK.A.
NOTE: DATA statement used (Total process time):
ERROR: BY variables are not properly sorted on data set WORK.A.
a=1 b=0.3620147008 FIRST.b=1 LAST.b=1 _ERROR_=1 _N_=1
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 2 observations read from the data set WORK.A.
NOTE: Sort order of input data set has been verified.
NOTE: There were 10.000 observations read from the data set WORK.A.
NOTE: Input data set is already sorted, no sorting done.
NOTE: PROCEDURE SORT used (Total process time):
NOTE: Input data set is not in sorted order.
NOTE: There were 10.000 observations read from the data set WORK.A.
NOTE: SAS sort was used.
NOTE: The data set WORK.A has 10.000 observations and 2 variables.
NOTE: PROCEDURE SORT used (total process time):
*) Sorteringen er nu valideret og sortedby markeringen ændret
Sorteringer
• Er gode at have
• Kræver omsorg og omtanke
• Nogle problemer løses nemmest ved sorteringer
• Bedste sted at overveje performance forbedringer
•
•
•
•
Minimer antal sorteringer
Undgå sortering og gensortering
Brug hukommelsen
Undgå ”INFO: Index b selected for BY clause processing “
Hvorfor ikke sortering fra A-Å?
proc sort data=navne out=navne1 ;
by navn;
run ;
Navn
proc sort data=navne out=navne2
danish ;
by navn;
run ;
Navn
a.txt
a.txt
aa.txt
aa.txt
b.txt
b.txt
bb.txt
bb.txt
c.txt
c.txt
cc.txt
cc.txt
Æ.txt
æ.txt
å.txt
Æ.txt
æ.txt
ø.txt
ø.txt
å.txt
07-10-2015
proc sort data=navne out=navne3
sortseq=linguistic ;
by navn;
run ;
Navn
a.Txt
b.txt
bb.Txt
c.Txt
cc.txt
Æ.txt
æ.txt
ø.txt
å.txt
aa.txt
81
Spørgsmål
?
ABCDEFHGIJKLMNOPQRSTUVWXYZ