proc sort
Transcription
proc sort
Sortering fra A-Z Henrik Dorf Chefkonsulent SAS Institute Hvorfor ikke sortering fra A-Å? • Det er for svært… Hvorfor ikke sortering fra A-Å? Hvorfor ikke sortering fra A-Å? Hvorfor ikke sortering fra A-Å? Hvorfor ikke sortering fra A-Å? Derfor ikke A-Å • Og ikke meget om NLS Sortere – hvorfor sortere? • Fordi orden er godt • Og effektivt • Et lille korttrick: 07-10-2015 9 07-10-2015 10 07-10-2015 11 07-10-2015 12 Hurtigere adgang til data Fordi data ligger samlet efter • Dato • Afdelinger • Varegrupper • …. Så er det hurtigere at læse data fra disk, fordi man ikke skal så mange forskellige steder hen. ( Ikke IKEA-lager) Hvad sker der, når data skal sorteres? • Når data sorteres , ordnes det i et arbejdsareal i hukommelsen på computeren 07-10-2015 15 07-10-2015 16 07-10-2015 17 07-10-2015 18 Hvad sker der? • Ved større datamængder bliver arbejdsarealet for lille • Derfor oprettes et arbejdsområde på disk: Utility-filer 07-10-2015 20 07-10-2015 21 07-10-2015 22 07-10-2015 23 07-10-2015 24 07-10-2015 25 07-10-2015 26 07-10-2015 27 07-10-2015 28 07-10-2015 29 Hvad sker der? • Utility-filer er smarte, men hvad betyder de for performance? • Mere disk I/O: Tager mere tid, men kan jeg vise det ? SAS-session og arbejdsområder D:\WORK\_TD7160_KOHDOW72_ SAS jobs kan ikke måle sig selv Local Work Workset Monitor • En ”måleapp” som holder øje med udvalgte ressourcer • Startes fra jobbet • Afslutter og leverer statistik • SAS kan levere analysen …naturligvis Workset Monitor SAS Work SAS utility folder Input sas table Output SAS table Workset Monitor Current size (bytes) Max size (bytes) Stat output CSV Workset Monitor - output Analysen Eksempel på måling 07-10-2015 38 07-10-2015 39 Bedre anvendelse af hukommelse 07-10-2015 40 07-10-2015 41 07-10-2015 42 07-10-2015 43 Elaps-tider Memory skal anvendes • Computer memory (16-96GB) 16 GB SAS options proc options group=memory ; run ; Group=MEMORY SORTSIZE=1073741824 SUMSIZE=0 MAXMEMQUERY=0 MEMBLKSZ=16777216 MEMMAXSZ=2147483648 LOADMEMSIZE=0 MEMSIZE=2147483648 REALMEMSIZE=0 Specifies the amount of memory that is available to the SORT procedure. Specifies a limit on the amount of memory that is available for data summarization procedures, when class variables are active. Specifies the maximum amount of memory that is allocated for procedures. Specifies the memory block size for Windows memory-based libraries. Specifies the maximum amount of memory to allocate for using memory-based libraries. Specifies a suggested amount of memory that is needed for executable programs loaded by SAS. Specifies the limit on the amount of virtual memory that can be used during a SAS session. Specifies the amount of real memory SAS can expect to allocate. Memory • Computer memory Memsize=1G 16 GB Memsize=2G Memsize=0 Memory • Computer memory Sortering memory 16 GB Memsize=2G Sortsize=1G Memory • Computer memory Sortering memory Memsize=6G Sortsize=4G 16 GB Hvad sker der? MEMSIZE SAS does not automatically reserve or allocate the amount of memory that you specify in the MEMSIZE system option. SAS uses only as much memory as it needs to complete a process. For example, a DATA step might require only 20M of memory, so even though MEMSIZE is set to 500M, SAS uses only 20M of memory. SORTSIZE The SORTSIZE system option can reduce the amount of swapping SAS must do to sort the data set. If PROC SORT needs more memory than you specify, it creates a temporary utility file in your SAS work directory in which to store the data. The SORT procedure's algorithm can swap unneeded data more efficiently than Windows can. Memory - deles mellem mange jobs • Computer memory Sortering memory Memsize=6G Sortsize=5G 16 GB Memory options • Memsize 4G skal sættes før SAS start • Sortsize 3G kan sættes undervejs i jobbet 07-10-2015 53 Mulige alternativer til meget memory • Proc SQL – order by : • Proc sort data=a out=b TAGSORT; By ….. • Proc sort via index SQL Order by SQL 25 Sekunder 07-10-2015 55 Alternativer • Tagsort – spar på pladsen i utility-området 07-10-2015 57 07-10-2015 58 07-10-2015 59 07-10-2015 60 07-10-2015 61 07-10-2015 62 Index sort • Hvis en tabel er indekseret på den (ene) variabel, der skal sorteres, kan indekset anvendes til at læse data i sorteret rækkefølge, uden at udføre en egentlig sortering. 07-10-2015 64 07-10-2015 65 07-10-2015 66 07-10-2015 67 07-10-2015 68 Sort via Index - grundtabellen med Index 3739 data a(INDEX=(B)); 3740 length c $100; 3741 do i=1 to 10000000; 3742 b=ranuni(0); 3743 c=repeat("A",100); 3744 output; 3745 end; 3746 run; NOTE: The data set WORK.A has 10.000.000 observations and 3 variables. INFO: Multiple concurrent threads will be used to create the index. NOTE: Simple index B has been defined. NOTE: DATA statement used (total process time): real time 4.48 seconds user cpu time 6.02 seconds system cpu time 1.17 seconds memory 158521.56k OS Memory 186072.00k Timestamp 25/09/2015 11.01.18 f.m. By clause via Index 3758Options msglevel=I; 3759DATA c ; 3760 SET a ; 3761 BY b ; 3762 RUN; INFO: Index b selected for BY clause processing. NOTE: There were 10.000.000 observations read from the data set WORK.A. NOTE: The data set WORK.C has 10.000.000 observations and 3 variables. NOTE: DATA statement used (total process time): real time 4:38.62 user cpu time 57.12 seconds system cpu time 3:18.77 memory 641.37k OS Memory 28908.00k Timestamp 25/09/2015 01.03.23 e.m. Sorteret tabel 3747proc sort data=a out=b ; 3748 by b ; 3749 run; NOTE: There were 10.000.000 observations read from the data set WORK.A. NOTE: SAS threaded sort was used. NOTE: The data set WORK.B has 10.000.000 observations and 3 variables. NOTE: PROCEDURE SORT used (total process time): real time 7.73 seconds user cpu time 6.49 seconds system cpu time 1.84 seconds memory 156390.64k OS Memory 183676.00k Timestamp 25/09/2015 11.01.26 f.m. Sortering og standard by 3759Data c ; 3760 set b ; 3761 by b; 3762 run ; NOTE: There were 10.000.000 observations read from the data set WORK.B. NOTE: The data set WORK.C has 10.000.000 observations and 3 variables. NOTE: DATA statement used (total process time): real time 11.40 seconds user cpu time 1.15 seconds system cpu time 1.49 seconds memory 630.53k OS Memory 28908.00k Timestamp 25/09/2015 01.08.27 e.m. Resultat INFO: Index b selected for BY clause processing 00:04:38.62 Proc sort + dataudtræk 00:00:07.73 00:00:11.40 ========== 00:00:19.13 Sortering – datasæt metadata-markering • ”Sorted by” sættes af SAS ved sortering Proc sort data=master ; by kundenummer dato ; run; • ”Sorted by” kan sættes af programmet Data master(sortedby=(Kundenummer dato) ); Merge master transactions ; By kundenummer dato ; Run; Usorterede data Proc contents data=A ; run; Data Set Name WORK.A Observations 10000000 Member Type DATA Variables 3 Engine V9 Indexes 1 Created 25/09/2015 11:04:00 Observation Length 120 Last Modified 25/09/2015 11:04:00 Deleted Observations 0 Protection Compressed NO Data Set Type Sorted NO Label Data Representation WINDOWS_64 Encoding wlatin1 Western (Windows) Validated sort proc sort data=a out=b(INDEX=(B)) ; by b ; run; Proc contents data=B ; run; Data Set Name WORK.B Observations 10000000 Member Type DATA Variables 3 Engine V9 Indexes 0 Created 25/09/2015 10:37:57 Observation Length 120 Last Modified 25/09/2015 10:37:57 Deleted Observations 0 Protection Compressed NO Data Set Type Sorted YES Label Data Representation WINDOWS_64 Encoding wlatin1 Western (Windows) Sort Information Sortedby b Validated YES Character Set ANSI Unvalidated sort Data D(sortedby=b); SET B ; BY B ; RUN; Proc contents data=D ; run ; Data Set Name WORK.D Observations 10000000 Member Type DATA Variables 3 Engine V9 Indexes 0 Created 25/09/2015 14:46:23 Observation Length 120 Last Modified 25/09/2015 14:46:23 Deleted Observations 0 Protection Compressed NO Data Set Type Sorted YES Label Data Representation WINDOWS_64 Encoding wlatin1 Western (Windows) Sort Information Sortedby b Validated NO Character Set ANSI Kontrol af en ikke-valideret sortedby= • Validering af sorteringen: Data _null_ ; Set master ; By kundenummer dato ; Run; options sortvalidate; Proc sort data=master ; By kundenummer dato ; Run; Options nosortvalidate; NOTE: There were 10.000 observations read from the data set WORK.A. NOTE: DATA statement used (Total process time): ERROR: BY variables are not properly sorted on data set WORK.A. a=1 b=0.3620147008 FIRST.b=1 LAST.b=1 _ERROR_=1 _N_=1 NOTE: The SAS System stopped processing this step because of errors. NOTE: There were 2 observations read from the data set WORK.A. NOTE: Sort order of input data set has been verified. NOTE: There were 10.000 observations read from the data set WORK.A. NOTE: Input data set is already sorted, no sorting done. NOTE: PROCEDURE SORT used (Total process time): NOTE: Input data set is not in sorted order. NOTE: There were 10.000 observations read from the data set WORK.A. NOTE: SAS sort was used. NOTE: The data set WORK.A has 10.000 observations and 2 variables. NOTE: PROCEDURE SORT used (total process time): *) Sorteringen er nu valideret og sortedby markeringen ændret Sorteringer • Er gode at have • Kræver omsorg og omtanke • Nogle problemer løses nemmest ved sorteringer • Bedste sted at overveje performance forbedringer • • • • Minimer antal sorteringer Undgå sortering og gensortering Brug hukommelsen Undgå ”INFO: Index b selected for BY clause processing “ Hvorfor ikke sortering fra A-Å? proc sort data=navne out=navne1 ; by navn; run ; Navn proc sort data=navne out=navne2 danish ; by navn; run ; Navn a.txt a.txt aa.txt aa.txt b.txt b.txt bb.txt bb.txt c.txt c.txt cc.txt cc.txt Æ.txt æ.txt å.txt Æ.txt æ.txt ø.txt ø.txt å.txt 07-10-2015 proc sort data=navne out=navne3 sortseq=linguistic ; by navn; run ; Navn a.Txt b.txt bb.Txt c.Txt cc.txt Æ.txt æ.txt ø.txt å.txt aa.txt 81 Spørgsmål ? ABCDEFHGIJKLMNOPQRSTUVWXYZ