Document 6502291

Transcription

Document 6502291
HOW TO USE SAS® SOFTWARE EFFECTIVELY ON LARGE FILES
Juliana M. Ma, University of North Carolina
INTRODUCTION
''large" clearly depends on the computer system's capacity. The principles presented here
remain useful with appropriate adjustments
to the specific techniques.
The intent of this tutorial was to present basic
principles worth remembering when working
with SAS software and large files. The paper
varies slightly from the original tutorial to
compensate for format differences. The methods discussed are directed toward project
managers, data managers, programmers,
and other users who have data files that must
be processed carefully because of size. The objective is to present ideas for improving efficiency in projects using the SAS system on
large files.
PROJECT MANAGEMENT
The importance of project management is
magnified when large files are involved. The
first principle is to require advanced planning
for all processes and programs. Just a few of
the advantages of proper planning are:
The topics include:
•
•
•
•
•
• documentation of projects,
• data storage considerations,
• efficient programming techniques.
Examples with actual SAS statements show
how the principles can be applied. Basic
knowledge of base SAS software is assumed.
The emphasis is on batch processing in a
mainframe computing environment, e.g.,
under OS.
efficient use of resources,
easy progress tracking,
cooperative programming,
reusable programs,
reusable databases.
Planuing should begin well before a project
requires any programming. Early planning
insures that computing resources, both machinery and personnel, are used efficiently. A
sirople example of poor planning is a project
for which programming must begin, but no
arrangement has been made for access to the
mainframe computer through an !lPpropriate
account.
What is a Large File?
A basic characteristic of a "large" file is that
you do not want to process it unnecessarily.
When used alone, file refers to either a raw file
or a SAS dataset. Processing a large file is generally expensive and only done after careful
program testing. Any production run, which
processes an entire file, becomes a significant
event. Common avoidable reasons for wasted
production runs are programming mistakes,
programming oversights, or slight changes in
the original program request.
During a project, planning results in better
management control. Project management is
simpler when a structure for reporting project
progress is well defined. If you know the
current status of a project, then making a
modification is usually easier. Cooperative
programming, in which one programmer
continues a task begun by another, benefits
from planned documentation of programs.
In a mainframe computing environment a file
with 10,000 to 1,000,000 records may be considered large. The number of variables is obviously another factor in determining a file's size
category. Batch processing, overnight processing, and tape storage are other indications
that you are dealing with large files.
Programs and data sets are more useful after a
project is completed when project wrap-up is
an integral part of project planning. For instance, a program may contain programming
techniques applicable to another situation, but
without documentation such a program may
be lost or too difficult to use effectively.
Datasets are difficult to reuse effectively without thorough documentation from the original
project.
In computing environments using microcomputers or minicomputers, the definition of
94
Project Documentation
Naming Conventions
Documentation is the key to simpli(yiI)g the job
of project management. Some of the basic
forms for documentation are:
A well organized project has pre-defined
naming conventions that are readily apparent.
Naming conventions make it easier to understand any project or program. You should
establish rules and guidelines for the
following:
• flowcharts,
• comments and TITLEs in programs,
• codebooks,
• programmer notes.
• program. names,
• dataset names,
• variables names,
Flowcharts are still the best way to follow complex program logic, or the flow of data through
many processing stages. Program flowcharts
are particularly appropriate for complex System programs that require maintenance. The
CONTENTS procedure with the HISTORY option provides one way of reviewing the background of a dataset, but the graphical format
of a flowchart is easier to understand quickly.
•
format names,
• value labels.
The name of a program should indicate which
project it belongs to, as well as its place in the
program sequence. Even small programs
should aclhere to the conventions since any program has the potential to grow beyond original
expectations. Whether you choose to use descriptive names (e.g., NC84REP, KIDTAB) or
sequential names (e.g., STEPl, STEP2) is
relatively unimportant as long as the choice
is deliberate. (See Muller, Smith, and
Christiansen, SUG! '81.)
An effective way to accomplish program documentation is to encourage the creation of selfdocumenting programs. The modular structure of the SAS system, in which programs
have separate DATA and PROC steps, provides a good foundation on which to build. A
self-documenting program includes blank
lines, indentation, TITLE statements, labels,
descriptive variable names, and comments.
Any programmer finds it easier to modifY a
program with built-in documentation; even
the original programmer benefits when
changes are required after a time lapse (i.e., a
month or more). Comments ean apply to program logic, version changes, basically anything without an obvious SAS statement.
Names need to be chosen carefully for all types
of files. Raw files, SAS datasets, SAS databases, and files of program statements should
each follow prescribed guidelines. In particular, the name of a file should immediately indicate the file type. Some of the advantages of
having planned file names include making it
easier to locate a specific file, files are clearly
associated with a particular project, and clean
up of unnecessary files at the end of a project
is quicker.·
Any permanent file must have a codebook with
detailed information about the values of each
variable. A codebook for a SAS dataset may
use output from the CONTENTS and FMTUB
procedures, especially if all variables are associated with permanent formats. New permanent variables should be added to a project's
codebook as they are created.
As with program names, define variable
names and format names deliberately. In
general, descriptive names simplifY program
development. Although sequential names
(VARl, VAR2, etc.) can be used in variable
lists (VAru-VAru6), remembering that
driver's age is VAR3 is more difficult than
remembering DR_AGE. Choose format names
that have natural associations· with their
variables. For instance, use AGEF as the
format name for DR_AGE.
Along with the formal documentation required
for a project, programmers should be encouraged to keep their own notes. Plan to incorporate these notes into the archived documentation at the end of a project.
95
Data Storage
PROGRAMNITNGTEC~QUES
The method used for data storage is more critical with large files. You should understand
the advantages and disadvantages of USing
raw files versus SAS datasets. In some cases
the relatively high cost of converting a large
file into a SAS dataset is quickly recouped in
later processing. Using a SAS dataset, with a
SET statement, eliminates certain sources of
programming errors because SAS variables
can have permanent labels, formats, and storage lengths. However, if a large raw file will
only be used infrequently, or processing will
always involve selecting a very small subset,
then using an INPUT statement may be the
best method. Another· consideration is that
concatenating files stored on many separate
tapes may be simpler with raw files.
Although a complete discussion of large file
programming techniques is impossible in this
tutorial, some examples are instructive. An
important point to remember is that processing time becomes a factor with large files. In
contrast, techniques that decrease the time
spent programming and testing may be more
appropriate for projects with smaller files.
Efficient Programming
Efficient programming techniques that decrease processing costs are especially importantfor large file processing. Three techniques
to consider are:
• Select records as soon as possible.
• Select variables when possible.
• Use intermediate working files.
When a permanent SAS dataset is created,
consider whether or not permanent formats
are desirable. A format is permanently associated with a variable when a FORMAT statement is placed in a DATA step. All procedures then display formatted values unless an
overriding FORMAT statement is included. A
compromise is to only associate permanent
formats with key variables. One result of including a permanent format is that a user
cannot even list observations without providing a format library or an appropriate PROC
FORMAT. On the other hand, that means
only a well informed user has ready access to
the data.
A variety of methods exist for selecting records
in a DATA step. The difference in execution
logic between using Select IF and IF conditwn
THEN OUTPUT is worth understanding.
DATA step statements following Select IF (or
DELETE) are executed selectively because of
an implicit RETURN. Statements following
IF ... THEN OUTPUT are. executed regardless
of whether or not a record satisfies the condition. For processing efficiency, place appropriate Select IF statements before other executable .statements when creating a subset from
a large file.
The storage space used by the SAS system for
numerical variables is a very important aspect
to understand when creating large SAS datasets. The LENGTH statement gives you control over the storage space used. Remember
that, withQut a LENGTH statement, a raw twodigit value can become a SAS value that takes
four times the storage space. Always use the
shortest length possible for efficiency; when in
doubt, use a conservative length to avoid truncation problems. Storing numeric identifiers
(e.g., eight-digit case numbers) as character
variables eliminates the possibility of numerical conversion problems.
In some situations, limiting the number of
variables is a way to improve program performance. However, do not eliminate variables
unless you are positive that they have no
potential uses. for related tasks. When a SAS
dataset is used as input, consider including a
dataset DROPIKEEP option in the SET,
MERGE, or UPDATE statement.
Intermediate working files are helpful, and
occasionally unavoidable, with large file
processing. (See Fischell and Hamilton,
SUGI12.) These files are usually stored as
permanent SAS datasets during a project, but
96
i.
not retained after project completion. One type
of working file is a permanent sample dataset;
probability or systematic samples are easily
created using a DATA step. Alternatively,
summary statistics can be generated and used
as input for analyses. In Example 2, PROC
SUMMARY is used as an intermediate step
for producing PROC FREQ tables.
Examplel
This program, shown in Figure l, is a prototype for extracting information from raw
records. We want to create a permanent SAS
dataset of selected variables for a specific analysis task. Our interest is only in cars, which
make up seventy-five percent of the records.
The analysis dataset will become input for a
series of programs to investigate seat belt
usage of car drivers.
Program Testing
Even an efficient program must be tested carefully before a production run. The SAS system
provides many ways of simplifYing the testing
process. For less complex programs, a test
run with the general OES= option may be sufficient. Regardless of how you create an appropriate test file, check intermediate results with
procedures such as CONTENTS and PRINT.
Ideally, statements used for testing should be
easy to identify, and change, or should not
require any changes before a production run.
Changing test statements to comments, which
remain in the production run, is a helpful
documenting technique.
The program is meant to be self-documenting.
For an actual run, the SAS dataset name
would be chosen to be as informative as possible, e.g. NC86_CAR. Note that variable names
are descriptive. Comments and labels are included; blank lines and indentation make the
program easier to understand. The procedures included to check DATA step results
have appropriate TITLEs associated with
th,em. For instance, a descriptive title could
indicate that the dataset contained records for
cars in accidents that occurred in 1986.
Testing is accomplished by using the OES
option. In this fashion a separate test file is
unnecessary. All references to the actual raw
file are automatically verified during test
runs. For example, the tape containing the
complete file is mounted and used, but only
1000 records are processed. Converting the
OPTIONS statement to a comment provides
documentation about the number of records
required to test the program adequately.
APPLICATIONS
The examples of programs for large file processing are based on motor vehicle accident
files. One input file has about 300,000 records
for one calendar year. The data for a single
year is stored on a separate full-size magnetic
tape (2400 feet at density 6250). Each raw record is about 300 characters long. One record
has information about one accident-involved
unit; a unit is a motor vehicle (car, truck, bus,
motorcycle), a pedestrian, or a bicyclist.
The DATA statement includes several components for which naming conventions should be
followed. The first level of the SAS dataset
name, ddtape, corresponds to an appropriate
DDNAME in the IBM® Job Control Language
associated with the SAS statements. The
second level, sasname, is permanent and
limited to eight alphanumeric characters. The
dataset LABEL option allows better documentation since its limit is forty characters.
Techniques are emphasized that relate to
efficiently creating a subset from a large file.
The following elements are discussed:
• test statements,
• meaningful SAS dataset names,
• use of the dataset LABEL option,
• efficient record selection,
• intelligent variable selection,
• definition of variable attributes,
• use ofPROCs to check results,
• use of comments,
• creation of working files.
The first INPUT statement (for VEHTYPE)
uses a trailing @ to efficiently select car
records only. Since a Select IF statement is
used, no further DATA step statements are
executed for non-car records. The relatively
expensive process of converting raw data is
only done for the records of interest. The
97
Examplel
Creating a SAS Dataset From a Large Raw File
Figurel
·OPTIONS OBS =1000;
f. for testing ·f
DATA ddtape.name (LABEL = 'description') ;
INFILE indd;
INPUT VEHTYPE 12 - 13 @ ;
IF VEHTYPE = 1 OR VEHTYPE = 2 ; f. CARS .f
DROP VEHTYPE ;
INPUT
TOWN
MONTH
ACCYEAR
RDCLASS
3-5
12 -13
14 -15
30
other variables (1 or 2 digit fields)
DRINJ
DRBELT
214
215
,
LENGTH TOWN 3
LABEL
DEFAULT = 2 ;
MONTH = 'MONTH OF ACCIDENT'
ACCYEAR = 'YEAR OF ACCIDENT'
RDCLASS = 'ROAD CLASS'
DRINJ = 'DRIVER INJURY'
DRBELT = 'DRIVER SEAT BELT USAGE' ;
OUTPUT;
RETURN;
f* this is an optional statement *f
f* this is an optional statement *f
PROC CONTENTS NOSOURCE;
TITLE 'description';
PROC PRINT DATA = ddtape.name (OBS=1 0) ;
PROC FREQ DATA = ddtape.name (OBS=1 000) ;
TITLE2 'FIRST 1000 OBS ONLY' ;
98
variable VEHTYPE is not kept in the output
dataset since we have no further interest in
the vehicle type information. Notice how a
simple comment makes it easier to understand that cars are being selected. The second
INPUT statement uses column input to read
the required variables for car records only.
Placing each variable on a separate line and
listing variables in the same order as a codebook makes checking the program simpler.
Adding another variable is also simplified
with this format. Variable names are descriptive and standardized. With this type of extract program, think carefully about which
variables are necessary. Including a variable
that is never analyzed is usually less expensive than reprocessing a large file because one
variable was not included in the extract
program.
Example 2
These programs show two ways of creating
simple cross-tabulation tables from a subset of
a large SAS dataset. The objective is to generate three tables considering the use of seat
belts by car drivers. The input SAS dataset contains 150 variables and we expect it to be sorted
by month.
The first program, Figure 2a, gets the job
done, but is not self-documenting and does not
allow for potential changes in the table specifications. A follow-up request to combine
months into quarters would require reprocessing the large dataset or "hand calculations"
based on the first tables. The combination of
an uninformative dataset name, SUBSET, and
a lack of comments make it difficult to know
which records are being selected.
In this example, specifYing DEFAULT=2 on
the LENGTH statement is appropriate because
only one variable requires a storage length
greater than two. However, an explicit length
specification is necessary for TOWN. Note
that no error messages would be generated if
TOWN is given a length of two even though
numeric truncation would result in errors for
values greater than 255.
The second program, Figure 2b, strives to be
self-documenting and produces a permanent
intermediate dataset of summary counts
appropriate for producing a variety of tables.
A summary dataset provides input for PROC
FREQ using the WEIGHT statement, or PROC
TABULATE using the FREQ statement, or
even a DATA step if recoding of unknown
categories to SAS missing values is needed
(see Ma and Leininger, SUGI '84). Collapsing
categories of any of the CLASS variables is
possible using FORMAT statements or DATA
step recoding. In addition to providing more
flexibility, PROC SUMMARY is more efficient
than PROC FREQ (see Sharlin, SUGI '83). .
The CONTENTS, PRINT, and FREQ procedures provide simple w~ys to check the results
of INPUT, LENGTH, and LABEL statements.
PROC CONTENTS shows attributes of all variables in the output dataset. The NOSOURCE
option is appropriate for batch printouts in
which program statements, called source
code, are printed together with the output
from PROC CONTENTS. The PRINT and
FREQ procedures can highlight obvious errors
in the INPUT or LENGTH statements, e.g.,
incorrect input column specifications. The
dataset OBS option serves to avoid inadvertent
problems. when the complete large file is
processed.
The dataset KEEP option appears in the SET
statement of both programs to select only the
required variables. This is more efficient than
a KEEP statement that only affects the output
dataset.
When MONTH is a CLASS variable, as opposed to a BY variable, the successful completion of a production run depends less on
whether the input dataset is sorted by
MONTH. With BY processing, records are
processed until the first out-of-place record is
encountered. This could cause an expensive
mistake if a problem occurs at record number
290,000 of 300,000.
The first TITLE statement applies to the output of all procedures in this program. The
second applies only to PROC FREQ and serves
to clarify that the frequencies are for program
verification only.
99
Example 2
Simple Tables From A Large SAS Dataset
Figure2a
DATA SUBSET;
SET ddtape.name (KEEP = MONTH VEHTYPE DRINJ DR BELT
DRSEX DRAGE) ;
IF VEHTYPE = 1 OR VEHTYPE = 2 ;
IF DRINJ = 6 THEN DELETE;
IF DRBELT = 5 THEN DR BELT = 0 ;
PROC FREQ DATA = SUBSET;
BY MONTH ;
TABLES (DRINJ DRSEX DRAGE) • DRBELT ;
TITLE 'description';
Figure2b
.OPTIONS OBS =5000 ;
/. for testing ./
DATA CAR_DR (LABEL = 'description') ;
SET ddtape.name (KEEP = MONTH VEHTYPE
DRINJ DRBELT DRSEX DRAGE) ;
IF VEHTYPE = 1 OR VEHTYPE = 2 ;
DROP VEHTYPE ;
/. CARS ONLY 0/
IF DRINJ = 6 THEN DELETE;
/. NO DRIVER PRESENT ./
IF DRBELT = 5 THEN DRBELT = 0;
/. COMBINE UNKNOWNS ./
OUTPUT;
RETURN;
PROC CONTENTS NOSOURCE;
TITLE 'description';
PROC PRINT DATA = CAR_DR (OBS = 20) ;
PROC SUMMARY DATA = CAR_DR NWAY;
CLASS MONTH DRINJ DRSEX DRAGE DRBELT;
OUTPUT OUT = ddkeep.DRSTAT (LABEL = 'description') ;
PROC PRINT DATA = ddkeep.DRSTAT (OBS = 50) ;
TITLE2 'SUMMARY STATS' ;
PROC FREQ DATA = ddkeep.DRSTAT;
TABLES MONTH - - DRBELT ;
WEIGHT _FREQ_ ;
PROC FREQ DATA = ddkeep.DRSTAT;
BY MONTH ;
TABLES (DRINJ DRSEX DRAGE). DRBELT;
WEIGHT JREQ_;
TITLE2 'description of requested tables' ;
100
:1
SAS Procedures to Remember
The following SAS procedures prove useful for
data management or checking programs. The
list demonstrates the wide variety of utility procedures found in the SAS system.
The Main. Objective
Efficient
self-documenting
projects
Using SAS procedures in place of operating
system utilities often make data management
tasks easier. The SAS utility procedures usually have clearer syntax than comparable operating system utilities. Information provided by
the SAS system is more understandable in
some cases, e.g., the output from PROC
TAPELABEL.
REFERENCES
Procedures are available that simpl.ifY the
process of checking programming logic and
results. As shown in the examples, procedures such as CONTENTS and PRINT provide
simple ways to verify intermediate results as
well as final output.
Clark, S. and Konowal, L. (1986), "Efficient
Use of PROC SUMMARy," Proc. of the
Eleventh Annual SAS Users Group Intl.
Conference, Cary, NC: SAS Institute, Inc.,
664-669.
Council, KA. (1980), SAS Applications Guide,
1980 Edition, Chapter 10: Processing Large
Data Sets with SAS, Cary, NC: SAS
Institute, Inc., 149-157.
COMPARE
CONTENTS
COpy
*DATACHK
DATASETS
*FMTLIB
*ISAM
PDS
PDSCOPY
PRINT
QPRINT (new in Version 5)
RELEASE
SOURCE
TAPECOPY
TAPELABEL
Fischell, T.R and Hamilton, E.G. (1987),
"Processing Large Data Sets into Customized Tables," Proc. of the Twelfth Annual
SAS Users Group Intl. Conference, Cary,
NC: SAS Institute, Inc., in press.
Helms, R (1978), "An Overview of RDM: I.
Project and Data Management System
Planning," American Statistical Association Proceedings of the Statistical Computing Section, 10-17.
* SUG! Supplemental Library User's Guide,
Version 5.
Ma, J.M. and Leininger, C. (1984), ''PROC
SUMMARy As the Basis for Efficient Analysis of Large Files," Proc. of the Ninth Annual SAS Users Group Intl. Conference,
Cary, NC: SAS Institute, Inc., 309-312.
CONCLUSION
The objective of discussing a variety of topics
related to processing large files is to encourage
a better understanding of the data management and program checking tools in SAS
software. The topics covered in this paper are
meant to inspire you to explore the power of
SAS software, not to provide a comprehensive
set of techniques. In the long run, any project
with large files is better when you take time to
think ahead, and document, at every step.
Ma, J.M. and Fischell, T.R (1985), "Descriptive Statistics: Using Indicator Variables,"
Proc. of the Tenth Annual SAS Users
Group Intl. Conference, Cary, NC: SAS
Institute, Inc., 1026-1030.
Merlin, RZ. (1984), "Design Concepts for SAS
Applications," Proc. of the Ninth Annual
SAS Users Group Intl. Conference, Cary,
NC: SAS Institute, Inc., 283-287.
101
Muller, KE., Smith, J., and Bass, J. (1982),
"Managing 'not small' Datasets in a
Research Environment," Proc. of the
Seventh Annual BAS Users Group Intl.
Conference, Cary, NC: SAS Institute, Inc.,
371-376.
ACKNOWLEDGEMENTS
The examples are based on research done at
the. UNC Highway Safety Research Center,
Chapel Hill, NC 27514.
IBM is a registered trademark of International Business Machines Corp.
Muller, KE., Smith, J., and Christiansen,
D.H. (1981), "Rules We Followed and Wish
We Had Followed in Managing Datasets,
Programs and Printouts," Proc. of the
Sixth Annual BAS Users Group Intl.
Conference, Cary, NC: SAS Institute, Inc.,
401-405.
SAS is a registered trademark of SAS Institute
Inc., Cary, NC 27511-8000, USA.
SAS Institute (1985), BAS® User's Guide:
BASICS, Version 5 Edition, Cary, NC: SAS
Institute Inc.
SAS Institute (1986), SUG! Supplemental
Library User's Guide, Version 5 Edition,
Cary, NC: SAS Institute Inc.
Ramsay, A. (1984), "Keyed Access to SAS Data
Sets," Proc. of the Ninth Annual BAS Users
Group Inti. Conference, Cary, NC: SAS
Institute, Inc., 322-325.
Shadin, J. (1983), "Data Reduction and Summarization," Proc. of the Eighth Annual
BAS Users Group Inti. Conference, Cary,
NC: SAS Institute, Inc., 912-919.
102