IBM SPSS Statistics Performance Best Practices

Transcription

IBM SPSS Statistics
IBM SPSS Statistics Performance Best
Practices
Contents
Overview ....................................................................................................................................................... 3
Target User................................................................................................................................................ 3
Introduction .............................................................................................................................................. 3
Methods of Problem Diagnosis ................................................................................................................. 3
Performance Logging for Statistics Server ............................................................................................ 3
Timing for Backend Procedures ............................................................................................................ 4
Benchmarking with a Python Module .................................................................................................. 4
Best Practices for Data Preparation .............................................................................................................. 4
Preparing data automatically with ADP .................................................................................................... 5
Benefits ................................................................................................................................................. 5
Obtaining ADP ....................................................................................................................................... 5
Note ...................................................................................................................................................... 5
SQL Pushback ............................................................................................................................................ 5
Preconditions ........................................................................................................................................ 5
Obtaining SQL Pushback ....................................................................................................................... 6
Example ................................................................................................................................................. 6
Summary ............................................................................................................................................... 6
Note ...................................................................................................................................................... 6
Best Practices for Data Transformations ...................................................................................................... 7
Grouping the Transformations ................................................................................................................. 7
Benefits ................................................................................................................................................. 7
Example ................................................................................................................................................. 7
Summary ............................................................................................................................................... 8
Compiled Transformations ....................................................................................................................... 8
Preconditions ........................................................................................................................................ 8
Obtaining Compiled Transformations................................................................................................... 9
©Copyright IBM Corporation 1989, 2012.
Example ................................................................................................................................................. 9
Best Practices for Data Analysis .................................................................................................................. 10
Cache Compression for Large Datasets .................................................................................................. 11
Benefits .............................................................................................................................................. 11
Obtaining Cache Compression ............................................................................................................ 11
Example ............................................................................................................................................... 11
Multithreading ....................................................................................................................................... 13
Preconditions .................................................................................................................................... 13
Setting ................................................................................................................................................. 13
Example ............................................................................................................................................... 14
Working with Output................................................................................................................................. 15
Extract What You Need from Large Output ..................................................................................... 15
Benefits ............................................................................................................................................... 15
Obtaining OMS and OUTPUT Commands .................................................................................. 15
Examples ............................................................................................................................................. 15
Summary ............................................................................................................................................. 17
Working with Command Syntax ............................................................................................................. 17
Removing Unnecessary EXECUTE Commands ........................................................................................ 17
Benefits ............................................................................................................................................... 17
Examples ............................................................................................................................................. 17
Working with SPSS Statistics Server .................................................................................................... 18
Decreasing Data Passing Costs with SPSS Statistics Server ...................................................... 18
Benefits .............................................................................................................................................. 19
Testing and Results ............................................................................................................................. 19
Guidelines for purchasing Statistics Server......................................................................................... 19
64-bit Computing with Statistics Server ............................................................................................ 20
Benchmarking Test.............................................................................................................................. 20
Using Multiple Locations for Temporary Files ..................................................................................... 21
Benefits .............................................................................................................................................. 21
How to Set Multiple Temporary File Locations ............................................................................ 21
Conclusion ................................................................................................................................................. 23
Trademarks ............................................................................................................................................... 24
2
Overview
Target User
This paper is intended for users of and support specialists for both IBM® SPSS® Statistics Desktop and
IBM® SPSS® Statistics Server. You will find information about optimizing performance and
troubleshooting performance-related issues.
Introduction
SPSS Statistics is comprehensive software for data and statistical analysis. It enables users to quickly look
at their data and includes a wide range of procedures and tests to help users solve complex business and
research challenges. This article provides SPSS Statistics users and support specialists with best practices
for configuration, data preparation, data analysis, and other tasks. These best practices can improve the
efficiency, performance, and optimization of SPSS Statistics.
This article contains the following information:







Methods for diagnosing problems
Best practices for data preparation, primarily with Automatic Data Preparation (ADP)
Best practices of data transformations, including compiled transformations and how to group
the transformations for best performance
Best practices for data analysis, including multithreading and cache compression
Best practices about how to extract useful information from large output efficiently
Best practices for working with syntax
Best practices for SPSS Statistics Server
For each of the best practices, this article provides detailed background, sample code, and instructions
for running the sample code.
Methods of Problem Diagnosis
If you want to use SPSS Statistics efficiently, you must first identify the problems, especially for
performance issues. The methods described in this section help you identify which areas may be
problematic.
Performance Logging for Statistics Server
If you need to check the performance of SPSS Statistics Server, the IBM® SPSS® Statistics
Administration Console allows you to configure the analytic server software to write performance
information to a log file. The log file provides detailed information about current users, CPU usage, and
RAM usage. For more information about logging, refer to Chapter 4 in the IBM SPSS Statistics Server
Administrators Guide.
3
Timing for Backend Procedures
This method is designed for backend procedures. In this method, the show $VARS command is used
to get time information. By issuing the command at the beginning and end of a job, you can obtain an
accurate cost of the job and diagnose the problematic area.
Example
GET FILE = dataset.
SHOW $VARS.
FREQUENCIES VARIABLES= var1 var2.
SHOW $VARS.
FREQUENCIES VARIABLES=var3 var4.
SHOW $VARS.




The first SHOW $VARS command records the start time of the first FREQUENCIES command.
The second SHOW $VARS command records the end time of the first FREQUENCIES
command and the start time of the second FREQUENCIES command.
The last SHOW $VARS command records the end time of the second FREQUENCIES
command.
You can then calculate the costs for each FREQUENCIES command with subtraction.
Benchmarking with a Python Module
The benchmark Python module helps you to identify inefficient work. It provides classes that measure
various aspects of the SPSS Statistics syntax that is executed on the Microsoft Windows platform. To run
this module, you must do the following.




Install Python. Note that the Python version is specific for the SPSS Statistics version and the
operating system.
Download and install win32com utility from http://sourceforge.net/projects/pywin32.
Download and install IBM SPSS Statistics – Integration Plug-In for Python, which is installed with
IBM SPSS Statistics – Essentials for Python. For more information, refer to the document IBM
SPSS Statistics - Essentials for Python: Installation Instructions for Windows.
Download the benchmark module, which can be found in the SPSS community’s Utilities
collection at http://www.ibm.com/developerworks/spssdevcentral. To install this module,
please read the article “How to Use Downloaded Python Modules,” which is also available in the
SPSS community,
After finishing installation process, open benchmark.py in a text editor or Python development
environment and follow the instructions to execute the benchmarking work.
Best Practices for Data Preparation
This section provides best practices for data preparation. IBM SPSS Statistics Data Preparation option
allows you to identify unusual and invalid cases, variables, and data values in your active dataset. It also
allows you to prepare data for modeling.
4
Preparing data automatically with ADP
Preparing data for analysis is one of the most important steps in any project—and traditionally, one of
the most time consuming. Automated Data Preparation (ADP) handles the task for you, analyzing your
data and identifying fixes, screening out fields that are problematic or not likely to be useful, deriving
new attributes when appropriate, and improving performance through intelligent screening techniques.
Benefits
Using ADP enables you to make your data ready for model building quickly and easily, without needing
prior knowledge of the statistical concepts involved. Models will tend to build and score more quickly; in
addition, using ADP improves the robustness of automated modeling processes.
Obtaining ADP
To run ADP automatically, from the menus choose:


Transform > Prepare Data for Modeling > Automatic...
Click Run.
Optionally, you can:



Specify an objective on the Objective tab.
Specify field assignments on the Fields tab.
Specify expert settings on the Settings tab.
Note
This article provides only general instructions for using ADP. For more details, read the document IBM
SPSS Statistics Data Preparation released with the product. In particular refer to the following:


Chapter 4 provides detailed instructions for running ADP, including background information,
user interface operations, and explanations of the settings.
In chapter 8, you can find ADP sample code and examples, including the full process of running
ADP. Also, build models using the data “before” and “after” preparation so that you can
compare the results.
SQL Pushback
SPSS Statistics Server supports the pushback of sorting and aggregation to a SQL database. This ability to
perform sorting and aggregation operations in the SQL database is called SQL Pushback. When large
datasets are sourced from a SQL database, SQL Pushback ensures that operations that can be performed
more efficiently in the database are performed there.
Preconditions
The following preconditions are required for SQL Pushback functionality.



SPSS Statistics Server
SPSS Statistics Client used to connect to a SPSS Statistics Server
SQL database, such as IBM DB2®, Microsoft SQL Server, or Oracle Database
5
Obtaining SQL Pushback
SQL Pushback is available only through the graphical user interface. Therefore you first need to use SPSS
Statistics client to connect to the SPSS Statistics Server. Then complete the following steps.









From the menus choose File > Open Database > New Query...
Select the data source.
If necessary (depending on the data source), select the database file and/or enter a login name,
password, and other information.
Select the table(s) and fields. For OLE DB data sources (available only on Windows operating
systems), you can select only one table.
Specify any relationships between your tables, such as selection criteria.
If needed, aggregate the data by selecting one or more break variables, aggregated variables
and an aggregate function for each aggregate variable. Otherwise, skip this step.
Edit variable names and properties.
If needed, sort the data. Otherwise, press Next to skip this step.
Run the query or save it.
Example
This example compares the performance of SQL Pushback versus using the SORT procedure with SPSS
Statistics client.
Data File and Configurations
Dataset: Size 1.25 GB, 7.71 million cases, 27 variables
CPU: 1 CPU, Intel T 9400, 2.53 GHz, dual-core processor
RAM: 3 GB
Operating System: Windows XP, 32-bit
IBM SPSS Statistics: Statistics Server 20, Statistics Client 20
Test Results
Sort with SQL Pushback: 77 seconds
Sort with Statistics Client: 289 seconds
Time Saved: 212 seconds (73.35%)
Note: The above result is based on testing done in IBM SPSS laboratories. Although our test
environments simulate typical production environments in the field, we can’t guarantee that
organizations performing similar tests will see identical results. This data are presented for general
guidance.
Summary
Based on the example, the performance improvement is up to 73.35% by executing sorting with SQL
Pushback. The improvement may vary depending on configurations, data size, and syntax.
Note
If you are familiar with the SQL language, you can arrange the SQL query to execute sorting and
aggregating work in the database, which can gain the same performance improvement as SQL Pushback.
6
Best Practices for Data Transformations
In most situations, the raw data aren’t perfectly suitable for the type of analysis you want to perform.
Preliminary analysis may reveal inconvenient coding schemes or coding errors, and then data
transformations may be required in order to expose the true relationship between variables. You can
perform data transformations ranging from simple tasks, such as collapsing categories for analysis, to
more advanced tasks, such as creating new variables.
This section introduces several best practices for data transformations, which help to use SPSS Statistics
Data Transformations more efficiently.
Grouping the Transformations
Data transformations are usually necessary for data analysis. The typical user job is defining data,
transforming, analyzing, transforming, analyzing and so on.
Obviously, the transformation commands are interspersed with analytic procedures, which cause low
efficiency because of repetitive executions of data transformations. In this situation, you need to group
the transformations.
Benefits
By grouping the transformation commands, you can execute all the transformation work at one time,
which saves extra interpretation cost for the transformations. In addition, it makes syntax arrangement
clearer and more ordered.
Example
The example executes the sample syntax before and after grouping the transformation work, so that
you can see the difference from the results.
Ungrouped Syntax
Get file="dataset".
COMPUTE testvar1=var1-var2.
IF (testvar1 LT 10 OR testvar1 GT 50) testvar1=20.
FREQUENCIES testvar1.
COMPUTE testvar2=var3.
RECODE testvar2 (1 thru 10=1) (11 thru 30=2) (31 thru 50=3) (51 thru
Highest=4).
RECODE testvar3 (SYSMIS=SYSMIS) (Lowest thru 20=1) (21 thru 50=2) (100 thru
Highest=4) (51 thru 100=3).
Grouped Syntax
Get file="dataset".
COMPUTE testvar1=var1-var2.
IF (testvar1 LT 10 OR testvar1 GT 50) testvar1=20.
7
RECODE testvar2 (1 thru 10=1) (11 thru 30=2) (31 thru 50=3) (51 thru
Highest=4).
RECODE testvar3 (SYSMIS=SYSMIS) (Lowest thru 20=1) (21 thru 50=2) (100 thru
Highest=4) (51 thru 100=3).
The syntax creates three test variables (testvar1, testvar2, and testvar3) based on the original variables
(var1, var2, var3, and var4), and then recodes them for next step analysis. We use the simple
FREQUENCIES command for demonstration.
Data File and Configurations
Dataset: Size 0.9 GB, 3 million cases, 132 variables
RAM: 3 GB
Operating System: Windows XP, 32-bit
IBM SPSS Statistics: Statistics Client 20
Test Results
Ungrouped syntax: 77 seconds
Grouped syntax: 43 seconds
Time saved: 26 seconds (33%).
guidance.
Summary
Based on the example, the performance improvement is up to 33% by grouping the transformations.
The improvement may vary depending on configurations, data size, and syntax, but you can see obvious
improvement. Grouping your transformation work is a good practice.
Compiled Transformations
The compiled transformations feature is designed to improve the performance of complex
transformations. When you use compiled transformations, transformation commands (such as
COMPUTE and RECODE) are compiled to machine code at run time for better performance. This feature
works only with SPSS Statistics Server running on Windows Server.
Preconditions
The following preconditions are required for the compiled transformations feature.



SPSS Statistics Server running on Windows.
The SPSS Statistics Administration Console for configuring SPSS Statistics Server.
GNU G++ compiler.
8

Because there is an overhead involved in compiling the transformations, you should use
compiled transformations only when there are a large number of cases and multiple
transformations commands.
Obtaining Compiled Transformations
To run compiled transformations, complete the following steps:

Have an administrator use the SPSS Statistics Administration Console to turn on the feature and
set the correct compiler path. Chart 1 highlights these settings.
Chart 1: Settings for compiled transformations


Set CMPTRANS to YES in the syntax file.
Execute the syntax while connected to the SPSS Statistics Server or with the SPSS Statistics Batch
Facility.
Note: For compiled transformations to be available the administrator must turn on compiled
transformations with the SPSS Statistics Server setting and CMPTRANS must be set to YES. If the
administrator does not turn on compiled transformations, a warning message is displayed and the
command is ignored.
Example
This example runs compiled transformations with different data sizes and complexity levels. It also
provides the test results without compiled transformations for a contrast.
Sample Syntax
INPUT PROGRAM.
LOOP icase = 1 to 1000000.
END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.
9
EXECUTE.
SET CMPTRANS=ON.
VECTOR x(10).
LOOP jvar = 1 to 10.
COMPUTE x(jvar)=rnd(uniform(10)).
END LOOP.
EXECUTE.

The above syntax generates a dataset and initializes the variables with the COMPUTE command.
The first LOOP command (highlighted with bold) defines the case numbers, and the second
LOOP defines the variable numbers.
Configurations
CPU: 4 CPUs, Intel Xeon, 3.00 GHz, dual-core hyper-threaded processor
RAM: 8 GB
Operating System: Windows 2003 Server, 64-bit
IBM SPSS Statistics: Statistics Server 20
Test Results
The following table summarizes the test results.
Cases Number
No Compiled
Transformations
Compiled
Transformations
Time Saved
1,000,000
9 seconds
70 seconds
45 seconds
349 seconds
9 seconds
51 seconds
36 seconds
255 seconds
0
27%
20%
28%
5,000,000
10 loops
100 loops
10 loops
100 loops
Table 1: Test results of compiled transformations
guidance.
Summary
Compiled transformations may improve performance when there are a large number of cases and
complex transformation commands.
Best Practices for Data Analysis
This section provides best practices for data analysis. IBM SPSS Statistics is a comprehensive system for
analyzing data. It makes statistical analysis more accessible for the beginner and more convenient for
the experienced user. The best practices introduced here are helpful for analyzing large datasets more
efficiently and improving the parallelization for CPU intensive procedures.
10
Cache Compression for Large Datasets
When running many procedures on a large dataset, the cost of getting data obviously increases. The
application must read the original dataset for each procedure. For data tables read from a database
source, this means that the SQL query must be re-executed for any command or procedure that needs
to read the data. Cache compression allows you to avoid this overhead.
Benefits
Creating a data cache eliminates multiple data readings. The CACHE command copies all of the data to a
temporary disk file for subsequent uses of the data. To decrease I/O costs, you can also compress the
temporary data file. Combining CACHE with compression improves efficiency when dealing with large
datasets.
Obtaining Cache Compression
Cache compression works only if you are connected to Statistics Server. Then, complete the following
steps.

Have an administrator use the SPSS Statistics Administration Console to turn on the feature.
Chart 2 highlights these settings.
Chart 2: Settings for Cache Compression



Issue an explicit CACHE command before the analytical procedures.
Set ZCOMPRESSION to YES in syntax file.
Execute the syntax while connected to the SPSS Statistics Server or with the SPSS Statistics Batch
Facility.
Example
This example runs several procedures with cache compression on a large dataset and then summaries
the test results.
11
Configurations
Dataset: Size 1.25 GB, 7.71 million cases, 27 variables
RAM: 8 GB
Results
Procedures
No Cache Compression (Seconds)
Cache Compression (Seconds)
Time Saved
CODEBOOK
41.79
11.77
71.83%
CORRELATIONS 21.18
8.42
60.25%
COXREG
28.09
16.28|
42.04%
CROSSTABS
21.85
8.84
59.54%
CTABLES
25.06
11.10
55.71%
EXAMINE
1343.87
1157.34
13.88%
GLM
20.72
8.68
58.11%
LOGISTIC
37.09
25.55
31.11%
NOMREG
29.26
16.35
44.12%
OLAP CUBES
30.83
16.96
44.98%
T-TEST
20.22
8.24
59.25%
TREE
192.09
164.17
14.54%
Table 2: Test results for cache compression
As shown in Table 2, the procedures CODEBOOK, CORRELATIONS, CROSSTABS, CTABLES, GLM, TTEST improve over 50%.
Note: The data shown is based on testing done in IBM SPSS laboratories. Although our test
environments simulate typical production environments in the field, we cannot guarantee that
guidance. Actual results will vary depending on the configuration of the SPSS Statistics Server and clients
(number of CPU cores, RAM, disk speed, etc.).
12
Multithreading
Multithreading is the technical term used to break a task into multiple tasks that can be executed in
parallel. Not all analytical procedures can take advantage of multithreading. Procedures that can be
easily parallelized and scheduled to run simultaneously on different CPUs/cores benefit the most. The
procedures that are multithreaded in SPSS Statistics are listed in the following table.
Procedure family
Procedure Name
Correlations
Bivariate
Partial
Regression
Linear
Ordinal
Multinomial
Logistic
Data Reduction
Factor Analysis
Survival Analysis
Cox Regression
Logistic Regression
Multiple Imputation
Impute missing values
Table 3: Multithreaded analytical procedures
Preconditions
To benefit from multithreading, the following preconditions are required.


The computer on which the procedure is run has multiple processors or each processor has
multiple cores.
The procedure that is executed is listed in Table 3.
Note: In SPSS Statistics client, the maximum thread number is 4. In SPSS Statistics Server, there is no
limit to the number of threads.
Setting
By default, SPSS Statistics uses an internal algorithm to determine the number of threads for a particular
computer. You can change this setting, but the default will often provide the best performance. You can
override the default setting by issuing the command SET THREADS=n, where n indicates the number
of threads, often corresponding the number of CPUs or cores. It’s suitable to use SET THREADS to
override the default setting in the following scenarios.
13


The default thread number is usually equal to the number of processing units. The threads
consume CPU resources, which may reduce the processing cycles needed for other CPUintensive applications. In this situation, you can use SET THREADS to limit the thread number.
For multi-threaded procedures the performance may not improve when the thread number
increases because the overhead on separating the data, managing the threads, and merging the
results also increases. (For specific results, you can refer to Table 4). Therefore, you should find
the optimal thread number and set it by using the command SET THREADS.
Example
This example provides detailed performance information for multithreaded procedures using different
data sizes, and different thread number.
Configurations
RAM: 8 GB
Results
Multi-threaded
Procedures
File Size
(MB)
Case
Number
Variable
Number
Time (sec)
2 threads
Time (sec)
4 threads
Time (sec)
8 threads
Time (sec)
16 threads
Saved
Time
Discriminant
Cscoxreg
SORT
Csordinal
Cslogistic
Linear
regression
Factor
Correlation
Partially
correlated
Nomreg
Csselect
688
2.38
2610
47.6
47.6
703
400,000
50,000
2,000,000
1,000,000
1,000,000
200,000
200
50
457
50
50
400
7.56
32.12
392.68
40.11
63.03
50.13
6.67
20.36
261.13
37.05
55.74
30.93
6.03
15.25
241.29
36.87
58.49
14.74
6.34
12.90
249.42
36.94
54.72
11.83
20.23%
59.84%
38.55%
8.07%
13.18%
76.4%
686
343
343
200,000
200,000
200,000
400
200
200
97.83
29.67
21.94
49.24
19.55
12.41
27.94
16.81
12.56
28.18
12.97
12.31
71.44%
56.28%
43.89%
3.76
33.2
50,000
1,069,000
15
6
16.61
47.97
11.89
48.27
9.41
48.10
8.16
48.00
50.87%
0.00%
Table 4: Benchmarking results with different thread numbers
Based on above results, as the number of threads increases from 2 to 16:





Cscoxreg procedure improves by 59.84%.
Linear regression procedure improves by 76.4%.
Factor procedure improves by 71.44%.
Partially correlated procedure improves by 43.89%.
Nomreg procedure improves by 50.87%.
14
Note: The data shown is based on testing done in IBM SPSS laboratories. Although our test
environments simulate typical production environments in the field, we cannot guarantee that
(number of CPU cores, RAM, disk speed, etc.).
Working with Output
SPSS Statistics provides rich methods to display the statistical results, including tables, charts, and text.
By default, the results are displayed in an SPSS Statistics Viewer window. You can manipulate the output
and create an output document that contains precisely the output you want, arranged and formatted
appropriately. The best practice introduced in this section helps you to achieve this goal.
Extract What You Need from Large Output
When running multiple procedures, SPSS Statistics often generates mass results consisting of tables,
charts, logs, text and so on. It’s painful to review so much information and find what you want.
Fortunately, SPSS Statistics provides the Output Management System (OMS) and OUTPUT commands
(OUTPUT NEW, OUTPUT NAME, OUTPUT ACTIVATE, OUTPUT OPEN, OUTPUT SAVE,
OUTPUT CLOSE) to help you refine and route the output.
Benefits
With OMS and OUTPUT commands, you can gain the following benefits.




Partition the large output into separate output documents.
Select and route required information from the output.
Work with multiple open output documents in a given session.
Use output as input with OMS.
Obtaining OMS and OUTPUT Commands
There are two ways to run OMS: from the OMS control panel and a syntax command.


Use the OMS Control Panel. From the menus choose Utilities > OMS Control Panel. With the
control panel, you can start and stop the routing of the output to various destinations. Note that
OUTPUT commands can be used only with a syntax command.
Use OMS and OUTPUT commands. The following examples illustrate how to insert these
commands into your existing syntax.
Examples
Example 1: Partitioning the Output with OUTPUT Commands
This example demonstrates how to partition the statistical results according to gender. Results for males
will appear in one output documents, and results for females will appear in another one.
GET FILE='SurveyData.sav'.
TEMPORARY.
SELECT IF (Sex='Male').
FREQUENCIES VARIABLES=ALL.
15
OUTPUT NAME males.
TEMPORARY.
SELECT IF (Sex='Female').
OUTPUT NEW NAME=females.
FREQUENCIES VARIABLES=ALL.
OUTPUT SAVE NAME=males OUTFILE='Males.spv'.
OUTPUT SAVE NAME=females OUTFILE='Females.spv'.
OUTPUT CLOSE *.





The GET command loads survey data for male and female respondents.
The FREQUENCIES output for male respondents is written to the designated output document.
The OUTPUT NAME command is used to assign the name males to the designated output
document.
The FREQUENCIES output for female respondents is written to a new output document
named females.
The OUTPUT SAVE commands are used to save the output in two separate files.
The OUTPUT CLOSE command closes all open output documents.
Example 2: Formatting and Routing the Output with OMS
This example demonstrates how to route the output in different format. The following is the sample
code.
OMS
/SELECT TABLES
/IF COMMANDS = ['Regression']
/DESTINATION FORMAT = DOC
OUTFILE = 'tables.doc'.
REGRESSION
/STATISTICS COEFF OUTS R ANOVA
/DEPENDENT income
/METHOD=ENTER age address EDU employ.
OMS SELECT WARNINGS
/DESTINATION FORMAT=HTML
OUTFILE='warnings.htm’
FREQUENCIES age EDU.
OMSEND.





The first OMS command selects tables from REGRESSION results and saves them to tables.doc.
The REGRESSION command generates the output used by OMS.
The second OMS command selects warnings from FREQUENCIES results and saves them to
warnings.htm.
The FREQUENCIES command generates the results used by the second OMS command.
OMSEND command ends OMS commands.
Example 3: Converting Output into Input with OMS
Using the OMS command, you can save the output to an SPSS Statistics data file and then use that
output as input in subsequent commands or sessions.
OMS
16
/SELECT TABLES
/IF COMMANDS=['Descriptives'] SUBTYPES=['Descriptive Statistics']
/DESTINATION FORMAT=SAV OUTFILE='des_table.sav'
/COLUMNS DIMNAMES=['Variables'].
DESCRIPTIVES VARIABLES=salary salbegin.
OMSEND.



The OMS command selects the “Descriptive Statistics” table from DESCRIPTIVES results and
saves it as the SPSS Statistics data file des_table.sav. The COLUMNS subcommand selects the
descriptive variables as the variables of output data file.
The DESCRIPTIVES command generates the table used by OMS.
The OMSEND command ends OMS commands.
Summary
The OMS and OUTPUT commands provide the ability to manage one or more output documents
programmatically. This ability helps you deal with the output more easily. For more information, please
refer to the IBM SPSS Statistics Command Syntax Reference, which is released with the product.
Working with Command Syntax
The powerful command syntax allows you to save and automate many common tasks. It also provides
some functionality not found in the menus and dialog boxes. You can also save your jobs in a syntax file
so that you can repeat your analysis at a later date. This section provides best practices for working with
command syntax.
Removing Unnecessary EXECUTE Commands
The EXECUTE command is designed for use with transformation commands and facilities such as ADD
FILES, MATCH FILES, UPDATE, PRINT, and WRITE, which do not read data and are not executed
unless followed by a data-reading procedure. Because the EXECUTE command forces the data to be
read, unnecessary EXECUTE commands can result in extra data passing and wasted time.
Benefits
By identifying and removing unnecessary EXECUTE commands, you can optimize syntax arrangement
and reduce the time needed for reading data. This optimization is especially effective for I/O intensive
procedures.
Examples
The following examples demonstrate the improper usage of EXECUTE commands and how to correct the
improper usage.
Example 1: Using EXECUTE Between Independent Transformations
COMPUTE var1=var1*2
EXECUTE.
COMPUTE var2=var2*2
17

The two COMPUTE commands operate on different variables. They are independent. In this
scenario, inserting the EXECUTE command causes unnecessary data passing and lowers the
execution efficiency of the transformations.
Ensuring that the transformations are truly independent is critical. If the transformation are in fact
dependent, you may need to put the EXECUTE command between the transformations to get the right
results. For example:
Syntax 1:
COMPUTE lagvar=LAG(var1).
COMPUTE var1=var1*2.
Syntax 2:
COMPUTE lagvar=LAG(var1).
EXECUTE.

Compared with Syntax 1, the only difference in Syntax 2 is the EXECUTE command between
the two COMPUTE commands. However, the value of lagvar is totally different in Syntax 1 and
Syntax 2. Syntax 1 uses the transformed value of var1 to calculate lagvar, while Syntax 2 uses
the original value.
Example 2: Inserting EXECUTE Between a Transformation and Statistical Procedure
EXECUTE.
FREQUENCIES VARIABLES=var1.

Sometimes it’s necessary to execute the transformations with the EXECUTE command.
However, when the transformations are followed by one or more statistical procedures that
need to read the data, the EXECUTE command becomes redundant. In this example, you
should remove the EXECUTE command.
Working with SPSS Statistics Server
SPSS Statistics Server is robust, powerful analytical software that seamlessly scales from handling the
analytical needs of a single department to hundreds of users across the enterprise. It provides all of the
features of SPSS Statistics, plus capabilities that deliver faster performance, more efficient processing of
large datasets, and enhanced security in enterprise deployments. This section provides best practices for
working with SPSS Statistics Server.
Decreasing Data Passing Costs with SPSS Statistics Server
For an organization with distributed offices, accessing large data files across offices takes a significant
amount of time. Passing large data on the network can cause bandwidth saturation, which disturbs the
normal use of other applications. In this situation, SPSS Statistics Server is a good choice.
18
Benefits
With SPSS Statistics Server, data is read from server machine, avoiding transferring large datasets to end
users’ desktops. The data transferred over the network is minimized and performance is improved. This
prevents bandwidth saturation and improves the performance of SPSS Statistics in addition to other
mission-critical applications, including e-mail, enterprise resource planning (ERP), and customer
relationship management (CRM).
Testing and Results
The following table compares the time needed to access data in these situations:


SPSS Statistics client is running in local mode and accesses files in the data center directly over
the wide area network (WAN).
SPSS Statistics client is running in distributed mode and is connected to an SPSS Statistics Server
installed at the data center.
File Size
SPSS Statistics client
connecting directly to the
data over a WAN (T1 3.0
Mbps)
SPSS Statistics client connecting to
the SPSS Statistics Server at the data
center over a WAN (T1 3.0 Mbps)
Time saved with
SPSS Statistics
Server in seconds
50 MB
2 minutes, 10 seconds
4 seconds
250 MB
40 seconds
1 GB
80 seconds
2 minutes, 6
seconds
10 minutes, 10
seconds
41 mi minutes,
57 seconds
Table 5: Timing in seconds to access a data file
As shown in above table, compared with SPSS Statistics client, significant time savings can be achieved
with SPSS Statistics Server when accessing files in distributed offices. For example, 2 minutes were saved
for a 25 MB file, 10 minutes for a 250 MB file, and 42 minutes for a 1 GB file.
Note: The results are based on the assumption that the available bandwidth is 3.0 Mbps. In reality, the
time saved will be greater as bandwidth is taken up by other applications such as e-mail, network
backups, and other network resources. The data presented here are for illustrative purposes only. Actual
results will vary depending on the configuration, bandwidth, and latency of the WAN. Therefore,
organizations performing similar tests may not see identical results.
Guidelines for purchasing Statistics Server
The SPSS Statistics Server is especially designed for the following scenarios:


Organizations with distributed offices looking to centralize their data and IT infrastructure in one
or more data centers.
Organizations with distributed offices that need to analyze and share files greater than 25 MB
across offices.
19

Organizations that need to perform analysis on large datasets (greater than 100 MB) sourced
from a SQL server or a data warehouse.
64-bit Computing with Statistics Server
The amount of physical RAM is critical for performance because accessing data from RAM is much faster
than accessing data from a disk. For faster performance, it’s best to have the entire dataset in RAM.
However, the total amount of RAM supported depends on the processor. Theoretically, 32-bit
processors are limited to accessing 4 GB of RAM. Transferring to a 64-bit machine allows you to increase
the amount of RAM to several multiples higher than a 32-bit machine. It’s much faster to execute
analytical procedures with larger datasets on a 64-bit machine.
SPSS Statistics Server has strong support for 64-bit computing on multiple server operating systems,
including Windows Server, IBM® AIX®, Sun Solaris, HP-UX, Red Hat Enterprise Linux, and SUSE Linux
Enterprise Server. Most analytical procedures run much faster on 64-bit SPSS Statistics Server than 32bit SPSS Statistics client.
Benchmarking Test
We compare the processing times for statistical procedures run on 64-bit SPSS Statistics Server and 32bit SPSS Statistics client.
Configuration of Statistics Server
CPU: 4 CPUs, Intel Xeon 3 GHz, dual-core hyper-threaded processor
RAM: 8 GB
Operating system: Windows 2003 Server, 64-bit
Configuration of Statistics Client
RAM: 3 GB
Operating system: Windows XP, 32-bit
Datasets
Two datasets were used:


Dataset 1: Size 2.1 GB, 5 million cases, 127 variables
Dataset 2: Size 3 GB, 10 million cases, 127 variables (for testing multithreaded procedures)
Result
The test results are summarized in the following table. The chosen procedures the typical type of
analysis or data processing that an SPSS Statistics user might execute in daily work.
Procedures
ADD FILES
AGGREGATE
64-bit Server
32-bit Client
(seconds)
(seconds)
18.45
33.19
169.34
94.95
Time Saved
Average Speedup
Factor
89.10%
65.04%
20
9.18
2.86
Procedures
MATCH FILES
SORT
CORRELATION
FACTOR
GLM
MIXED
TREES
BETA
64-bit Server
32-bit Client
(seconds)
(seconds)
22.00
146.90
230.78
140.95
70.09
116.23
615.00
40.12
224.17
578.73
800.83
219.22
350.91
174.13
885.49
106.20
Time Saved
Average Speedup
Factor
90.19%
74.62%
71.18%
35.70%
80.03%
33.25%
43.98%
62.22%
10.19
3.94
3.47
1.56
5.01
1.50
1.44
2.65
Table 6 Benchmarking results for jobs run on 64-bit SPSS Statistics Server and 32-bit SPSS Statistics clients
Note: The results shown in Table 6 are based on testing done in IBM SPSS laboratories. Although our
test environments simulate typical production environments in the field, we cannot guarantee that
(number of CPU cores, RAM, disk speed, etc.)
Summary
The benchmarking results show impressive speedup for most procedures with 64-bit SPSS Statistics
Server. For best performance, use 64-bit SPSS Statistics Server.
Using Multiple Locations for Temporary Files
When SPSS Statistics Server processes data, it often keeps a temporary copy of that data on disk. In
addition, some procedures (CACHE, SORT, AGGREGATE, transformations, etc.) can create temporary
files during execution. The size of temporary files varies from the size of the data file to three times the
size of the data file. Because the temporary files are writable and can get quite large, it’s hard to manage
I/O operation, especially when there are several concurrent I/O intensive users. In this situation, setting
multiple temporary file locations is necessary.
Benefits
Using multiple temporary file locations, you can:



Limit the users to operate the directories to which they have access.
Control the temporary files space allocated to each user by specifying a partitioned drive.
Improve performance when the locations are on different spindles. This option requires your
server workstation to have multiple physical disks.
How to Set Multiple Temporary File Locations
There are several ways to set multiple temporary file locations using the SPSS Statistics Administration
Console. Note that this optimization is available only when using SPSS Statistics Server.
21
Set global temporary file locations
Chart 3 shows a screen capture from the SPSS Statistics Administration Console and highlights the
setting for temporary file locations.
Chart 3: Setting for temporary file location
As shown in Chart 3, the administrator set three locations: c:\temp, d:\temp, and e:\temp. This setting is
global for all users but can be overridden by the user profile or group setting.
Set Temporary File Location with Group Setting
The group setting applies to all users in a group, but it can be overridden with the setting in specific user
profiles. To display the group settings, double-click the User Profiles and Groups node beneath the
desired SPSS Statistics Server in the Server Administration pane.
The Manage Users and Groups pane displays the currently defined user profiles and groups in the User
Profiles and Groups grid. To create a new user group, complete the following steps.



In the Manage Users and Groups pane, click New Group.
In the Create New Group dialog box, enter a name for the group.
Define any of the available settings, including temporary file locations.
Set Temporary File Location for Each User
To create a new user profile, open the Manage Users and Groups pane in the Server Administration
pane and complete the following steps.



In the Manage Users and Groups pane, click New User Profile.
In the Create New User Profile dialog box, enter the name of the user for whom you are creating
the profile.
If necessary, define any of the available settings. You can define the temporary file location for
this user. If you are creating a user profile to assign to a group, you don’t have to define any
settings. The group settings will be applied to the user.
For more information about creating and editing SPSS Statistics Server user profiles and groups, please
refer to Chapter 4 of IBM SPSS Statistics Server Administrators Guide.
22
Conclusion
This paper provides some best practices for improving the efficiency, performance and optimization of
IBM SPSS Statistics. These best practices include data preparation, data transformations, data analysis,
output, command syntax and Statistics Server. By learning from these cases, SPSS Statistics users can
optimize their work and improve overall performance.
23
Trademarks
IBM, the IBM logo, and ibm.com are trademarks of International Business Machines Corp., registered in
many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other
companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark
information" at http://www.ibm.com/legal/copytrade.shmtl.
24

IBM SPSS Statistics Performance Best Practices

Transcription

Similar documents

Contact Person - Hemwati Nandan Bahuguna Garhwal University

Prepared by: Assoc. Prof. Dr Bahaman Abu Samah

Release notes Âµ-ARGUS version 5.1.1 build 1

Kits Handbill - Kavikulguru Institute of Technology and Science

APSTA-GE 2995 Biostatistics I

WHO SHOULD ATTEND School of Civil