IBM SPSS Statistics Performance Best Practices
Transcription
IBM SPSS Statistics Performance Best Practices
IBM SPSS Statistics IBM SPSS Statistics Performance Best Practices Contents Overview ....................................................................................................................................................... 3 Target User................................................................................................................................................ 3 Introduction .............................................................................................................................................. 3 Methods of Problem Diagnosis ................................................................................................................. 3 Performance Logging for Statistics Server ............................................................................................ 3 Timing for Backend Procedures ............................................................................................................ 4 Benchmarking with a Python Module .................................................................................................. 4 Best Practices for Data Preparation .............................................................................................................. 4 Preparing data automatically with ADP .................................................................................................... 5 Benefits ................................................................................................................................................. 5 Obtaining ADP ....................................................................................................................................... 5 Note ...................................................................................................................................................... 5 SQL Pushback ............................................................................................................................................ 5 Preconditions ........................................................................................................................................ 5 Obtaining SQL Pushback ....................................................................................................................... 6 Example ................................................................................................................................................. 6 Summary ............................................................................................................................................... 6 Note ...................................................................................................................................................... 6 Best Practices for Data Transformations ...................................................................................................... 7 Grouping the Transformations ................................................................................................................. 7 Benefits ................................................................................................................................................. 7 Example ................................................................................................................................................. 7 Summary ............................................................................................................................................... 8 Compiled Transformations ....................................................................................................................... 8 Preconditions ........................................................................................................................................ 8 Obtaining Compiled Transformations................................................................................................... 9 ©Copyright IBM Corporation 1989, 2012. Example ................................................................................................................................................. 9 Best Practices for Data Analysis .................................................................................................................. 10 Cache Compression for Large Datasets .................................................................................................. 11 Benefits .............................................................................................................................................. 11 Obtaining Cache Compression ............................................................................................................ 11 Example ............................................................................................................................................... 11 Multithreading ....................................................................................................................................... 13 Preconditions .................................................................................................................................... 13 Setting ................................................................................................................................................. 13 Example ............................................................................................................................................... 14 Working with Output................................................................................................................................. 15 Extract What You Need from Large Output ..................................................................................... 15 Benefits ............................................................................................................................................... 15 Obtaining OMS and OUTPUT Commands .................................................................................. 15 Examples ............................................................................................................................................. 15 Summary ............................................................................................................................................. 17 Working with Command Syntax ............................................................................................................. 17 Removing Unnecessary EXECUTE Commands ........................................................................................ 17 Benefits ............................................................................................................................................... 17 Examples ............................................................................................................................................. 17 Working with SPSS Statistics Server .................................................................................................... 18 Decreasing Data Passing Costs with SPSS Statistics Server ...................................................... 18 Benefits .............................................................................................................................................. 19 Testing and Results ............................................................................................................................. 19 Guidelines for purchasing Statistics Server......................................................................................... 19 64-bit Computing with Statistics Server ............................................................................................ 20 Benchmarking Test.............................................................................................................................. 20 Using Multiple Locations for Temporary Files ..................................................................................... 21 Benefits .............................................................................................................................................. 21 How to Set Multiple Temporary File Locations ............................................................................ 21 Conclusion ................................................................................................................................................. 23 Trademarks ............................................................................................................................................... 24 2 Overview Target User This paper is intended for users of and support specialists for both IBM® SPSS® Statistics Desktop and IBM® SPSS® Statistics Server. You will find information about optimizing performance and troubleshooting performance-related issues. Introduction SPSS Statistics is comprehensive software for data and statistical analysis. It enables users to quickly look at their data and includes a wide range of procedures and tests to help users solve complex business and research challenges. This article provides SPSS Statistics users and support specialists with best practices for configuration, data preparation, data analysis, and other tasks. These best practices can improve the efficiency, performance, and optimization of SPSS Statistics. This article contains the following information: Methods for diagnosing problems Best practices for data preparation, primarily with Automatic Data Preparation (ADP) Best practices of data transformations, including compiled transformations and how to group the transformations for best performance Best practices for data analysis, including multithreading and cache compression Best practices about how to extract useful information from large output efficiently Best practices for working with syntax Best practices for SPSS Statistics Server For each of the best practices, this article provides detailed background, sample code, and instructions for running the sample code. Methods of Problem Diagnosis If you want to use SPSS Statistics efficiently, you must first identify the problems, especially for performance issues. The methods described in this section help you identify which areas may be problematic. Performance Logging for Statistics Server If you need to check the performance of SPSS Statistics Server, the IBM® SPSS® Statistics Administration Console allows you to configure the analytic server software to write performance information to a log file. The log file provides detailed information about current users, CPU usage, and RAM usage. For more information about logging, refer to Chapter 4 in the IBM SPSS Statistics Server Administrators Guide. 3 Timing for Backend Procedures This method is designed for backend procedures. In this method, the show $VARS command is used to get time information. By issuing the command at the beginning and end of a job, you can obtain an accurate cost of the job and diagnose the problematic area. Example GET FILE = dataset. SHOW $VARS. FREQUENCIES VARIABLES= var1 var2. SHOW $VARS. FREQUENCIES VARIABLES=var3 var4. SHOW $VARS. The first SHOW $VARS command records the start time of the first FREQUENCIES command. The second SHOW $VARS command records the end time of the first FREQUENCIES command and the start time of the second FREQUENCIES command. The last SHOW $VARS command records the end time of the second FREQUENCIES command. You can then calculate the costs for each FREQUENCIES command with subtraction. Benchmarking with a Python Module The benchmark Python module helps you to identify inefficient work. It provides classes that measure various aspects of the SPSS Statistics syntax that is executed on the Microsoft Windows platform. To run this module, you must do the following. Install Python. Note that the Python version is specific for the SPSS Statistics version and the operating system. Download and install win32com utility from http://sourceforge.net/projects/pywin32. Download and install IBM SPSS Statistics – Integration Plug-In for Python, which is installed with IBM SPSS Statistics – Essentials for Python. For more information, refer to the document IBM SPSS Statistics - Essentials for Python: Installation Instructions for Windows. Download the benchmark module, which can be found in the SPSS community’s Utilities collection at http://www.ibm.com/developerworks/spssdevcentral. To install this module, please read the article “How to Use Downloaded Python Modules,” which is also available in the SPSS community, After finishing installation process, open benchmark.py in a text editor or Python development environment and follow the instructions to execute the benchmarking work. Best Practices for Data Preparation This section provides best practices for data preparation. IBM SPSS Statistics Data Preparation option allows you to identify unusual and invalid cases, variables, and data values in your active dataset. It also allows you to prepare data for modeling. 4 Preparing data automatically with ADP Preparing data for analysis is one of the most important steps in any project—and traditionally, one of the most time consuming. Automated Data Preparation (ADP) handles the task for you, analyzing your data and identifying fixes, screening out fields that are problematic or not likely to be useful, deriving new attributes when appropriate, and improving performance through intelligent screening techniques. Benefits Using ADP enables you to make your data ready for model building quickly and easily, without needing prior knowledge of the statistical concepts involved. Models will tend to build and score more quickly; in addition, using ADP improves the robustness of automated modeling processes. Obtaining ADP To run ADP automatically, from the menus choose: Transform > Prepare Data for Modeling > Automatic... Click Run. Optionally, you can: Specify an objective on the Objective tab. Specify field assignments on the Fields tab. Specify expert settings on the Settings tab. Note This article provides only general instructions for using ADP. For more details, read the document IBM SPSS Statistics Data Preparation released with the product. In particular refer to the following: Chapter 4 provides detailed instructions for running ADP, including background information, user interface operations, and explanations of the settings. In chapter 8, you can find ADP sample code and examples, including the full process of running ADP. Also, build models using the data “before” and “after” preparation so that you can compare the results. SQL Pushback SPSS Statistics Server supports the pushback of sorting and aggregation to a SQL database. This ability to perform sorting and aggregation operations in the SQL database is called SQL Pushback. When large datasets are sourced from a SQL database, SQL Pushback ensures that operations that can be performed more efficiently in the database are performed there. Preconditions The following preconditions are required for SQL Pushback functionality. SPSS Statistics Server SPSS Statistics Client used to connect to a SPSS Statistics Server SQL database, such as IBM DB2®, Microsoft SQL Server, or Oracle Database 5 Obtaining SQL Pushback SQL Pushback is available only through the graphical user interface. Therefore you first need to use SPSS Statistics client to connect to the SPSS Statistics Server. Then complete the following steps. From the menus choose File > Open Database > New Query... Select the data source. If necessary (depending on the data source), select the database file and/or enter a login name, password, and other information. Select the table(s) and fields. For OLE DB data sources (available only on Windows operating systems), you can select only one table. Specify any relationships between your tables, such as selection criteria. If needed, aggregate the data by selecting one or more break variables, aggregated variables and an aggregate function for each aggregate variable. Otherwise, skip this step. Edit variable names and properties. If needed, sort the data. Otherwise, press Next to skip this step. Run the query or save it. Example This example compares the performance of SQL Pushback versus using the SORT procedure with SPSS Statistics client. Data File and Configurations Dataset: Size 1.25 GB, 7.71 million cases, 27 variables CPU: 1 CPU, Intel T 9400, 2.53 GHz, dual-core processor RAM: 3 GB Operating System: Windows XP, 32-bit IBM SPSS Statistics: Statistics Server 20, Statistics Client 20 Test Results Sort with SQL Pushback: 77 seconds Sort with Statistics Client: 289 seconds Time Saved: 212 seconds (73.35%) Note: The above result is based on testing done in IBM SPSS laboratories. Although our test environments simulate typical production environments in the field, we can’t guarantee that organizations performing similar tests will see identical results. This data are presented for general guidance. Summary Based on the example, the performance improvement is up to 73.35% by executing sorting with SQL Pushback. The improvement may vary depending on configurations, data size, and syntax. Note If you are familiar with the SQL language, you can arrange the SQL query to execute sorting and aggregating work in the database, which can gain the same performance improvement as SQL Pushback. 6 Best Practices for Data Transformations In most situations, the raw data aren’t perfectly suitable for the type of analysis you want to perform. Preliminary analysis may reveal inconvenient coding schemes or coding errors, and then data transformations may be required in order to expose the true relationship between variables. You can perform data transformations ranging from simple tasks, such as collapsing categories for analysis, to more advanced tasks, such as creating new variables. This section introduces several best practices for data transformations, which help to use SPSS Statistics Data Transformations more efficiently. Grouping the Transformations Data transformations are usually necessary for data analysis. The typical user job is defining data, transforming, analyzing, transforming, analyzing and so on. Obviously, the transformation commands are interspersed with analytic procedures, which cause low efficiency because of repetitive executions of data transformations. In this situation, you need to group the transformations. Benefits By grouping the transformation commands, you can execute all the transformation work at one time, which saves extra interpretation cost for the transformations. In addition, it makes syntax arrangement clearer and more ordered. Example The example executes the sample syntax before and after grouping the transformation work, so that you can see the difference from the results. Ungrouped Syntax Get file="dataset". COMPUTE testvar1=var1-var2. IF (testvar1 LT 10 OR testvar1 GT 50) testvar1=20. FREQUENCIES testvar1. COMPUTE testvar2=var3. RECODE testvar2 (1 thru 10=1) (11 thru 30=2) (31 thru 50=3) (51 thru Highest=4). FREQUENCIES testvar2. COMPUTE testvar3=var4. RECODE testvar3 (SYSMIS=SYSMIS) (Lowest thru 20=1) (21 thru 50=2) (100 thru Highest=4) (51 thru 100=3). FREQUENCIES testvar3. Grouped Syntax Get file="dataset". COMPUTE testvar1=var1-var2. IF (testvar1 LT 10 OR testvar1 GT 50) testvar1=20. COMPUTE testvar2=var3. 7 RECODE testvar2 (1 thru 10=1) (11 thru 30=2) (31 thru 50=3) (51 thru Highest=4). COMPUTE testvar3=var4. RECODE testvar3 (SYSMIS=SYSMIS) (Lowest thru 20=1) (21 thru 50=2) (100 thru Highest=4) (51 thru 100=3). FREQUENCIES testvar1. FREQUENCIES testvar2. FREQUENCIES testvar3. The syntax creates three test variables (testvar1, testvar2, and testvar3) based on the original variables (var1, var2, var3, and var4), and then recodes them for next step analysis. We use the simple FREQUENCIES command for demonstration. Data File and Configurations Dataset: Size 0.9 GB, 3 million cases, 132 variables CPU: 1 CPU, Intel T 9400, 2.53 GHz, dual-core processor RAM: 3 GB Operating System: Windows XP, 32-bit IBM SPSS Statistics: Statistics Client 20 Test Results Ungrouped syntax: 77 seconds Grouped syntax: 43 seconds Time saved: 26 seconds (33%). Note: The above result is based on testing done in IBM SPSS laboratories. Although our test environments simulate typical production environments in the field, we can’t guarantee that organizations performing similar tests will see identical results. This data are presented for general guidance. Summary Based on the example, the performance improvement is up to 33% by grouping the transformations. The improvement may vary depending on configurations, data size, and syntax, but you can see obvious improvement. Grouping your transformation work is a good practice. Compiled Transformations The compiled transformations feature is designed to improve the performance of complex transformations. When you use compiled transformations, transformation commands (such as COMPUTE and RECODE) are compiled to machine code at run time for better performance. This feature works only with SPSS Statistics Server running on Windows Server. Preconditions The following preconditions are required for the compiled transformations feature. SPSS Statistics Server running on Windows. The SPSS Statistics Administration Console for configuring SPSS Statistics Server. GNU G++ compiler. 8 Because there is an overhead involved in compiling the transformations, you should use compiled transformations only when there are a large number of cases and multiple transformations commands. Obtaining Compiled Transformations To run compiled transformations, complete the following steps: Have an administrator use the SPSS Statistics Administration Console to turn on the feature and set the correct compiler path. Chart 1 highlights these settings. Chart 1: Settings for compiled transformations Set CMPTRANS to YES in the syntax file. Execute the syntax while connected to the SPSS Statistics Server or with the SPSS Statistics Batch Facility. Note: For compiled transformations to be available the administrator must turn on compiled transformations with the SPSS Statistics Server setting and CMPTRANS must be set to YES. If the administrator does not turn on compiled transformations, a warning message is displayed and the command is ignored. Example This example runs compiled transformations with different data sizes and complexity levels. It also provides the test results without compiled transformations for a contrast. Sample Syntax INPUT PROGRAM. LOOP icase = 1 to 1000000. END CASE. END LOOP. END FILE. END INPUT PROGRAM. 9 EXECUTE. SET CMPTRANS=ON. VECTOR x(10). LOOP jvar = 1 to 10. COMPUTE x(jvar)=rnd(uniform(10)). END LOOP. EXECUTE. The above syntax generates a dataset and initializes the variables with the COMPUTE command. The first LOOP command (highlighted with bold) defines the case numbers, and the second LOOP defines the variable numbers. Configurations CPU: 4 CPUs, Intel Xeon, 3.00 GHz, dual-core hyper-threaded processor RAM: 8 GB Operating System: Windows 2003 Server, 64-bit IBM SPSS Statistics: Statistics Server 20 Test Results The following table summarizes the test results. Cases Number No Compiled Transformations Compiled Transformations Time Saved 1,000,000 9 seconds 70 seconds 45 seconds 349 seconds 9 seconds 51 seconds 36 seconds 255 seconds 0 27% 20% 28% 5,000,000 10 loops 100 loops 10 loops 100 loops Table 1: Test results of compiled transformations Note: The above result is based on testing done in IBM SPSS laboratories. Although our test environments simulate typical production environments in the field, we can’t guarantee that organizations performing similar tests will see identical results. This data are presented for general guidance. Summary Compiled transformations may improve performance when there are a large number of cases and complex transformation commands. Best Practices for Data Analysis This section provides best practices for data analysis. IBM SPSS Statistics is a comprehensive system for analyzing data. It makes statistical analysis more accessible for the beginner and more convenient for the experienced user. The best practices introduced here are helpful for analyzing large datasets more efficiently and improving the parallelization for CPU intensive procedures. 10 Cache Compression for Large Datasets When running many procedures on a large dataset, the cost of getting data obviously increases. The application must read the original dataset for each procedure. For data tables read from a database source, this means that the SQL query must be re-executed for any command or procedure that needs to read the data. Cache compression allows you to avoid this overhead. Benefits Creating a data cache eliminates multiple data readings. The CACHE command copies all of the data to a temporary disk file for subsequent uses of the data. To decrease I/O costs, you can also compress the temporary data file. Combining CACHE with compression improves efficiency when dealing with large datasets. Obtaining Cache Compression Cache compression works only if you are connected to Statistics Server. Then, complete the following steps. Have an administrator use the SPSS Statistics Administration Console to turn on the feature. Chart 2 highlights these settings. Chart 2: Settings for Cache Compression Issue an explicit CACHE command before the analytical procedures. Set ZCOMPRESSION to YES in syntax file. Execute the syntax while connected to the SPSS Statistics Server or with the SPSS Statistics Batch Facility. Example This example runs several procedures with cache compression on a large dataset and then summaries the test results. 11 Configurations Dataset: Size 1.25 GB, 7.71 million cases, 27 variables CPU: 4 CPUs, Intel Xeon, 3.00 GHz, dual-core hyper-threaded processor RAM: 8 GB Operating System: Windows 2003 Server, 64-bit IBM SPSS Statistics: Statistics Server 20 Results The following table summarizes the test results. Procedures No Cache Compression (Seconds) Cache Compression (Seconds) Time Saved CODEBOOK 41.79 11.77 71.83% CORRELATIONS 21.18 8.42 60.25% COXREG 28.09 16.28| 42.04% CROSSTABS 21.85 8.84 59.54% CTABLES 25.06 11.10 55.71% EXAMINE 1343.87 1157.34 13.88% GLM 20.72 8.68 58.11% LOGISTIC 37.09 25.55 31.11% NOMREG 29.26 16.35 44.12% OLAP CUBES 30.83 16.96 44.98% T-TEST 20.22 8.24 59.25% TREE 192.09 164.17 14.54% Table 2: Test results for cache compression As shown in Table 2, the procedures CODEBOOK, CORRELATIONS, CROSSTABS, CTABLES, GLM, TTEST improve over 50%. Note: The data shown is based on testing done in IBM SPSS laboratories. Although our test environments simulate typical production environments in the field, we cannot guarantee that organizations performing similar tests will see identical results. This data are presented for general guidance. Actual results will vary depending on the configuration of the SPSS Statistics Server and clients (number of CPU cores, RAM, disk speed, etc.). 12 Multithreading Multithreading is the technical term used to break a task into multiple tasks that can be executed in parallel. Not all analytical procedures can take advantage of multithreading. Procedures that can be easily parallelized and scheduled to run simultaneously on different CPUs/cores benefit the most. The procedures that are multithreaded in SPSS Statistics are listed in the following table. Procedure family Procedure Name Correlations Bivariate Partial Regression Linear Ordinal Multinomial Logistic Data Reduction Factor Analysis Survival Analysis Cox Regression Logistic Regression Multiple Imputation Impute missing values Table 3: Multithreaded analytical procedures Preconditions To benefit from multithreading, the following preconditions are required. The computer on which the procedure is run has multiple processors or each processor has multiple cores. The procedure that is executed is listed in Table 3. Note: In SPSS Statistics client, the maximum thread number is 4. In SPSS Statistics Server, there is no limit to the number of threads. Setting By default, SPSS Statistics uses an internal algorithm to determine the number of threads for a particular computer. You can change this setting, but the default will often provide the best performance. You can override the default setting by issuing the command SET THREADS=n, where n indicates the number of threads, often corresponding the number of CPUs or cores. It’s suitable to use SET THREADS to override the default setting in the following scenarios. 13 The default thread number is usually equal to the number of processing units. The threads consume CPU resources, which may reduce the processing cycles needed for other CPUintensive applications. In this situation, you can use SET THREADS to limit the thread number. For multi-threaded procedures the performance may not improve when the thread number increases because the overhead on separating the data, managing the threads, and merging the results also increases. (For specific results, you can refer to Table 4). Therefore, you should find the optimal thread number and set it by using the command SET THREADS. Example This example provides detailed performance information for multithreaded procedures using different data sizes, and different thread number. Configurations CPU: 4 CPUs, Intel Xeon, 3.00 GHz, dual-core hyper-threaded processor RAM: 8 GB Operating System: Windows 2003 Server, 64-bit IBM SPSS Statistics: Statistics Server 20 Results The following table summarizes the test results. Multi-threaded Procedures File Size (MB) Case Number Variable Number Time (sec) 2 threads Time (sec) 4 threads Time (sec) 8 threads Time (sec) 16 threads Saved Time Discriminant Cscoxreg SORT Csordinal Cslogistic Linear regression Factor Correlation Partially correlated Nomreg Csselect 688 2.38 2610 47.6 47.6 703 400,000 50,000 2,000,000 1,000,000 1,000,000 200,000 200 50 457 50 50 400 7.56 32.12 392.68 40.11 63.03 50.13 6.67 20.36 261.13 37.05 55.74 30.93 6.03 15.25 241.29 36.87 58.49 14.74 6.34 12.90 249.42 36.94 54.72 11.83 20.23% 59.84% 38.55% 8.07% 13.18% 76.4% 686 343 343 200,000 200,000 200,000 400 200 200 97.83 29.67 21.94 49.24 19.55 12.41 27.94 16.81 12.56 28.18 12.97 12.31 71.44% 56.28% 43.89% 3.76 33.2 50,000 1,069,000 15 6 16.61 47.97 11.89 48.27 9.41 48.10 8.16 48.00 50.87% 0.00% Table 4: Benchmarking results with different thread numbers Based on above results, as the number of threads increases from 2 to 16: Cscoxreg procedure improves by 59.84%. Linear regression procedure improves by 76.4%. Factor procedure improves by 71.44%. Partially correlated procedure improves by 43.89%. Nomreg procedure improves by 50.87%. 14 Note: The data shown is based on testing done in IBM SPSS laboratories. Although our test environments simulate typical production environments in the field, we cannot guarantee that organizations performing similar tests will see identical results. This data are presented for general guidance. Actual results will vary depending on the configuration of the SPSS Statistics Server and clients (number of CPU cores, RAM, disk speed, etc.). Working with Output SPSS Statistics provides rich methods to display the statistical results, including tables, charts, and text. By default, the results are displayed in an SPSS Statistics Viewer window. You can manipulate the output and create an output document that contains precisely the output you want, arranged and formatted appropriately. The best practice introduced in this section helps you to achieve this goal. Extract What You Need from Large Output When running multiple procedures, SPSS Statistics often generates mass results consisting of tables, charts, logs, text and so on. It’s painful to review so much information and find what you want. Fortunately, SPSS Statistics provides the Output Management System (OMS) and OUTPUT commands (OUTPUT NEW, OUTPUT NAME, OUTPUT ACTIVATE, OUTPUT OPEN, OUTPUT SAVE, OUTPUT CLOSE) to help you refine and route the output. Benefits With OMS and OUTPUT commands, you can gain the following benefits. Partition the large output into separate output documents. Select and route required information from the output. Work with multiple open output documents in a given session. Use output as input with OMS. Obtaining OMS and OUTPUT Commands There are two ways to run OMS: from the OMS control panel and a syntax command. Use the OMS Control Panel. From the menus choose Utilities > OMS Control Panel. With the control panel, you can start and stop the routing of the output to various destinations. Note that OUTPUT commands can be used only with a syntax command. Use OMS and OUTPUT commands. The following examples illustrate how to insert these commands into your existing syntax. Examples Example 1: Partitioning the Output with OUTPUT Commands This example demonstrates how to partition the statistical results according to gender. Results for males will appear in one output documents, and results for females will appear in another one. GET FILE='SurveyData.sav'. TEMPORARY. SELECT IF (Sex='Male'). FREQUENCIES VARIABLES=ALL. 15 OUTPUT NAME males. TEMPORARY. SELECT IF (Sex='Female'). OUTPUT NEW NAME=females. FREQUENCIES VARIABLES=ALL. OUTPUT SAVE NAME=males OUTFILE='Males.spv'. OUTPUT SAVE NAME=females OUTFILE='Females.spv'. OUTPUT CLOSE *. The GET command loads survey data for male and female respondents. The FREQUENCIES output for male respondents is written to the designated output document. The OUTPUT NAME command is used to assign the name males to the designated output document. The FREQUENCIES output for female respondents is written to a new output document named females. The OUTPUT SAVE commands are used to save the output in two separate files. The OUTPUT CLOSE command closes all open output documents. Example 2: Formatting and Routing the Output with OMS This example demonstrates how to route the output in different format. The following is the sample code. OMS /SELECT TABLES /IF COMMANDS = ['Regression'] /DESTINATION FORMAT = DOC OUTFILE = 'tables.doc'. REGRESSION /STATISTICS COEFF OUTS R ANOVA /DEPENDENT income /METHOD=ENTER age address EDU employ. OMS SELECT WARNINGS /DESTINATION FORMAT=HTML OUTFILE='warnings.htm’ FREQUENCIES age EDU. OMSEND. The first OMS command selects tables from REGRESSION results and saves them to tables.doc. The REGRESSION command generates the output used by OMS. The second OMS command selects warnings from FREQUENCIES results and saves them to warnings.htm. The FREQUENCIES command generates the results used by the second OMS command. OMSEND command ends OMS commands. Example 3: Converting Output into Input with OMS Using the OMS command, you can save the output to an SPSS Statistics data file and then use that output as input in subsequent commands or sessions. OMS 16 /SELECT TABLES /IF COMMANDS=['Descriptives'] SUBTYPES=['Descriptive Statistics'] /DESTINATION FORMAT=SAV OUTFILE='des_table.sav' /COLUMNS DIMNAMES=['Variables']. DESCRIPTIVES VARIABLES=salary salbegin. OMSEND. The OMS command selects the “Descriptive Statistics” table from DESCRIPTIVES results and saves it as the SPSS Statistics data file des_table.sav. The COLUMNS subcommand selects the descriptive variables as the variables of output data file. The DESCRIPTIVES command generates the table used by OMS. The OMSEND command ends OMS commands. Summary The OMS and OUTPUT commands provide the ability to manage one or more output documents programmatically. This ability helps you deal with the output more easily. For more information, please refer to the IBM SPSS Statistics Command Syntax Reference, which is released with the product. Working with Command Syntax The powerful command syntax allows you to save and automate many common tasks. It also provides some functionality not found in the menus and dialog boxes. You can also save your jobs in a syntax file so that you can repeat your analysis at a later date. This section provides best practices for working with command syntax. Removing Unnecessary EXECUTE Commands The EXECUTE command is designed for use with transformation commands and facilities such as ADD FILES, MATCH FILES, UPDATE, PRINT, and WRITE, which do not read data and are not executed unless followed by a data-reading procedure. Because the EXECUTE command forces the data to be read, unnecessary EXECUTE commands can result in extra data passing and wasted time. Benefits By identifying and removing unnecessary EXECUTE commands, you can optimize syntax arrangement and reduce the time needed for reading data. This optimization is especially effective for I/O intensive procedures. Examples The following examples demonstrate the improper usage of EXECUTE commands and how to correct the improper usage. Example 1: Using EXECUTE Between Independent Transformations COMPUTE var1=var1*2 EXECUTE. COMPUTE var2=var2*2 17 The two COMPUTE commands operate on different variables. They are independent. In this scenario, inserting the EXECUTE command causes unnecessary data passing and lowers the execution efficiency of the transformations. Ensuring that the transformations are truly independent is critical. If the transformation are in fact dependent, you may need to put the EXECUTE command between the transformations to get the right results. For example: Syntax 1: COMPUTE lagvar=LAG(var1). COMPUTE var1=var1*2. Syntax 2: COMPUTE lagvar=LAG(var1). EXECUTE. COMPUTE var1=var1*2. Compared with Syntax 1, the only difference in Syntax 2 is the EXECUTE command between the two COMPUTE commands. However, the value of lagvar is totally different in Syntax 1 and Syntax 2. Syntax 1 uses the transformed value of var1 to calculate lagvar, while Syntax 2 uses the original value. Example 2: Inserting EXECUTE Between a Transformation and Statistical Procedure COMPUTE var1=var1*2. EXECUTE. FREQUENCIES VARIABLES=var1. Sometimes it’s necessary to execute the transformations with the EXECUTE command. However, when the transformations are followed by one or more statistical procedures that need to read the data, the EXECUTE command becomes redundant. In this example, you should remove the EXECUTE command. Working with SPSS Statistics Server SPSS Statistics Server is robust, powerful analytical software that seamlessly scales from handling the analytical needs of a single department to hundreds of users across the enterprise. It provides all of the features of SPSS Statistics, plus capabilities that deliver faster performance, more efficient processing of large datasets, and enhanced security in enterprise deployments. This section provides best practices for working with SPSS Statistics Server. Decreasing Data Passing Costs with SPSS Statistics Server For an organization with distributed offices, accessing large data files across offices takes a significant amount of time. Passing large data on the network can cause bandwidth saturation, which disturbs the normal use of other applications. In this situation, SPSS Statistics Server is a good choice. 18 Benefits With SPSS Statistics Server, data is read from server machine, avoiding transferring large datasets to end users’ desktops. The data transferred over the network is minimized and performance is improved. This prevents bandwidth saturation and improves the performance of SPSS Statistics in addition to other mission-critical applications, including e-mail, enterprise resource planning (ERP), and customer relationship management (CRM). Testing and Results The following table compares the time needed to access data in these situations: SPSS Statistics client is running in local mode and accesses files in the data center directly over the wide area network (WAN). SPSS Statistics client is running in distributed mode and is connected to an SPSS Statistics Server installed at the data center. File Size SPSS Statistics client connecting directly to the data over a WAN (T1 3.0 Mbps) SPSS Statistics client connecting to the SPSS Statistics Server at the data center over a WAN (T1 3.0 Mbps) Time saved with SPSS Statistics Server in seconds 50 MB 2 minutes, 10 seconds 4 seconds 250 MB 10 minutes, 50 seconds 40 seconds 1 GB 43 minutes, 17 seconds 80 seconds 2 minutes, 6 seconds 10 minutes, 10 seconds 41 mi minutes, 57 seconds Table 5: Timing in seconds to access a data file As shown in above table, compared with SPSS Statistics client, significant time savings can be achieved with SPSS Statistics Server when accessing files in distributed offices. For example, 2 minutes were saved for a 25 MB file, 10 minutes for a 250 MB file, and 42 minutes for a 1 GB file. Note: The results are based on the assumption that the available bandwidth is 3.0 Mbps. In reality, the time saved will be greater as bandwidth is taken up by other applications such as e-mail, network backups, and other network resources. The data presented here are for illustrative purposes only. Actual results will vary depending on the configuration, bandwidth, and latency of the WAN. Therefore, organizations performing similar tests may not see identical results. Guidelines for purchasing Statistics Server The SPSS Statistics Server is especially designed for the following scenarios: Organizations with distributed offices looking to centralize their data and IT infrastructure in one or more data centers. Organizations with distributed offices that need to analyze and share files greater than 25 MB across offices. 19 Organizations that need to perform analysis on large datasets (greater than 100 MB) sourced from a SQL server or a data warehouse. 64-bit Computing with Statistics Server The amount of physical RAM is critical for performance because accessing data from RAM is much faster than accessing data from a disk. For faster performance, it’s best to have the entire dataset in RAM. However, the total amount of RAM supported depends on the processor. Theoretically, 32-bit processors are limited to accessing 4 GB of RAM. Transferring to a 64-bit machine allows you to increase the amount of RAM to several multiples higher than a 32-bit machine. It’s much faster to execute analytical procedures with larger datasets on a 64-bit machine. SPSS Statistics Server has strong support for 64-bit computing on multiple server operating systems, including Windows Server, IBM® AIX®, Sun Solaris, HP-UX, Red Hat Enterprise Linux, and SUSE Linux Enterprise Server. Most analytical procedures run much faster on 64-bit SPSS Statistics Server than 32bit SPSS Statistics client. Benchmarking Test We compare the processing times for statistical procedures run on 64-bit SPSS Statistics Server and 32bit SPSS Statistics client. Configuration of Statistics Server CPU: 4 CPUs, Intel Xeon 3 GHz, dual-core hyper-threaded processor RAM: 8 GB Operating system: Windows 2003 Server, 64-bit Configuration of Statistics Client CPU: 1 CPU, Intel T 7500, 2.19 GHz, dual-core processor RAM: 3 GB Operating system: Windows XP, 32-bit Datasets Two datasets were used: Dataset 1: Size 2.1 GB, 5 million cases, 127 variables Dataset 2: Size 3 GB, 10 million cases, 127 variables (for testing multithreaded procedures) Result The test results are summarized in the following table. The chosen procedures the typical type of analysis or data processing that an SPSS Statistics user might execute in daily work. Procedures ADD FILES AGGREGATE 64-bit Server 32-bit Client (seconds) (seconds) 18.45 33.19 169.34 94.95 Time Saved Average Speedup Factor 89.10% 65.04% 20 9.18 2.86 Procedures MATCH FILES SORT CORRELATION FACTOR GLM MIXED TREES BETA 64-bit Server 32-bit Client (seconds) (seconds) 22.00 146.90 230.78 140.95 70.09 116.23 615.00 40.12 224.17 578.73 800.83 219.22 350.91 174.13 885.49 106.20 Time Saved Average Speedup Factor 90.19% 74.62% 71.18% 35.70% 80.03% 33.25% 43.98% 62.22% 10.19 3.94 3.47 1.56 5.01 1.50 1.44 2.65 Table 6 Benchmarking results for jobs run on 64-bit SPSS Statistics Server and 32-bit SPSS Statistics clients Note: The results shown in Table 6 are based on testing done in IBM SPSS laboratories. Although our test environments simulate typical production environments in the field, we cannot guarantee that organizations performing similar tests will see identical results. This data are presented for general guidance. Actual results will vary depending on the configuration of the SPSS Statistics Server and clients (number of CPU cores, RAM, disk speed, etc.) Summary The benchmarking results show impressive speedup for most procedures with 64-bit SPSS Statistics Server. For best performance, use 64-bit SPSS Statistics Server. Using Multiple Locations for Temporary Files When SPSS Statistics Server processes data, it often keeps a temporary copy of that data on disk. In addition, some procedures (CACHE, SORT, AGGREGATE, transformations, etc.) can create temporary files during execution. The size of temporary files varies from the size of the data file to three times the size of the data file. Because the temporary files are writable and can get quite large, it’s hard to manage I/O operation, especially when there are several concurrent I/O intensive users. In this situation, setting multiple temporary file locations is necessary. Benefits Using multiple temporary file locations, you can: Limit the users to operate the directories to which they have access. Control the temporary files space allocated to each user by specifying a partitioned drive. Improve performance when the locations are on different spindles. This option requires your server workstation to have multiple physical disks. How to Set Multiple Temporary File Locations There are several ways to set multiple temporary file locations using the SPSS Statistics Administration Console. Note that this optimization is available only when using SPSS Statistics Server. 21 Set global temporary file locations Chart 3 shows a screen capture from the SPSS Statistics Administration Console and highlights the setting for temporary file locations. Chart 3: Setting for temporary file location As shown in Chart 3, the administrator set three locations: c:\temp, d:\temp, and e:\temp. This setting is global for all users but can be overridden by the user profile or group setting. Set Temporary File Location with Group Setting The group setting applies to all users in a group, but it can be overridden with the setting in specific user profiles. To display the group settings, double-click the User Profiles and Groups node beneath the desired SPSS Statistics Server in the Server Administration pane. The Manage Users and Groups pane displays the currently defined user profiles and groups in the User Profiles and Groups grid. To create a new user group, complete the following steps. In the Manage Users and Groups pane, click New Group. In the Create New Group dialog box, enter a name for the group. Define any of the available settings, including temporary file locations. Set Temporary File Location for Each User To create a new user profile, open the Manage Users and Groups pane in the Server Administration pane and complete the following steps. In the Manage Users and Groups pane, click New User Profile. In the Create New User Profile dialog box, enter the name of the user for whom you are creating the profile. If necessary, define any of the available settings. You can define the temporary file location for this user. If you are creating a user profile to assign to a group, you don’t have to define any settings. The group settings will be applied to the user. For more information about creating and editing SPSS Statistics Server user profiles and groups, please refer to Chapter 4 of IBM SPSS Statistics Server Administrators Guide. 22 Conclusion This paper provides some best practices for improving the efficiency, performance and optimization of IBM SPSS Statistics. These best practices include data preparation, data transformations, data analysis, output, command syntax and Statistics Server. By learning from these cases, SPSS Statistics users can optimize their work and improve overall performance. 23 Trademarks IBM, the IBM logo, and ibm.com are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at http://www.ibm.com/legal/copytrade.shmtl. 24