ABSTRACT EVOLUTION OF A DSS COURSE

Transcription

ISKEN
Data Cleansing and Analysis as a Prelude to Model Based Decision Support
Data Cleansing and Analysis as a Prelude to Model
Based Decision Support
advent of the spreadsheet based modeling course.
There has been a virtual revolution in terms of
spreadsheet based management science and operations management courses that seems to have stuck
in business schools (Winston, 1996, Powell, 1998,
2001). Spreadsheets have evolved into a quite capable
platform for end-user decision support modeling
(Grossman, 1997, Ragsdale, 2001). Within Microsoft
Excel, this evolution has resulted in the inclusion
of Solver (Fylstra, Lasdon, Watson, Waren, 1998)
for optimization, Pivot Tables, database connectivity,
numerous mathematical and statistical functions and
the Visual Basic for Applications (VBA) programming
language. Sophisticated add−ins such as @Risk,
Crystal Ball and PrecisionTree allow analysts to do
Monte−Carlo simulation and decision analysis within
the familiar confines of Excel. Relational database
management systems such as Microsoft Access provide a viable platform for creating decision support
tools with the inclusion of VBA, database functionality
and connectivity, as well as simple yet powerful
interface development tools. Taken together, these
developments suggest there exists tremendous opportunity for business analysts in the role of end−user
modelers and end−user developers to pair models with
rich data sources and create sophisticated decision support tools using readily available software components.
Mark W. Isken
Dept. of Decision and Information Sciences
School of Business Administration
Oakland University
[email protected]
ABSTRACT
We describe a loosely integrated set of teaching modules that allow students to explore decision support
systems (DSS) topics ranging from data warehousing
to model based decision support. The modules are
used as part of a spreadsheet based modeling approach
to teaching DSS. We describe one particular module
in detail in which students are forced to deal with a
muddy dataset as a prelude to analysis, modeling and
decision support system development.
Editor’s note: This is a pdf copy of an html document
which resides at
http://ite.pubs.informs.org/Vol3No3/Isken/
EVOLUTION OF A DSS COURSE
While this is all quite exciting for those with a passion
for both systems development and mathematical
modeling, there are challenges involved in effective
teaching of such concepts and skills in todays business
schools. At Oakland University, MIS is the largest
major in the business school; there is no operations
management major nor is there a management science
course taught regularly. Students take six credit hours
of business statistics in their sophomore year. The MIS
department has offered a traditional survey DSS course
for a number of years. In 2001, we created a spreadsheet based version of the course. Given the heavy MIS
emphasis at our school, we decided to develop a course
that integrates MIS, DSS, and management science
concepts with the goal of helping students become
technically adept problem solvers in practical business
settings. Having spent many years as a practicing
operations research and decision support specialist
in the healthcare industry, the author is convinced
of the value of equipping business students with this
The huge umbrella of decision support systems (DSS)
has long provided a welcome gathering spot for those
interested in building software applications based
on a mixture of models, data analysis, and powerful
interfaces. DSS attracts practitioners, scholars and
students from a range of fields including information
systems, operations research/management science,
computer science, psychology and other business disciplines. While this breadth is a strength, it also poses
challenges to those teaching DSS courses. The sheer
range of potential topics poses a challenge as well as a
fact that while management information systems (MIS)
is a strong, popular major at many business schools,
the elimination of the required management science
course at numerous universities (Powell, 1998, Grossman, 2001), makes it more difficult to introduce model
based DSS to students at a deeper than superficial level.
However, there has been a positive side effect of
the pressure put on management science courses the
INFORMS Transaction on Education 3:3 (23-75)
23
c INFORMS ISSN: 1532-0545
ISKEN
integrated tool set. The course is taught in a computer
teaching lab and is extremely hands−on. Currently
it is cross−listed and attracts senior undergraduates,
MBA students, and MS in Information Technology
Management students. No computer programming
experience is assumed, though all students are taught
to program as part of the course. While we will give
a brief overview in this paper, the course web site
(http://www.sba.oakland.edu/faculty/isken/mis646/)
contains detailed information about the entire course.
The primary focus of this paper will be on a particular
set of modules used to integrate concepts of data based
and model based decision support. These modules
were designed to provide a natural progression from
the world of corporate transaction system data to data
based decision support and finally to model based
decision support.
in the business world. This multi−part module begins
with muddy data, then OLAP, and concludes with
queueing and optimization models. The application
thread running through all three modules is that of
analyzing call center data for purposes of assessing
staffing capacity and customer service related questions. In this paper we will describe the muddy data
module in some detail as it is somewhat unique, the
OLAP module in less detail and just briefly describe
the follow on modeling topics.
The progression of the modules parallels a situation commonly faced by business analysts. In response
to some business problem or managerial inquiry, the
analyst might begin by trying to better understand
the problem through preliminary data analysis and
exploration. Unfortunately, at this stage one is often
faced with formidable data gathering and cleansing
problems that must be tackled before the data is
usable for analysis. Next, relatively unstructured data
exploration and analysis might be done in an attempt
to quantify the problem, explore relationships between
variables, and to gain better insight into the nature
of the business problem. Depending on the specific
problem at hand and set of questions to be answered,
data analysis may simply not be sufficient and the
analyst must move on to modeling and more structured
data analysis. This linear progression is idealized
in the sense that, certainly, one can do modeling in
parallel with (or even in the absence of) data analysis.
However, it is a common scenario in practice and has
pedagogical value in that business students are usually
more comfortable with concrete data than with more
abstract modeling topics.
MOTIVATION FOR THE TEACHING MODULES
Many businesses have realized the tremendous value
of their internal operational data for supporting decision making. This has played out in the software
marketplace in the growth of the data warehousing,
OLAP (online analytical processing), and data mining
fields. Data warehouses provide a foundation of data
that has already been merged from multiple systems,
validated, transformed for purposes of facilitating
analysis, stored according to a logical organizational
scheme, and which is easily accessible using various
client side tools such as query builders, spreadsheets
and OLAP systems (Kimball, 1996). Multidimensional data modeling provides an alternative way of
structuring databases for business analysis (Thomsen,
2002). OLAP tools such as Microsoft Excels Pivot
Table and Pivot Chart functionality provide a powerful
front-end for flexible data exploration. However, no
matter how good the data warehouse and data analysis
tools, many common managerial decision making
problems require models to aid the decision maker in
arriving at a defensible decision.
MODULE 1 - MUDDY DATA
Perceived value of data buried in corporate information systems, the growth in enterprise resource
planning (ERP), workflow and e-commerce systems
along with the availability of inexpensive computing
resources have contributed to massive growth in the
data warehousing and OLAP markets (Dhond et
at 2000). Typically, such efforts involve a significant amount of effort to get data out of one place
and into another. This phase, optimistically called
extraction−transform−load (ETL), can be an unbelievably difficult exercise in dealing with missing,
contradictory, ambiguous, undefined, and just plain
We created a set of loosely integrated teaching
modules that allows students to explore the inextricably linked concepts of data warehousing, OLAP,
and decision modeling. In doing so, the students also
become familiar with a host of advanced Excel tools
and techniques that they will find extremely valuable
24
ISKEN
messy data (Pyle, 1999, Mallach, 2000).
6. Cultivate an MIS mindset that abhors manual
data cleaning/transformation/manipulation and is
always looking for structure in data files that can
be exploited to create automated solutions,
In general, much time is spent on getting data out
of systems for:
7. Remind students that knowledge work often involves getting down and dirty with data and that
their value as analysts is increased by developing proficiency in handling any data set thrown at
them.
• Analysis (spreadsheet, database, statistics package)
• Transformation and input into new system (ERP,
data warehouse)
There are commercial products such as Monarch (from
Data Watch (http://www.datawatch.com/index.asp))
and
Content
Extractor
(Data
Junction
(http://www.datajunction.com/)) which address the
general problem of extracting usable data from formatted electronic reports. However, the process of
doing this task with a general purpose tool such as
Excel gives students a better understanding of data
manipulation concepts such as data types, string
parsing, exploiting structure in data and automation
of repetitive tasks which is often required even with
commercial ETL tools. Furthermore, virtually every
business analyst will have access to a spreadsheet
and/or database package.
Ideally, systems can export data in nice comma delimited, tab delimited, spreadsheet or database formats.
This is not always the case with legacy systems. Often
the fastest way to get data from such systems is to
redirect a printed report to disk. Now the business
analyst is faced with a huge electronic file that needs
to be manipulated in order to make it useful. This is
a typical and important aspect of working data from
corporate information systems. It is often dismissed as
tedious yet mindless work. This is far from the truth as
it provides an opportunity for clever problem solving
as well as saving time during the analysis process.
We have created an Excel based teaching module
entitled Dealing with Muddy Data Using Excel to
introduce this real business problem to our students.
The module has multiple learning objectives:
STRUCTURE OF THE CLASS
1. Let students experience the difficulty of ETL activities,
We begin the session with a brief introduction to the
general problem and stress the ubiquitous nature of
such problems and the undesirability of brute force
manual approaches due to time constraints and likelihood of error. Then the specific case that we are
going to use is introduced: data generated from an
automatic call distributor (ACD) from a call center at a
large tertiary care hospital. We take the perspective of
analysts interested in studying call center performance
and suggesting staffing changes that would improve
customer service and/or reduce operating costs. The
ACD system generated many summary reports but it
was difficult to get at this data in an electronic fashion.
The system was capable of sending reports to disk.
So we had a large number of reports for different
areas (called splits) in the hospital and different dates
generated and redirected to disk. A small set of slides
are used during the discussion DataMud Presentation
(http://ite.pubs.informs.org/Vol3No3/Isken/DataMud.ppt).
2. Allow students to learn some of Excels useful
data manipulation functions such as string parsing and date/time functions and to generally raise
their power level in Excel (which translates to most
database and spreadsheet applications),
3. Help the students start to learn how to make
guesses about function existence, to hunt down
useful functions and to figure out how to use them
from online help systems and wizards,
4. Provide students with their first look at VBA via
small user defined functions to aid the ETL process,
5. Provide a bridge for our predominantly MIS audience from the comfortable world (for them) of information systems to the uncomfortable world of
data analysis and modeling,
25
ISKEN
The tutorial is an HTML document, MuddyDataTutorial (http://ite.pubs.informs.org/Vol3No3/Isken/
index.php#MuddyDataTutorial), and contains hyperlinks to all of the data files and spreadsheets used. The
tutorial is written in an informal tone. The tutorial begins with some background to the general problem and
then has the student open the primary data file in a text
editor such as WordPad.
discuss structural elements such as report headers and
footers which contain useful information (e.g. report
date), blank and other unnecessary lines, and totals sections. The notion of looking for exploitable structure is
emphasized. For example, subtotal lines follow a line
with a string of hyphens partially spanning the page,
the Split number seems to follow the heading Split: and
the phrase Daily Split Report signals the beginning of
a new report.
They get a chance to explore the structure of the file
before attempting to import it into Excel. An annotated snapshot of the file (Figure 1) is displayed and we
Figure 1: Sample ACD Report
26
ISKEN
EXAMPLE 1 SINGLE REPORT
to delimited), is used to import the file. Surprisingly,
most students, though regular users of Excel, had little
experience even with simple text file importing. Figure
2 shows the result of this step.
We start by working with a text file that contains a single ACD report for one date. The Text Import feature
of Excel, treating the file as Fixed Width (as opposed
Figure 2: Result of First Import
While the numeric data in the report falls nicely into columns and rows, one important piece of information, the
Split Number and Split Name have gotten somewhat garbled:
Figure 3: Garbled Split Name
27
ISKEN
The students now are told that we want to create a formula in cell A3 that will return the split number, i.e.
2. They can assume that the split number is always
<=100 and they are given a hint that the Value() and
Mid() functions might be useful.
Step 1. Concatenate A4:G4 with a formula in B3.
Use the & operator.
Step 2. Find the numeric position of the left parentheses in the result in cell B3. Use the Find() function
in cell F3.
Next we want to create a formula in cell J3 that will
return the split name, i.e. OUTPAT CLINIC REG.
They get the following guidance:
Step 3. Use the result of Step 2 along with the Mid()
function to get OUTPAT CLINIC REG. in cell G3.
Figure 4: Fixing the Split Name
Of course, several of these steps could be combined.
However, students are urged to break problems like this
into small formulaic chunks as it is much easier to debug multiple simple formulas than one enormous formula containing five or six nested Excel functions. After getting the desired result, one can always build a
single function by copying and pasting the pieces.
use this first example to talk about things like using
the concatenation operator to create unique field names
in a single row in preparation for creating Pivot Tables or importing the data into a database. Finally,
to get ready for the second part of the tutorial and
to illustrate another useful approach to text manipulation, we re−import the text file into a single column and then use the Text to Columns tool available
in Excel to parse just the data rows into columns.
This is all explained in detail in MuddyDataTutorial
(http://ite.pubs.informs.org/Vol3No3/Isken/
index.php#MuddyDataTutorial).
A LITTLE DATE/TIME PROBLEM
It turns out that the date has luckily made it unscathed
into a cell but the time periods in column A look like:
EXAMPLE 2: MANY DAYS OF REPORTS IN
ONE FILE
This is one string in a cell. The students are challenged
to create formulas in two separate cells that contain the
actual date/time values corresponding to the beginning
and end of the period.
Now that the students have some familiarity with data
importing and cleaning, we tackle a case of over sixty
ACD reports from three different splits and multiple
dates all contained in a single text file. The size of the
file drives home the point that we really do not want to
manually wade through this file to prepare it for analysis. Certainly, if the data files will need to be dealt
with on an ongoing basis or if the file is enormous, the
value of automation is clear. However, even if the file
is somewhat of a one-time obstacle and of manageable
size, the value of the learning that takes place in figuring out how to automate the process usually outweighs
This leads to a discussion about how dates and times
are stored in most spreadsheet and database packages.
Our experience has been that few students are aware
of how they are stored and thus cannot effectively manipulate date/time values in formulas nor take advantage of the wealth of date/time functions built into
standard office productivity applications. A related
discussion then is likely to ensue about the differences between date/time values, strings that look like
dates, and different date/time cell formats. We also
28
ISKEN
any extra time spent doing so. The goal is to eventually create a spreadsheet that looks like Figure 5 (many
more columns to the right).
Figure 5: Our Goal
In other words, we have gotten rid of all the extraneous
rows and we have the report date and split number on
each row so that we know where the data came from.
This file is now in good shape for pivoting or importing
into a database or statistics package.
a network manager for an auto supplier, used these
ideas to create a network utility that summarized the
file content of their LAN by processing the output of a
VBA automated stream of DOS DIR commands from
within an Excel application. In addition to data cleansing, the utility also included pivot table based analytics
(the subject of the next section) which were made available via their corporate intranet. Another student was
faced with a difficult data transformation problem as
part of an ERP implementation. The project involved a
mainframe report directed to text file with over 50,000
lines of data describing information security access levels for several thousand employees. Also, a graduate
student who was working in the Universitys finance
department gave the class a demonstration of a commercial ETL product and described numerous muddy
data problems that they deal with routinely. Finally,
a former graduate student contacted me and described
her new position with a major consulting firm − Lots
of modeling, muddy data problems, and working with
OLAP tools and data warehousing. I know the muddy
data area was a particular area of interest to you and it
seems that it is a really BIG issue for many businesses
- actually bigger than I imagined.
The tutorial walks the students through one general
approach for tackling this problem. A combination of
instruction, hints, screen shots, and questions are used
to guide the learning experience. Solutions to the questions can be found in the file MuddyDataTutorial-Notes
index.php#MuddyDataTutorial-Notes). The students
are reminded that this is certainly not the only way to
proceed and they are urged to look for opportunities for
other, perhaps better, approaches. Along the way they
encounter concepts and tactics like:
• inserting a column containing a unique sequence
number so that the original row order can be regained as needed,
• use of functions such as FIND(), ISBLANK(), ISERROR(), and ISNUMBER() to test cells for certain conditions,
• using VBA to create user defined functions to encapsulate complex, nested formulas,
MODULE 2 OLAP STYLE ANALYSIS WITH
EXCEL PIVOT TABLES AND CHARTS
• using data filtering to hide unwanted data rows,
he muddy data tutorial is followed by a module focusing on OLAP style analysis using Excel Pivot Tables and Pivot Charts. While a precise definition of OLAP software technology is elusive, an oft cited definition is the one given by
Nigel Pendse, co-founder of the The OLAP Report
(http://www.olapreport.com/), an industry research
group. OLAP is characterized as tools providing
• copying and Paste Special|Values to freeze intermediate results of complex calculations.
APPLICATIONS
This tutorial has never failed to elicit examples from
students from their places of employment. One student,
29
ISKEN
A warning is in order − this tutorial contains material related to Access. At our university, since many
of the students taking the course are MIS majors, they
are familiar with Access (often more so than with Excel). However, even those students with limited or
no experience with Access are usually able to follow along even if they cannot do all of the Access
portions of the tutorial. I believe it is important for
business analysts to have a working knowledge of
database concepts and Access is a very good package for meeting this objective. It also offers a perfect
opportunity to discuss the pros and cons of spreadsheets versus databases for various business computing tasks. For those interested in an Excel only version of this tutorial, see ExcelPivotTutorial-NoAccess
index.php#ExcelPivotTutorial-NoAccess).
fast analysis of shared multidimensional information
(FASMI (http://www.olapreport.com/fasmi.htm)). The
multidimensional part, arguably the defining feature
of OLAP (Thomsen, 2002), refers to the approach of
organizing and accessing business facts, such as sales
volume, along natural dimensions such as customers,
products, and time. The notion of a data cube arises
from this approach. Numerous software tools exist
for creating, managing, and analyzing multidimensional data to support business analysis. A few good
sources for information related to data warehousing
and OLAP include The Data Warehousing Information
Center (http://www.dwinfocenter.org/), DSS Resources
(http://dssresources.com/), and Intelligent Enterprise
(http://www.intelligententerprise.com/).
Excel Pivot Tables, while certainly not considered
a full-featured OLAP analysis tool, provide a readily accessible and simple way to introduce OLAP
analysis concepts in the classroom. Through polling
our students, many who are working and are routine spreadsheet users, we have found that very few
are familiar with Pivot Tables and Pivot Charts. This
session begins with a discussion of basic data warehousing concepts including star schemas and multidimensional data models., Differences are highlighted
between databases designed for analysis and databases
designed to support transaction processing . For this
module we have created a fictitious call center that responds to technical support calls for Microsoft Office
products. The application itself, computer technical
support call center, was chosen partly because it falls
under the MIS managerial domain - MIS students may
end up managing such call centers. The data model,
implemented in the database management package
Microsoft Access, consists of time, customer, application, and problem dimensions and two facts: time
spent on hold and time to service the technical support call. The dataset consists of approximately 26,000
rows of data generated from a simulation model of a
call center. Each row is a single phone call. Like the
previous module, the students are given a tutorial consisting of a background section and a combination of
instruction, hints, screen shots, and questions. Again,
the tutorial is an HTML document, ExcelPivotTutorial
index.php#ExcelPivotTutorial), and contains hyperlinks to all of the data files and spreadsheets used.
After becoming familiar with the data model, we use
the MS Access QBE (Query by Example) tool to create basic queries and quickly find that SQL is far from
ideal for ad-hoc exploratory analysis. This motivates
the need for flexible OLAP tools such as Pivot Tables.
The students are provided with an Excel workbook
containing the results of a query that combines fields
from the dimension and fact tables. The bulk of the
tutorial focuses on the basics of creating Pivot Tables
and Charts to answer business questions such as the
volume of calls by various combinations of customer,
application, problem and time values. Specific topics
include creating the pivot table, sections of pivot tables,
pivoting, changing display types, drilling down to detail data, formatting, general table options, controlling
subtotals, filtering of field values and creating pivot
charts.
We recently used this tutorial with a group of healthcare
executive MBA students (using data from a surgical recovery room) and it was received extremely well. They
are frustrated by the lack of data to support their decision making and quickly saw the potential for these
types of tools. Many of their questions focused on how
to go about getting access to such data from healthcare
information systems. This provided a nice lead in to
the section in the tutorial where we see how to connect Excels Pivot Table to data in an external database
such as Access. Discussion of commercial OLAP tools
(e.g. EssBase, WebFocus, Business Objects, Oracle
Express, SQL Server Analysis Services) complements
30
ISKEN
this topic since such tools usually are capable of working with data in a myriad of database and file formats.
sales (L.L Bean, Lands End), healthcare, police and
emergency services, fast food, insurance companies,
and product support. While historical data is useful for
analyzing current system performance, we need something else to handle What if? or Whats best? type questions. We need models. Try to think a bit about what
makes questions 1−4 difficult and well discuss modeling with spreadsheets next time.
As the tutorial nears the end, the students are presented
with their task. They are placed in the role of call
center manager with staffing and scheduling responsibilities and challenged to analyze the data in response
to customer complaints. To begin to address this issue,
they are first guided to look for relationships between
time and hold and staffing levels by time of day and
day of week. They use pivot charts to illustrate such
relationships. Then the tutorial closes with a series of
managerial questions and a preview of things to come:
FOLLOW-ON ANALYTICAL POSSIBILITIES
The previous two modules have set the stage to discuss the use of queueing models to explore the relationship between capacity and system performance
and customer service measures such as average and
upper percentiles of time on hold, employee utilization, and lost calls. We use the spreadsheet modeling
based textbook, Practical Management Science (Winston and Albright, 2001) which includes a chapter on
basic queueing models. The students also receive a tutorial which introduces queueing models in the context
of our call center example. The basics physics of the
M/M/s queue are discussed and the students work with
a pre-developed spreadsheet model that calculates performance measures for user defined parameter values
such as the call arrival rate and service time. The objective is to have students discover that, 1) retrospective
data analysis is simply not sufficient to answer these
kinds of basic operational questions, 2) mathematical
models can provide a way to address such questions,
3) models are inherently based on a host of simplifying
assumptions and 4) they should have paid better attention in their math and probability/statistics courses.
In this module, youve analyzed call center data using
pivot tables and pivot charts. Youve taken the perspective of the manager of the call center. As the manager,
certain questions would likely arise based on the analysis.
1. It seems like there are times where staff utilization
is really low. If we reduce staffing how will the
average time on hold will be affected?
2. There are times when average time on hold and
staff utilization are both pretty high. I bet we could
reduce average time on hold by adding staff (thus
reducing staff utilization). How much staff would
we need to reduce average time on hold to 30 seconds? 15 seconds?
3. Average time on hold isnt the only important performance measure. Were also interested in the percentage of customers who have to hold and the percentage of customers who hold less than 1 minute.
How could we calculate these measures from the
current data and how could we figure out how they
will be affected by changes in staff levels?
The last objective is a real issue (Groleau, 1999). It is
valuable for students to see whats going on underneath
the hood of these mysterious things called mathematical models. As one graduate MIS student recently
commented, it was nice to learn that there was no man
behind the green curtain. However, the mathematical
background of many of the students is sparse and care
must be taken to present the material so they do actually
gain insight from it. By exploring the queue length and
wait time distribution formulas for the M/M/s queue,
students see that everything in business cannot be described by linear regression lines, and that very powerful and informative statements about real business
systems can be made by equations containing nothing
4. We currently just have our tech support staff working 8-hour shifts. The midnight shift start is from
12a-8a, the day shift is from 8a-4p, and the afternoon shift is from 4p-12a. If we were better able to
estimate our staffing needs by hour of day and day
of week, how would we figure out how to schedule
our staff (e.g. shift start times and shift lengths) to
meet these staffing needs?
These are not easy questions but are the type routinely
faced by managers of a myriad of customer service related systems including software tech support, retail
31
ISKEN
more complex than summations, exponentiation, and
factorials.
model based DSS topics using Excel. These tutorials naturally lead the students to difficult managerial
questions involving call center staffing and scheduling
for which mathematical models prove extremely helpful. The greatest pedagogical challenge has been to
structure the DSS course (recently renamed Business
Analysis and Modeling) so as to move at a pace the
students can handle and yet cover the range of topics necessary for the students to actually build Excel
based DSSs for their end of the term project. In addition to the topics discussed above, students are taught
to program in VBA (using Albright, 2001), develop basic user forms that communicate with their spreadsheet
models, do spreadsheet simulation with @Risk, and decision analysis with PrecisionTree (specific topics have
varied slightly over semesters). Obviously, in a single
semester, we can only explore each topic to a limited
degree. Ideally this would be a two course sequence
with the VBA and other application development topics
moved into the second course. The lack of a required
management science course makes this difficult to accomplish and so we have decided to err on the side of
packing a bit too much into this one course. At the very
least, students leave the class as power users of Excel
and with a newfound appreciation and respect for this
unassuming tool.
Then students answer basic questions using the M/M/s
spreadsheet model concerning the relationship between
wait time and changes in call volume, staffing to meet
service level goals, and use of Littles Law. They also
are asked to examine the formula for the probability
of an empty system in the M/M/s queue and to discuss
the relative merits of cell based formulas versus procedures written in VBA for such iterative calculations.
This brings MIS development issues back into a discussion about modeling.
Like others, we have struggled with the difficulty of
illustrating the dynamics of queueing systems within
the spreadsheet environment. We usually demonstrate
a discrete event simulation model of a simple queueing
system to try to augment the spreadsheet based queueing models. Recently, this issue has been addressed
by contributors to this journal (see Ashley (2002) and
Grossman and Ingolfsson (2002)) and we are actively
working to integrate these ideas into the queueing module.
Finally, we discuss the fact that though we can use
this queueing model to estimate staffing needs by time
of day and day of week, our job is far from over. Now
we have to figure out how to develop work schedules
that meet our staffing targets and conform to policies
concerning shifts lengths, start time, days worked per
week, and other employee desires and constraints concerning schedules. This provides a way to introduce
optimization models using Solver in Excel. Over several class sessions, the students are introduced to the
basics of linear and integer programming using Solver.
They explore and enhance a basic employee scheduling model (Albright, 2001). We then use this model to
provide a bridge to talking about creating full blown
decision support systems on the Excel platform, complete with a graphical user interface that hides complex mathematical models from the user (see Albright
(2001), Zobel, Ragsdale and Rees (2000), and Ragsdale (2001)).
SUMMARY
We have developed a number of related modules that
take students on a journey through a range of data and
32
ISKEN
REFERENCES
Warehouse Systems, McGraw-Hill, Boston, MA.
Albright, S.C. (2001), VBA for Modelers, Duxbury, Pacific Grove, CA.
Powell, S.G., (1997), The Teachers Forum: From Intelligent Consumer to Active Modeler, Two MBA Success
Stories?, Interfaces, Vol. 27, No. 3, pp. 88-98.
Ashley D.W. (2002), An Introduction to Queuing
Theory in an Interactive Text Format, INFORMS
Transactions on Education, Vol.
2, No.
3,
http://ite.pubs.informs.org/Vol2No3/Ashley/
Powell, S.G., (1998), The Teachers Forum: Requiem
for the Management Science Course?, Interfaces, Vol.
28, No. 2, pp. 111-117.
Dhond, A., Gupta, A., and S. Vadhavkar (2000) Data
Mining Techniques for Optimizing Inventories for
Electronic Commerce, Proceedings of KDD 2000,
Boston, MA.
Powell,
S.G.
(2001),
Teaching
Modeling
in Management Science, INFORMS Transactions on Education, Vol.
1, No.
2,
http://ite.pubs.informs.org/Vol1No2/Powell/
Foulds, L. and C. Thachenkary (2001) Empower to
the People: How Decision Support Systems and IT
Can Aid the Analyst, OR/MS Today, Vol. 28, No. 3,
pp. 32-37.
Pyle, D. (1999), Data Preparation for Data Mining,
Academic Press, San Diego, CA.
Ragsdale C. T. (2001), Teaching Management Science
With Spreadsheets: From Decision Models to Decision
Support, INFORMS Transactions on Education, Vol. 1,
No. 2, http://ite.pubs.informs.org/Vol1No2/Ragsdale/
Fylstra, D., L. Lasdon, J. Watson, A. Waren (1998),
”Design and use of the Microsoft Excel Solver”, Interfaces, Vol. 28, No. 5, pp. 29-55.
Thomsen, E.(2002), OLAP Solutions: Building Multidimensional Information Systems, Wiley Computer
Publishing, NY.
Groleau, T.G. (1999), Spreadsheets Will Not Save
OR/MS!, OR/MS Today,.Vol 26. No. 1. pp. 6.
Winston, W. (1996), The Teachers Forum: Management Science with Spreadsheets for MBAs at Indiana
University?, Interfaces, Vol. 26, No. 2, pp. 105-111.
Grossman, T. A. Jr.(1997) End User Modeling, OR/MS
Today, Vol 24. No. 5. pp. 10.
Grossman, T.A. Jr. (2001), Causes of the Decline of
the Business School Management Science Course, INFORMS Transactions on Education, Vol. 1, No. 2,
http://ite.pubs.informs.org/Vol1No2/Grossman/
Winston, W. (2001), Executive Education Opportunities: Millions of Analysts Need Training in Spreadsheet Modeling, Optimization, Monte Carlo Simulation
and Data Analysis, OR/MS Today, Vol. 28, No. 4. pp.
36-39.
Grossman , T.A. and A. Ingolfsson (2002), “Graphical Spreadsheet Simulation of Queues, INFORMS
Winston, W. and Albright, S.C. (2001), Practical ManTransactions on Education, Vol.
2, No.
2,
http://ite.pubs.informs.org/Vol2No2/IngolfssonGrossman/ agement Science, Duxbury, Pacific Grove, CA.
Zobel, C., Ragsdale, C.T., and L. Rees (2000), Handson Learning Experience: Teaching Decision-Support
Systems Using VBA, OR/MS Today, Vol. 27, No. 6.
pp. 28-31.
Kimball, R. (1996), The Data Warehouse Toolkit, John
Wiley & Sons, Inc., NY.
Mallach, E.G. (2000), Decision Support and Data
33
ISKEN
APPENDICES
DEALING WITH MUDDY DATA USING EXCEL
Background
Perceived value of data buried in corporate information systems, the growth in enterprise resource planning
(ERP), workflow and e-commerce systems along with the availability of inexpensive computing resources have
contributed to massive growth in the data warehousing and OLAP markets. Typically, such efforts involve a
significant amount of effort to get data out of one place and into another. This phase, optimistically called
extraction−transform−load (ETL), can be an unbelievably difficult exercise in dealing with missing, contradictory,
ambiguous, undefined, and just plain messy data.
In general, much time is spent on getting data out of systems for:
• Analysis (spreadsheet, database, statistics package, etc.)
• Transformation and input into new system (ERP, data warehouse, etc.)
Ideally, systems can export in nice comma delimited, tab delimited, spreadsheet or database formats. Thats not
always the case with legacy systems. Often the fastest way to get data from such systems is to redirect a printed
report to disk. Now the business analyst is faced with a huge electronic file that needs to be manipulated in order
to make it useful. Similarly, automating parsing or mining of log files, web pages, and other irregularly formatted
text files leads to similar problems. This is a necessary and important part of working with information systems. It
is often dismissed as tedious yet mindless work, kind of a dogs could do it mentality. This is far from the truth.
We all dont start out as strategic IS visionaries and even if we do, small companies require people who are
willing to sweep the floor as well as envision the future of the Internet, in the same day. Today we are going to
sweep the floor (or more appropriately, take out the garbage.)
Objective
In this tutorial we will learn techniques for dealing with ASCII (text) data files that we want to get into Excel and/or Access. We will learn about string parsing and other string functions, date/time functions, and the Text
Import and Text to Columns Wizards. We will also learn some tactics for dealing with really large text files that
have been created from mainframe reports being dumped to an ASCII file. Our MIS mindset abhors manual data
cleaning/transformation/manipulation and is always looking for structure in data files that we can exploit to create
automated approaches to data cleaning and transformation.
The Tutorial
We will use data from an ACD system at a major tertiary care hospital. As analysts, we are interested in
studying call center performance and suggesting staffing changes that would improve customer service and/or
reduce operating costs. The ACD system generates many summary reports but it is difficult to get at this data in an
electronic fashion. Fortunately, the system was capable of sending reports to disk. So we had a large number of
reports for different areas in the hospital (each area is called a Split) and different dates generated and dumped to
disk. Heres what one report for one split for one day looks like.
34
ISKEN
Hmm, doesnt look too bad. Well look at two different scenarios to illustrate a whole host of techniques for dealing
with data in Excel. In the first example, well start simple by just dealing with one report from one date as in the
screen shot above. Our goal is to get the report into multiple columns in Excel so we can analyze the data. We
wont do any analysis here, were just concerned with the data loading, cleaning, and transformation problem. Since
theres just one report, we dont need a ton of elaborate techniques to deal with the report header and footer info. In
the second example we will deal with one huge text file that has many such reports for different splits and different
dates all in the same file. Well use more sophisticated techniques to deal with this very common scenario.
Example 1: One report for one date
The file example1.txt is available in the Downloads section of the course web. It contains the report you see
above. Lets look at some quick ways to get the data in that report into different columns in Excel. First, open the
file in WordPad and take a look around the file. OK, close the file.
Method 1 Fixed Width Multi−Column Text Import
In Excel, open example1.txt (http://ite.pubs.informs.org/Vol3No3/Isken/Example1.txt) as a text file:
Youll see the first screen of the Text Import Wizard. Notice there are two main types of imports: Delimited
and Fixed Width. Lets do Fixed Width since we saw in WordPad that the columns are lined up nicely in the data
section. You can use the horizontal and vertical scroll bars to view your data as you go through the Wizard. Also,
notice the Start Import at Row spinner. You can use this to not import the report header information. Lets leave it
at row 1 for now.
35
ISKEN
Click Next and Excel will guess at where you want the column breaks. Heres the result of its guess (notice Ive
scrolled down a bit to look at the data). Excel will often do a nice job of setting the column breaks with nicely
formatted data. You can modify the breaks by clicking and dragging/dropping per the on-screen instructions. In this
case, Excel has done a fine job (look for yourself) and well click Next.
The next screen lets you go column by column setting data types and/or skipping columns. This can be very useful
but well just click Finish.
36
ISKEN
Ta da. Notice that each data value in the body and totals section of the report are in separate columns which means
we can actually analyze the data. Notice also that we got lucky in that the date in Q2 fell into a column. Now
lets look at a few basic techniques for dealing with the not so nice parts of this spreadsheet. This is a warm-up for
Example 2 in which well really have to take advantage of Excels rich function library as well as our own ingenuity.
37
ISKEN
Split number and name
Heres a look at A4:D4
Heres a table showing the values in cells A4:D4
A4
B4
C4
D4
SPLIT: 2 (OUT
PAT. CLI
NIC REG.
)
We want to create a formula in cell A3 that will return the split number, i.e. 2. Assume that the split number is
always <=100. HINT: Use the Value() and Mid() functions.
We also want to create a formula in cell J3 that will return the split name, i.e. OUTPAT CLINIC REG. (assume the name is always surrounded by parentheses). HINT: Do this in multiple steps. After the four steps
below, a screen shot is shown with the results.
Step 1. Concatenate A4:G4 with a formula in B3. Use the & operator or the Concatenate() function. Going out to column G should catch even very long split names.
Step 2. Find the numeric position of the left parentheses in the result in cell B3. Use the Find() function in cell F3.
Step 3. Use the result of Step 2 along with the Mid() function to get OUTPAT CLINIC REG.) in cell G3.
Step 4. Now we just need to get rid of the right parentheses. Use Left() and Len() to do this. The final result is in
J3.
See if you can figure out how to do these. If you cant, see the file Example1-splitnum.xls and look at the
formulas in A3, B3, F3, G3 and J3.
38
ISKEN
Field Names
Notice that the column headers in the report are spread over multiple rows. If we want to eventually get
this data into a database such as Access or if we want to create a Pivot Table, well need unique field names on one
row. Use the & operator to do this.
A Little Date/Time Problem
Notice that in column A are time periods that look like: This is a string representing a range of time stored
07:00-07:30AM
in a single cell. How might you create formulas in two separate cells that contain the actual date/time values
corresponding to the beginning and end of the period? Recall that the date is up in cell Q2. So, you want to do this:
Give it a try. HINT: Use the Mid() and TimeValue() functions on the end time first and understand how Excel
11/2/92 7:00 AM
11/2/92 7:30 AM
(and most other software for that matter) stores dates and times to get the beginning time. If you get stuck,
check out the formulas in A43 and B43. How do those formulas work?
Unprintable Character Codes
Often when reports are sent to disk, youll see unprintable characters which will show up as empty boxes.
For example, see A39 in the following screen shot:
Take a look at the formula in B39, =CODE(A39). The Excel function CODE() returns the ASCII code number for
the given character. So it turns out that our unprintable character in A39 is a ˆZ (or Ctrl−Z) which I believe is an
end of file character.
39
ISKEN
A Different Approach Using Text to Columns
A similar approach to importing this file can be used when you really dont want the header or footer information to get chopped up. Close all of your worksheets and try to reopen example1.txt. This time, select Delimited
in the first screen of the Text Import Wizard.
Leave it set to Tab delimited. Since there are no tabs in our file, all of the data will come into a single column in the
spreadsheet. Just click Finish and heres what youll see.
40
ISKEN
A useful trick at this point is to change the font from Arial (which is a variable width font) to a fixed width font such
as Courier New so you can see if the columns are naturally aligned. Much nicer, no?
However, the data is still not usable since its embedded in the strings in column A. Select the range A10:A36 (the
data rows) and pick Text to Columns from the Data menu. Youll get a Wizard that looks a lot like the Text Import
Wizard. You can now either select Delimited (by what?) or Fixed Width and Excel will parse column A into
multiple columns. Try it. This ends the warm up. Now lets tackle something a little more difficult.
Example 2: Many days of reports in one file
Completing this second part of the tutorial is your job. There are seven questions within the tutorial that
you should answer in addition to completing the overall task outlined in the tutorial.
A common scenario is to get a huge report which has been directed to an electronic file. Well use the file
called example2.txt. It has about sixty ACD reports all in one file. Open it in WordPad and scroll through it a bit.
Each report is for some date and some split. There are three different splits. Our goal is to eventually create a
spreadsheet that looks like the following (with many more rows and columns.)
41
ISKEN
In other words, weve gotten rid of all the extraneous rows and we have the report date and split number on each
row. This file is now in good shape for pivoting or for importing into a database or statistics package.
Lets start by opening it in Excel. Open it as a tab delimited text file so it all comes in one column (A) just
like we did at the end of Example 1. Change the font to Courier New so we can better see the structure of the file.
Youll see its just a sequence of the report we dealt with in Example 1 for different dates and different split numbers.
Save it as an XLS file called acddata− [your name].xls.
Step 0: Add a Sequence Number column
Its not a bad idea to insert a column at the far left of the sheet and sequentially number each row in the
spreadsheet. I also added a row at the top and created the field names Sequence and Line. It looks like:
Notice that the second report starts on Sequence=36. The Sequence field provides means of recovering the order
of the rows in the report and as a reference back to the original report after we do a bunch of data gymnastics.
This sheet is the sheet named Step0 in acddata-finished.xls (http://ite.pubs.informs.org/Vol3No3/Isken/acddatafinished.xls) .Note: This file contains VBA code. Make sure you select Enable Macros when opening the file and
that you have Macro Security set to Medium (Tools | Macro | Security).
The basic strategy used to get this file in the desired format is summarized in the following table. Each step
is described in a bit more detail following the table. Each step has a corresponding tab in acddata-finished.xls
(http://ite.pubs.informs.org/Vol3No3/Isken/acddata-finished.xls) so you can see what was done.
42
ISKEN
Step 0
Step 1
Step 2
Step 3
Step 4
Step 5
Step 6
Add record sequence number.
Identify unneeded rows using indicator formulas.
Sort the records by Keep (descending) and Sequence (ascending) so the unneeded rows sort to
the bottom of the worksheet but all other rows kept in correct order. Classify all the remaining
rows into a small number of record type categories. We used a little VBA function called GetRecType()
which contains a sequence of If-Then type rules to classify each record.
Copied the results (not including the unneeded rows) from Step 2 and Paste Special Values
into a new worksheet to freeze all the formulas. Inserted two new columns: one for report data and
one for report split number. Wrote formulas to fill these columns.
Copied the results (not including the unneeded rows) from Step 3 and Paste Special Values into
a new worksheet to freeze all the formulas. Sorted it by split, data, and then sequence. Use
Data Filter to just see the record type were interested in (the data rows).
Copy and paste the filtered data into a new worksheet, delete unneeded columns. Then do a
Data Text to Columns on the Line field to get the data into separate columns. Finally, insert
two columns and use formulas to break up the time bin string into separate start and end times as
we did in Example 1.
Paste Special - Values the column headings that we created in Example 1.
Step 1 - Identify unneeded rows using indicator formulas
Scanning the data usually results in identification of several types of unneeded rows. In this example, there
are three of these:
1. DAILY SPLIT REPORT header
2. Blank rows
3. Rows with many hyphens used as lines
In sheet Step 1 we added three columns, one for each of these unneeded row types. They are in indicator columns
B, C and D and are labeled Daily, Blanks, and Hyphens. In each indicator column there is a formula that returns a
0 if this row was that type of unneeded row and returns a 1 if not.
43
ISKEN
Notice that Sequence=1 is a DAILY SPLIT REPORT header so the value in the Daily column is 0 indicating that we
dont want this record for that reason. Likewise, Sequence=3 is a blank row and Sequence=8 is a hyphen row. So, if
a row has a 0 in columns B, C or D, we know we dont want it. Then a fourth column was inserted immediately to
the right of column D. The formula in this new Column E (labeled Keep) is just =PRODUCT(B1:D1) which will
equal 0 if any of the indicators in that row is equal to 0. Now we can sort the data by column E (descending) and
column A (ascending) to move the unneeded rows to the bottom. By cleaning out obviously unneeded stuff, it can
be easier to find structure in the remaining records.
Question 1: How does the formula in cell B2 work? The formula is: =IF(ISERROR(FIND(”DAILY”,F2)),1,0)
Question 2: How does the formula in cell D2 work? The formula is: =IF(LEN(TRIM(F2))=0,0,1)
Step 2 Record Type Identification Function
Usually you can categorize the remaining records into a few types (in this case, seven).
While we could take the indicator column approach again to try to automatically classify each row, instead
1
2
3
4
5
6
7
Line with report date
Line with split number and split name
First line of the data column header
Second line of the data column header
Third line of the data column header
A data detail line (this is the type were really interested in)
The totals line
we will use a small user-defined VBA function called GetRecType() to do this. Its important to remember that you
can write your own functions in VBA for Excel (and Access, and any other product for that matter that supports
VBA). These functions then can be used just like any other Excel function. Like Access, the VBA code lives in the
file but is only visible through the Visual Basic editor (Tools|Macros|Visual Basic Editor). Heres the code:
44
ISKEN
Step 3 Date and Split Formulas
Question 6: How does the formula in cell G2 work on sheet Step3? The formula is:
=IF(F2=1,DATEVALUE(RIGHT(I2,8)),G1)
Question 7: How does the formula in cell H2 work on sheet Step3? The formula is:
=IF(F2=1,VALUE(MID(I3,8,3)),
IF(F2=2,VALUE(MID(I2,8,3)),H1))
Yep, these are a bit disgusting but are pretty typical.
Step 4 Data Filter
Put your cursor anywhere within a list of data with column headings (so Excel guesses to treat it as a database) and
select Data | Filter | AutoFilter. Now you can control which data rows are displayed by making selections in one
or more of the drop down boxes that Excel has added to the header row. If you make selections in multiple columns,
Excel treats it as a logical AND condition and only rows meeting all filter conditions will be displayed. Data
filtering is a powerful data exploration tool that we are using to help with data cleansing. By selecting RecType 6,
we filter out all of the unneeded rows in the data set, i.e. all rows which are not data detail lines. Check out the
Advanced Filter feature to see how to create more complex filter conditions combining logical ANDs and ORs.
Step 5 Parse and Transform
Copy and paste the filtered data into a new worksheet. You can delete unneeded columns such as the indicators in B-E. Then do a Data | Text to Columns on the Line field to get the data into separate columns. Now, insert
two columns to the right of the Line column and use formulas to break up the time bin string into separate start and
end times as we did in Example 1.
Step 6 Field Headings
It is important that each column (or field in database terminology) have a unique heading (field name) if one
wants to begin exploring the data using Pivot Tables/Charts, a statistics package or moving the data into a database.
We did this back in Example 1 with the concatenation operator. You can copy and paste from that example or type
in new field names if you wish.
After completing these last two steps, your data should look something like the tab labeled Step5-6 in acddatafinished.xls. Finally, we are ready to analyze the data.
45
ISKEN
Summary
This concludes our brief journey into the wonderful world of muddy data. You might be surprised to find
out how much time is wasted in the corporate world doing tasks like this manually. With very little programming
(which we could have avoided if we wished), we were able to use built in Excel tools to deal with extracting useful
data embedded in electronic reports. If we had to routinely process this file, it would be relatively easy to automate
the entire process within Excel with a little VBA. While this type of work may not be glamorous, it is a fact of life
in the world of information systems and business analysis. It is well worth your time to become more familiar with
the huge number of functions available in Excel (and Access) for manipulating all types of data. If you need to do
this kind of thing all the time on the same report, you might want to look into commercial products such as Content
Extractor from Data Junction or Monarch from Data Watch.
DEALING WITH MUDDY DATA USING EXCEL SOLUTIONS
Question 1: How does the formula in cell B2 work? The formula is: =IF(ISERROR(FIND(”DAILY”,F2)),1,0)
The FIND() function will look for the string DAILY in cell F2. If it finds it, it will return the character position of the letter D in F2. If it does not find it, it will return an #VALUE! error. The ISERROR() function is used
to trap this error. If DAILY is in F2, the FIND() will not return an error, the ISERROR() will return FALSE, and
the IF() will return 0 (the else part of the if−then−else) indicating that this is a row we do not want to keep.
Question 2: How does the formula in cell D2 work? The formula is: = IF(LEN(TRIM(F2))=0,0,1)
We are checking to see if F2 is empty. The TRIM() function gets rid of any leading or trailing spaces in the
cell and the LEN() function returns the length of the trimmed text in F2. If LEN() returns 0, the IF() will return 0,
indicating that this row is blank and we do not want to keep it.
Step 2 Record Type Identification Function
Usually you can categorize the remaining records into a few types (in this case, seven).
While we could take the indicator column approach again to try to automatically classify each row, instead
1
2
3
4
5
6
7
Line with report date
Line with split number and split name
First line of the data column header
Second line of the data column header
Third line of the data column header
A data detail line (this is the type were really interested in)
The totals line
we will use a small user-defined VBA function called GetRecType() to do this. Its important to remember that you
can write your own functions in VBA for Excel (and Access, and any other product for that matter that supports
VBA). These functions then can be used just like any other Excel function. Like Access, the VBA code lives in the
file but is only visible through the Visual Basic editor (Tools|Macros|Visual Basic Editor). Heres the code:
46
ISKEN
User defined functions are a nice way to introduce VBA programming. They can be used to encapsulate complex
logic and avoid giant multiple nested IF() type formulas.
The passed in input parameter, strLine, is a string and contains the cell contents used as the argument to this
user defined function. Questions 3−5 deal with various parts of the user defined function used to check if the string
contains certain interesting information.
Question 3:
If IsDate Trim(Left(strLine, 10))) Then
intRecType = 1 ’Date
GetRecType = intRecType
Exit Function
End If
If the leftmost 10 characters (after removal of leading and trailing spaces via the Trim() function) of the cell
can be interpreted as a valid date, then the function returns a 1 to indicate that this is a record of type 1 the report
date. The IsDate() function is a VBA function, not an Excel worksheet function.
47
ISKEN
Question 4:
If InStr(strLine, ”SPLIT”) Then
intRecType = 2 ’Split
Exit Function
End If
The VBA function InStr() is used to search the input string for an occurrence of the word SPLIT. InStr() will return
a 0 if it does not find it. If it does, then we return a 2 to indicate that this is a line with the split number and split name.
Question 5:
If IsNumeric(Left(Trim(strLine), 6)) Then
intRecType = 7 ’Totals line
Exit Function
End If
If the leftmost 6 characters (after removal of leading and trailing spaces via the Trim() function) of the cell
can be interpreted as a valid number, then the function returns a 7 to indicate that this is a record containing the
totals for the report. The IsNumeric() function is a VBA function, not an Excel worksheet function. There is an
Excel worksheet function called ISNUMBER() which serves the same purpose.
Step 3 Date and Split Formulas
The next two questions deal with how we fill in repeating dates and split numbers for each line of each report.
Question 6: How does the formula in cell G2 work on sheet Step3?
IF(F2=1,DATEVALUE(RIGHT(I2,8)),G1)
The formula is:
=
The basic strategy is to determine if a new report is starting on the current line. If it isnt, the date can be
obtained from the row immediately above the current row. One way to determine if a new report is starting is
to check if the RecType column has a value of 1 a report date line. This is the = IF(F2=1 part of the formula.
If this is true, we extract the date using RIGHT() as a string, and then convert it to a valid Excel date with the
DATEVALUE() function. This is a useful function that can convert strings that look like dates into actual dates.
Question 7: How does the formula in cell H2 work on sheet Step3? The formula is:
=IF(F2=1,VALUE(MID(I3,8,3)),
IF(F2=2,VALUE(MID(I2,8,3)),H1))
For the split numbers, we use a similar strategy as used for the dates. If its a report date line (F2=1), we
extract the split number from the next row. If its a split name an number line (F2=2), we extract the split number
from the current row. If F2¿2, the split number can be obtained from the row immediately above the current row.
The string function MID() pulls out the split number by starting at position 8 of the text in column I and taking the
next 3 characters. The VALUE() function converts the resulting string into an actual number.
48
ISKEN
EXCEL PIVOT TABLE TUTORIAL: AN EXAMPLE OF AN OLAP-TYPE ANALYSIS
Objective
Introduce OLAP-type analysis via Excels Pivot Table and Pivot Chart features. We will also use some more
advanced Access and Excel functions which are useful for data analysis. Our perspective is one of a business
analyst using data for decision support.
The Tutorial
Introduction
We will use the call center example database discussed in class. The call center handles technical support
calls for MS Office products. We are interested in using concepts from multidimensional data modeling, OLAP
(online analytical processing), and data analysis functions to do some quantitative business analysis. The file is
called CallCenter-DataWarehouse.mdb and can be found in the Downloads section of the course web. Make a copy
of the database for yourself. Since its a multidimensional data warehouse, the data model is a star-schema. The main
fact table, tblFACT Scen01 CallLog, contains over 25,000 records from 10/3/99−12/25/99. Each record is a single
phone call. See the table design for descriptions of each of the fields. The two measures included in the fact table are:
OnHoldMins
ServiceTimeMins
Time in minutes customer spends on hold before talking to tech support person.
Time in minutes spent on the phone by the tech support person with the customer.
49
ISKEN
There are four dimension tables in the star-schema. Look inside each table in both datasheet and design view in
Access to learn more about them.
tblDIM
tblDIM
tblDIM
tblDIM
Customer
Application
Problem
Time
About the customers.
About the MS Office Applications supported.
About the nature of the problem.
About each hour of each date in 10/3/ 99 10/25/99. (12 weeks)
Date/Time Analysis with Access Queries
Before getting into OLAP with Excel Pivot Tables and Pivot Charts, lets explore some of the capabilities in
Access for analyzing date/time data. Both Access and Excel have a wide range of functions to facilitate date/time
analysis as this is a very common business analysis need. Some of the Access functions included Month(), Year(),
Weekday(), and the very general DatePart() function. Using these and/or perhaps other date/time related functions
in Access, create the following queries using ONLY the fact table tblFACT Scen01 CallLog (i.e. you cant use
tblDIM Time). Save your query with the same name that appears in the title bar of each query result.
Problem 1a Number of phone calls arriving by year and month
Problem 1b Average time on hold by hour of day and day of week
50
ISKEN
Looking through the previous query, do you see some potential problem times and days during the week?
Problem 2 General Exploratory Analysis with SQL
The type of problems for which multidimensional models were created were queries such as summing volume over different categories. For example, create the following query.
As you scroll through the table, notice that it can be difficult to see patterns. Also, the QBE window in Access is
OK, but not great, for interactive, exploratory analysis.
Problem 3 Limitations of SQL
A common business analysis problem is to compare some measure such as average time customers spent on
hold in different time periods. In Access, try to figure out how to create queries using just tblFACT Scen01 CallLog
that will eventually give the following query result, where the numbers are the average of the OnHoldMins field for
the month in the column heading. This is not terribly easy (if you get stuck, move on) and is meant to show how
the limitations of SQL for relatively simple business questions.
51
ISKEN
OLAP Style Analysis with Excel Pivot Tables and Pivot Charts
Excels Pivot Table and Pivot Chart tools are one of the most sophisticated features in Excel. They provide a
simple, yet powerful, method for interactively slicing and dicing and displaying data. Multidimensional data is
particularly well suited to Pivot Table analysis. Pivot Tables can be used to analyze data that resides in Excel or in
an external data source such as Access or some other database.
There is an Excel file called CallCenterPivot.xls (http://ite.pubs.informs.org/Vol3No3/Isken/CallCenterPivot.xls)
which you can get from the Downloads section of the course web. Make a copy of it for yourself in your
folder to work in. In order to facilitate the analysis, the data youll analyze is in the sheet entitled, Data.
Well use it for Pivoting. All of fields included in the sheet are defined in the file OLAP-DataDictionary.xls
(http://ite.pubs.informs.org/Vol3No3/Isken/OLAP-DataDictionary.xls).
Ive also created a query called qselDataCube in the call center database, CallCenter-DataWarehouse.mdb
(http://ite.pubs.informs.org/Vol3No3/Isken/CallCenter-DataWarehouse.mdb), that draws fields from the fact table
as well as the four dimension tables. Well use this query to see how Excel can Pivot on data in Access.
Start Excel and Read About Pivot Tables
No sense repeating the stuff in the help files. Click on the Assistant, type pivot table and select About PivotTable reports: interactive data analysis from the help menu when the Assistant answers. Read it.
First, well look at basic Pivot Table functionality using
(http://ite.pubs.informs.org/Vol3No3/Isken/CallCenterPivot.xls).
52
the
data
within
CallCenterPivot.xls
ISKEN
Creating a Pivot Table
Select Pivot Table and Pivot Chart from the Data menu and youll get the first screen of the Wizard.
Select the Data to use for Pivot Table
53
ISKEN
Select the Location to put the Pivot Table
The Pivot Table Interface
Now youll see:
You build your pivot table by dragging and dropping fields from the Pivot Table toolbar into the different sections
of the pivot table: rows, columns, pages, and data areas. Lets revisit the query we did in Problem 2. Drag Customer
into the Page area, AppName into the Row area and ProbName into the Column area. In order to count the number
54
ISKEN
of calls, we can drag any non-null field into the Data area. Drag the CallID field into the Data area. Excel will
create the pivot table and will guess that you want to sum the CallID field. The first result will look like this:
There are many ways to modify the data field to change the summation to a count. On the Pivot Table toolbar there
is a Wizard button. Push it (you can also right click anywhere in the pivot table and select Wizard from the menu).
When you get into the Wizard, push the Layout button. Youll see:
55
ISKEN
Obviously you can drag and drop fields in this screen as well. Notice that Excel has summed the CallID field.
Double click that button and youll get:
You can get this same dialog box if you select the field in your pivot table, right click, and select Field Settings
from the menu.
Heres the revised pivot table:
The drop down arrow on the Page field controls which data is displayed. Your choices are any one value or all. The
drop down arrows on the row and column fields let you selectively exclude certain values. Try them to create the
following:
56
ISKEN
Theres no law against having multiple fields in any section, so heres another variation on this layout. Try it.
Subtotals for each field can be controlled by selecting the field, right clicking, and selecting Field Settings. Youll
see:
There are also a whole host of options for the entire pivot table. Activate any cell in the pivot table, right click, Table
Options. Youll see:
57
ISKEN
See online help to describe each of these options. Some are a bit complex.
Formatting the Pivot Table
You can pretty up your pivot table with various built in Report Formats. Yep, right click
58
ISKEN
Refreshing the Pivot Table
If you make changes to the underlying data of a pivot table, you need to refresh the pivot table. As always
there are several ways to do this. See the Pivot Table toolbar or right click while anywhere in the Pivot Table.
Normal and Special Display of Values So far we have just displayed data in its normal form. Pivot tables
also let you show data in many interesting ways. For example, we might be interested in the percentage of total
calls placed by each customer. First, modify the pivot table so just total calls for each customer is displayed. Then
activate a cell in the data area, right click, and select Field Settings. Then push the Options>> button and click the
drop down arrow. Notice that Ive selected % of Column.
The result is:
59
ISKEN
Pivot Charts
It is easy to create a Pivot Chart from a pivot table. Just select the pivot table, right click, Pivot Chart (or
click the Pivot Chart button on the Pivot Table toolbar). I changed our Pivot Table to display the data normally
again (as opposed to % of column) and then created a Pivot Chart. Heres what you get:
Pivot charts are live charts in the sense that you can manipulate them just like pivot tables. You can drag new fields
in, reorient the axes, change the formatting, etc. For example, create the following pivot chart:
Pivoting Using An External Database
Pivot Tables arent just for Excel based data. For example lets see how to connect to an Access database.
Start a new pivot table in an Excel worksheet. In the first screen of the Wizard, select External data source. Push
the Get Data button. Select Access.
60
ISKEN
Browse
to
the
call
center
database
weve
been
using,
CallCenter-DataWarehouse.mdb
(http://ite.pubs.informs.org/Vol3No3/Isken/CallCenter-DataWarehouse.mdb). Youll then see a list of all the
tables and queries in that database. Select the query qselDataCube and push the > button. Youll see:
Click Next and then Next two more times. On the last screen of the Wizard, you have a few options regarding what
to do with the data. By default, the first option, Return Data to Microsoft Excel, is selected. Leave it and just
click Finish. Youll be returned to the Pivot Table Wizard where you left off. Now just keep going as you would
with any other pivot table.
61
ISKEN
Drilling Down to Underlying Data
Double clicking on a data cell in a pivot table will create a new worksheet with the underlying data detail
records that correspond to that cell. Try it.
We have only briefly touched on the basics of pivot tables and pivot charts. There are many more advanced
features which you can explore as you find the need. For example, there are ways to create calculated custom fields
and data items within pivot tables. Also, in the last screen of the Wizard for getting external data that we just
used, there was an option to Create an OLAP Cube from the data. This is a slightly more sophisticated way of
structuring the data in which hierarchical relationships such as time periods (minutes, hours, days, weeks, months,
quarters, years) are modeled explicitly and which facilitates automatic roll-ups of data. Read in help About creating
an OLAP cube from relational data if youre interested.
Your Job
Now that youre a bit familiar with pivot tables, lets use them to do some exploratory analysis. You are the
manager of this call center. Youve heard some grumbling about long waits on hold from some customers. Since
you dont want to overreact to anecdotal evidence, you decide to explore the data warehouse a bit to find out more
since OnHoldMins is a field in the database. Also, the actual staffing level of the call center is summarized by
day of week and hour of day in tblStaffLevelSummary. That table includes a field called WeeklyHourBin which
ranges from 1-168 and represents each hour of the week starting on Sunday at midnight. Some basic questions you
might want to address are:
1. Does overall average time on hold differ by time of day, by day of week, by both?
2. Do different customer types experience different hold times on average? What about by time of day and day of
week?
As the manager, your primary way of controlling caller time on hold is through staffing and scheduling. Youd like
to see a summary showing average hold time and average staff utilization by day of week and time of day. Seems
like a pretty straight-forward thing for a manager of a call center to want. Turns out its a bit of a bear to create this
using SQL in Access. Heres part of the final summary query result (see qsumUtilization Step4 FinalSummary in
the call center database). If you want to see how this was done, see qsumUtilization Step1,2, & 3.
62
ISKEN
Your job is to get the results of this query from the call center database into Excel somehow (obviously, without
retyping it J) and create a graph or two (using either Pivot Charts or regular old Charts) which provide some insight
into the relationship between staff utilization, time on hold, time of day, day of week, etc. Use your imagination.
Heres two things I did.
It would be nice to see a line showing staff utilization on a second y-axis as well on the previous graph. See if you
can figure out how to do that.
Heres a scatter graph (not a Pivot Chart) relating average time on hold to staff utilization for each of the
168 hours of the week. Hmm, seems to be some sort of pattern? What do you think?
From Data Analysis to Modeling
In this module, youve analyzed some call center data using pivot tables and pivot charts. Youve taken the
perspective of the manager of the call center. As the manager, certain questions would likely arise based on the
analysis.
63
ISKEN
1. It seems like theres some times where staff utilization is really low. I wonder if we reduce staffing how the
2. There are times when average time on hold and staff utilization are both pretty high. I bet we could reduce
average time on hold by adding staff (thus reducing staff utilization). How much staff would I need to reduce
average time on hold to 30 seconds? 15 seconds?
3. Average time on hold isnt the only important performance measure. Were also interested in the percentage
of customers who have to hold and the percentage of customers who hold less than 1 minute. How could
we calculate these measures from the current data and how could we figure out how they will be affected by
changes in staff levels?
4. We currently just have our tech support staff working 8-hour shifts. The midnight shift start is from 12a-8a, the
day shift is from 8a-4p, and the afternoon shift is from 4p-12a. If we were better able to estimate our staffing
needs by hour of day and day of week, how would we figure out how to schedule our staff (e.g. shift start times
and shift lengths) to meet these staffing needs?
These are not easy questions but are the type routinely faced by managers of a myriad of customer service related
systems including software tech support, retail sales (L.L Bean, Lands End), healthcare, police and emergency
services, fast food, insurance companies, and product support. While historical data is useful for analyzing current
system performance, we need something else to handle What if? or Whats best? type questions. We need models.
Try to think a bit about what makes questions 1−4 difficult and well discuss modeling with spreadsheets next time.
64
ISKEN
EXCEL PIVOT TABLE TUTORIAL: AN EXAMPLE OF AN OLAP-TYPE ANALYSIS
Objective
Introduce OLAP-type analysis via Excels Pivot Table and Pivot Chart features. We will also use some more
advanced Access and Excel functions which are useful for data analysis. Our perspective is one of a business
analyst using data for decision support.
The Tutorial
Introduction
We will use the call center example discussed in class. The call center handles technical support calls for
MS Office products. We are interested in using concepts from multidimensional data modeling, OLAP (online
analytical processing), and data analysis functions to do some quantitative business analysis. The data consists of
over 25,000 records from 10/3/99-12/25/99. Each record is a single phone call. The two measured facts are: OLAP
OnHoldMins
ServiceTimeMins
Time in minutes customer spends on hold before talking to tech support person.
Time in minutes spent on the phone by the tech support person with the customer.
Style Analysis with Excel Pivot Tables and Pivot Charts
Excels Pivot Table and Pivot Chart tools are one of the most sophisticated features in Excel. They provide a
simple, yet powerful, method for interactively slicing and dicing and displaying data. Multidimensional data is
particularly well suited to Pivot Table analysis. Pivot Tables can be used to analyze data that resides in Excel or in
an external data source such as Access or some other database.
There is an Excel file called CallCenterPivot.xls (http://ite.pubs.informs.org/Vol3No3/Isken/CallCenterPivot.xls)
which you can get from the Downloads section of the course web. Make a copy of it for yourself in your
folder to work in. In order to facilitate the analysis, the data youll analyze is in the sheet entitled, Data.
Well use it for Pivoting. All of fields included in the sheet are defined in the file OLAP-DataDictionary.xls
(http://ite.pubs.informs.org/Vol3No3/Isken/OLAP-DataDictionary.xls).
Start Excel and Read About Pivot Tables
No sense repeating the stuff in the help files. Click on the Assistant, type pivot table and select About PivotTable reports: interactive data analysis from the help menu when the Assistant answers. Read it.
65
ISKEN
First, well look at basic Pivot Table functionality using
(http://ite.pubs.informs.org/Vol3No3/Isken/CallCenterPivot.xls).
the
data
within
CallCenterPivot.xls
Creating a Pivot Table
Select Pivot Table and Pivot Chart from the Data menu and youll get the first screen of the Wizard.
Select the Data to use for Pivot Table
66
ISKEN
Select the Location to put the Pivot Table
The Pivot Table Interface
Now youll see:
You build your pivot table by dragging and dropping fields from the Pivot Table toolbar into the different sections
of the pivot table: rows, columns, pages, and data areas. Lets revisit the query we did in Problem 2. Drag Customer
into the Page area, AppName into the Row area and ProbName into the Column area. In order to count the number
67
ISKEN
of calls, we can drag any non-null field into the Data area. Drag the CallID field into the Data area. Excel will
create the pivot table and will guess that you want to sum the CallID field. The first result will look like this:
There are many ways to modify the data field to change the summation to a count. On the Pivot Table toolbar there
is a Wizard button. Push it (you can also right click anywhere in the pivot table and select Wizard from the menu).
When you get into the Wizard, push the Layout button. Youll see:
Obviously you can drag and drop fields in this screen as well. Notice that Excel has summed the CallID field.
Double click that button and youll get:
68
ISKEN
You can get this same dialog box if you select the field in your pivot table, right click, and select Field Settings
from the menu.
Heres the revised pivot table:
The drop down arrow on the Page field controls which data is displayed. Your choices are any one value or all. The
drop down arrows on the row and column fields let you selectively exclude certain values. Try them to create the
following:
Theres no law against having multiple fields in any section, so heres another variation on this layout. Try it.
69
ISKEN
Subtotals for each field can be controlled by selecting the field, right clicking, and selecting Field Settings. Youll
see:
There are also a whole host of options for the entire pivot table. Activate any cell in the pivot table, right click, Table
Options. Youll see:
See online help to describe each of these options. Some are a bit complex.
70
ISKEN
Formatting the Pivot Table
You can pretty up your pivot table with various built in Report Formats. Yep, right click
Refreshing the Pivot Table
If you make changes to the underlying data of a pivot table, you need to refresh the pivot table. As always
there are several ways to do this. See the Pivot Table toolbar or right click while anywhere in the Pivot Table.
Normal and Special Display of Values
So far we have just displayed data in its normal form. Pivot tables also let you show data in many interesting ways. For example, we might be interested in the percentage of total calls placed by each customer. First,
modify the pivot table so just total calls for each customer is displayed. Then activate a cell in the data area, right
click, and select Field Settings. Then push the Options>> button and click the drop down arrow. Notice that Ive
selected % of Column.
71
ISKEN
The result is:
Pivot Charts
It is easy to create a Pivot Chart from a pivot table. Just select the pivot table, right click, Pivot Chart (or
click the Pivot Chart button on the Pivot Table toolbar). I changed our Pivot Table to display the data normally
again (as opposed to % of column) and then created a Pivot Chart. Heres what you get:
Pivot charts are live charts in the sense that you can manipulate them just like pivot tables. You can drag new fields
in, reorient the axes, change the formatting, etc. For example, create the following pivot chart:
72
ISKEN
Drilling Down to Underlying Data
Double clicking on a data cell in a pivot table will create a new worksheet with the underlying data detail
records that correspond to that cell. Try it.
We have only briefly touched on the basics of pivot tables and pivot charts. There are many more advanced
features which you can explore as you find the need. For example, there are ways to create calculated custom fields
and data items within pivot tables. Also, in the last screen of the Wizard for getting external data that we just
used, there was an option to Create an OLAP Cube from the data. This is a slightly more sophisticated way of
structuring the data in which hierarchical relationships such as time periods (minutes, hours, days, weeks, months,
quarters, years) are modeled explicitly and which facilitates automatic roll-ups of data. Read in help About creating
an OLAP cube from relational data if youre interested.
Your Job
Now that youre a bit familiar with pivot tables, lets use them to do some exploratory analysis. You are the
manager of this call center. Youve heard some grumbling about long waits on hold from some customers. Since
you dont want to overreact to anecdotal evidence, you decide to explore the data warehouse a bit to find out more.
As the manager, your primary way of controlling caller time on hold is through staffing and scheduling. Youd like
to see a summary showing average hold time and average staff utilization by day of week and time of day. The file
WeeklySummary.xls (http://ite.pubs.informs.org/Vol3No3/Isken/WeeklySummary.xls) was created from the data
warehouse and contains 168 rows, one for each hour of the week. Columns I and J contain fields called StaffUtil
and AvgOnHoldMins, which give the overall percentage of time the staff is utilized fielding calls and the average
customer time on hold, respectively. Some basic questions you might want to address are:
1. Does overall average time on hold differ by time of day, by day of week, by both?
2. Do different customer types experience different hold times on average? What about by time of day and day of
week?
Your job is to create a graph or two (using either Pivot Charts or regular old Charts) which provide some insight
into the relationship between staff utilization, time on hold, time of day, day of week, etc. Use your imagination.
Heres two things I did.
73
ISKEN
It would be nice to see a line showing staff utilization on a second y-axis as well on the previous graph. See if you
can figure out how to do that.
Heres a scatter graph (not a Pivot Chart) relating average time on hold to staff utilization for each of the
168 hours of the week. Hmm, seems to be some sort of pattern? What do you think?
From Data Analysis to Modeling
In this module, youve analyzed some call center data using pivot tables and pivot charts. Youve taken the
perspective of the manager of the call center. As the manager, certain questions would likely arise based on the
analysis.
1. It seems like theres some times where staff utilization is really low. I wonder if we reduce staffing how the
2. There are times when average time on hold and staff utilization are both pretty high. I bet we could reduce
average time on hold by adding staff (thus reducing staff utilization). How much staff would I need to reduce
average time on hold to 30 seconds? 15 seconds?
3. Average time on hold isnt the only important performance measure. Were also interested in the percentage
of customers who have to hold and the percentage of customers who hold less than 1 minute. How could
74
ISKEN
we calculate these measures from the current data and how could we figure out how they will be affected by
changes in staff levels?
4. We currently just have our tech support staff working 8-hour shifts. The midnight shift start is from 12a-8a, the
day shift is from 8a-4p, and the afternoon shift is from 4p-12a. If we were better able to estimate our staffing
needs by hour of day and day of week, how would we figure out how to schedule our staff (e.g. shift start times
and shift lengths) to meet these staffing needs?
These are not easy questions but are the type routinely faced by managers of a myriad of customer service related
systems including software tech support, retail sales (L.L Bean, Lands End), healthcare, police and emergency
services, fast food, insurance companies, and product support. While historical data is useful for analyzing current
system performance, we need something else to handle What if? or Whats best? type questions. We need models.
Try to think a bit about what makes questions 1-4 difficult and well discuss modeling with spreadsheets next time.
75

ABSTRACT EVOLUTION OF A DSS COURSE

Transcription

Similar documents

Introduction to Maptitude for MapPoint Users

Shag Rag October 24 - Northern Virginia Shag Club

Morgan Stanley Smith Barney Benefit Access Quick

Infrastructure Support

What do Portledge School and IKEA have in common?

AISD AMEX Purchasing Card Corporate Account Reconciliation

sb6c owner`s manual

owner`s manual yeti sb6c

yeti cycles - jehlebikes.de

Operation and Maintenance Guide

management science - Dipartimento di Economia

A bibliography of data envelopment analysis - Rutcor