Going the Distance - Western Users of SAS Software

Transcription

Going the Distance - Western Users of SAS Software
Going the Distance: Google Maps Capabilities in a Friendly SAS Environment
Anton Bekkerman, Ph.D., Montana State University, Bozeman, MT
ABSTRACT
While the GEODIST procedure allows users to calculate “as the crow flies,” straightline distances, SAS does not directly
provide capabilities to calculate road distances between locations. For many users who might only have addresses
(rather than geographic coordinates) and who need to determine actual road distances for optimizing routes, minimizing
transportation costs, or simply translating postal addresses to geographic coordinates, existing SAS functionality may
be insufficient. I demonstrate how Google Maps can be integrated with SAS to perform these functions and output the
desired results within the SAS environment. That is, after a SAS user specifies a location or multiple locations (as postal
addresses, city names, state names, etc.), the information is passed to Google Maps from within SAS, the underlying
Google Maps HTML code with the coordinates and/or directions is retrieved and parsed, and the desired results are
recorded to a SAS dataset. The entire process is completed using only a few lines of code within a single DATA step
statement. Moreover, I demonstrate how the process can be easily automated for numerous location entries within
a MACRO environment. A comparison of the native SAS straightline and integrated road distance methods indicates
that, on average, the straightline method underestimates the true road distance by approximately 25%, and this error
becomes larger as the distance between spatially separated locations increases.
INTRODUCTION
When you are asked to get from location A to location B, what is your first reaction? Perhaps it is to pull out your smart
phone and use one of the myriad driving directions apps. Or maybe it is to access a web-based option, such as Google
Maps, MapQuest, Bing, among others. Or, perhaps you may even be tempted to pull out the circa 1997 road atlas,
which has had its corners chewed off by your dog (or kid), proudly displays travel mug coffee stains, and has been
accumulating dust in your car’s trunk and waiting for the “just-in-case” scenario when there is neither a wifi nor cellular
phone signal.1
Regardless of your preferred method, rarely do you consider calculating distances using the “as the crow flies” method—
a straightline connection between two spatially-separated points, which accounts for the Earth’s curvature but ignores
the constraints associated with traveling on roads. Such constraints are manifest in routes being indirect connections
between a starting and ending locations due to factors such as geological characteristics (e.g., unbridged bodies of
waters), construction or road repair projects, or simply no available routes that mimic the “as the crow flies” path.
Moreover, it is reasonable to assume that most travel occurs using ground transportation, rather than other methods
that may be more characteristic of a straightline distance.2
While the SAS software has continued to update and expand its spatial analysis capabilities, tools for easily determining
and automating road distances between locations are not directly available. Moreover, the constantly changing road
conditions and accessibility to driving routes require a dynamic method for recognizing these changes and providing
the most current spatial analysis results. This paper presents a relatively straightforward method for determining
road distances by integrating the Google Maps directions tool, which has developed a mechanism for optimizing
transportation routes within much of North America and the world. A preliminary example demonstrates the underlying
process for calling the Google Maps directions tool directly from SAS and extracting relevant distance information into a
SAS dataset. The technique is then generalized to determine distances for any number of starting and ending location
combinations.
The presented methodology is then compared to the native distance calculation tool in SAS—the GEODIST function—
which calculates the straightline distance between two spatially-separated points. The comparison analysis shows that
the GEODIST function underestimates road distances by approximately 25%. Such errors can have non-trivial impacts
on studies that rely on the precise understanding of distances and travel routes for estimating costs and revenues,
optimizing logistics, and improving marketing efforts, among other activities.
1
Yes. From a recent personal experience, I can attest that such places still exist.
2
One could argue that travel by rail or air follow straightline routes. However, railroads are often subject to similar constraints as roads and air travel is
subject to layovers in locations that prevent direct routes.
1
NATIVE SAS DISTANCE CALCULATION TOOLS
The GEODIST function is used to calculate distances between two geographic coordinates using the Haversine
formula (SAS Institute, Inc. 2011). The formula determines the shortest, straightline distance between two coordinates,
accounting for the approximate curvature of the Earth. The function requires four arguments—the latitude and longitude
of the starting location and the latitude and longitude of the destination. While manually obtaining these coordinates
from postal addresses or location names is not overly costly when dealing with only a few locations, increasing the
number of observations can become expensive or even impractical.3 Using a known set of coordinates, the GEODIST
function can be called within the DATA step as follows:
distance = geodist(latitudeStart,longitudeStart,latitudeEnd,longitudeEnd,’M’);
where latitudeStart and longitudeStart represent columns of the latitude and longitude coordinates of the
starting locations, latitudeEnd and longitudeEnd represent columns of the latitude and longitude coordinates of
the destinations, and distance is the column containing the resulting straightline distances. The option ‘M’ requests
that the distance is output in miles rather than kilometers, which are the default units.
The straighline route is rarely the same as the driving route between the two locations. Moreover, it is expected that
the difference between the two alternatives will be more substantial as the distance between two locations increases.
Figure 1 provides a visual comparison of the straightline distance and one that is based on drivable routes between
Bozeman, MT and Las Vegas, NV. The figure makes evident the constraints that bind road travel but not necessarily
straightline approximations.
INTEGRATING GOOGLE MAPS
As shown in Figure 1, the Google Maps directions tool can be used to obtain a more precise estimate of driving
distances. This is the underlying mechanism for generating driving distance data within SAS. The following SAS code
demonstrates a basic framework for performing the SAS—Google Maps integration.
%let addr1 = Bozeman,MT;
%let addr2 = Las+Vegas,NV;
filename google url "http://maps.google.com/maps?daddr=&addr2.%nrstr(&saddr)=&addr1";
data dist(drop=html);
infile google recfm=f lrecl=10000;
input @ ’<div class="altroute-rcol altroute-info">
input html $50.;
if _n_ = 1;
locStart = "&addr1";
locEnd = "&addr2";
roaddistance = input(scan(html,1," "),comma12.);
run;
<span>’
@;
proc print data=dist noobs; run;
The MACRO variables addr1 and addr2 specify the starting and ending locations, respectively, and are the only userinput variables. The URL google requests that Google Maps generates driving directions between the two specified
locations. The HTML code underlying the route displayed in Google Maps is then read into SAS and parsed within
the DATA step. The third line of the DATA step specifies that SAS begins to parse the HTML code beginning after the
line <div class="altroute-rcol altroute-info"> <span>. That is, the DATA step eliminates all text that
precedes location where the road distance value is reported. Lastly, the SCAN function is used to extract the road
distance value into the SAS dataset. Table 1 shows the contents of the resulting dist dataset.
3
The Appendix presents SAS code that helps automate the process for obtaining geographic coordinates for postal addresses and location names.
Users can also use the GEOCODE procedure, but a detailed discussion of this procedure is out of the scope of this paper.
2
Figure 1: Comparison of Straightline and Driving Routes Between Bozeman, MT and Las Vegas, NV
Source: The map was generated using Google Maps.
Notes: The starting location is Bozeman, MT (45.682677,-111.053288) and the ending location is Las Vegas, NV (36.116799,-115.174534).
3
Table 1: Contents of the dist Dataset: Road Distance Information
locStart
Bozeman,MT
locEnd
Las+Vegas,NV
roaddistance
832
Of course, the advantages of using this approach are minimal when determining distances for one or a few location
pairs—users can go directly to Google Maps and obtain the same outputs. Substantial improvements in efficiency (and
cost-savings) begin to be realized when road distances need to be recorded for a large number of location pairs. For
example, consider a courier service that has three warehouses from where packages could be delivered to customers.
The courier service may be interested in understanding how to efficiently allocate delivery packages to the warehouses
such that the final delivery distances are minimized. This requires that the courier service determines the driving
distances from each of the three warehouses to the final destinations. The following data represent randomly generated
courier service warehouse sites and customer locations in the Bozeman, MT area.
data courier;
input warehouse_address & $19. warehouse_city $ & warehouse_state $
customer_address & $19. customer_city $ & customer_state $ ;
datalines;
8250 Huffine Lane Bozeman MT 2884 Caterpillar Dr. Bozeman MT
8250 Huffine Lane Bozeman MT 408 S 12th Ave. Bozeman MT
8250 Huffine Lane Bozeman MT 30 Main Street Belgrade MT
6553 N 19th Ave Bozeman MT 2884 Caterpillar Dr. Bozeman
MT
6553 N 19th Ave Bozeman MT 408 S 12th Ave. Bozeman MT
6553 N 19th Ave Bozeman MT 30 Main Street Belgrade MT
1340 Kagy Blvd Bozeman MT 2884 Caterpillar Dr. Bozeman MT
1340 Kagy Blvd Bozeman MT 408 S 12th Ave. Bozeman MT
1340 Kagy Blvd Bozeman MT 30 Main Street Belgrade MT
...
;
run;
The following MACRO uses the location pair information in the courier dataset and creates an output dataset
containing the driving distance for each pair.
/**********************************************************************/
/* Purpose: Determine road distances for location pairs
*/
/* Author: Anton Bekkerman
*/
/*
*/
/* User inputs:
*/
/*
input = name of SAS input dataset
*/
/*
(e.g., libname.inputName)
*/
/*
output = name of SAS output dataset
*/
/*
(if empty, then libname.inputName_dist)
*/
/*
startAddr = variable name of starting location address
*/
/*
(variable content example: 555 StreetName Dr.)
*/
/*
startCity = variable name of starting location city
*/
/*
(variable content example: Bozeman)
*/
/*
startSt = variable name of starting location state
*/
/*
(variable content example: MT)
*/
/*
endAddr = variable name of destination address
*/
/*
endCity = variable name of destination city
*/
/*
endSt = variable name of destination state
*/
/**********************************************************************/
4
%macro road(input,output,startAddr,startCity,startSt,endAddr,endCity,endSt);
/* Check if input data set exists; otherwise, throw exception */
%if %sysfunc(exist(&input))ˆ=1 %then %do;
data _null_;
file print;
put #3 @10 "Data set &input. does not exist";
run;
%abort;
%end;
/* Check if user specified output dataset name; otherwise, create default */
%if &outputˆ="" %then %let outData=&output;
%else %let outData = &input._dist;
/* Replace all inter-word spaces with plus signs */
data tmp; set &input;
addr1 = tranwrd(left(trim(&startAddr))," ","+")||","||
tranwrd(left(trim(&startCity))," ","+")||","||
left(trim(&startSt));
addr2 = tranwrd(left(trim(&endAddr))," ","+")||","||
tranwrd(left(trim(&endCity))," ","+")||","||
left(trim(&endSt));
n = _n_;
run;
data _NULL_;
if 0 then set tmp nobs=n;
call symputx("nObs",n); stop;
run;
%do i=1 %to &nObs;
/* Place starting and ending locations into macro variables */
data _null_; set tmp(where=(n=&i));
call symput("addr1",trim(left(addr1)));
call symput("addr2",trim(left(addr2)));
run;
/* Determine road distance*/
options noquotelenmax;
filename google url "http://maps.google.com/maps?daddr=&addr2.%nrstr(&saddr)=&addr1";
data dist(drop=html);
infile google recfm=f lrecl=10000;
input @ ’<div class="altroute-rcol altroute-info"> <span>’ @;
input html $50.;
if _n_ = 1;
roaddistance = input(scan(html,1," "),comma12.);
run;
data dist; merge tmp(where=(n=&i)) dist; run;
/* Append to output dataset */
%if &i=1 %then %do;
data &outData; set dist(drop=n addr:); run;
%end;
%else %do;
proc append base=&outData data=dist(drop=n addr:) force; run;
5
%end;
%end;
/* Delete the temporary dataset */
proc datasets library=work noprint;
delete tmp;
quit;
%mend;
The MACRO road is used to evaluate road distances for the destinations contained in the courier dataset. Table 2
presents an abbreviated representation of the resulting output data, courier dist. These data can now be used to
evaluate the optimal courier warehouse location (conditional on distance to final destination) to minimize the total costs
for delivering packages to their final destinations.
Table 2: Contents of the courier dist Dataset: Road Distance Information for Multiple Location Pairs
Warehouse
Address
City
8250 Huffine Lane Bozeman
8250 Huffine Lane Bozeman
8250 Huffine Lane Bozeman
6553 N 19th Ave
Bozeman
6553 N 19th Ave
Bozeman
6553 N 19th Ave
Bozeman
1340 Kagy Blvd
Bozeman
1340 Kagy Blvd
Bozeman
1340 Kagy Blvd
Bozeman
..
.
..
.
State
MT
MT
MT
MT
MT
MT
MT
MT
MT
Destination
Address
City
2884 Caterpillar Dr Bozeman
408 S 12th Ave.
Bozeman
30 Main Street
Belgrade
2884 Caterpillar Dr Bozeman
408 S 12th Ave.
Bozeman
30 Main Street
Belgrade
2884 Caterpillar Dr Bozeman
408 S 12th Ave.
Bozeman
30 Main Street
Belgrade
..
.
..
.
State
MT
MT
MT
MT
MT
MT
MT
MT
MT
Road Distance (miles)
6.3
6.4
8.1
6.2
4.8
14.1
3.2
1.3
11.1
..
.
..
.
AN EMPIRICAL COMPARISON OF METHODS
As noted above and shown in Figure 1, there is likely a discrepancy between the straightline and driving directions
methods for calculating distances. However, if the discrepancy is only trivial, then using the integrated Google Maps
approach may be a cost-ineffective approach.
To evaluate whether the dissimilarities are statistically significant and quantify the potential error, I use the road
MACRO and the GEODIST function to determine distances using a large number of location pairs. As an example,
the comparison is made using the travel distance between the locations of four- and two-year universities in California,
Colorado, Montana, Nevada (excluding those in Las Vegas), Oregon, Washington, and Wyoming and Las Vegas, NV—
the 2013 location of the Western Users of SAS Software annual conference. The resulting dataset yielded a total of
272 location pairs.
Figure 2 shows a comparison of these distances across all location pairs, across pairs that are separated by less than
or equal to 500 miles, and across locations that are separated by a distance greater than 500 miles. In each case, the
straightline approximation underestimates the road distance. More importantly, this difference is statistically significant
across all scenarios. This suggests that using straightline distances as approximations to road distances could lead to
inaccurate inferences. In the sample used for this example, the average error is approximately 25%—that is, the road
distance is underestimated by approximately 25% when using the straightline approach.
The results also indicate that the error is larger when two spatially separated locations are farther apart. This is
generally observable in Figure 2, but is much clearly observed in Figure 3. The latter figure shows that as the distance
between two location pairs increases, so does the degree of underestimation due to the use of a straighline distance
approximation.
6
Figure 2: Comparison of Haversine (Straightline) and Road Distances Across 272 Location Pairs
Source: Figure generated by the author.
Notes: Bar heights indicate average distances and bands represent 95% confidence limits.
7
Figure 3: Percent Underestimation of Road Distance when Using the Straightline Distance Approximation
Source: Figure generated by the author.
CONCLUSION
The capabilities for spatial analysis continues to rapidly improve in SAS, but there remain aspects that require additional
external resources. One such deficiency is the ability to calculate road distances between spatially separated locations.
While the GEODIST function offers an approximation (which is appropriate to use in some cases), a more precise
mechanism is not currently available. This imprecision is relatively straightforward to overcome by using the geocoding
and driving direction functions of Google Maps. The Google Maps directions feature becomes even more powerful
when it is coupled with SAS, enabling users to easily automate the road distance data collection process. This allows
users to determine road distances across large datasets and immediately employ these data for statistical analyses.
Being able to obtain a more detailed and precise understanding of distances can substantially improve individuals’ and
companies’ abilities to optimize their decisions and strategies, and can have significant economic impacts.
APPENDIX: DETERMINING LATITUDE AND LONGITUDE COORDINATES
The following code uses the SAS—Google Maps integration to geocode an address or location (determine the latitude
and longitude coordinates).
%let addr1 = Bozeman,MT;
filename google url "http://maps.google.com/maps?q=&addr1";
data location(keep=lat long);
infile google recfm=f lrecl=10000;
input @ ’viewport:{center:{’ @;
input html $50.;
if _n_ = 1;
ystart = index(html,"lat:");
yend = index(html,",lng");
xstart = index(html,"lng:");
8
xend = index(html,"},span");
lat = input(substr(html,ystart+4,yend-1),best8.);
long = input(substr(html,xstart+4,xend-1),best11.);
run;
REFERENCES
SAS Institute, Inc. 2011. SAS/STAT 9.3 Users Guide, Cary, NC: SAS Institute Inc.
CONTACT INFORMATION
All SAS code described in this paper can be accessed by clicking here or by visiting the “Tools/Code” tab on the website
listed below. Please address comments and questions to:
Anton Bekkerman, Ph.D.
205 Linfield Hall
Montana State University
P.O. Box 172920
Bozeman, MT 59717-2920
Phone: (406) 994-3032
[email protected]
http://www.montana.edu/bekkerman
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute
R indicates USA registration.
Inc. in the USA and other countries. Other brand and product names are trademarks of their respective companies.
9