Geocluster

Transcription

Geocluster
Diplomarbeitspräsentation
Geocluster: Server-side clustering
for mapping in Drupal
based on Geohash
Masterstudium:
Software Engineering
& Internet Computing
Josef Dabernig
9
APTER 2. CLUSTER ANALYSIS
•
•
Problem
then a cluster can be defined as a connected component: a group of
Clustering is the task of grouping unlabeled data
in an automated way. The thesis researches cluster
analysis to create an algorithm for server-side
clustering with maps.
Performance and readability of digital mapping
applications decreases when displaying large
Density-based.
A
cluster
is
a
dense
region
of
objects
that
is
suramounts of data. Client-side clustering uses
rounded
a region
of items.
low Server-side
density. a density-based definition of a
JavaScriptby
to group
overlapping
clustering
is needed
when too many
itemsthe
slowclusters are irregular or intertwined
cluster
is often
employed
when
Geohash space decomposition on level 1.
down processing and create network bottle necks.
The letter „D“ covers parts of the Americas
Geohash is a latitude/longitude geocode system
based on the Morton order. Coordinates are
encoded as string identifiers with a hierarchical
spatial structure.
objects that are connected to one another, but have no connection to
Maps visualize
datathe
in an group.
intuitive way.
objects
outside
An example can be seen in Figure 2.3c.
Algorithm considerations
and when noise and outliers are present. A density-based cluster can
Goals
take
on any shape, an example can be seen in Figure 2.3d.
◘ Pattern representation: spatial clusters
◘ Proximity measure: Euclidean distance
◘ Cluster type: prototype-based
◘ Algorithm: based on Geohash
◘ Implement real-time, server-side clustering
◘
Cluster
up
to
1,000,000
items
within
1
second
8VHU
Shared-Property (Conceptual Clusters). More generally,
we can
◘ Visualize clusters on an interactive map
,QWHUDFW
9LVXDOL]H
ZLWKPDS definition
PDS
define
a cluster
asDrupal
a setframework
of objects that share a property.
This
◘ Integrate
with the
0DS
◘
Publish
under
the
Open
Source
GPL
license
encompasses all the previous definitions of a cluster. The process of
◘ Implement use cases and evaluate results
Clustering
finding such clusters is called conceptual clustering. When this concepApproach
tual
clustering gets too sophisticated, %URZVHU
it becomes pattern recognition
◘
Research
clustering,
mapping
and
visualization
&OLHQW definition any more.
on its own. Then this definition is no basic
9HFWRUGDWDOD\HU
%DVHLPDJHOD\HU
-DYDVFULSWPDSSLQJOLEUDU\
◘ Evaluate state-of-the-art technologies
◘ Design a scalable algorithm for clustering
specific
interpretation
of clusters
◘ Implement
and test the algorithm
Implementation
:HESDJH
%DVH
OD\HUWLOHV
Create a Geohash-based hierarchical spatial index
1) initialize algorithm variables (cluster level)
2) pre-cluster points based on Geohash
3) merge clusters by neighbor-check
9HFWRU
GDWD
that a method uses to create these
6HUYHU
ters can result in totally different mathematical approaches. It is importo decide which type of clusters are needed to solve a problem.
7LOH6HUYHU
)HDWXUHVHUYHU
The algorithm has been integrated into the Drupal
mapping stack as shown in the figure below:
6SDWLDO
GDWDEDVH
Mapping
$SDFKH6ROU6HUYHU
'UXSDO6HUYHU
A modern web mapping stack
◘ Spatial data is represented by points, lines or
polygons in vector format or rastered images
◘ Projections map the geoid earth onto a planar
surface which causes distortion
◘ A modern web mapping stack uses image base
(a)
Well-separated
tiles with overlays of vector data
◘ The slippy map is rendered client-side by a
JavaScript mapping library
Geocluster
$UUD\RI
*HRGDWD
6HDUFK$3,
Types of cluster analysis
*HRFOXVWHU
9LVXDOL]DWLRQ
9LHZV
6HDUFK$3,
6ROU
$SDFKH
6ROU
/HDIOHW
/LEUDU\
*HR-621
,QWHUDFWLYH
0DS
*HRFOXVWHU
$OJRULWKP
*HRFOXVWHU
VROU
Geocluster Solr architecture overview
(d) Density-based
Foundations of geovisualization, visual variables,
data exploration techniques
and 2.3:
clutterTypes
reductionof
Figure
have been researched. A state-of-the-art analysis
enumerates map visualization types and
techniques for putting clustered, multi-variate data
on maps.
%%2;
6WUDWHJ\
9LHZV
*HR-621
(b) Prototype-based
Cluster islands
(c) Graph-based
+70/0DS
:UDSSHU
*HRFOXVWHU
The Drupal mapping stack has been studied for
integration for a server-side clustering solution.
Visualization
&OLHQW%URZVHU
+70/0DS
:UDSSHU
Drupal
clusters
Drupal is a free and open source content
management system and framework. Developed
and maintained by an international community,
it currently backs more than 2% of all websites.
The Drupal mapping stack has been evaluated
for integration of a server-side clustering
implementation,
including
for spatial
suited method
formodules
extractdata storage and presentation.
◘ Map types: Geographic maps with markers,
Heat/choropleth maps, Dot grid maps and
en theVoronoi
wanted
mapstype of cluster is known, a
visualization
techniques:
these◘ Cluster
clusters
is needed.
A variety of methods for searching clusters is
Icon-based/Glyphs, Pixel-oriented as well as
Geocluster
integrates
with state-of-the-art
lable, Geometric
each producing
its
own
type
of
clusters.
The
way
these
methods
techniques and Diagrams.
Drupal 7 modules like Geofield, Views, Leaflet to
k can be divided based on three characteristics.
This defines
not the
provide interactive,
scalable, clustered
maps.
An evaluation classifies the stated techniques for
It has been released under the GPL license and
cluster visualization on maps, based on
can be downloaded from:
exploratory analysis.
Results
Two use cases have been realized and evaluated for
performance and visualization: a geocluster demo use
case and a GeoRecruiter prototype that extends the
Recruiter distribution for job boards in Drupal 7.
The performance tests show that one of the 3
algorithm implementations fulfills the objective:
◘ the PHP implementation doesn‘t scale well
◘ the MySQL clustering scales up to 100,000 items
◘ the Solr version scales beyond 1,000,000 items
Cluster algorithm performance
Request time
4
Technische Universität Wien
Institut für Softwaretechnik und Interaktive Systeme
Arbeitsbereich: Information & Software Engineering Group
Betreuer: O.Univ.Prof. Dr. A Min Tjoa
1000ms
none
900ms
mysql
800ms
php
solr
700ms
600ms
500ms
400ms
300ms
200ms
100ms
10
0
10
00
1,0
00
,0
10
00
1
,0
00
Clustered items
Simple glyph types
no edge
entry
Matrix edge detection
http://drupal.org/project/geocluster
Geocluster performance
Kontakt: http://dasjo.at