Masterarbeit - Institut für Informatik

Transcription

Masterarbeit - Institut für Informatik
Freie Universität Berlin
Fachbereich für Mathematik und Informatik
Institut für Informatik
Masterarbeit
Storage and Parallel Processing of Image Data Gathered for
the Analysis of Social Structures in a Bee Colony
7. August 2014
Bearbeitet von:
Simon Wichmann
Erich-Baron-Weg 24A
12623 Berlin
Kontakt: [email protected]
Gutachter : Prof. Raúl Rojas
Betreuer : Dr. Tim Landgraf
Abstract
The Beesbook project’s goal is to gather information of unprecedented detail about the
processes in a beehive. During the planned experiments, due to a high temporal and
image resolution, a big amount of image data will be acquired (up to 300 Terabyte).
This thesis, firstly, presents an approach to live transfer the recorded image data to a
specialized data storage facility in a stable and efficient way. For that purpose, it has been
elaborated how to use a previously established Gigabit connection to capacity. Secondly,
due to the image analysis taking an estimated time of 700 years on one processor, a
parallelization of the analysis has been developed. The presented approach uses the newly
built supercomputer of Zuse–Institut Berlin, which provides 17856 processors. In order
to analyze the whole dataset in an automated way, a program has been developed that
monitors the supercomputer’s batch queue and submits processing jobs continuously. The
resulting program is applicable to all perfectly parallel problems. In fact, it can be be
compared to the Map–step of the well-known MapReduce programming model, but in a
supercomputer environment instead of a compute cluster.
2
Eidesstattliche Erklärung
Hiermit erkläre ich an Eides statt, dass ich die vorliegende Arbeit selbstständig und ohne
fremde Hilfe verfasst und keine anderen als die angegebenen Hilfsmittel verwendet
habe. Diese Arbeit wurde keiner anderen Prüfungsbehörde in gleicher oder ähnlicher Form
vorgelegt.
Berlin, den 7. August 2014
..............................................
Simon Wichmann
Acknowledgements
An der erfolgreichen Ausführung einer solchen Arbeit sind weit mehr Menschen beteiligt,
als nur der Autor. An dieser Stelle möchte ich einigen von ihnen danken.
Ein großes Dankeschön geht an Dr. Tim Landgraf für seine äußerst kreative und pragmatische Arbeitsweise und Hilfestellung, sowie sein Vertrauen in meine Arbeit. Weiterhin
danke ich dem ganzen Beesbook und Biorobotik Team für viele inspirierende Gespräche
und ein sehr angenehmes Zusammenarbeiten.
Ich möchte außerdem Dr. Wolfgang Baumann und Wolfgang Pyszkalski vom Zuse–Institut
Berlin für lange und aufschlussreiche Gespräche danken. Dank gebührt auch dem IT–
Dienst des Instituts für Informatik für die Hilfestellung beim Umgang mit der Netzwerktechnik, sowie der schnellen Schaltung und Wartung einer Gigabit–Leitung für das
Beesbook Projekt.
Für Hinweise zur korrekten Benutzung der englischen Sprache, sowie das ausführliche Korrekturlesen bedanke ich mich herzlich bei Anne Becker.
Besonderer Dank gilt meiner Freundin, Michele Ritschel, die für etwaige Probleme während
der Arbeit immer ein offenes Ohr hatte. Zuletzt möchte ich meinen Eltern danken, ohne
die diese Arbeit in diesem Umfang nicht möglich gewesen wäre.
4
Contents
1. Introduction
8
1.1. The Beesbook Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
1.1.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
1.1.2. Project Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
1.2. The Scope of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.1. Transfer and Storage of the Image Data . . . . . . . . . . . . . . . . 12
1.2.2. Parallelized Image Processing . . . . . . . . . . . . . . . . . . . . . . 12
2. Comparison with Similar Projects
13
3. Transfer and Storage of the Image Data
14
3.1. Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2. Storage Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.1. Local Hard Disks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.2. Tape Storage Devices at Zuse Institut Berlin . . . . . . . . . . . . . 15
3.2.3. Using the Supercomputer’s File System . . . . . . . . . . . . . . . . 15
3.3. Live Transfer of the Image Data . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3.2. Monitoring the Image Directory with the FileSystemWatcher . . . . 17
3.3.3. Archiving the Images with 7zip . . . . . . . . . . . . . . . . . . . . . 18
3.3.4. Transferring the Archives with Winscp . . . . . . . . . . . . . . . . . 19
3.3.5. Error Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4. Parallelization of the Image Analysis
21
4.1. The Parallelization Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.1. Problem Decomposition and Computing Architectures . . . . . . . . 21
4.1.2. The Right Parallelization Approach for the Beesbook Problem . . . . 22
4.2. The HLRN/Cray Supercomputer System
. . . . . . . . . . . . . . . . . . . 23
4.2.1. Overview on the CrayXC30 Supercomputer . . . . . . . . . . . . . . 23
4.3. Parallelization on the Cray Supercomputer System . . . . . . . . . . . . . . 25
4.3.1. Parallelization per Job . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.4. Organizing the Automatic Job Submission . . . . . . . . . . . . . . . . . . . 26
4.4.1. The Beesbook Observer Script . . . . . . . . . . . . . . . . . . . . . . 26
4.4.2. The BbCtx Module
. . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5
4.4.3. The Job Queue Manager Module . . . . . . . . . . . . . . . . . . . . 28
4.4.4. The Image Provider Module . . . . . . . . . . . . . . . . . . . . . . . 30
4.4.5. Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.5. How to Configure and Use the Beesbook Observer . . . . . . . . . . . . . . . 33
4.5.1. The Configuration via the BbCtx Module . . . . . . . . . . . . . . . 34
4.5.2. (Re-) Initializing the Observers’ Work Directory . . . . . . . . . . . 35
5. Evaluation
36
5.1. Evaluation of the Image Transfer . . . . . . . . . . . . . . . . . . . . . . . . 36
5.1.1. The Best Transfer Protocol . . . . . . . . . . . . . . . . . . . . . . . 36
5.1.2. Transfer Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2. Evaluation of the Parallelization . . . . . . . . . . . . . . . . . . . . . . . . 39
6. Discussion
42
6.1. Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.2. Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
7. Future work
44
7.1. Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7.2. Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7.2.1. Starting the Observer . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7.2.2. Extending the Observer . . . . . . . . . . . . . . . . . . . . . . . . . 44
Appendices
46
A. Calculations and Tables
47
A.1. Calculation Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
A.1.1. Image Sizes and Bandwidth . . . . . . . . . . . . . . . . . . . . . . . 47
A.1.2. HDD Capicities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
A.1.3. Maximal Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . 48
A.1.4. Needed NPL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
A.2. Additional Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
A.2.1. FileSystemWatcher Event Buffer Size . . . . . . . . . . . . . . . . . 48
A.2.2. Archive Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
A.2.3. Processing Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
A.2.4. Number of Batch Jobs . . . . . . . . . . . . . . . . . . . . . . . . . . 50
A.2.5. Work Directory Structure . . . . . . . . . . . . . . . . . . . . . . . . 51
B. Glossary
53
B.1. Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6
List of Figures
1.1. The top level of the Beesbook structure, showing the separate steps of the
experiment’s workflow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2. The design of the bee tags (source: [9]) . . . . . . . . . . . . . . . . . . . . . 11
3.1. The hard- and software setup during the recording stage . . . . . . . . . . . 17
3.2. A diagram showing the activities of one image transfer thread . . . . . . . . 18
4.1. An illustration of the batch job scheduling . . . . . . . . . . . . . . . . . . . 24
4.2. An illustration of the proceedings in one batch job . . . . . . . . . . . . . . 27
4.3. The top–level Beesbook Observer schema . . . . . . . . . . . . . . . . . . . . 29
5.1. The results of the data transfer stability tests . . . . . . . . . . . . . . . . . 38
5.2. Determined NPL overhead during several test runs . . . . . . . . . . . . . . 41
A.1. The bandwidths and total sizes corresponding to certain JPEG quality levels 47
A.2. The time capacities of differently sized hard disks . . . . . . . . . . . . . . . 47
A.3. The maximal number of concurrent computations . . . . . . . . . . . . . . . 48
A.4. NPL need depending on runtime per image . . . . . . . . . . . . . . . . . . 48
A.5. The image used for benchmarking . . . . . . . . . . . . . . . . . . . . . . . 50
A.6. The structure of the Beesbook work directory. . . . . . . . . . . . . . . . . . 51
7
1. Introduction
The tasks I worked on in the scope of this master thesis are part of the Beesbook project [1]
led by Tim Landgraf. To give an idea of how this thesis’ topic integrates into Beesbook,
I will firstly describe the project and its structure. The second chapter introduces and
compares some related work. Subsequently, chapter three and four describe the implementation of two main tasks of this thesis. The implementations will then be evaluated and
discussed in chapter five and six. The last chapter gives an outlook on further applications
of the solved problems.
1.1. The Beesbook Project
Pollination services provided by bee colonies are essential for a wide range of important
plants, for example almond, apple and cocoa [17]. In North America bee pollinated crops
represent about 30% of the food consumed by humans [12]. In recent years there was
a growing number of reports about whole colonies dying (colony-collapse disorder) [16]
while the cause remains unclear. The honeybee’s agricultural importance, as well as the
complexity of the processes in a beehive, makes it a frequent object of research.
1.1.1. Motivation
Understanding the individual behaviors and a bee colony’s coaction is crucial in order
to understand the colony’s behavior as a whole. One prominent example of interactive
communication between bees is the waggle dance, which is an important part of foraging.
Bees following the dance are pointed to a certain spot in the environment by decoding the
location’s polar coordinates from the body movements. While many technical aspects of
the waggle dance are well understood, its social components remain unclear. Why does a
bee follow a specific dance? Do dancers target certain individuals?
To answer these questions, we have to survey factors, which might have an influence
on the waggle dance behavior. Beesbook aims at acquiring interaction information not
only about a few individuals in the proximity of a dance but about all individuals in the
hive. A novel automated tracking system is currently built that facilitates this unprecedented breadth of observation. In order to identify patterns of interesting interaction, the
observation also covers a very large time frame of 60 days.
The gathered time and location data about every single bee will make it possible to
analyze the social bee network. However, the storage capacity and computing time needed
to achieve this are far beyond the capabilities of current commodity hardware. The amount
8
1.1. THE BEESBOOK PROJECT
of image data to be recorded will sum up to about 300 Terabyte (TB) and the expected
computing time to analyze all images will be up to 700 years (on one single processor). For
that reason both a storage solution and a parallelization approach have to bee invented.
1.1.2. Project Structure
In order to analyze all the interactions inside a beehive, they have to be observed in an
automated way. For this purpose, cameras are set up for video recording. In addition, a
tracking system that allows to analyze the image data is needed. It consists of 2 parts:
1. a unique marking for each individual, and
2. a program that can spot and identify each marking and its orientation in an image
of the hive — the software decoder.
The top level of the project structure is depicted in figure 1.1. A short description of each
stage of the experiment is given in the following paragraphs.
Further background information, for example about biological considerations and camera/hive setup, will be given in upcoming Beesbook papers.
Growing the Beehive
As the base for the experiment, the bee colony has to be grown first. Moreover, a container
for the beehive that allows to observe its interior must be built.
Bee Tagging
Each single bee has to be marked with a unique number tag in order to identify it unambiguously. This stage again consists of two parts:
• The tag design: This includes the tag’s physical properties like material, dimensions
and production facilities. Figure 1.2 shows the information encoding based on a
circular alignment.
• Fixing the tags to the bees: Each bee has to be extracted individually from the hive
to glue a tag to its torso. All tags are fixed with the same defined orientation so
that the bee’s orientation can also be observed because it correlates with the tag
orientation.
Image Capturing and Storing
During this stage the actual experiment data is recorded. The aim is to capture 4 frames
per second from 4 cameras with a resolution of 12 megapixel each. Due to the large
bandwidth produced (between 20 MB/s and 66 MB/s, depending on the image quality)
and the complexity of the tracking step, the data cannot be processed in real-time. Since
Chapter 1
S. Wichmann
9
1.1. THE BEESBOOK PROJECT
Top-Level
Grow Beehive
Tag Bees
Store Images
Start Image Capturing
Analyze Images
Analyze Location Data
Figure 1.1.: The top level of the Beesbook structure, showing the separate steps of the experiment’s workflow.
the overall size of the recorded data will amount to about 300 TB, it has to be transferred
directly to a specialized data storage facility. The live transfer of the image data is one
task I worked on and will be further introduced in section 1.2.1.
Image Analysis
At this stage the image data gathered has to be processed in a way that allows for later
reconstruction of the social bee network. For each single image a table containing the
location (two dimensional coordinates) and orientation (angle) of each bee is saved as an
intermediate result. For this purpose a software decoder using the openCV framework is
developed.
Even if one image could be analyzed in only one second (actually it takes much more
time), due to the large number of images the overall core time would add up to about
Chapter 1
S. Wichmann
10
1.1. THE BEESBOOK PROJECT
Figure 1.2.: The tag design: The binary sequence is represented by black and white cells lying
on the outer arc. The inner semicircles define the tag’s orientation and the location
of the lowest significant bit.
3 years. To obtain results in a reasonable time it is necessary to parallelize the image
analysis. This is the second task I worked on, which will be introduced in more detail in
section 1.2.2.
Location Data Analysis
By utilizing the location and orientation data extracted in the previous stage, it is possible
to reconstruct a social network which covers the interactions of all marked bees. Two
important points will affect the result:
• The definition of what an interaction is, in terms of distance/orientation between
two bees.
• The handling of detection errors; A filtering step could remove or even correct unreasonable bee detections, by using the chronological context.
Eventually, the resulting interaction network (or other representations which can be derived from the intermediate location data) can be used to obtain new insights on the
waggle dance behavior.
Chapter 1
S. Wichmann
11
1.2. THE SCOPE OF THIS THESIS
1.2. The Scope of this Thesis
As I already announced in the previous sections, I now give some additional introductory
information on the tasks I worked on.
1.2.1. Transfer and Storage of the Image Data
The bandwidth of the recorded image data depends on the quality needed to obtain sufficient tracking results. Since the software decoder is still in the development process, there
are no quantitative measurements of its tracking performance yet. The estimated bandwidths range from about 9.1 MB/s (Q=70) to 66 MB/s (Q=100), assuming that the black
and white JPEG format will satisfy the software decoder’s requirements. Accordingly, a
hard disk with a capacity of 2 TB could merely hold 4.3 hours (Q=100) to 13.9 h (Q=90)
or 63 h (Q=70) of image data.
Given that real-time processing is not feasible, as described in 1.1.2, it becomes necessary
to store the data for later analysis. The planned experiment length of 60 days will result
in an overall data volume of 45 TB (Q=70) to 159 TB (Q=90) or even 325 TB (Q=100).
The only way to handle such a volume, instead of storing it locally, is to directly transfer
the data to a specialized data storage facility, during the recording. For this purpose, I
researched and implemented solutions for the following questions:
• Where can a data volume of more than 300 Terabyte be stored?
• How do we establish a stable connection to the target storage server?
• How can we provide a constant bandwidth of up to 66 MB/s (528 Mbit/s)?
The answers and their implementations are described in chapter 3.
1.2.2. Parallelized Image Processing
The experiment, if we used the parameters we aim for (4 cameras, 4 fps, experiment
runtime of 60 d), will produce a total of 4 · 4/s · 60d · 24h/d · 3600s/h = 82.944.000 images.
Because the software decoder is still in development, it was initially not clear how long
its runtime would be. Still, as described in section 1.1.2, even if one image could be
processed in one second, the overall core time would sum up to more than 3 years. The
estimated runtime being much longer, only parallelizing the processing stage would allow
to obtain results in a reasonable time.
This, to achieve a considerable speed up I investigated various parallelization approaches,
and implemented the one that fitted best. Chapter 4 describes how to identify the best
approach and its implementation.
Chapter 1
S. Wichmann
12
2. Comparison with Similar Projects
Seeley [13] marked up to 4000 bees by combining plastic bee tags and different colors
of paint. Yet, the observation of the hive could only be performed manually (sometimes
augmented by video recordings for later reference). The spatial and temporal resolution of
these observations is conceivable small, demanding for a more powerful observation system
in order to investigate larger scale questions.
A study [10] similar to Beesbook used video cameras to track six ant colonies (containing
about 900 individuals altogether) over 41 days to learn how labor is divided among worker
ants. The ants’ positions and orientations were extracted from image data in real time
twice a second, using an existing tag library. Only a video with reduced resolution is saved
in the process.
To be able to identify far more individuals (about 2000), a new tag had to be designed
for the Beesbook project. Since bees would chew the paper tags, shape and material of
the tags have to be different from the ones used in [10]. Due to more information concentrated on a small marker, the locating and decoding of these new tags is too expensive to
be carried out in real time. Hence, the full resolution image data has to be stored for later
analysis, which is a challenging demand on storage facilities. Furthermore, the much more
expensive tag decoding does also pose a challenge on the available computing power. This
is why a transfer and storage solution as well as a high–parallel computing environment
had to be developed for the Beesbook project.
Processing large amounts of data is a common task for some big companies like e.g.
Google. A total of more than twenty petabytes of data is processed on Google’s clusters
every day (as of 2008 [7]). A programming model called MapReduce was invented by
Google (and is widely used today) to automatically parallelize computations on a large
cluster once the problem has been specified in terms of a map and a reduce function. The
MapReduce model is useful in cluster environments but it is inappropriate for other high
performance computing systems without network issues. Chapter 4 explains the decision
for the CrayXC30 supercomputer system and the need for a new way of work partitioning
and job organization.
13
3. Transfer and Storage of the Image Data
The following two chapters deal with the actual implementation of the tasks I worked on.
The calculation of most of the mentioned numbers is depicted in the appendices.
This chapter deals with the transfer and storage of the recorded image data. It consists
of three parts: Firstly, I give a list of the general requirements for this stage of the
experiment. The subsequent section deals with the various possible storage approaches.
Finally, this chapter is concluded by a description of the data transfer program I developed.
3.1. Requirements
The image capturing stage will produce up to 66 Megabyte per second of image data.
Together with the stage’s length of 60 days and the impossibility of real-time processing,
this requires:
• A storage facility capable of storing up to 325 Terabyte of image data must be found.
• The recorded data must be transferred to the storage facility of choice, presumably
over the Internet.
• Throughout the transfer the local storage is prevented from overflowing by ensuring
a constant bandwidth of at least 66 MB/s.
3.2. Storage Approaches
There are a range of possible storage approaches. Yet, both the bandwidth and the data
volume’s total size render some of them unfeasible. The following sections describe three
of them, including their advantages and disadvantages and how they can meet the given
requirements.
When speaking about data storage, the possibility of binary compression (e.g. gzip, tar)
has to be considered. But, in case of image data, this yields no advantage because image
formats like PNG and JPEG already employ advanced image compression algorithms.
These algorithms leave little to no room for compression on the binary layer.
3.2.1. Local Hard Disks
The easiest solution would be to store the image data on local hard disks, making a (live)
transfer unnecessary. This would require 82 currently available hard disks (capacity of
14
3.2. STORAGE APPROACHES
4 TB) to store all the data, costing more than 10000€. Moreover, they would have to be
hot-swapped in the running system, whenever a hard reaches the limits of its capacity.
After the recording, the data would have to be transferred to a compute cluster. In an
article assessing the potentials of cloud computing [5] the data transfer over the Internet
(20 Mbit/s) is said to be slower than sending the hard disks via overnight shipping, which
would yield a bandwidth of 1500 MBit/s. However, this assumes that the whole data set
is present on hard disks at one time.
This approach is not viable due to poor cost-efficiency and the need for swapping the
hard disks every 44 minutes (see figure A.2). Consequently, the live transfer of the recorded
data to a specialized storage facility is inevitable.
3.2.2. Tape Storage Devices at Zuse Institut Berlin
The data management department of Zuse Institut Berlin (ZIB) provides quasi infinite
disk space. This is achieved by internally copying all the files to tapes while only the
metadata (e.g. the inodes) remains present on the hard disks [11, p. 13]. Each tape has
a capacity of about 5 TB and is read/written using a tape drive. 24 handbots swap the
tapes in one of the 60 tape drives, according to current access needs. Furthermore, there
is room for 18.000 tapes, resulting in an overall capacity of about 45 Petabyte.
This storage approach has some advantages over the local storage: It is cheaper (the
cooperation contract with ZIB would cost 4000€) and it is more secure due to professional
data storage methods, such as replication. However, the recorded data has to be transferred to the storage servers. Since the servers are located outside of the institutes’s local
network, the transfer relies on the Internet connection of the Freie Universität (FU).
Depending on the Internet bandwidth available, the frame rate which can be used for
the recording could be drastically reduced by the transfer. Yet, we were able to establish
a Gigabit connection to the storage gateway server, thanks to our cooperation with the
institute’s IT–Services and our institute’s direct Internet connection to ZIB.
The resulting Gigabit connection provides a theoretical bandwidth of 117.5 MB/s net
[18], which exceeds the limitations of a standard 100 Mbit/s connection (11.75 MB/s net).
Tests showed an actual application level bandwidth of approx. 100 MB/s when using at
least four connections.
3.2.3. Using the Supercomputer’s File System
The supercomputer on which the image analysis will be executed is introduced in chapter 4.
Its work file system has a capacity of about 1 Petabyte which would be open for us to use
for a limited time.
However, there are a few drawbacks: The file system has no means of data security.
Furthermore, the supercomputer has scheduled downtimes which last up to two days.
Therefore we would have to buffer the data locally during these times.
Chapter 3
S. Wichmann
15
3.3. LIVE TRANSFER OF THE IMAGE DATA
Since the usage of the supercomputer’s file system would be free of charge, this approach
is the preferred one. Some future work will have to research whether the required buffering
is practicable. In addition, it is unclear whether the Gigabit connection can be established
to the supercomputer’s login servers too.
3.3. Live Transfer of the Image Data
The local storage approach is not feasible, as explained in section 3.2.1. As a consequence,
the data has to be transferred to a storage facility right away during the recording process.
Figure 3.1 gives an overview on the hard- and software setup of the image recording
stage. It also shows the data flow from the cameras to ZIB servers. One important point
is the utilization of a RAM–Disk instead of a hard disk to capture the images from the
cameras. This ensures that concurrently writing new images and reading older ones, in
order to transfer them, is not obstructed by slow hard disks.
Several points have to be considered in order to automate the continuous transfer process. For this reason I developed a transfer script in the Windows Powershell language
which takes care of the necessary steps like image archiving and connection maintenance.
The following sections describe the implementation details of the script.
3.3.1. Overview
The Windows PowerShell
is an alternative to cmd.exe, provided since Windows XP and
based on the .NET–Framework. In addition, the PowerShell Scripting Language facilitates
the scripting of system commands, as known from Unix–Shells. It offers sophisticated tools
like the execution of multiple concurrent jobs and modularization by the use of Cmdlets.
The number of concurrent jobs
to use is determined by two factors. One of them is
the fact that only by using multiple concurrent FTP connections one can ensure that
the available Internet bandwidth is used to capacity. This can only be achieved by using
multiple asynchronous jobs.
The other factor is a limitation of the PowerShell: It is impossible to communicate with
a job once it is started. Consequently, the file system monitoring and the transfer must
be executed together in the same job.
Tests showed that three connections suffice to use the bandwidth to capacity. This is due
to a bandwidth of about 35,15 MB/s per connection (see section 5.1.1) and the absence of
a bandwidth loss when using multiple connections concurrently. However, since there are
four cameras to handle, four connections are used so that each job can use its own one.
The transfer script starts four concurrent jobs (threads), one for each camera. Prior
to the data transfer loop, each job initializes its own FileSystemWatcher (described in
section 3.3.2) and FTP (File Transfer Protocol) session.
Chapter 3
S. Wichmann
16
3.3. LIVE TRANSFER OF THE IMAGE DATA
Figure 3.1.: The hard- and software setup during the recording stage; The arrows show the
image data rate, the data flow direction and the interface speeds of SSD [15], PCI
and USB 3.0 [6].
Figure 3.2 shows the data transfer loop executed by each job after the initialization.
Whenever a certain number of images has been captured, they are stored in a tar–archive
and deleted afterwards. After the archive has been transferred via FTP connection, it will
be deleted in order to release disk space. In case an error occurs, the respective job will
be terminated and the error will be printed to the command line. The single steps of the
data transfer loop are described in the next three sections.
3.3.2. Monitoring the Image Directory with the FileSystemWatcher
The FileSystemWatcher (FSW) is a .NET Framework Class which is used to monitor file
system events. Whenever an event occurs to which the job is subscribed (in this case the
creation of a file), the information is stored in an event buffer and the job gets notified.
Since all cameras write their images into the same directory, an event filter is registered
with each FSW. The filter is represented by a string of the form ∗_$id.png, matching
only filenames of camera $id (∗ is a wildcard). Thereby each job only handles new images
created by its associated camera.
Whenever an event is raised, the respective job increments its file counter and stores
Chapter 3
S. Wichmann
17
3.3. LIVE TRANSFER OF THE IMAGE DATA
One Of Four Transfer-Threads
Archive size
not reached
FileSystemWatcher
watching for new
images captured
from one camera
Archive size
reached
Store images in a
tar archive
Transfer the
archive
over FTP
Delete archive
and images
Figure 3.2.: This diagram shows the activities of a transfer thread. The initialization of the
FileSystemWatcher and the FTP session is not shown here and is performed prior
to this loop. The archive size (number of images) is configured in the transfer script
and depends on the available disk space and the transfer speed.
the file name. Then, if the archive size is reached, the loop proceeds to the next stage, the
archiving (see next section). After the event has been handled, it will be removed from
the event queue to release its memory.
The maximum event buffer size is 64 Kilobyte [2] and therefore limits the number of
events which can be stored. Conservative calculations yield an event buffer queue capacity
of approximately 1638 events (see appendix A.2.1). As a consequence, the transfer of an
archive must not take longer than 409 s (1638/frame rate). Stated differently, each archive
must have a maximal size of about
409s ∗100M B/s
4
= 10225 M B.
3.3.3. Archiving the Images with 7zip
In order to minimize transfer protocol (FTP) overhead, it is convenient to transfer one
large file instead of many smaller ones. For this purpose, the images are stored in a tar
archive using the 7zip command line program 7za.exe. Then the images are deleted to
release disk space as early as possible.
As I explained earlier (see section 3.2), it is of no use to compress the images. This is
why I chose compression mode -mx0, which just copies the data without any compression.
Even if the additional compression were of any benefit, it would most likely not be worth
the additional computational work. All in all, the computational throughput must not be
less than the Internet bandwidth during the whole pre–transfer processing.
Chapter 3
S. Wichmann
18
3.3. LIVE TRANSFER OF THE IMAGE DATA
A bigger archive size
can reduce the transfer protocol overhead. Yet, the archive size is
limited by the available amount of disk space. In our case this is a portion of the available
RAM, of which I allocated 15 GB for the RAM–disk. The calculations resulted in a
maximal archive size (number of contained images) of 467 (see appendix A.2.2). However,
tests showed (see evaluation chapter) that the transfer throughput is already stable enough
at an archive size of 190.
3.3.4. Transferring the Archives with Winscp
After a job has stored the images into an archive, it is transferred to the remote storage
facility over the job’s Internet connection. One essential requirement for this connection is
a bandwidth of one Gigabit. The next slower level of 100 Mbit/s (approx. 10 MB/s net.)
would allow a JPG quality level of 70 at most, which will presumably not be sufficient for
the tracking. This is why we established a Gigabit connection to ZIB (see 3.2.2), which is
the basis for the viability of the transfer step.
Tests showed that encrypted file transfer protocols achieve much lower bandwidths than
unencrypted ones (see section 5.1.1). This is mostly because the encryption needs a significant amount of computing power, which reduces the computational throughput significantly. Since data security is not relevant for us, I chose FTP as transfer protocol because
it provides the necessary bandwidths.
The WinSCP program
delivers a .NET–library in addition to its standard executable.
This library provides an interface to several file transfer protocols and can be used natively
inside the PowerShell Scripting Language. Together with the FileSystemWatcher, a FTP
session is configured (server information, binary mode) and initialized at the beginning of
each job.
Immediately after archiving a certain number of images, the archive is transferred by
calling the PutFiles() function of the WinSCP session object. Then the transfer result
is checked and the transfer will be repeated in case anything goes wrong. Finally, the
archive will be deleted to release its memory.
3.3.5. Error Handling
The fault tolerance of the transfer script is a vital property in order to ensure a continuous
data acquisition during the whole experiment. All errors will immediately be printed to
the command line, but the transfer script will not be able to recover from most of the
errors because they arise from problems in the underlying system (e.g. file system or
Internet problems). There are three steps which can produce errors during the runtime of
the script:
• The FileSystemWatcher will produce an Error event instead of a Created event if
there is a problem with the FSW. This can happen when the event queue reaches
the limits of its capacity (1638 events, see appx. A.2.1), which can only be avoided
Chapter 3
S. Wichmann
19
3.3. LIVE TRANSFER OF THE IMAGE DATA
by setting conservative archive sizes. If this or another error occurs here, the script
will not be able to recover itself.
• The image archiving with 7zip can fail. In that case it sets a global exit code to a
value indicating the cause. One cause can be an image which is still opened while
trying to archive it. This can happen because the Created event is raised immediately
after the file is created and hence the script might try to archive the images while
the camera is still writing data to the newest file. To overcome this uncertainty,
the archiving is executed again until it succeeds or another error code is specified
by 7zip. In case of another error code the transfer script will not be able to recover
itself.
• The transfer of the archive may not succeed. Since the WinSCP interface already
employs connection recovery, all other errors are beyond the transfer script’s capability to recover itself.
Chapter 3
S. Wichmann
20
4. Parallelization of the Image Analysis
The experiment will produce a total of 82.944.000 images. As shown in appendix A.2.3,
the overall processing time sums up to more than 770 years. For this reason, to obtain
results in a reasonable amount of time, it is necessary to parallelize the image processing.
This chapter covers the research and implementation of an efficient way to speed up this
experiment stage.
I will first explain some general points regarding the parallelization approach. Then,
an explanation of the parallelization approach applied to the Beesbook image analysis
concludes the conceptional part of this chapter. The following three sections go into detail
about working with the Cray supercomputer at Norddeutscher Verbund für Hoch– und
Höchstleistungsrechnen (HLRN) and the implementation of the parallelization of the data
analysis. Finally, the last section explains how to configure and use the resulting Beesbook
Observer program.
4.1. The Parallelization Approach
In this section I explain program parallelization driven by problem decomposition. Furthermore, I introduce the parallelization approach I decided to use for the Beesbook problem.
4.1.1. Problem Decomposition and Computing Architectures
A program or problem can be parallelized in three ways:
• By decomposing the functionality of the program into multiple parts with few interconnections (henceforth intra–parallelization). Usually, this is done for systems
which are meant to be highly responsive while executing multiple computations parallel, which is not required for this project. Intra–parallelization can also become
useful, if there are fewer datums to analyze than processors available (see next point).
• By decomposing the domain data into chunks (single images, in the case of Beesbook),
which can be processed parallel by multiple program instances (henceforth inter–
parallelization). That makes it possible to process several images at a time, speeding
up the whole process. If there are fewer data chunks than processors available,
additional intra–parallelization could allow to utilize more processors.
21
4.1. THE PARALLELIZATION APPROACH
All the approaches assume the availability of multiple processing units. Currently there
are two common architectures (taxonomy: [8]) providing a large amount of processing
units:
Array Processors
like a GPU employ a Single Instruction Multiple Data (SIMD) ap-
proach. All processing units execute the same command in the program flow on separate
parts of the data. This is only efficient while the processing of all data parts is highly
homogeneous. The program flow of our software decoder depends strongly on the input
image, which is why the analysis of multiple images is not homogeneous.
Supercomputer Systems commonly consist of many nodes containing several independent processors. These massive parallel systems constitute a Multiple Instruction Multiple
Data (MIMD) architecture, making it possible to execute multiple independent programs
concurrently. The Single Program Multiple Data (SPMD) approach makes use of this
architecture by executing one program in multiple instances, working on separate data
(images).
4.1.2. The Right Parallelization Approach for the Beesbook Problem
First of all, the parallelization of the image analysis stage shall decrease the overall running
time. Consequently, the objective is to utilize as many processors as possible. The number
of processors which can be used concurrently is limited by several factors, though. These
factors include the number of processors available, the problem’s degree of decomposability
and the data transfer bandwidth in case the data does not reside near the supercomputer.
The effectiveness of data decomposition depends on the problem’s degree of decomposability and the number of parts the decomposition results in. In case of Beesbook each
image can be processed independently, which is already a sufficient decomposition in order
to use a huge number of processors concurrently. Sometimes this problem structure, which
has little to no interdependencies between results, is called perfectly parallel because of
that property. Additional intra–parallelization would only be helpful if there were many
more processors available than images. Hence, the most efficient solution is to map one
process to each available processor, each analyzing a chunk of the image data. In this
case, implicit intra–parallelization, as offered by the openCV library and another research
library [14], would even hinder the process because additional threads would compete for
processors already allocated to particular processes.
In case the data does not reside in the file system of the supercomputer, prior to processing, a data transfer is necessary. Depending on the bandwidth available between the
supercomputer and the data storage, as well as the processing time, this transfer can be a
considerable bottleneck. See appendix A.1.3 for actual calculations of how many computations can happen in parallel. This bottleneck is one important reason for us to pursue
both, storing the data at ZIB and processing it on the close-by Cray system.
Chapter 4
S. Wichmann
22
4.2. THE HLRN/CRAY SUPERCOMPUTER SYSTEM
4.2. The HLRN/Cray Supercomputer System
As pointed out in section 4.1.1, the Beesbook image analysis is not homogeneous enough
to be carried out on an array processor. Accordingly, the architecture of choice is a
MIMD supercomputer system. The decision for HLRN has several reasons: First of all, it
shares the file system designated to store our data which circumvents an additional data
transfer. Second, the HLRN houses the new Cray XC30 supercomputer, which provides
17856 processors (we need about 4.3% of 17856 core years, compare appx. A.2.3) that
enables us to obtain the computing capacity we need.
During the following sections I introduce a system that is able to process all data in
a directory in an efficient, automatic and configurable way. The system described is
completely independent from the actual processing program and could also be applied to
other perfectly parallel problems.
1
4.2.1. Overview on the CrayXC30 Supercomputer
The CrayXC30 supercomputer system at HLRN consists of 744 compute nodes with 24
CPU cores each, and some additional auxiliary servers for login and data management.
Moreover, the system has 46.5 TB of memory and 1.4 Petabyte of hard disks, organized
as RAID 6. This information and further hints on how to use the Cray system can be
found in the Cray user documentation [3].
The Batch System
The compute nodes are not directly accessible from the login shell. In order to execute a
program, a batch script has to be submitted to the job queue. Said batch script contains
information about the resources needed (number of compute nodes, runtime) and a call
to the actual binary. After submitting the script with the msub command, the system
scheduler decides when to execute the job. This decision is based on complex calculations
on the job’s resource requirements and the system load, among other things. One job may
run for a maximum of twelve hours and allocate a maximum of 256 compute nodes, which
requires us to partition the whole workload into smaller jobs.
The important thing to note about the batch system is that the actual job execution
happens in a completely asynchronous way. Therefore it is not possible to guarantee a
constant number of jobs running at a time. Hence, it is also impossible to guarantee a
constant speedup during the whole analysis stage. Instead, one can only finetune the job
sizes in order to fit in the small spaces left by scheduling fragmentation (see figure 4.1).
After some organizational work in the batch script, the binary is executed using the
aprun command. Its parameters include the number of nodes and cores used for this
execution, program arguments and the path to the binary. An example call to execute the
program on 48 cores would look like this: aprun -n 48 beesbook.bin jobWorkDir.
1
For further information about the Beesbook software decoder see Laubisch, 2014 [9]
Chapter 4
S. Wichmann
23
4.2. THE HLRN/CRAY SUPERCOMPUTER SYSTEM
Figure 4.1.: An illustration of the batch job scheduling. The jobs are scheduled so that unused
capacities are minimized. In contrast to the Beesbook problem, most of the computations carried out on Cray cannot be decomposed to an arbitrary degree (dark
gray). Thus, the scheduling usually leaves some slots unassigned because no job
fits in there. The scheduler will backfill these slots with Beesbook jobs (light gray)
to minimize unused capacities. The Beesbook jobs trickle into these slots like grains
of sand (also called Sandkornprinzip), thereby minimizing the waiting time.
Chapter 4
S. Wichmann
24
4.3. PARALLELIZATION ON THE CRAY SUPERCOMPUTER SYSTEM
Job Accounting
The HLRN’s currency is called NPL (Norddeutsche Parallelrechner–Leistungseinheit).
Each active account gets 2500 NPL per quarter (a project proposal is required to get
more NPL) and one compute node costs 2 NPL per hour. There is no deduction in case
not all cores of a node are used, hence it is important to utilize all reserved cores as
efficiently as possible.
If we neglected any overhead, the overall core time of 773.57 core years would cost
773.57 core years ∗ 2
NPL
cores
/24
= 564.703 N P L.
node h
node
This corresponds to about 4.3% of the HLRN’s yearly capacity (compare appx. A.2.3).
One important point to note about figure 4.1 and the Sandkornprinzip is that minimizing
job waiting times becomes unnecessary if the project’s NPL are granted over multiple
quarters. This is because the overall runtime will be at least n-1 quarters. In this case,
only the last quarter’s runtime would benefit from efficient scheduling.
4.3. Parallelization on the Cray Supercomputer System
After introducing the parallelization approach and the Cray supercomputer system, I now
describe the actual implementation and organization of the automated image analysis. As
was explained earlier, the processing will be divided in many jobs. Each job will reserve
a certain amount of compute nodes for a certain amount of time to process one part of
the image data. The parallelized processing therefore consists of two parts: Concurrently
analyzing images on all allocated cores and organizing the continuous job submission.
4.3.1. Parallelization per Job
As described in section 4.1.2, the partitioning of the data into single images is enough to
utilize many processors concurrently. Hence, during the execution of one job the software
decoder is simply started on each allocated core. Each decoder then loops over a part of
the images which was assigned to this program instance.
In order to avoid conflicts, the data must be partitioned so that each process works on
its own part of the data. This is achieved by supplying the data in separate directories,
one for every process.
The last difficulty is to actually assign the directories to the processes. Due to a limitation of the aprun command it is not possible to do this by passing the particular directory
via command line arguments. This limitation arises from the fact that aprun can only
spawn processes in a homogeneous way, hence all processes on a node are started with the
same command line argument.2
2
It is also impossible to call aprun for each single process to pass different arguments because each
subsequent call to aprun targets another compute node, which would prevent us from starting more
than one process per node.
Chapter 4
S. Wichmann
25
4.4. ORGANIZING THE AUTOMATIC JOB SUBMISSION
Consequently, the only way to have both, one process per core and a coordinated data
distribution, is to use interprocess communication. For this purpose I utilize the MPI
(Message Passing Interface) library, which has several implementations present at Cray.
Besides initializing and finalizing the MPI environment, there is one single function used:
MPI_Comm_rank( MPI_COMM_WORLD, &world_rank ). It determines the ID of the process
among all other processes (ranging from one to the number of processes) started with the
same aprun call. This way each process knows which directory it has to work on. A more
detailed explanation of the proceedings during one batch job can be found in figure 4.2.
Furthermore, the structure of the Beesbook work directory is shown in appx. A.2.5.
4.4. Organizing the Automatic Job Submission
The maximum runtime for one job is twelve hours. This means that between 471 and
23530 jobs will have to be submitted in total (compare appx. A.2.4), which necessitates
an automated job organization and submitting program. Since we want to comply with
the Sandkornprinzip mentioned in figure 4.1, there will be shorter, hence, even more jobs.
There are two conceivable approaches to implement the automation:
• The so-called job script chaining would use some reserved time of a batch job to
organize and submit the next job before calling the aprun command. The data
would be provided during another asynchronous job and the newly submitted job
would have to wait for that job to complete.
• An external observer program can organize the batch jobs just like a human would
do it manually. However, this requires the possibility to have a process running on
the Cray system during the whole data processing stage.
Since the job script chaining requires some non-trivial synchronization between multiple
jobs (only one of them must perform the organization), I chose the latter possibility. Yet,
this is only possible because user processes are allowed to run for an unlimited time on
a data node (login node processes are limited to 3000 seconds) at Cray, given that the
process does little CPU–intensive work.
4.4.1. The Beesbook Observer Script
The automation program is implemented as a Python script which utilizes the command
line programs available on Cray: msub <jobScript> returns the jobID of the submitted
job and checkjob <jobID> returns the status of the job. The status comprises a state
(Idle/Running/Completed) and a remaining runtime if the job is already running.
In order to facilitate a fast and efficient image processing (and to minimize waiting
time), the automation program performs the following tasks:
• It provides a block of images to the working file system for the next job to be
submitted. This may take some time, depending on where the image data is stored.
Chapter 4
S. Wichmann
26
4.4. ORGANIZING THE AUTOMATIC JOB SUBMISSION
BeesbookMPI
aprun -n <number of processes>
Process 1
on CPU 1
Process ...
on CPU ...
Process n
on CPU n
id =
MPI_Comm_rank()
id =
MPI_Comm_rank()
id =
MPI_Comm_rank()
Loop over dir <id>
until time is over
Loop over dir <id>
until time is over
Loop over dir <id>
until time is over
MPI_Finalize()
Exit job
Figure 4.2.: An illustration of the proceedings in one batch job. System calls have a red background, MPI calls orange background and the Beesbook program has a green background.
After adding the openCV libraries to the environment variable LD_LIBRARY_PATH,
the aprun command is called to start the execution of the Beesbook program on
all reserved cores. Then, each process determines its ID by calling the appropriate
MPI function. Subsequently, the actual software decoder is executed for every image contained in the directory of the process. At last the MPI Finalize() function
is called in every process which behaves like a barrier, ensuring that all processes
finished their analysis before exiting the job.
Chapter 4
S. Wichmann
27
4.4. ORGANIZING THE AUTOMATIC JOB SUBMISSION
The data providing must be completed before the actual job is executed in order to
avoid idle running during the reserved computation time. In particular, the image
providing can be done while waiting for a job to finish.
• In order to minimize the overall processing time, the waiting time between jobs has
to be minimized. For that purpose, the program has to maintain an internal job
queue, ensuring that at least one Beesbook job is available for scheduling at any
time.
• Whenever a job is finished its results must be saved and a new job has to be prepared
for the job queue.
Besides the organizational tasks, the program is able to recover itself from every failure to
continue with normal production. Figure 4.3 is a schematic depiction of the main routine
of the Beesbook Observer script.
The Beesbook Observer is divided into three modules, which are described in the following sections.
4.4.2. The BbCtx Module
The Beesbook Context module is a container for all global constants and is used in both
other modules. These constants include for example directory paths as well as job properties (wall–clock time, number of used cores, etc.).
Additionally, the persistent status file is loaded here, using Python’s shelve module.
The statusShelf is used to persistently store information about the job queue and the
overall progress across consecutive executions of the Beesbook Observer. Moreover, certain
checkpoint values are stored in the statusShelf to indicate whenever a critical section
is entered. This way, the Observer can recover from unexpected terminations (e.g. after
the program was terminated by the system). The statusShelf is used like a dictionary (a key–value map) so that, e.g., the job queue is accessed in the following way:
BbCtx.statusShelf[’jobQueue’].
Before starting the actual analyzing stage, some of the variables in the BbObserverConfig
file will have to be adapted to the system (e.g. path to Beesbook binary and the image
archives).
4.4.3. The Job Queue Manager Module
The Job Queue Manager encapsulates functions for the manipulation of the internal job
queue. It is based on the implementation of a persistent queue which immediately writes
all changes to the status file. A description of the Managers functions follows:
SubmitJob
calls the msub command to submit the job in the next job slot and parses
the job’s information (Cray–ID and job slot) from the output. A tuple representing the
Chapter 4
S. Wichmann
28
4.4. ORGANIZING THE AUTOMATIC JOB SUBMISSION
BeesbookObserver
Recover state
and progress
[no images
left]
Wait for and collect
all remaining jobs
[images left]
Provide next
image block in
a free job slot
Wait until
queue is not full
or a job finishes
[a job
finished]
Save and organize
results and errors
[no job finished]
(just filling the queue)
Submit BeesbookMP
job in the free slot
Figure 4.3.: First of all, the Beesbook Observer recovers the state of the job queue and the overall
progress. This is necessary because the program could have been terminated at any
point of execution. Then, while there are images left to process, the observer keeps
the job queue filled in the following way: At the beginning of the loop an image block
is provided in advance for the next job to submit. Then the next job is submitted
if there is still room in the job queue or after a job is finished. The submission
can be executed immediately because the image block was already provided before
waiting for a job to finish. Since there is one more job slot than the job queue is
big, the next image block can always be provided before waiting.
job is then enqueued in the job queue. After that, the next job slot, which will be used,
is calculated as (currentSlot +1) % (MAX_QUEUE_SIZE +1).
Chapter 4
S. Wichmann
29
4.4. ORGANIZING THE AUTOMATIC JOB SUBMISSION
WaitForQueue
returns nothing if the job queue is not full. Otherwise it waits for a job
to finish. This is used to fill the queue with jobs in the beginning (see figure 4.3). The
size of the queue is defined by the BbCtx constant MAX_QUEUE_SIZE.
WaitForJobsToFinish checks the status of the oldest job in the queue (by parsing the
output of the checkjob command) and waits until it is finished. The job is then removed
from the queue and its information is returned to the caller.
The checkjob command returns the job state, among other information. There are four
cases to handle:
• If the job state is Completed, the job finished execution and can be removed from
the queue.
• If the state is Idle, the job is waiting for execution. The job will not finish before at
least the given wallclock time expired, so the Observer waits to check again later.
• The Running state signalizes that the job is already running. The checkjob command also returns a remaining wallclock time so the Observer parses and waits for
that amount of time.
• In case the system cannot find a job with the given ID, it returns an error message.
The system deletes completed jobs after some minutes so the error message means
that the job finished the execution and was deleted before the Observer polled the
job’s status. Thus, the job can be handled as if its status was Completed.
SaveResults cleans the given job slot by moving the results to the result directory. Then
all remaining image files (which are unprocessed, the software decoder would have deleted
the input after successful processing) are moved back to the image directory so that they
are processed in another job.
The Observer relies on the Beesbook program to delete the input image to signalize that
the analysis was completed successfully. The Observer does save the result only if the
input image is not present anymore. Otherwise the result is discarded and the input file
is returned to the image heap. Note, that besides some MPI commands (see section 4.3.1)
this is the only requirement the Beesbook software decoder has to fulfill in order to be
compatible with the Observer.
Recover
uses the statusShelf to recover the job queue from previous Observer runs. It
also takes action against possible inconsistencies caused by abnormal program termination.
For further information on recovery, see section 4.4.5.
4.4.4. The Image Provider Module
The Image Provider encapsulates functions to provide jobs with images and to track the
overall processing progress. The image providing is organized in two stages: When an
Chapter 4
S. Wichmann
30
4.4. ORGANIZING THE AUTOMATIC JOB SUBMISSION
image block is provided to a job slot, the images are taken from an internal image heap
which is a directory containing the extracted images. In case there are not enough images
left, the next image archive is extracted into the image heap.
The functions utilized in the textitObserver are:
ProvideNextImageBlock fills the given job slot with images to process. As described in
section 4.3.1, each slot consists of n directories, with n as the number of processes (and
hence cores) per job. The number of images per process is defined in the BbCtx constant
CHUNKSIZE_PER_PROC. The function unpackNextArchive (which utilizes the Python tarfile
module) is called to extract the next image archives if necessary. In that case, also the
new progress is saved in the statusShelf, represented as the number of archives already
extracted.
ImagesLeft
checks if there are images left in the image heap or if there are image archives
left which have not been extracted yet.
Recover
uses the statusShelf to recover the current progress. This is necessary if the
Observer was terminated during the extraction of an archive or the providing of an image
block. For further information on recovery, see section 4.4.5.
4.4.5. Recovery
It may happen that the Beesbook Observer program gets terminated at any point during
execution. This may be, for example, due to Cray maintenance shutdowns or Cray system
errors. The Observer is able to recover from most inconsistencies, which can arise from
incomplete operations. Note that the recovery system that is described here does not
handle errors occurring in the Beesbook binary.
All operations manipulating data that is located in files on the hard disk may produce
inconsistencies due to incomplete execution. This includes program terminations before a
new status can be persisted in the statusShelf. Yet, due to the sequential nature of the
observer, there can be at most one error which has to be recovered during each program
start. This insight makes it much easier to think about the whole recovery process.
The recovery system consists of two parts: Checkpoint values are stored in the statusShelf
(comp. section 4.4.2) to indicate that a critical section was entered. The recovery functions then check for these values to identify incomplete operations and to recover from the
inconsistent state. In the following paragraphs, I give an analysis of the critical sections
in the program and how they are handled by the recovery system. The first three paragraphs cover the critical sections of the Job Queue Manager module, while the subsequent
paragraphs address the Image Provider’s critical sections.
Chapter 4
S. Wichmann
31
4.4. ORGANIZING THE AUTOMATIC JOB SUBMISSION
Job Submission
The function submitJob calls msub to submit the next job and stores the job’s ID and
its slot in the persistent job queue. If the program is terminated after submitting the
job but before storing the information, the result is a ghost job which will lead to errors
when the next job is submitted in the same slot. A placeholder job with ID = −1 is
saved in the job queue before calling msub to be able to detect this situation. After the
submission the placeholder is updated with the actual ID. During recovery, a job with
ID = −1 is recovered by comparing the internal job queue with the output of the Cray
showq command. If there is an ID that is not present in the internal job queue, that
would be the ghost job’s missing ID. If there is no excess ID in the Cray queue, that would
indicate that the ghost job already finished and can be saved.
If the termination occurs before determining the next slot to use, the current slot would
be used again for the next job. For that reason, the next slot is inferred from the youngest
job in the persistent job queue during recovery.
Job Waiting
The only critical section is after the job is removed from the persistent job queue in
waitForJobsToFinish. If the Observer is terminated before the results of the job are
saved, the reuse of the slot would possibly lead to errors. To avoid this, the slot of the
finished job is stored under the key saveResults in the statusShelf before removing the
job from the queue. Then, at the end of the saveResults function, the entry is deleted.
During recovery, if an entry for saveResults is present in the statusShelf, saveResults
is executed retroactively for the corresponding job slot.
If the Observer is terminated before actually removing the finished job from the job
queue, it would check again for the job’s status during the next run and remove it because
it is finished. Hence, this operation does not need to be guarded by recovery. As explained
in the next paragraph, it is also safe to execute saveResults again during this process.
Saving Results
After a job is finished, its slot directories (one for each process) are traversed and the
results are moved into the result directory. Moreover, remaining image files are moved
back into the image heap for later reprocessing. If the Observer is terminated before or
during saveResults, the function can just be called again during recovery. As described in
the previous paragraph, callers of saveResults can guard the completion of the operation
by adding the key saveResults to the statusShelf. Since it is an idempotent function (the
result stays the same if successively called multiple times), it is save to call saveResults
again until the operation completed.
Chapter 4
S. Wichmann
32
4.5. HOW TO CONFIGURE AND USE THE BEESBOOK OBSERVER
Recovery
The recovery functions of both, the Job Queue Manager and the Image Provider, are
the first functions to be called during each program start. Therefore, if the program is
terminated while still recovering, the respective recovery operation will be repeated since
the indicating statusShelf entry was not deleted yet.
The following paragraphs address the Image Provider’s critical sections.
Image Block Providing
To recover from partially provided image blocks, saveResults is called for the slot that
will be used for the next job. This will move all images back to the image heap so that
the image providing can be executed anew. Since this is done during each program start,
regardless whether the image providing for the next slot has been completed successfully,
an image block might be provided multiple times. To prevent this overhead, it is insufficient
just to indicate a failure by adding an appropriate key to the statusShelf. For further
discussion on this topic see chapter 7.
Archive Extraction
In case the image heap does not contain enough images for the next image block, the
next archives are extracted into the image heap. To ensure the successful completion of
the extraction, a flag is stored under the key extracting in the statusShelf. If this flag
is set during recovery, the function unpackNextArchive is executed again to complete
the operation. Since the extraction silently overwrites existing files, it is an idempotent
operation and therefore it is safe to execute it again.
Then, after extracting the next archive, the overall progress has to be updated. To
make sure that this happens, the old progress value is stored under the key oldProgress
in the statusShelf before deleting the extracting flag. By interlocking these flags, one
can ensure that a termination during unpackNextArchive is definitely detected. During
recovery, if the oldProgress value is set, it is compared to the current progress value. If
they have the same value, which indicates that the update of the progress did not happen,
the update will be repeated. Now, if the Observer is terminated right after incrementing
the progress but before deleting the oldProgress value, the incrementation will not happen
again, since now oldProgress and progress differ.
4.5. How to Configure and Use the Beesbook Observer
In order to start the Observer, it must be configured first.
Chapter 4
S. Wichmann
33
4.5. HOW TO CONFIGURE AND USE THE BEESBOOK OBSERVER
4.5.1. The Configuration via the BbCtx Module
The file BbObserverConfig.py contains all the variable information for the Observer like
binary path, job queue size etc. An explanation of all relevant values and how to pick the
right value follows. If any of the parameters is changed during the analysis, the Observer
has to be re-initialized. A description of how to do that follows below. However, only
Cray parameters should be changed after starting the analysis.
The values shown are chosen to comply with the Sandkornprinzip and will work for the
whole analysis. If it turns out that this is unnecessary, CHUNKSIZE_PER_PROC, WALLTIME,
and NUM_NODES can be increased to e.g. (50, 04:04:50, 10).
• MAX_QUEUE_SIZE: 8
– Number of the Cray jobs put into queue;
– Since most of the time no more than 4 jobs will be running (Cray soft limit),
this number should stay below 10.
– Note that there will be CHUNKSIZE_PER_PROC * NUM_PROCS images provided per
queue slot. One should make sure that there is enough disk space (and quota)
available.
• NUM_NODES: 3
– The number of nodes to reserve per Job.
– This affects the number of images analysed per job. Should be relatively small
to conform with the Sandkornprinzip. Values up to 3 nodes are scheduled into
the "smallqueue".
– 3 Nodes corresponds to 0.4% of all available nodes (744).
– For system maxima of Walltime and number of nodes/cores see [3].
• CHUNKSIZE_PER_PROC: 10
– Number of images per process per job.
– CHUNKSIZE_PER_PROC and WALLTIME are the most important values to adjust
properly in order to minimize the NPL overhead.
– Should correspond to the Walltime: Depending on the variance of the Beesbook
program, The WALLTIME should be adjusted to CHUNKSIZE_PER_PROC * <avg
runtime for one image>, in order to avoid idle running.
– To conform with Sandkornprinzip, the Walltime should not be longer than 60
min. Hence, for a runtime of 293s per image, 10-12 is a good value
• WALLTIME: 00:48:55
– Specifies the time the requested ressources are reserved for. Has to be of the
form hh:mm:ss.
Chapter 4
S. Wichmann
34
4.5. HOW TO CONFIGURE AND USE THE BEESBOOK OBSERVER
– Should correspond to CHUNKSIZE_PER_PROC, see explanation above. 10∗293s =
2930s = 48 : 50, including some extra buffer seconds.
– Maximum Walltime is 12h.
• WORKING_DIRECTORY: <Path to where the work directory will be initialized>
– The directory where the image blocks are provided and the log and status files
are written.
• BIN_PATH: <Path to the Beesbook binary>
• ARCHIVE_DIR: <Path to the image archives>
– The archives will be untared into IMAGE_FILE_DIR during the analysis.
• RESULT_FILE_DIR: <Path to the where the results will be stored>
4.5.2. (Re-) Initializing the Observers’ Work Directory
Before starting the analysis, the Observer has to be executed with
python BeesbookObserver.py --init. This way the work directory is initialized at the
path specified in WORKING_DIRECTORY. Later, if parameters are changed in the configuration, the parameter --reInit would be used to update the work directory according to
the new parameters. Note that the job queue has to be empty before reinitializing the Observer because that will delete the present directories. The program switch --collectJobs
can be used to collect all active jobs without submitting new ones.
Chapter 4
S. Wichmann
35
5. Evaluation
Both, the evaluation chapter and the discussion chapter, are divided into two individual
sections. The chapters begin with the image transfer’s evaluation and discussion, followed
by the parallelization.
5.1. Evaluation of the Image Transfer
The efficiency of the transfer of the image archives was measured in two ways. Firstly, the
best transfer protocol was identified by measuring the maximal bandwidth one can achieve
with it using one single connection. Secondly, longterm tests were performed with varying
archive sizes and using four connections. The longterm tests should measure the stability
of the transfer and identify the archive size needed in order to achieve the maximal possible
bandwidth.
5.1.1. The Best Transfer Protocol
During these tests an image archive of 104 MB was transfered multiple times. The following results represent the average bandwidth of all transfers per protocol.
• Putty SCP: 14,87 MB/s
• WinSCP: 18,55 MB/s
• WinFTP: 35,15 MB/s
The results show that the unencrypted File Transfer Protocol is by far the most efficient
one. This is due to additional computations that are needed for encryption, which is
unnecessary in case of Beesbook.
5.1.2. Transfer Stability
In order to measure the longterm stability of the data transfer a script has been written
that generates images inside the observed folder. Thereby the script produces about
97,92 MB/s of data. The transfer stability was then measured by observing the number of
images residing in the observed directory each second, while executing the transfer script.
Depending on the archive size there is a maximum for this number of images while the
transfer is fast enough. If this number is exceeded that indicates that the transfer is too
36
5.1. EVALUATION OF THE IMAGE TRANSFER
slow (less than an average of 97,92 MB/s) and hence one could run out of disk space soon.
See figure 5.1 for the results.
The expected maximum number of files in the directory is calculated as the archive size
times four (four cameras) times two (while transferring one archive, new images continue
to be generated) plus four (number of archives that reside in the directory):
Archive Size · 4 · 2 + 4
The archives sizes that were tested are:
• 16 images: 97,92 MB, expected maximum number of files: 132
• 32 images: 195,84 MB, expected maximum number of files: 260
• 64 images: 391,68 MB, expected maximum number of files: 516
• 128 images 783,36 MB, expected maximum number of files: 1028
In addition to the tests depicted in figure 5.1, test runs with an archive size of 128 were
also performed for runtimes of 8 hours, 10h, 16h and 19h. The maximum and average
file count did not differ significantly from the ones showed in the diagram. Those results
prove that the present 1GBit connection to ZIB can be successfully used to capacity and
that it is stable enough to be used for the data acquisition of the Beesbook project.
However, further tests with other parameters like another data generation bandwidth
or longer runtime were not possible because the account at the ZIB data storage server
expired and could not be reactivated free of charge. Hence, there could still be bandwidth
fluctuations during certain periods which were not be detected during the tests.
Chapter 5
S. Wichmann
37
5.1. EVALUATION OF THE IMAGE TRANSFER
Figure 5.1.: This chart shows the results of the transfer stability tests. The bars display the
maximum (and average) number of files that resided in the image directory during
the whole test (26,6 min). For that purpose a test script generated about 100 MB/s
of image data inside a directory located on the RAM–disk. Values above 100%
indicate that the transfer was slower than the image generation and hence the
number of images increased continuously. During the actual experiment this must
not happen because eventually there would be no more disk space available, which
would lead to data loss.
The results show that the transfer speed is too low for archive sizes below 64. This
is due to connection maintenance and archiving overhead. Yet, archive sizes from
64 on result in a sufficient transfer bandwidth of nearly 100 MB/s.
Chapter 5
S. Wichmann
38
5.2. EVALUATION OF THE PARALLELIZATION
5.2. Evaluation of the Parallelization
The success of the parallelization effort can be measured in 3 dimensions:
• Reliability of the Beesbook Observer regarding the recovery and the seamless organization of the data analysis.
• The overall speedup/overall runtime. The speedup is represented as the dividend by
which the total runtime is divided by parallelizing the computation.
• The efficiency in terms of needed NPL. How much NPL are needed beyond the actual
runtime of the software decoder? How short can one single job be so that the NPL
overhead is minimal (to comply with the Sandkornprinzip)?
The Observer’s reliability
was extensively tested in three ways:
1. The Observer was intentionally terminated before, during and after every critical
section during several runs. For this purpose, a call to sys.exit() was executed at
the corresponding spots in the code.
2. The Observer was terminated manually several times during test runs.
3. A script was written, which automatically restarts the Observer repeatedly after a
random amount of time. During one test run the times were chosen from the range
20s − 120s, during another from 2s − 40s.
After each run, the outcome was verified by checking whether all result files were present
in the result directory. No errors were detected during these executions .
The overall speedup was tested by measuring the runtimes for one single image while
using:
• one core: 22 seconds
• 24 cores (one node): 22.5s
• 48 cores (2 nodes): 23s
• 96 cores (four nodes): 23s
These results show that the overhead introduced by using all cores on a node and communication via MPI is minimal and growing very slowly. Hence, the overall speedup is close
to the number of cores used, as expected for a problem with this structure (explained in
section 4.1.2). However, the number of cores used concurrently does highly depend on
several Cray properties and hence it is subject to fluctuations. For the most part this
can be attributed to the scheduling system which will execute an unpredictable number
of Beesbook jobs concurrently, depending on the system load.
Chapter 5
S. Wichmann
39
5.2. EVALUATION OF THE PARALLELIZATION
It is more promising to examine the speedup from another point of view: Computing
time is measured in NPL and the HLRN does allocate them to users quarterly. Since no
more NPL are allocated than can be used during the quarter, every user is guaranteed
that his computations will be carried out in at least three months. In case the overall
amount of NPL granted by the HLRN is distributed over q quarters, this implies that the
entire analysis will take at least q − 1 quarters. In view of the large amount of processing
time (several hundred years) needed, the critical question will not be about how long the
analysis will take but whether the amount of NPL needed will be granted by the HLRN
(see appx. A.1.4 for NPL calculations). The possible speedup then depends directly on
the amount of NPL granted per quarter.
The efficiency of the parallelization was tested in eleven test runs with varying properties. All tests have in common the use of 24 cores (one node) per job and the chunk size
of 40 images per process (960 images per job). The number of tests one can run is limited
by the amount of NPL available (2500 per quarter) and the long runtime per test. This
is because in order to measure the overhead of NPL usage, the software decoder must
actually run for a certain time during the batch job. For this reason, only qualitative
measurements of the NPL usage can can be performed.
Because the actual software decoder was not finished at testing time, it simply slept for
a given time during execution. This way, the analysis of each image took the exact same
time. Yet, the actual software decoder will likely take varying amounts of time.
There are 3 properties that were changed during the tests: the total number of images
to process, the runtime per image (a constant value) and the wallclock time (the job
runtime). Figure 5.2 shows the most important results of the measurements. The overhead
was determined by calculating the NPL demand (see appx. A.1.4 for the formula) for a
number of images and a runtime per image. Then, the job’s NPL costs of the test run
were summed up and divided by the calculated value.
Chapter 5
S. Wichmann
40
5.2. EVALUATION OF THE PARALLELIZATION
Figure 5.2.: Determined NPL overhead during several test runs. WC — wallclock time.
One can see that, with an increasing number of images and runtime per image, the
NPL overhead decreases. The data point [20000 images, 21s runtime per image]
emphasizes the importance of an accurate wallclock adjustment. If the wallclock
time is too short to analyze all images in a job, each process is terminated while
analyzing an image. Hence, up to [number of cores] · [runtime per image] seconds
wallclock
of computing time are wasted. The smaller the ratio runtime
per image , the greater
the NPL overhead. On the other hand, if the processes finished their analysis but
there was still wallclock time available, the job would idle up to 70 seconds (verified
during multiple tests). This is due to the scheduling system taking some time to
release the resources. The resulting overhead again depends on the wallclock time
and diminishes with increasing job length.
For jobs the wallclock time of which is too long, the overhead goes below 1%
for wallclock times above 70s / 0.01 / 60 = 116min. For jobs the wallclock
time of which is too short, the overhead goes below 1% for wallclock times above
[runtime per image] / 0.01s. The needed wallclock time is potentially much longer
than in the latter case, thus, the most efficient way is to assign a slightly longer
wallclock time than the job is expected to need.
Chapter 5
S. Wichmann
41
6. Discussion
6.1. Transfer
The results show that it is possible to use a Gigabit connection to capacity in an efficient
and stable way. However, the risk of bandwidth fluctuations and even server downtimes
cannot be ruled out entirely. A system that can buffer data during short connection
problems will be needed in order to guarantee a data acquisition without any data loss.
Yet, as a prerequisite for the parallelized image processing on the Cray supercomputer,
the possibility of a continuous data transfer was proven.
6.2. Parallelization
The Beesbook Observer automatically and reliably analyzes images supplied in an archive
directory. In case the Observer gets terminated, it can simply be started again and
continue its work without further user interaction, thanks to the recovery system. However,
there is not yet a mechanism that notifies the user about the Observer’s termination.
Currently, the result files are stored unorganized in a result directory. Whereas such
an organization would still have to be invented, one could also use a database in order
to store the results. Additionally, the Beesbook Observer has to be configured correctly
before the image analysis can be started. The configuration process and its values are
described in section 4.5.
The overall speedup of the presented parallelization approach is close to the possible
maximum (the number of cores in use per job). Yet, as I explained in section 5.2 (overall
speedup paragraph), the critical condition to successfully execute the whole image analysis
stage on Cray is that the project is granted the according amount of NPL. In this light,
it is important to take the efficiency of the NPL usage, which was calculated during
test runs, into account. Overhead values ranging from 0.5% to 11% were found. However,
because of the unknown variance of the actual Beesbook software decoder’s runtime, it is
impossible to optimize this value any further at this point.
The NPL overhead highly depends on the wallclock time chosen for a job. Generally,
as explained in figure 5.2, it is best to assign to the job a longer wallclock time than the
processes’ estimated runtime. However, due to the fact that the decoder runtimes per
image will vary, a larger overhead is possible because each process that completed its work
42
6.2. PARALLELIZATION
before the last process wastes computing time (which has to be paid for). A dynamic work
scheduling system might be needed to diminish this overhead. Yet, it was not possible to
address this problem within the scope of this thesis (especially as its benefit can not be
quantified without knowing the actual variance of the runtime).
Chapter 6
S. Wichmann
43
7. Future work
7.1. Transfer
As explained in the discussion, bandwidth fluctuations due to varying network load or
server issues cannot be ruled out entirely. Hence, the most urgent improvement is to
facilitate some sort of local buffering. Local buffering would not make sense with the
parameters used in this thesis because such a buffer can never be emptied if the available
connection is already used to capacity. However, the actual experiments will produce
smaller bandwidths (likely no more that 66MB/s) so a local buffer can be a viable option
to handle unstable bandwidths.
7.2. Parallelization
First I give an overview of the most imminent future work. Then some more general
developmental possibilities follow.
7.2.1. Starting the Observer
In order to actually start the image analysis, the following points must be prepared:
• The Observer must be configured correctly (described in 4.5).
• The Observer must be adapted to the image archive nomenclature used during the
experiment’s image acquisition. The Image Provider module provides internal functions that can be changed to reflect the archive naming.
• The analysis’ results must be organized, either in archives or in a database.
Furthermore, as described in section 6.2, a scheduling system might be needed in order to
reduce the NPL overhead.
7.2.2. Extending the Observer
The design of the Observer makes it possible to apply it to all perfectly parallel problems.
Basically, just like the Map step of the MapReduce programming model, the Observer
applies a given function (the program binary) to a given dataset (the input file archives).
Adapting the Observer to another problem involves just the same steps as preparing it
for the Beesbook analysis. There are few dependencies regarding the program that will be
executed for the analysis during the batch jobs:
44
7.2. PARALLELIZATION
• The program must take one command line argument that represents a job’s data
directory.
• It must use MPI (or another interprocess communication library) to identify the
individual data directory for each process. The program then must run its analysis
on each of the files in its data directory.
• Each input file must be deleted after persisting the result to indicate a successful
analysis.
See the Beesbook source code for an example how to implement those requirements. The
additional code needed to prepare a program for the Observer should be no more than 30
lines of code.
In the light of a field of application beyond Beesbook, there is a number of conceivable
enhancements:
• In order to reduce the NPL overhead when an analysis did not finish because the job
was terminated, a checkpoint system could be implemented. Intermediate results
can be written to a file which is always moved together with its input file by the
Observer. This way an aborted analysis need not start from the beginning.
• Currently, the Observer’s state has to be checked manually to ensure that it is
still running. An alerting system might be implemented by setting up an external
watchdog.
• In case a program error occurs during analysis, it would be helpful to report it (e.g.
via email).
• If an error occurs during the analysis, the input file concerned would be analyzed
again and again. Besides error reporting, the Observer could keep track about
erroneous input files and put them into a separate directory to prevent too much
overhead.
• Currently, all results are assumed to be independent from each other. But in case of
the Beesbook tracking problem, it could be beneficial to take the results of images
from previous time frames into account. Thereby, the last position of an individual
could be used to locate it faster in the new image. To make that possible, some kind
of processing order would have to be enforced by the Observer. Furthermore, the
respective result files would have to be supplied properly.
Chapter
S. Wichmann
45
Appendices
46
A. Calculations and Tables
Here I include some of the calculation tables I generated during the research process. All
calculations were made using the following parameters:
Number of cameras: 4; frame rate: 4 fps; Experiment length: 60 days;
The resulting number of images is:
4 ∗ 4f ps ∗ 60d = 82.944.000f rames
(A.1)
A.1. Calculation Tables
A.1.1. Image Sizes and Bandwidth
This table shows the produced bandwidth and the total size of the dataset at the end of
the experiment, depending on the size per image (second column).
Figure A.1.: The bandwidths and total sizes corresponding to certain JPEG quality levels
A.1.2. HDD Capicities
Depending on the image quality (respective image size), this table shows the time until a
hard disk of the given capacity runs full during .
Figure A.2.: The time capacities of differently sized hard disks
47
A.2. ADDITIONAL CALCULATIONS
A.1.3. Maximal Parallelization
Depending on the bandwidth available between the supercomputer and the data storage,
as well as the processing time, the maximal number of concurrent computations can be
limited to relatively small values. Generally, this happens when the transfer time of one
image dominates its processing time. This means that only a small number of images
can be transferred while processing one image, which is actually the maximal number of
images processed concurrently. The table shows values for a processing time of 294s per
image (for calculation of the processing time see A.2.3).
Figure A.3.: The maximal number of concurrent computations
A.1.4. Needed NPL
This table shows the expected amount of NPL needed for the analysis of the data depending on the runtime per image. The formula is as follows:
[runtime per image]s · [number of images] / 3600
s
NPL
/24 · 2
h
h
Figure A.4.: NPL need depending on runtime per image
A.2. Additional Calculations
A.2.1. FileSystemWatcher Event Buffer Size
Each event uses 16 Byte of memory (according to Microsoft), excluding the filename.
Since the organization of the recorded data is not specified yet, it is unclear how long the
image filenames (of the form ∗_$id.png) will exactly be. At least, the image names need
to be unique inside each archive. Because the archive size will not exceed 1000 images
Chapter A
S. Wichmann
48
A.2. ADDITIONAL CALCULATIONS
(see section 3.3.3), the part before the underscore will consist of at least three digits. This
results in a total filename length of nine characters, and hence one event can consume up
to 25 Byte.
Consequently, an event buffer of 64 KB can hold 65536/25 = 2621 events. Other sources [4]
state that one event’s size is 12 Byte + 2*|file path| which would result in an event queue
capacity of approx. 1638 events, assuming the file path contains 14 characters (the shortest
possible path would be something like "G:\b\").
A.2.2. Archive Size
The maximal number of images per archive depends on the image size (test image is
4210 KB), the Internet bandwidth (approx. 100 MB/s), the bandwidth of the recorded
data (65.8 MB/s) and the available disk space (15 GB RAM–disk). The point is that the
system must not run out of disk space while transferring the archives.
We have to assume that all jobs transfer their archives at the same time, while the cameras
continue to produce data. If the Internet bandwidth and the bandwidth of the recorded
data were equal, the needed memory would amount to two times the archive size: the
archive itself and the images newly created while transferring the archive. The formula
for the maximal archive size hence is (including the factor for the four cameras):
15GB
≈ 467
2 · 4 · 4, 11M B
If the recorded bandwidth is smaller than the Internet bandwidth, this value would slightly
increase. Yet, tests showed (see evaluation chapter) that the transfer throughput is already
stable enough at an archive size of 190 (corresponds to 781,15 MB per archive).
A.2.3. Processing Times
Since the software decoder is still being developed, the runtime can only be estimated.
Furthermore, the available test images contained only a small number of tags. Consequently, I measured the runtime with an image (see fig. A.5) containing 17 visible tags
and extrapolated it to the runtime of an image containing 250 tags (the average number
of tags in one image). The lack of images containing a larger number of tags is due to the
difficult tagging process and a yet nonexistent synthesis of images. The runtime of the
test image amounts to 20s (measured on Cray).
Hence, the runtime per image will be
20 core seconds/17 tags ∗ 250 tags = 294 core seconds.
(A.2)
The overall runtime will hence amount to
294 core seconds ∗ 82944000 = 773.57 core years.
Chapter A
S. Wichmann
(A.3)
49
A.2. ADDITIONAL CALCULATIONS
Figure A.5.: The image used for benchmarking
This corresponds to about 4.3% of the HLRN’s yearly capacity (about 17856 core years),
or 17.3% of its quarterly capacity (about 4464 core years)
A.2.4. Number of Batch Jobs
The maximum runtime for one job is twelve hours. Depending on how many nodes we can
use per job, the overall number of jobs will range between the following values. The batch
system allows 256 nodes to be used per job which represents an upper bound of nodes to
use per job. A much more realistic number of nodes is 50 or even less (complying to the
Sandkornprinzip, see figure 4.1). The runtime per job will most likely also be much less
than twelve hours, which increases the number of jobs even further.
Chapter A
773.57 core years/12
h
cores
/(24
∗ 1 node) = 23530 jobs
job
node
773.57 core years/12
h
cores
/(24
∗ 50 nodes) = 471 jobs
job
node
773.57 core years/12
h
cores
/(24
∗ 256 nodes) = 92 jobs
job
node
S. Wichmann
50
A.2. ADDITIONAL CALCULATIONS
A.2.5. Work Directory Structure
The Beesbook work directory is structured in a way that allows the use of multiple job
slots. For one job slot the image data is distributed among the process directories.
Beesbook_Work Directory
BbMPI_jobScript_Slot_0
...
BbMPI_jobScript_Slot_<n+1>
Beesbook.log
Beesbook.status
image_heap
job_outputs
results
job_slots
data_slot_0
...
data_slot_<n+1>
proc_0
...
proc_<c>
Figure A.6.: The structure of the Beesbook work directory. Files have round boxes, directories
have rectangular boxes.
Explanation:
• The BbMPI_jobScript_Slot_n files are the job scripts for each job slot. They are
generated by the Observer when initializing the work directory.
• The image_heap directory contains the extracted images.
• The job_outputs directory contains the command line output of the batch jobs (e.g.
for error reporting).
• The results directory contains the result files of the image analysis.
• The job_slots directory contains the image data for the batch jobs. Each data_slot
directory contains one proc_ directory for each process in the batch job. The image
blocks are provided here for their respective job.
Chapter A
S. Wichmann
51
A.2. ADDITIONAL CALCULATIONS
CD Content
There are three directories on the CD. The Beesbook Observer directory contains the implementation of the parallelization. The Data Transfer directory contains the transfer
script, as well as a test script. Additionally, all executables and libraries needed for the
transfer (7zip and WinSCP) are included.
The Thesis directory again contains two directories. The Document directory contains
the Latex source of the this thesis and all images used in it. The Evaluation directory
contains the protocols of the Observer tests.
Chapter A
S. Wichmann
52
B. Glossary
Cray The Cray XC30 is the supercomputer system housed at HLRN.
FTP File Transfer Protocol – An unencrypted protocol for file transfers over the internet.
HLRN Norddeutscher Verbund für Hoch- und Höchstleistungsrechnen
MPI Acronym for Message Passing Interface; A library for interprocess communication.
It provides an abstraction layer which hides the work on the network layer. It also
brings various convenience functions, for example, to determine the processes ID
among all other processes.
RAID 6 Acronym for Redundant Array of Independent Disks; An organization approach
for hard disks to introduce data redundancy and performance improvement.
RAM–Disk A portion of the main memory (RAM) is allocated and can be used like a
hard disk, thereby significantly reducing data access times.
SCP Secure CoPy – An encrypted protocol for file transfers over the internet.
B.1. Units
fps Acronym for frames per second.
KB, MB, GB, TB, PB Kilobyte (equals 1024 Bytes), Megabyte, Gigabyte, Terabyte,
Petabyte.
Wall–clock Amount of time a batch job reserves its resources for.
NPL Acronym for Norddeutsche Parallelrechner-Leistungseinheit; One Cray compute node
costs 2 NPL per hour.
53
Bibliography
[1] Beesbook website. http://beesbook.mi.fu-berlin.de/wordpress.
[2] FileSystemWatcher — Internal Buffer Size. http://msdn.microsoft.com/en-us/
library/system.io.filesystemwatcher.internalbuffersize(v=vs.110).aspx.
[3] HLRN–III User Documentation.
https://www.hlrn.de/home/view/System3/
WebHome.
[4] Is it really that expensive to increase Filesystemwatcher internal buffersize? http:
//stackoverflow.com/a/13917670/909595.
[5] Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy Katz,
Andy Konwinski, Gunho Lee, David Patterson, Ariel Rabkin, Ion Stoica, and Matei
Zaharia. A View of Cloud Computing. Commun. ACM, 53(4):50–58, April 2010.
[6] Benjamin Benz. Pfeilschnell. Die dritte USB-Generation liefert Transferraten von 300
MByte/s. c’t, 2008. ISSN 0724-8679.
[7] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on
Large Clusters. Commun. ACM, 51(1):107–113, January 2008.
[8] M. Flynn. Some Computer Organizations and Their Effectiveness. Computers, IEEE
Transactions on, C-21(9):948–960, Sept 1972.
[9] Balduin Laubisch. Automatic Decoding of Honeybee Identification Tags from Comb
Images. 2014. Bachelor Thesis, Freie Universität Berlin.
[10] Danielle P. Mersch, Alessandro Crespi, and Laurent Keller. Tracking Individuals
Shows Spatial Fidelity Is a Key Regulator of Ant Social Organization. Science,
340(6136):1090–1093, 2013.
[11] Wolfgang Pyszkalski. Technischer Bericht TR13-16 — Übersicht über die Datenhaltung im ZIB und die Möglichkeiten einer Nutzung durch Projekte, 2013.
[12] Morse Roger A. and Nicholas Calderone. The value of honey bees as pollinators of
u.s. crops in 2000. 2000. http://www.utahcountybeekeepers.org/Other%20Files/
Information%20Articles/Value%20of%20Honey%20Bees%20as%20Pollinators%
20-%202000%20Report.pdf.
[13] T.D. Seeley. The Wisdom Of The Hive. Harvard University Press, 1995.
54
Bibliography
[14] F.J. Seinstra, D. Koelma, and J.M. Geusebroek. A software architecture for user
transparent parallel image processing. Parallel Computing, 28(7–8):967 – 993, 2002.
[15] Anand Lal Shimpi.
As-ssd Incompressible Sequential Performance (Samsung
SSD 840 Pro (256gb) Review).
2012.
http://www.anandtech.com/show/6328/
samsung-ssd-840-pro-256gb-review/2.
[16] Bryan Walsh.
Beepocalypse redux:
we still don’t know why.
2013.
Honeybees are still dying — and
http://science.time.com/2013/05/07/
beepocalypse-redux-honey-bees-are-still-dying-and-we-still-dont-know-why/.
[17] Wikipedia.
List of crop plants pollinated by bees — wikipedia, the free ency-
clopedia. http://en.wikipedia.org/w/index.php?title=List_of_crop_plants_
pollinated_by_bees&oldid=585911208, 2013. [Online; accessed 19-March-2014].
[18] Dusan Zivadinovic. Selbst ist der Spiderman. Netzausbau: Weitere Räume und
Gebäude ans LAN anbinden. c’t, 2008. ISSN 0724-8679.
Chapter 7
S. Wichmann
55