View - NOVI

Transcription

View - NOVI
fxx–
SEVENTH FRAMEWORK PROGRAMME
THEME ICT 2009.1.6
“Future Internet Experimental Facility and
Experimentally-driven Research”
Project acronym: NOVI
Project full title: Networking innovations Over Virtualized
Infrastructures
Contract no.: ICT-257867
Review of Monitoring of
Virtualized Resources
Project Document Number: NOVI-D3.2-11.02.28-v3.0
Project Document Date: 28-02-2011
Workpackage Contributing to the Project Document: WP3
Document Security Code1: PU
Editors: Carsten Schmoll (Fraunhofer)
1
Security Class:
PU- Public, PP – Restricted to other programme participants (including the Commission), RE – Restricted
to a group defined by the consortium (including the Commission), CO – Confidential, only for members of the consortium
(including the Commission)
NOVI-D3.2-11.02.28-v3.0
Page 1 of 90
Abstract: Infrastructure as a Service (IaaS) is in the focus of current Future Internet
research and experimentation both in Europe and in the US. Monitoring of such
virtualized infrastructures is of primary importance for reliable control and management
operations and for providing experimenters with accurate monitoring tools.
Classical measurement methods are designed to operate in non-virtualized
environments. In virtualized environments, the classical metrics assume stationary
network characteristics, thus can be misleading. This document provides a review of the
state-of-the-art monitoring approaches and tools for existing experimental facilities
under the EC FIRE and the US NSF GENI initiatives, with emphasis on monitoring aspects
related to the federation of the FEDERICA and PlanetLab Europe virtualized
infrastructures.
We also analyze state-of-the-art network measurement infrastructures, which offer
network monitoring and measurement capabilities and tools. Finally, we present the
challenges of NOVI’s federated environment.
Keywords: NOVI, Monitoring, Measurements, Virtualized Infrastructures, Federation,
Metrics, Monitoring Tools
NOVI-D3.2-11.02.28-v3.0
Page 2 of 90
Project Number
ICT-257867
Project Name
Networking innovations Over Virtualized Infrastructures
Document Number
NOVI-D3.2-11.02.28-v3.0
Document Title
Review of Monitoring of Virtualized Resources
Workpackage
WP3
Editors
Carsten Schmoll (Fraunhofer FOKUS)
Authors
Susanne Naegele-Jackson
(DFN/University of Erlangen-Nuremberg)
Álvaro Monje (UPC)
Cristina Cervelló-Pastor (UPC)
Alejandro Chuang (i2Cat)
Celia Velayos López (i2Cat)
Mary Grammatikou (NTUA)
Alexandros Sioungaris (NTUA)
Christos Argyropoulos (NTUA)
Georgios Androulidakis (NTUA)
Leonidas Lymberopoulos (NTUA)
Tanja Zseby (Fraunhofer FOKUS)
Carsten Schmoll (Fraunhofer FOKUS)
Attila Fekete (ELTE)
József Stéger (ELTE)
Péter Hága (ELTE)
Béla Hullár (ELTE)
Bartosz Belter (PSNC)
Artur Binczewski (PSNC)
Reviewers
Peter Kaufmann (DFN)
Błażej Pietrzak (PSNC)
Vasilis Maglaris (NTUA)
Contractual delivery date
28-02-2011
Delivery Date
28-02-2011
Version
V3.0
NOVI-D3.2-11.02.28-v3.0
Page 3 of 90
Executive Summary
This deliverable provides an overview of the state-of-the-art of monitoring methods and
tools in experimental facilities for Future Internet research. Section 2 presents the
approaches, monitoring tools, and metrics available in the experimental facilities developed
under the EC FIRE and the US NSF GENI initiatives. The main focus is on the FEDERICA and
the PlanetLab Europe virtualized infrastructures.
Section 3 overviews state-of-the-art measurement infrastructures, like the HADES and the
ETOMIC active measurement infrastructures. Furthermore, other monitoring environments
that could be potentially deployed and used in virtualized infrastructures are discussed.
Examples of such environments include perfSONAR, SONoMa and PHOSPHORUS. Potential
metrics that need to be measured for the needs of NOVI are discussed and highlighted in
this section.
Section 4 discusses the technical challenges related to monitoring within NOVI’s federated
virtualized environment. An overview of the tools, metrics, and some initial monitoring
concepts are introduced. Section 5 summarizes and concludes this document.
NOVI-D3.2-11.02.28-v3.0
Page 4 of 90
Table of Contents
Acronyms and Abbreviations .................................................................................................. 8
List of Tables................................................................................................................................ 11
List of Figures .............................................................................................................................. 11
1.
Introduction ............................................................................................................................ 13
2.
Monitoring of Virtualized Platforms .............................................................................. 14
2.1
FEDERICA........................................................................................................................ 14
2.1.1 SNMP Monitoring Tool from CESNET (G3 System) .................................................... 14
2.1.2 VMware Client Tools ................................................................................................................ 18
2.1.3 Nagios............................................................................................................................................. 23
2.1.4 FlowMon (NetFlow Based Monitoring) ........................................................................... 25
2.1.5 HADES (Hades Active Delay Evaluation System)......................................................... 27
2.1.6 Slice Characteristics ................................................................................................................. 27
2.1.7 Current Monitoring Difficulties ........................................................................................... 28
2.2
PlanetLab........................................................................................................................ 29
2.2.1 CoMoN ........................................................................................................................................... 30
2.2.2 MyOps ............................................................................................................................................ 36
2.2.3 TopHat ........................................................................................................................................... 40
2.3
VINI ................................................................................................................................... 40
2.4
Panlab .............................................................................................................................. 42
2.4.1 Traces Collection Framework .............................................................................................. 42
2.4.2 Omaco ............................................................................................................................................ 44
2.5
ProtoGENI/Emulab ..................................................................................................... 46
2.5.1 Measurement Capabilities ..................................................................................................... 46
2.6
WISEBED ......................................................................................................................... 47
2.7
ORBIT ............................................................................................................................... 49
2.8
ORCA/BEN ...................................................................................................................... 50
2.9
OneLab............................................................................................................................. 51
2.9.1 ScriptRoute .................................................................................................................................. 51
2.9.2 CoMo ............................................................................................................................................... 51
2.9.3 APE .................................................................................................................................................. 53
2.9.4 ANME.............................................................................................................................................. 53
NOVI-D3.2-11.02.28-v3.0
Page 5 of 90
2.9.5 Packet Tracking ......................................................................................................................... 54
2.9.6 ETOMIC.......................................................................................................................................... 55
2.9.7 DIMES............................................................................................................................................. 55
2.10
GpENI ............................................................................................................................... 56
2.10.1 Cacti.............................................................................................................................................. 56
2.11
4WARD ............................................................................................................................ 57
2.11.1 Monitored Resources ............................................................................................................ 57
2.11.2 Algorithms ................................................................................................................................. 57
2.11.3 Measurements ......................................................................................................................... 58
2.12 Comparison of Current Monitoring Frameworks for Virtualized
Platforms ...................................................................................................................................... 59
3.
4.
Network Measurement Infrastructures........................................................................ 60
3.1
ETOMIC ............................................................................................................................ 60
3.2
HADES .............................................................................................................................. 63
3.3
PerfSONAR ..................................................................................................................... 69
3.4
SONoMA........................................................................................................................... 71
3.5
PHOSPHORUS ................................................................................................................ 73
Discussion on NOVI’s Monitoring .................................................................................... 75
4.1
Challenges on NOVI’s Monitoring .......................................................................... 75
4.2
NOVI’s Monitoring Requirements ......................................................................... 78
5.
Conclusions ............................................................................................................................. 80
6.
References ............................................................................................................................... 81
Appendix A: Comparison Table of Current Monitoring Frameworks ......................... 86
Appendix B: Potential Metrics in a Virtualized Environment ........................................ 87
Appendix C: FEDERICA and PlanetLab tools and related metrics ................................. 90
NOVI-D3.2-11.02.28-v3.0
Page 6 of 90
NOVI-D3.2-11.02.28-v3.0
Page 7 of 90
Acronyms and Abbreviations
ACM
Association for Computing Machinery
ANME
Advanced Network Monitoring Equipment
APE
Active Probing Equipment
API
Application Programming Interface
AS
Autonomous System
ASN
Abstract Syntax Notation
BEN
Breakable Experimental Network
BWCTL
Bandwidth Test Controller
CGI
Common Gateway Interface
CLI
Command Line Interface
CMS
Central Management System
CoS
Class of Service
CPU
Central Processing Unit
DANTE
Delivery of Advanced Network Technology to Europe
DFN
Deutsches Forschungs-Netzwerk
DICE
Dante-Internet2-Canarie-Esnet (cooperation)
DNS
Domain Name System
FI
Future Internet
FIRE
Future Internet Research and Experimentation
FMS
FEDERICA Monitoring System
GAP
Generic Aggregation Protocol
GENI
Global Environment for Network Innovations
GLIF
Global Lambda Integrated Facility
GPO
GENI Project Office
GPS
Global Positioning System
GUI
Graphical User Interface
HADES
Hades Active Delay Evaluation System
HRN
Human Readable Name
HTML
Hypertext Markup Language
HTTP
Hypertext Transfer Protocol
NOVI-D3.2-11.02.28-v3.0
Page 8 of 90
ICMP
Internet Control Message Protocol
ICT
Information and Communication Technologies
IETF
Internet Engineering Task Force
IMS
IP Multimedia Subsystem
IPFIX
IP Flow Information eXport (protocol)
IPPM
IP Performance Metrics
ISBN
International Standard Book Number
LAN
Local Area Network
LBNL
Lawrence Berkeley National Laboratory
LNCS
Lecture Notes in Computer Science
MA
Monitoring Agent
MRTG
Multi Router Traffic Grapher
NDL
Network Description Language
NOC
Network Operations Center
NREN
National Research and Education Network
NSF
National Science Foundation
NTP
Network Time Protocol
OFA
Open Federation Alliance
OMACO
Open Source IMS Management Console
OMF
cOntrol & Management Framework
OSDI
Operating Systems Design and Implementation (conference)
OSPF
Open Shortest Path First (routing protocol)
OWAMP
One-Way Active Measurement Protocol
OWD
One-Way Delay
OWDV
One-Way Delay Variation
PCU
Power Control Unit
POSIX
Portable Operating System Interface (for UNIX)
QoS
Quality of Service
RDL
Resource Description Language
REST
Representational State Transfer
RFC
Request for Comments
RRD
Round-Robin Database
NOVI-D3.2-11.02.28-v3.0
Page 9 of 90
RTT
Round trip time
SFA
Slice-based Federation Architecture
SNMP
Simple Network Management Protocol
SOA
Service Oriented Architecture
SOAP
Simple Object Access Protocol
SONOMA
Service Oriented NetwOrk Measurement Architecture
SSH
Secure Shell
SSL
Secure Sockets Layer
TARWIS
Testbed Architecture for Wireless Sensor Networks
TCC
Traces Collection Client
TCP
Transmission Control Protocol
TCS
Traces Collection Server
TERENA
TLS
Trans-European Research and Education Networking
Association
Transport Layer Security
UDP
User Datagram Protocol
URL
Uniform Resource Locator
URN
Uniform Resource Name
VHDL
VHSIC hardware description language
VHSIC
very-high-speed integrated circuits
VINI
Virtual Network Infrastructure
VLAN
Virtual Local Area Network
VM
Virtual Maschine
VN
Virtual Network
WSDL
Web Service Definition Language
WSN
Wireless Sensor Network
XML
eXtensible Markup Language
NOVI-D3.2-11.02.28-v3.0
Page 10 of 90
List of Tables
Table 1: Monitoring systems and tools used by PlanetLab administrators and users ............ 30
Table 2: Metrics provided by CoMon’s node-centric daemon ................................................ 33
Table 3: Metrics provided by CoMon’s slice-centric daemon ................................................. 34
Table 4: HADES measurements in various research projects.................................................. 65
List of Figures
Figure 1: G3 modules and relations (generic scheme) ............................................................ 15
Figure 2: Basic orientation map of the FEDERICA infrastructure ............................................ 17
Figure 3: View of interfaces on the routers and switches ....................................................... 18
Figure 4: The basic characteristics of a Vnode ........................................................................ 19
Figure 5 : Vnodes of FEDERICA with basic status and parameters.......................................... 20
Figure 6: Vnode load and traffic monitoring - CPU usage in week scale................................. 21
Figure 7: Network monitoring in Vnodes: Network activity 24 hour scale ............................. 21
Figure 8: CPU usage in month scale ........................................................................................ 22
Figure 9: Detailed CPU usage in 24h scale............................................................................... 22
Figure 10: Network data transmitted in a yearly scale. .......................................................... 23
Figure 11: Nagios interface...................................................................................................... 24
Figure 12: FlowMon probe with installed plugins ................................................................... 25
Figure 13: FlowMon slice (VLAN) communication chart ......................................................... 25
Figure 14: FlowMon OSPF traffic communication chart ......................................................... 26
Figure 15: FlowMon monitoring center: Visualization of virtual nodes’ (VMware)
communication bandwidth ............................................................................................. 26
Figure 16: Virtual machines table of slice “isolda” .................................................................. 27
Figure 17: Virtual network (vmnic-slice relation) .................................................................... 28
Figure 18: The MyPLC software stack (taken from [MyOps]) ................................................. 36
Figure 19: MyOps management structure (taken from [MyOps]) .......................................... 37
Figure 20: PlanetLab nodes’ PCU state (screenshot taken on the MyOps site) ...................... 38
Figure 21: MyOps action summary (screenshot taken on the MyOps site) ............................ 39
Figure 22: VINI nodes on Internet2 ......................................................................................... 41
Figure 23: VINI nodes on NLR .................................................................................................. 41
NOVI-D3.2-11.02.28-v3.0
Page 11 of 90
Figure 24: Panlab traces collection framework structure ....................................................... 43
Figure 25: Wireshark user interface ........................................................................................ 44
Figure 26: OMACO screenshot ................................................................................................ 45
Figure 27: OMACO as part of Panlab’s Teagle installation ...................................................... 45
Figure 28: Control plane interface of ProtoGENI .................................................................... 46
Figure 29: WISEBED system architecture ................................................................................ 48
Figure 30: OneLab user interface showing host status ........................................................... 52
Figure 31: MySlice overview of networks ............................................................................... 52
Figure 32: OneLab’s APE box ................................................................................................... 53
Figure 33: ANME’s ARGOS measurement card. ...................................................................... 54
Figure 34: Map showing a packet tracking scenario involving different nodes from testbeds
such as VINI, G-Lab, PlanetLab and others. ..................................................................... 55
Figure 35: Cacti screenshot ..................................................................................................... 56
Figure 36: Detection of threshold-crossing ............................................................................. 58
Figure 37: Network delay tomography, showing delay estimates between different............ 60
Figure 38: ETOMIC system architecture .................................................................................. 61
Figure 39: Extraction of network monitoring statistics ........................................................... 62
Figure 40: Time evolution of the number of experiments ...................................................... 63
Figure 41: Worldwide measurements using HADES................................................................ 64
Figure 42: Current nodes of FEDERICA that can be monitored by HADES tools ..................... 66
Figure 43: Website - HADES available measurement links for FEDERICA ............................... 66
Figure 44: Request of the specific measurements over each link ........................................... 67
Figure 45: HADES statistics. RedIRIS - CESNET ........................................................................ 68
Figure 46: PerfSONAR User Interface ...................................................................................... 70
Figure 47: SONoMA Measurement Agents Status Interface ................................................... 72
Figure 48: PHOSPHORUS test-bed architecture overview. ..................................................... 73
Figure 49: PHOSPHORUS monitoring system architecture. .................................................... 74
Figure 50: PHOSPHORUS monitoring visualization. ................................................................ 74
Figure 51: A preliminary conceptual view NOVI’s Monitoring architecture ........................... 76
NOVI-D3.2-11.02.28-v3.0
Page 12 of 90
1. INTRODUCTION
NOVI’s vision with respect to monitoring in federated virtual infrastructures stems from the
acknowledgement that different virtualized platforms employ a wide variety of different
monitoring methods, tools, algorithms, access methods and measurements. In this
document, the tools, algorithms and metrics related to the major EC FIRE and US NSF GENI
platforms are presented.
The goal of this document is to discuss and analyze the state-of-the-art of monitoring and
measurement of current experimental Future Internet platforms, focusing on solutions for
federated environments. We start (section 2) by illustrating the monitoring frameworks of
virtualized platforms and continue (in section 3) with the description of currently deployed
measurements infrastructures, like HADES and ETOMIC.
In section 4, we analyze the challenges and requirements for a monitoring framework for
NOVI’s federated environment. Based on this analysis, we provide the basis for decisions
that will be undertaken in NOVI regarding monitoring and measurement methods and tools.
This deliverable is structured as follows:
-
Section 2 describes the approaches, tools, and metrics available from the experimental networking facilities, with the main focus on FEDERICA and PlanetLab
platforms.
-
Section 3 provides an overview of existing network measurement infrastructures,
highlighting their features and tools.
-
Section 4 summarizes this document with an overview of the tools, metrics,
architectural and technical challenges of monitoring within a federation of
heterogeneous virtualized platforms.
-
Section 5 concludes the document and summarizes the requirements that need to
be addressed as part of the design and implementation of NOVI’s monitoring
framework for a federated virtualized environment.
NOVI-D3.2-11.02.28-v3.0
Page 13 of 90
2. MONITORING OF VIRTUALIZED PLATFORMS
2.1 FEDERICA
The FEDERICA project [FEDERICA+ is an European wide “technology agnostic” infrastructure
based upon Gigabit Ethernet circuits, transmission equipment and computing nodes capable
of virtualization, to host experimental activities and test new architectures and protocols on
the Future Internet. More information about FEDERICA’s control and management planes
can be found on the NOVI deliverable [D3.1].
The FEDERICA Monitoring System (FMS) can be seen from at least three different views: The
physical view, the node-centric view, and the slice-centric view. The node-centric view
contains information about all virtual instances created on top of the physical infrastructure
in each node (node-centric view). The slice-centric view gives information about how slices
are spread across the network. Thus, a complexity is introduced by these different
environments and the tools, which are available for such work.
In the FEDERICA project, monitoring data is currently offered by FEDERICA partner CESNET
and by the University of Erlangen/DFN. Both monitoring systems provide different types of
monitoring data: CESNET provides a detailed map with all FEDERICA nodes and ports. With a
click on one of these network components listed on the map, more statistics can be
accessed. Information provided via the CESNET URL includes average bit rates over the links,
descriptions of the network interfaces, physical addresses, and system uptimes for each
node, as well as ICMP and SNMP information. Furthermore, the monitoring system is
complemented with Nagios, VMware tools and the FlowMon Monitoring Center.
While this information is important, it does not provide sufficient statistics in the area of
network performance. The University of Erlangen/DFN have therefore expanded FEDERICA
monitoring to include IP Performance Metrics (IPPM) such as one way delay, delay variation
and packet loss. The measurements are conducted by using the perfSONAR [PERFSONAR]
measurement suite, in particular the measurement tool HADES (Hades Active Delay
Evaluation System). For more details see sections 3.2 and 3.3.
FMS spans all types of resources managed inside the project: Routers and switches and their
interfaces; links and virtual machines (VMs). The next sections explain how they are
monitored, what is measured and what tools are used for each kind of resource.
2.1.1 SNMP Monitoring Tool from CESNET (G3 System)
The G3 System [G3System] is a monitoring system based on standard measurement and
data export methods via SNMP, with non-standard measurement timing, specific data
processing and an advanced user interface. G3 system is designed to provide large scale
continuous network infrastructure monitoring. Data processing mechanisms ensure
automated adaptation on real device reconfigurations and provide a continuous and flexible
mapping between the technological (given by SNMP agents) and logical (human point of
view) structure of the measured devices. The user interface is able to aggregate measured
NOVI-D3.2-11.02.28-v3.0
Page 14 of 90
data while retrieving it from the storage. Therefore the whole user interface is designed to
be as interactive as possible.
As shown in Figure 1, the G3 system has an additional standalone module, the G3 reporter,
to satisfy visualization requirements of different groups of users. It generates “static” reports
in the form of HTML pages and images to make user navigation easy and fast.
Figure 1: G3 modules and relations (generic scheme)
The G3 system is designed to provide large scale measurements of the network
infrastructure, therefore a huge number of measured objects can be expected (tens of
devices, thousands of interfaces). Administrators need some filtering mechanisms to find out
all requested objects at once and quickly. The current G3 implementation for the searching
and filtering part is very simple but satisfies most of the typical requests.
The internal data aggregation mechanisms have a separate support in the modules providing
the visualization. The aggregated values may be sums, minimums, maximums - whatever a
user is interested in. For more information about G3, refer to [G3System].
NOVI-D3.2-11.02.28-v3.0
Page 15 of 90
2.1.1.1 Access Restrictions
The G3 System access to monitoring data can be done via the Web
(https://tom1.cesnet.cz/netreport/FEDERICA_utilization1/), showing generated network
measurement reports with statistical data published only for FEDERICA users. In order to
provide access to FEDERICA developers, there is a generic user name which is common to all
FEDERICA partners.
Similar rules of access are implemented for external users of FEDERICA. However, these
kinds of users only have access to data which belongs to the slice in which they work and to
the Vnodes which are part of their slice(s).
For developers who need online monitoring data there is a direct web access to the
standalone directory subtree. The data are provided for all currently monitored objects.
Lower layer directories contain per-item exports for appropriate objects. Expired export files
are removed after 20 minutes. All exports are in RRDTool [RRDTool] export XML format.
2.1.1.2 SNMP Measurements
Using the G3 system, all network devices connected via physical links (physical level view)
can be easily monitored. Similarly, all of the virtual paths on the core routers can be
monitored and information about I/O on all child virtual ports that are linked to the
corresponding physical port can be made available. Reports are provided from all devices
which were installed in FEDERICA. Reports (mainly in graphical form) contain counter values
of I/O operations, counters of errors, interface status, packet distributions in different time
scales and round trip time (RTT). Accessing the G3 System (see “Access restrictions“
subsection), the user can find a map in which all interfaces on core routers are displayed,
pointing into Vnode interfaces on a particular site. This map (shown in the following figure)
is called “Basic Orientation Map” and it is the most usable graphical interface.
NOVI-D3.2-11.02.28-v3.0
Page 16 of 90
Figure 2: Basic orientation map of the FEDERICA infrastructure
On this map users can immediately see the load of the links (represented by colours). In
addition, the users can get any detail graphs about traffic on the link by a simple click. It is
also possible to access full information about VMware running on a Vnode by clicking in an
existing VM and network traffic on each of its interfaces (vmnics), clicking in the link
between two resources.
The only way to get information about the networking structure in real-time is to read the
description of interfaces from all routers and switches. Therefore a list of interfaces is
prepared which is available for all FEDERICA users via the basic web pages as shown in the
following figure:
NOVI-D3.2-11.02.28-v3.0
Page 17 of 90
Figure 3: View of interfaces on the routers and switches
2.1.2 VMware Client Tools
VMware provides a range of tools and utilities for checking the state of the virtual machines
and their services and resources. The VMware Infrastructure Client is used to obtain
information about the status of individual Virtual Nodes.
The VMware Remote CLI provides a command-line interface for data-center management
from a remote server. Through this tool it is possible to acquire the desired monitoring
statistics from virtual machines (as with the VMware Infrastructure Client) by running
remotely several queries and commands; unlike, VMware Infrastructure Client, uses a GUI to
obtain/show statistics.
The VMware API provides information to the FMS in order to obtain monitoring information
for Virtual Nodes. The VI API works as a Web service allocated on ESXi Server. The API
provides access to the VMware Infrastructure management components (monitoring ones
among others).
To access the VMware Infrastructure Client (VI Client) a user must log on from his PC using
the tool into the server (FEDERICA node) under the username which grants him access to the
physical machine. Typically such user IDs are not assigned to researchers, but instead access
to this tool is more intended for NOC personnel. The VI Client can only access a single node
at a time, which is a limitation of this method. When the NOC logs into the server, it will get
an information window giving control over a particular node.
The VMware API and Remote CLI use the same user authentication as the VMware
Infrastructure Client.
It is also possible to access to VMware status, characteristics and traffic statistics through the
basic web page of FEDERICA: https://tom1.cesnet.cz/vmware-test/sledovac.html
NOVI-D3.2-11.02.28-v3.0
Page 18 of 90
2.1.2.1 VMware Measurements
The VMware Infrastructure remote command line interface (CLI) provides an access to the
data centre management console from a remote server. It is possible to run several queries
and commands at the virtual machine remotely and acquire the desired monitoring
statistics. For FEDERICA the command “VMware-cmd” has been used in order to perform
virtual machines operations remotely – for example, creating a snapshot, powering the
virtual machines on or off, or getting information about the state of the virtual machine.
Figure 4: The basic characteristics of a Vnode
As far as Vnode measurements are concerned, FEDERICA mostly requires getting the current
status of Vnodes. This includes monitoring the load and utilization of fundamental resources
(such as CPU and memory usage) and monitoring the amount of networking traffic which
terminates at individual interfaces (vmnics). In Figure 4 is possible to see how the
information is shown for the VMs deployed on a Node.
For this task the VMware Infrastructure Perl Toolkit [VI-Toolkit] provides an easy-to-use Perl
scripting interface to the VMware Infrastructure API. Moreover, the VI Perl Toolkit includes
numerous utility applications ready to run for querying virtual machines.
The following example call lists virtual machines present at the host “host_address”:
$ VMware-cmd --server <host_address> --username <username> --password <pass>
This VI API provides access to the VMware infrastructure managed objects that can be used
to access, monitor, and control life-cycle operations of virtual machines and other VMware
infrastructure components.
NOVI-D3.2-11.02.28-v3.0
Page 19 of 90
Figure 5 : Vnodes of FEDERICA with basic status and parameters
The figure above shows the status and parameters of some of the VMware servers of
FEDERICA. The colour green on the hostname means that the Vnode is active, while red
means it is inaccessible.
The monitoring data is collected in 5 minute intervals and displayed on a chart of a year,
month, week and 24 hour scale. These graphs show the situation of individual Vnodes and
the situation in the networks linked into Vnode.
As it is not possible to monitor ESXi servers at the physical level (PN) via SNMP, we cannot
either monitor them on the virtual level (VN) via SNMP tools. Thus a proprietary VMware
software solution needs to be used instead.
Several tools have been developed in order to: 1) get the status of the Vnodes 2) monitor
the load of individual Vnodes and utilization of fundamental resources such as individual
CPUs and memory usage 3) monitor networking traffic in the networks which terminate at
individual vmnics 4) make accessible all monitoring data from the node level accessible to all
FEDERICA users.
Some examples of this monitoring can be seen in the following figures:
NOVI-D3.2-11.02.28-v3.0
Page 20 of 90
Figure 6: Vnode load and traffic monitoring - CPU usage in week scale
Figure 7: Network monitoring in Vnodes: Network activity 24 hour scale
The next figure depicts CPU usage (%) of a certain Vnode within the current month. The
x-axis shows weeks of the month and the y-axis shows the global percentage of CPU usage
(value across all available CPUs).
NOVI-D3.2-11.02.28-v3.0
Page 21 of 90
Figure 8: CPU usage in month scale
Another example of CPU usage is represented on the graph below, where it is shown how
individual CPUs are used. There is a 24h scale in this example.
Figure 9: Detailed CPU usage in 24h scale
VMware allows monitoring of individual values of CPU usage, memory usage, disk access and
networking activities (in 20 second intervals). This granularity of monitoring could be used
on special request of the user.
Moreover, the network activity linked with a particular Vnode can also be monitored. The
next figure shows an example of these types of measurements.
NOVI-D3.2-11.02.28-v3.0
Page 22 of 90
Figure 10: Network data transmitted in a yearly scale.
In the above figure colours represent each physical interface (vmnic) of the server. The x axis
shows months of the year as a period of time; the y axis shows the transmit data rate (in
Mbps).
2.1.3 Nagios
Unfortunately, the implementation of the SNMP agent in VMware ESXi server version 3 is
insufficient for the FEDERICA monitoring requirements. Thus, FEDERICA uses the Nagios
[Nagios] monitoring system to monitor the VMware ESXi servers. Nagios is an open source
powerful monitoring system that enables network administrators to identify and resolve
network infrastructure problems. It is very popular among network administrators. Nagios
monitors entire network infrastructures including servers, network devices, other systems,
applications, services, and business processes. Furthermore, it verifies their proper
functioning. In the event of a failure Nagios can trigger an alert to inform network
administrators about the problem.
Nagios works either in an active fashion, probing services in regular intervals, or it may work
together with local monitoring agents which supply it with pre-processed results, in case an
external querying by the Nagios engine would be impossible, e.g. due to security constraints.
There exist a flurry of 3rd party Nagios plugins (mostly open source), for different server and
service checks. For VMware monitoring with Nagios there exist publicly available tools like
the VI Perl Toolkit or the VMware Remote CLI.
On the following figure the common look of a Nagios installation is depicted:
NOVI-D3.2-11.02.28-v3.0
Page 23 of 90
Figure 11: Nagios interface
Nagios is sometimes also used together with the Cacti tool [Cacti], which itself is a frontend
to RRDTool [RRDTool]. It is amongst others used by the GpENI testbed; see section 2.10.1 for
more details on it.
The Nagios monitoring is only available for FEDERICA administrators. Similarly to the G3
System, access is allowed only with the generic FEDERICA account and it is accessible via
basic web pages (https://FEDERICA6.cesnet.cz/nagios/).
Nagios provides a simple plug-in design which allows users to easily develop their own
service checks depending on their needs. For this reason a Nagios plug-in was deployed on
the FEDERICA facility in order to monitor VMware servers in a suitable and integrated way.
The usual information that needs to be monitored in the case of FEDERICA virtualization
servers are their overall reachability (e.g. via ping) and their status (running/responding,
information about virtual machines running at the server, CPU load, and memory
occupancy).
Nagios plugins can be deployed in the form of compiled executables or scripts (Perl scripts,
shell scripts, etc.) that can be run from a command line, in order to check the status or a
host or service.
Moreover, in the FEDERICA context, Nagios executes a scheduled plugin call in order to
check the status of a service or host. The plugin returns the results to Nagios, which then
processes them and takes any necessary actions (e.g. running event handlers, or sending out
notifications). Plugins act as an abstraction layer between the monitoring logic present in the
Nagios daemon and the actual services and hosts that are monitored.
NOVI-D3.2-11.02.28-v3.0
Page 24 of 90
2.1.4 FlowMon (NetFlow Based Monitoring)
NetFlow [NetFlow] is a network protocol developed by Cisco for gathering IP traffic
information, especially data related to traffic flows in the network. It is also supported by
other platforms such as Juniper routers, Linux or FreeBSD.
The FlowMon appliance generates NetFlow data with enhanced VLAN information and sends
the data to a FlowMon monitoring centre.
Figure 12: FlowMon probe with installed plugins
The basic FlowMon Probe web interface (figure above) provides access to the functionalities
mentioned before and permits access to many monitoring tools from one place.
Through the FlowMon Probe web interface, a user can access the FlowMon Configuration
Center, FlowMon Monitoring Center, Nagios, and the G3 System (see Figure 12). Access to
the FlowMon Configuration Center is limited to administrators. NetFlow measurements in
FEDERICA (FlowMon)
NetFlow based monitoring in FEDERICA was deployed at the CESNET PoP in Prague. It can
monitor the backbone links to Athens, Erlangen, Milan and Poznan. The FlowMon
monitoring centre collects the NetFlow data and provides detailed traffic statistics. This tool
can easily monitor traffic to/from virtual nodes, slices (VLAN) specific traffic as shown
inFigure 13, backbone related traffic and more.
Figure 13: FlowMon slice (VLAN) communication chart
NOVI-D3.2-11.02.28-v3.0
Page 25 of 90
Users can add new traffic profiles; for example for monitoring of OSPF traffic (see Figure 14).
NetFlow based alerts can be used to send an alert in case of a problem in the FEDERICA
network. Figure 15 shows how virtual nodes’ bandwidth is displayed in the FlowMon
monitoring centre.
Figure 14: FlowMon OSPF traffic communication chart
Figure 15: FlowMon monitoring center: Visualization of virtual nodes’ (VMware)
communication bandwidth
NOVI-D3.2-11.02.28-v3.0
Page 26 of 90
2.1.5 HADES (Hades Active Delay Evaluation System)
HADES [HADES] is a monitoring tool that was designed for multi-domain delay evaluations
with continuous measurements and is used in FEDERICA for monitoring IP Performance
Metrics. HADES incorporates an active measurement process which allows layer 2 and layer
3 network measurements and produces test traffic which delivers high resolution network
performance diagnostics. Access to the measurement data is available through the
perfSONAR-UI graphical interface (http://www.perfSONAR.net/). See section 3.2 for details
on HADES.
The monitoring of IP Performance Metrics in FEDERICA is conducted with the HADES monitoring system that is described in detail in section 3.2 of this deliverable. IPPM data of
FEDERICA such as one-way delay, one-way delay variation and packet loss over selected links
can be accessed via http://www.win-labor.dfn.de/cgi-bin/hades/map.pl?config=FEDERICA;map=star
The FEDERICA NOC is one of the primary beneficiaries of the extended monitoring, as with
this information the NOC can know when certain links are starting to become congested.
2.1.6 Slice Characteristics
When FEDERICA is running in manual management mode, data is collected from individual
resources independently (from description of networking interfaces, VMware configurations
and from the NOC web pages). The information referencing a single slice is grouped together
based on a unique slice name. At the moment it is possible to see the detected slices from
the User Access Server (UAS) at one of the FEDERICA’s monitoring web pages:
https://tom1.cesnet.cz/VMware-test/sledovac/slices.html. In order to see that web a
generic user is needed. In that web page there is a list of the named slices, and when one
slice is clicked, the name of the installed VMs and the Node where each one is installed are
shown. An example of one FEDERICA slice is in Figure 16.
In addition, when the information about one node is presented (by the VMware client tool),
not only the VM status is shown, but also the slices that are allocated in that host.
Figure 16: Virtual machines table of slice “isolda”
NOVI-D3.2-11.02.28-v3.0
Page 27 of 90
Regarding the relations between slice and the Vnode-Hardware, the monitoring system
collects data from individual subsystems independently. In order to put together complex
information for a slice there is a need to combine independent data received from different
monitors (SNMP, VMware) and from different Vnodes and Vrouters. Information about slice
structure and resource allocations is created at the moment of slice creation. The vmnic
(physical interface) can be allocated to one VM and to one slice. This vmnic allocation to a
particular entity (slice of VM or Hypervisor) can be read from a VMware configuration file
and displayed at the VI Client (see the following figure).
Figure 17: Virtual network (vmnic-slice relation)
2.1.7 Current Monitoring Difficulties
There are currently two main difficulties regarding monitoring in FEDERICA that possibly
hold true also for other virtualized platforms:
The first one relies on the fact that the FMS collects data from individual subsystems
independently and, in order to put together complex information for a slice, there is a need
to combine independent data received from different monitors (SNMP, VMware) and from
different virtual nodes and virtual routers. It is not trivial to deduce what information
belongs to which particular slice. There is a need to have accurate, timely information about
slice structure and resource allocations. Such information is created at the moment of slice
creation. As an example of this complexity, one can configure a vmnic X used by a group of
slices, or hosts inside the same slice. With this schema it is impossible to separate local
traffic (slice traffic) on vmnic X from external traffic and it is difficult to analyze external
traffic to other slices if they are assigned to the same vmnic. This leads to a quite
complicated process for finding the correct mappings for each particular slice.
NOVI-D3.2-11.02.28-v3.0
Page 28 of 90
The second one is about the VMware Client. It is a perfect client for standard utilization and
management of VMware systems. The NOC can use it to manually create a VM for a slice.
Also the node site manager can easily control all tasks running on the node. For the slice
concepts used in FEDERICA where users have slices that are running on different nodes, this
tool is not very effective and applicable. When the number of nodes grows and more slices
are created, the managers will have a lot of work to do. In that moment the manual control
of such an environment becomes slow and is not very effective and it is better to use other,
more automated tools for the management.
2.2 PlanetLab
PlanetLab [Planetlab] is a global testbed which allows researchers to run their experiments
across the real Internet using widely distributed end systems to run their software on. It
consists of a collection of machines, distributed over the globe, a common software base
package [MyPLC], an overlay network to build your own virtual testbed for experiments, and
also support to get and keep it (and your experiments) up and running.
Technically the separation of different experiments, which can be run in parallel on
PlanetLab, is achieved through techniques for application level virtualisation [VServer] and
network isolation. This means that experiments are not fully free of side-effects (i.e. when
running two different experiments on two VMs on the same host), however they are
logically separated, and running/scheduling them is much easier compared to exclusive,
limited-time access to hosts for dedicated experiments.
For this PlanetLab provides its experimenters with virtualized UNIX machines around the
world that can be booked to run experiments on, connected via the real Internet. A bunch of
those machines, booked for an experimenter is called a slice. To save resources, physical
machines are not booked to slices or users exclusively, but in a shared manner. Each
PlanetLab host can provide, via virtualization, multiple machines active at the same time to
the experimenter community. Each single virtual machine on a physical machine is called a
sliver. So, multiple slivers, logically bound together, form the above mentioned experimental
slice.
For PlanetLab, the VServer [VServer] virtualization technology was chosen as a lightweight
system, providing virtual machine abstraction by operating system level virtualization. With
this lightweight technique, virtual servers are provided on the operating system (kernel)
layer. Each such virtual machine looks and feels like a real server, from the point of view of
its owner. Technically however, in this setup, the virtual machines share a common kernel,
which means they must all run a Linux operating system in the PlanetLab case. They may,
however, run different Linux distributions, since most of those are quite independent of the
actual kernel version that is used. The VServer kernel provides the abstraction and
protection mechanisms to separate and shield the different virtual machines from each
other. It also provides multiple instances of some components, one for each VM, e.g. the
routing table and the network (IP) stack.
Both PlanetLab’s administrators and end-users have been using a wide variety of monitoring
systems and tools to examine the operational state of PlanetLab nodes and debug possible
problems affecting user experiments. A list of such monitoring systems and tools are listed in
the following table.
NOVI-D3.2-11.02.28-v3.0
Page 29 of 90
Monitoring
System/Tools
Objective
Operational Status
[Chopstix]
Diagnoses kernel functionality and
observes process scheduling (runtime
monitoring)
Active
[CoVisualize]
Creates visualizations of PlanetLab usage
Active
[Ganglia]
Monitors high performance computing
systems (i.e. clusters, Grids)
Inactive
[iPerf]
Measures the network bandwidth
Inactive
[PlanetFlow]
Logs the network traffic and provides a
query system for data
Active
[PLuSH]
Manages the whole “cycle” of an
experiment
Active
[SWORD]
Manages distributed resource discovery
queries
Active
[S33]
Provides a snapshot of all-pair capacity
and available bandwidth metrics
Active
Trumpet or [PsEPR]
Monitors the node health using an open
repository of data gathered from nodes
Active
Table 1: Monitoring systems and tools used by PlanetLab administrators and users
2.2.1 CoMoN
All of the aforementioned monitoring systems and tools provide node-centric measurements. However, many experiments require slice-centric measurements, where a slice is a
collection of virtual machines (i.e. slivers in the PlanetLab terminology) deployed across a
number of different PlanetLab nodes. To provide slice-centric measurements, a scalable
monitoring system CoMon [CoMon] was implemented by Princeton University.
2.2.1.1 Requirements of the CoMon Monitoring Framework
At first, CoMon was implemented as a service-specific monitoring system for [CoDeeN], an
experimental Content Distribution Network that is deployed by Princeton within the
PlanetLab testbed. CoDeeN is a network of high-performance proxy servers, which
cooperate in order to provide a fast Web-Content Delivery service. CoMon provided CoDeeN
the ability to display historical aggregated values of various data metrics in a user-friendly
way, provided as plain HTML tables over the Web.
NOVI-D3.2-11.02.28-v3.0
Page 30 of 90
However, after a certain period of time, CoMon was separated from the CoDeeN monitoring
system, becoming an independent monitoring system for the PlanetLab testbed. This
happened because CoMon needed to evolve in a different direction as CoDeeN in order to
provide solutions for a number of problems related to monitoring, which were not within
the scope of the CoDeeN project. A list of problems being addressed separately by CoMon
are the following:

CoDeeN could not take into consideration some metrics, such as the accountability
of nodes in low-bandwidth PlanetLab sites.

Privacy was not addressed by the CoDeeN project, as researchers were allowed to
see the status of other experiments via UNIX tools, such as top or ps.

The lack of resource isolation between slices, which effectively means that it is
possible that one slice affects the performance of other slices. This is a result of the
limitations of the VServer technology (employed by PlanetLab O/S) to provide full
resource isolation among slivers (i.e. VMs) running within a PlanetLab node.
Subsequently, there is no full resource isolation between different PlanetLab slices
that involve slivers running simultaneously within PlanetLab nodes.
To tackle the aforementioned problems, the main concept introduced by CoMon is that any
slice user is allowed to see the resources consumed by other slices that run over the shared
PlanetLab testbed. This way, the slice user is able to see if other slices have an effect on
his/her own slice. However, to preserve privacy of users, a slice user does not have access on
process-specific information of other slices.
2.2.1.2 CoMon Main Features
CoMon is a scalable monitoring system designed to passively collect measurement data
provided or synthesized by the PlanetLab O/S. It also works as an active monitoring system
that gathers by itself a variety of measurement metrics that are useful to PlanetLab’s
developers and users.
CoMon’s measurement operations take place every day in fixed time periods on 900-950
active PlanetLab nodes, where 300-350 active experiments are running. All PlanetLab users
are able to view the reported values of measured data via a Web interface, in the form of
HTML tables. Users are able to sort the columns of the provided HTML tables and get access
to graphs showing history of measurement data. Overall, CoMon provides the following
functionalities to users and developers of experiments within PlanetLab:

Monitoring of consumed resources within an experiment in order to identify
possible processes consuming resources in a non-efficient way, leading to
performance degradation of the experimental code.

Provides sufficient information to the code developers of each experiment to trace
code problems. However, as we discussed earlier in this section, the privacy of other
experimentation slices is preserved, i.e. a developer of an experiment can only see
information relative to his/her own experimentation slice.

Keeps track of errors occurring in specific nodes of the PlanetLab testbed.
NOVI-D3.2-11.02.28-v3.0
Page 31 of 90

Allows experimenters to run custom queries to select PlanetLab nodes fitting to the
requirements of their experiments. For example, an experimenter that wants to test
how the code implementing his/her Peer-to-Peer protocol behaves in cases where
nodes fail, may issue a query to CoMon in order to find PlanetLab nodes whose
historical data shows that their availability is less than 100%.
2.2.1.3 CoMon Architecture and Services
As we discussed earlier, CoMon provides both node-centric and slice-centric monitoring
information. To achieve this dual mode of operation, CoMon is composed of two daemons
that run on each PlanetLab node, a node-centric daemon and a slice-centric daemon. Both
daemons accept HTTP requests as queries, and their reply messages are also sent via the
HTTP protocol. The functionality of each daemon process is the following:
Node-Centric Daemon
This daemon is implemented as an event-driven process together with a set of persistent
monitoring threads which periodically generate updated values for the metrics they
measure. Thus, when a client requests data from the node-centric daemon, it receives data
as a combination of the most recent values from each monitoring thread. Regarding its
performance, the node-centric daemon uses less than 0.5% of CPU of each PlanetLab node.
Monitored data provided by the node-centric daemon are O/S metrics, passively measured
metrics, and actively-measured metrics. The decision taken by the PlanetLab developers on
which metrics to provide derived from what an experimenter needs to detect anomalies in a
particular node. The overall goal followed by the PlanetLab developers was to organize the
measurement metrics as a concrete set of metrics and also guide the development of active
tools for measuring metrics not involved in the current set of metrics. The current set of
metrics provided by the node-centric daemon is shown in the table below:
NOVI-D3.2-11.02.28-v3.0
Page 32 of 90
OS-provided
Uptime
Overall CPU utilization
(Busy, Free)
System CPU utilization
Passively-measured
SSH Status
Clock Drift
Actively-measured
1-Second Timer Max value
1-Second Timer average value
Memory Active
Consumption
Free Memory
Number of Raw Ports (in
use)
Number of ICMP Ports (in
use)
Number of Slices (in
memory)
Live Slices
Disk Size
Number of Slices using CPU
Disk Used
Number of Ports in use
GB Free (i.e. Disk Space
Free)
Swap Used
Number of Ports (in use) >
1h
CPU Hogs
Max time for Loopback
Connection
Average time for Loopback
Connection
Amount of CPU available to a
spin-loop
Amount of Memory Pressure
seen by a test program
UDP failure rates for local DNS
servers (primary & secondary
DNS UDP)
TCP failure rates for local DNS
servers (primary & secondary
DNS tcp)
Last HTTP failure time
File RW
Date
1 Minute Load
FD Test
Memory Hogs
Processes Hogs & Number of
Processes
Bandwidth Hogs (i.e. Tx, Tx
Kb, Rx, Rx Kb)
Port Hogs
Response Time
Last CoTop (i.e. last time
slice-centric daemon ran
successfully)
Bandwidth Limit
Transmit Rate
Receive Rate
Memory Size
5 Minute Load
Swap In/Out Rate
Disk In/Out rate
Kernel Version
Node Type
Boot State
FC Name
Presence of transparent Webproxies
Table 2: Metrics provided by CoMon’s node-centric daemon
NOVI-D3.2-11.02.28-v3.0
Page 33 of 90
2.2.1.4 Slice-centric Daemon
This daemon receives monitoring information from Slicestat [Slicestat], a process which
monitors the resource usage of all active slices running on the PlanetLab node. Using the
CoTop PlanetLab tool [CoTop], the slice-centric daemon produces information per slice,
which looks like the UNIX monitoring tool top [Unix-top]. Table 3 below shows the metrics
provided by CoMon’s slice-centric daemon. The same kind of information is also produced
by the slice-centric daemon in the format of HTML tables. This information can be accessed
via CoMon’s web interface. As an example, the reader may refer to the webpage
http://comon.cs.princeton.edu/status/tabulator.cgi?table=nodes/table_vicky.planetlab.ntua.gr
in
order to see the monitored information for all slices running within the PlanetLab node with
DNS name vicky.planetlab.ntua.gr. This node is one of two nodes located at Netmode-NTUA
PlanetLab site. The second node hosted at Netmode-NTUA has the DNS name
stella.planetlab.ntua.gr.
1 Minute Transmit Bandwidth Rate
15 Minute Transmit Bandwidth Rate
1 Minute Receive Bandwidth Rate
15 Minute Receive Bandwidth Rate
Number of processes in use
CPU usage %
Memory usage %
Physical Memory consumption (MB)
Virtual Memory consumption (MB)
Table 3: Metrics provided by CoMon’s slice-centric daemon
2.2.1.5 CoMon Data Collection Service
A process is running within PlanetLab’s MyPLC for collecting and processing the monitored
data received from both the node-centric and the slice-centric daemons that run within
every PlanetLab node. The MyPLC process runs periodically every 5 minutes. The collected
raw data is stored in the following three files within a MyPLC storage disk:

one file with the most recent snapshot for all nodes

one file containing all the data gathered for the current day

one file that contains an aggregation of the latest data received by every node that is
in “boot” state (Refer to PlanetLab documentation for all states of a PlanetLab node)
NOVI-D3.2-11.02.28-v3.0
Page 34 of 90
Within MyPLC, processing of the raw data is implemented to create metafiles. These are
used later from CGI programs to generate sorted HTML tables on demand through a web
interface. The number of these metafiles depends on the number of slices and nodes within
the testbed.
A slice-centric view shows all the slices running within each node. As an example, the reader
may refer to the already mentioned webpage to see all the running slices within the
PlanetLab node with DNS name vicky.planetlab.ntua.gr.
A node-centric view shows all measured metrics for a node within a specific slice. For
example, the reader may refer to the webpage http://comon.cs.princeton.edu/status/
tabulator.cgi?table=slices/table_princeton_codee to see all measured metrics for all nodes
(i.e. slivers) that are members of the slice with name “princeton_coblitz”.
Currently, CoMon requires transfers of a large amount of raw data taking into consideration
both the node-centric and slice-centric data collection operations. To reduce the amount of
raw data received by MyPLC, raw data produced within PlanetLab nodes is reduced by about
10% using bzip2 compression.
2.2.1.6 CoMon Data Handling and Visualization Service
The tool that PlanetLab operators use for graphical representation of monitored data is the
popular RRDtool. This is a high-performance data logging and visualization system for timeseries of data. Furthermore, PlanetLab operators use one separate database file for each
row of any table which provides them flexibility in creation of the tables. In this manner,
operators can add nodes/slices to CoMon without having to re-format the existing database
files. However, adding more metrics per row requires update operations on these files. To
control how much delay is introduced in the creation of the various tables/graphs and the
resource usage of the various components, data is processing in a four-step procedure. The
steps are the following:
1. Every five minutes, MyPLC collects data from the slice-centric daemons and stores
the data into a snapshot file showing the current state, writes another copy into an
“update” file, and appends this to the history file for the current day. Finally, it
writes the per-node, per-slice tables, and the tables for the maximum/average/total
resources consumed by each slice, if no other instance of the program is already
writing tables.
2. MyPLC splits the large “update” file into separate files per slice. This step is repeated
upon reception of every “update” file.
3. It then updates the RRD databases that contain the per-slice statistics for
max/avg/total resource consumption. Only one instance of this process can run at a
time and the processing is CPU intensive.
4. Finally, the per-slice files are split to create per-slivers files. Then, all per-sliver files
are stored in relevant RRD databases for fast query processing of a specific sliver.
NOVI-D3.2-11.02.28-v3.0
Page 35 of 90
CoMon provides support for users to specify “C-style” query expressions to request specific
data. This method allows users to select only the rows that satisfy their queries. An example
query, embedded in the URL to access the CoMon website is the following:
select ='numslices > 60 && cpu > 50'
This query requests the list of PlanetLab nodes that host more than 60 slices and that use
more than 50% of their CPU.
2.2.2 MyOps
All the monitoring tools described in the introduction have a significant assessment of the
physical infrastructure or network functionality, but they are not able to detect and resolve
substantial problems revealed in any PlanetLab site. The MyOps toolkit [MyOps] was built to
fill the gap between all these other tools and close the loop on the management cycle.
From the start-up sequence for the node software stack, it can be derived that relying solely
on slice-level monitoring services to diagnose the run-time status of the deployment is
inadequate, since so much occurs before the system is able to run a slice. And, while some
services on PlanetLab report the operational status of PlanetLab nodes or the connectivity
between nodes, this information is not part of the MyPLC management platform, and thus
cannot provide deep insight into the cause of problems inside a node during start-up. As a
result, PlanetLab operations have historically relied on the diligence of remote operators to
restart crashed nodes, run hardware diagnostics, or perform physical maintenance.
MyOps consists of techniques and strategies for the precise monitoring and management of
a MyPLC-based [MyPLC] deployment. Although some monitoring services are able to provide
useful information about the operational status of any node or its interface cards and
Internet network connectivity, this information is not provided as part of the MyPLC
management platform and thus it is not sufficient to provide essential hints on the specific
problems that may have occurred during the startup procedure of a node. To tackle this
issue, MyOps has been implemented as a service within the MyPLC stack itself (Figure 18).
Figure 18: The MyPLC software stack (taken from [MyOps])
Generally, MyOps supplements the monitoring operations by detecting problems using data
collection and aggregation of the observed history. Likewise, it enables automated actions to
be enforced upon failures and delivery of specific notifications to users. MyOps notifies the
NOVI-D3.2-11.02.28-v3.0
Page 36 of 90
responsible technical administrators of the appropriate PlanetLab site(s), too. Each
PlanetLab site has a number of technical administrators according to how delegation of
management responsibility is currently implemented within PlanetLab’s architecture. When
notified by MyOps, the per-site technical administrators are able to proceed in order to find
what caused disruption of the operational status of their PlanetLab nodes.
The MyOps’ management cycle is based on the following procedure:
a) It first collects data to examine the behaviour of a PlanetLab node.
b) Based on the data, MyOps categorizes any observed problem in a history-table in
consistency with some policy rule that classifies the observed data to a problem or
to a set of problems.
c) Finally, MyOps undertakes specific actions, under some policy rule, with the purpose
to resolve any existing problem on a node.
Figure 19: MyOps management structure (taken from [MyOps])
In the following two sub-sections, we will describe in detail the functionality that
MyOps implements in order to a) detect and b) resolve problems.
2.2.2.1 MyOps Functionality for Problem Detection
If the behaviour of a PlanetLab node differs from the expected one then MyOps
takes some actions. Especially it tries to contact agents on the PlanetLab nodes
administered by MyPLC [MyPLC] in order to:
NOVI-D3.2-11.02.28-v3.0
Page 37 of 90

Check the network accessibility of a node.

Retrieve some important environmental values of a node, such as the version of the
kernel, the system logs, the operational status of the slices, etc.

Check the operational state of a node (i.e. online, good, down, failboot, etc.).

Diagnose if a node is working correctly (i.e. OK, not OK).

View the functionality of the Power Control Unit (i.e. OK, Error, Not Run, etc.) – see
Figure 20.
Consequently, MyOps agents report to MyPLC, in fixed periods of time, how a node is
running and update a field that maintains the last recorded information. This operation
helps administrators to quickly trace those nodes that are not equivalent to the expected
behaviour for a long period. On the one hand, this method provides a good instant view of
the system and detects problems that occur currently in any node. On the other hand, the
instantaneous observation of the nodes’ state does not give a distinct and robust knowledge
of the problems which have occurred in a node over time. To face a wide variety of problems
that demand different treatment, MyOps applies a policy rule framework based on the
history of nodes (see Figure 20). For example, there should be a different approach on
handling a site of PlanetLab nodes that is “down” for one day due to power supply problems,
compared to a site of nodes that is going into downtime due to operating system update
procedures.
Figure 20: PlanetLab nodes’ PCU state (screenshot taken on the MyOps site)
NOVI-D3.2-11.02.28-v3.0
Page 38 of 90
2.2.2.2 MyOps Functionality for Problem Resolution
After detecting the problem, MyOps tries to cope with it efficiently based on particular
policy rules, as shown in Figure 21. More precisely it checks every time whether the problem
can be resolved by performing the following tangible actions. The MyOps system
a) Attempts to automatically repair run-time or configuration errors to continue the
regular start-up process of a node based on existing knowledge from start-up script
logs. MyOps either tries to restart (i.e. boot manager, hardware) or reinstall (worst
case scenario) the node depending on the nature of the problem. If there is not a
single already known way to resolve the problem, there will be a manual
intervention needed to deal with it.
b) Sends email notices to remote contacts to notify them of the problem. Depending on
the importance of the problem, the email must be sent to the appropriate people in
accordance with the roles (i.e. Principal Investigator, Technical Contact, Slice User,
Administrator) defined by MyPLC.
c) Gives incentives to users with the intention to make them examine possible errors
reported to them on their own. As an example, a MyPLC operator may halt a slice in
a PlanetLab node, in order to awake the interest of the site’s responsible persons to
look for the cause of this problem.
Figure 21 shows a table of actions (and their attained frequency) that MyOps accomplished
during the last week.
Figure 21: MyOps action summary (screenshot taken on the MyOps site)
NOVI-D3.2-11.02.28-v3.0
Page 39 of 90
2.2.3 TopHat
TopHat [TOPHAT] is a topology information system for PlanetLab. Researchers use the
PlanetLab facility for its ability to host applications in realistic conditions over the public best
effort internet.
TopHat continuously monitors this part of the network which is not under users' control and
supplies the data to PlanetLab-based applications. Its originality lies in its stateof-the-art
measurement infrastructure and its support of the full experiment lifecycle.
During slice creation, TopHat provides information about nodes and the paths in between to
help selection of the most appropriate nodes. This way, an experiment can leverage
PlanetLab's topological and geographical diversity.
Users and applications can request live measurements of routes, delays and available
bandwidth. Alerts can also be set up by the users to trigger when changes in the underlying
topology occur, thus providing a stream of updates in real time.
Access to information archived during the experiment allows for retrospective analysis.
Knowledge of topology evolution is crucial to make sense of experimental results. Historic
data unrelated to individual experiments is also available for research purposes.
Straightforward experiment integration relieves a user from writing his/her own measurement code. The consistent and quite easy-to-parse data formats for the different measurements make an experimenter’s life easier. TopHat provides transparent access to the bestof-breed measurement tools and infrastructures, such as Paris trace route as tool and Dimes
plus Etomic as infrastructures.
Available measurements and metrics include:

node and paths characteristics (high-precision, load balancers, ...)

measurements such as traceroute, delay, available bandwidth

AS-level information (prefixes, ASN and names, type of AS, ...)

geolocalization information (latitude/longitude, cities, countries, continent)
While all information is transparently tunnelled to the users, it can be annotated with the
tools, platforms and methods that were involved by the request.
2.3 VINI
VINI [VINI] is a virtual network infrastructure that allows network researchers to evaluate
their protocols and services in a realistic environment. VINI allows researchers to deploy and
evaluate their ideas with real routing software, traffic loads, and network events. To provide
researchers flexibility in designing their experiments, VINI supports simultaneous
experiments with arbitrary network topologies on a shared physical infrastructure.
PL-VINI [BAV06] is a prototype of a VINI that runs on the public PlanetLab. PL-VINI enables
arbitrary virtual networks, consisting of software routers connected by tunnels, to be
NOVI-D3.2-11.02.28-v3.0
Page 40 of 90
configured within a PlanetLab slice. A PL-VINI virtual network can carry traffic on behalf of
real clients and can exchange traffic with servers on the real Internet. Nearly all aspects of
the virtual network are under the control of the experimenter, including topology, routing
protocols, network events, and ultimately even the network architecture and protocol
stacks.
The monitoring tools that VINI uses – mainly CoMon and MyOps – are the same as the ones
used in PlanetLab. (see the PlanetLab section 2.1 for details)
VINI currently consists of nodes at sites connected to the National Lambda Rail, Internet2,
and CESNET (Czech Republic). Note that VINI leverages the PlanetLab software, but is a
separate infrastructure from PlanetLab. The maps below show the current VINI deployments
in the two US Research & Education backbones: Internet2 (Figure 22) and NRL (Figure 23).
Both figures are taken from [VINI].
Figure 22: VINI nodes on Internet2
Figure 23: VINI nodes on NLR
NOVI-D3.2-11.02.28-v3.0
Page 41 of 90
VINI allows its users to prepare an overlay network across booked nodes. The overlay and
routing can be setup by VINI for the user, when a specification is given for it. To control the
routing, e.g. the Quagga [Quagga] tool can be used. However there are a few limitations for
the structure of the overlay that depend on the physical connectivity of the underlying,
booked hosts: “Virtual links can only be added between nodes that are adjacent in the
physical topology (meaning there are no intervening VINI nodes)." If needed, this limitation
can be circumvented by building your own private overlay on top of a booked VINI slice
[VINI]. Additionally for the virtual VINI links the user can configure a bandwidth limitation.
The configuration is done by editing the RSpec to describe the resources to allocate to the
user’s slice. Sliver tags can be added to the nodes on which one wants slivers, and virtual link
(vlink) tags to the links on which one wants to create virtual links between slivers. Bandwidth
limitations are also configured as part of this configuration.
2.4 Panlab
2.4.1 Traces Collection Framework
Panlab provides a traces collection framework whose goal it is to provide a toolkit for
exposing inter-component communication for initial setup and “debugging” purposes. It
provides a mean for observation of real-time interactions, including the integration with a
widely used network protocol analyzer (Wireshark). However, it still does not have a native
support for displaying network activity from the interfaces of remote nodes.
The traces collection framework is composed of two main components: 1) Traces Collection
Client (TCC) and 2) Traces Collection Server (TCS). TCC is a software component that runs at
the testbed nodes and from which traffic traces are collected. It intercepts the packets and
transmits them to the server in an appropriate format. TCS is located at a central location
where the trace files are collected, stored, and managed and from which traffic captures can
be exposed to any user.
An example is shown in Figure 24. The Traces Collection Clients are deployed on resources A
and B. This way it will allow users to observe the packets that are sent between A, B and C.
Due to the fact that there are no TCCs deployed on resources C and D, users will not have
any platform-enabled mean to see what is happening on the interfaces between them.
NOVI-D3.2-11.02.28-v3.0
Page 42 of 90
Figure 24: Panlab traces collection framework structure
Traces Collection Clients consist of the capturing tool (tshark) and a Traces Transmitter
which is a program that relays traffic captures to a central server.
A set of parameters can be configured for each TCC individually such as sink IP, sink port, and
type of trace, among others. The developed solution supports real-time observation
functionality. For simplicity users are provided with the execution scripts. These can be
downloaded from a web interface. With a minimal browser configuration they can be
immediately executed allowing the start of a remote observation. Solutions have been
developed for POSIX-compliant operating systems and for Windows.
The execution of the shell script will carry out the following three actions:
1. Creation of a named pipe with the unique name.
2. Start of wireshark (see Figure 25), which will be configured to wait for the streaming
input from the mentioned pipe.
3. Sending of an HTTP GET request to the server for retrieving traces for the selected
resource and forwarding this date into the previously created named pipe.
As soon as new packets are captured on a remote node they will be delivered to the TCS and
then immediately sent to the user. As this data is forwarded into wireshark using a pipe it
automatically recognizes the changes and refreshes the GUI presenting new data.
NOVI-D3.2-11.02.28-v3.0
Page 43 of 90
Figure 25: Wireshark user interface
Results of an experiment – generally raw packet traces – can be stored in a centralized
location (Traffic Collection Server). This resource can remain persistent for an extended
duration of time, compared to the rest of the booked testbed elements. Therefore all other
resources can be released without the necessity to load a considerable amount of data onto
a hard drive of a user. Results can be stored in a “cloud” for further analysis and processing.
2.4.2 Omaco
An Open Source IMS Management Console (OMACO) and its GUI is shown in Figure 26. The
OMACO tool is a monitoring toolkit, integrated with Teagle [Teagle]. Teagle as a control
framework and collection of central federation services helps the user in configuring and
reserving a desired set of resources. Relying on a federation framework [WaMG10], users
can get access to distributed resources and group them into a virtual environment which is
called a VCT (virtual customer testbed). Resources are offered by participating organizations
across Europe [Panlab], [Wahle11]. More details on the Teagle architecture can be found in
NOVI Deliverable D3.1 [D3.1]. Thus, the monitoring section as part of the Teagle UI looks like
the following two figures:
NOVI-D3.2-11.02.28-v3.0
Page 44 of 90
Figure 26: OMACO screenshot
The choice of the metric and the resources to monitor are still hard coded. However in
further steps they are expected to be dynamically selectable. OMACO initially started as a
monitoring and fault management system for the IP Multimedia Subsystem (IMS). It includes
SIPNuke (a high performance IMS/SIP traffic generator and device emulator), SIPWatch (a
passive monitor for VoIP transactions (register/message/call)), and it allows to plugin other
types of probes via a script-based policy engine which is written in LUA.
Figure 27: OMACO as part of Panlab’s Teagle installation
NOVI-D3.2-11.02.28-v3.0
Page 45 of 90
2.5 ProtoGENI/Emulab
ProtoGENI [ProtoGENI] offers its Control Plane for monitoring their testbed. The ProtoGENI
Control Plane is responsible for configuration, control and monitoring of an experiment.
Every node is connected to some "control net" to enable this control. In Emulab, a separate
(from the data plane) LAN is used. However, while separate, this LAN is also used for
external access to the data plane. In VINI and PlanetLab components, the Internet is used as
the control plane.
Figure 28: Control plane interface of ProtoGENI
ProtoGENI supports both virtualized and space-shared resources. The latter can run custom
OS software on dedicated hardware.
The Control Plane handles both internal operations and experiments and monitors the
testbed resource state across all components. The resource allocation and the policies are
under development, currently resource brokering is not addressed (a slice embedding
service handles normal resource discovery and allocation for experimenters).
2.5.1 Measurement Capabilities
Experiment management has rich support in Emulab, including functions for specifying
topologies, installing and updating software, controlling execution, and deploying in
heterogeneous environments.
The used Ethernet switches have basic packet counters, and can be configured to "mirror"
traffic between selected ports for collection purposes. The PCs will be able to use standard
NOVI-D3.2-11.02.28-v3.0
Page 46 of 90
measurement and capture tools, such as tcpdump. Due to their nature as "programmable
hardware", quite a lot of measurement is possible on the NetFPGAs.
An instrumentation and measurement system by “Leveraging and Abstracting
Measurements with perfSONAR” (LAMP) project [LAMP], based on perfSONAR [perfSONAR],
has being developed since 2010 for use on ProtoGENI. The project roadmap aims at the
development of a common GENI instrumentation and measurement architecture. Emphasis
is given on providing a common but extensible format for data storage and exchange.
LAMP Instrumentation & Measurement (I&M) System is being realized through
a ProtoGENI slice with Instrumentation & Measurement (I&M) capabilities based on
the perfSONAR framework and the perfSONAR-PS software suite.
The LAMP I&M System is built on top of three pillars:

LAMP Portal: The LAMP Portal is the initial resource for experimenters to manage
and visualize their I&M services and data.

UNIS and Topology-based Configuration: The Unified Network Information Service
(UNIS) is the most important component of the LAMP architecture. This service is
the combination of the Lookup Service and the Topology Service of the perfSONAR
framework. UNIS is used to store the topology data of the Instrumentation &
Measurement (I&M) slice. Services register themselves so that they can be found
dynamically. The configuration of the LAMP services is done through annotations on
the topology data. Through the LAMP Portal, services are configured and the
modified topology is uploaded to UNIS. The configuration pushed to UNIS is then
fetched by the perfSONAR-PS pSConfig service that runs on each node. The pSConfig
service parses the configuration for the node and makes all necessary changes.

I&M software (measurement tools and perfSONAR services): The LAMP I&M System
is based on the existing measurement tools and the mature perfSONAR-PS software
suite. The LAMP I&M System derives directly from pS-Performance Toolkit and the
perfSONAR-PS services. All the services on the LAMP I&M System have
Authentication and Authorization (AA) capabilities based on the ProtoGENI AA
model.
2.6 WISEBED
WISEBED [WISEBED] is a federated environment of wireless sensor networks (WSN) of
considerable size (more than 2000 sensor nodes). WISEBED brings together and extends
different existing testbed platforms across Europe, forming a federation of distributed test
laboratories. The aim of bringing up heterogeneous small-scale testbeds to form organized,
large-scale structures is handled through a federated identity-based authentication and
authorization infrastructure, a scheduler for the resource leases, a Uniform Resource Name
(URN) schema, a management framework of the resources and an operating system
abstraction framework (Figure 33). A more detailed description of the overall architecture
can be found in NOVI-D3.1 Deliverable [D3.1].
NOVI-D3.2-11.02.28-v3.0
Page 47 of 90
Figure 29: WISEBED system architecture
In terms of monitoring, an Application Programming Interface (API) has been defined to
present the testbed’s data in a common manner. Monitoring API includes a single remote
function. Remote monitoring function getnetwork() returns monitoring data in WiseML
format (described below).
The aforementioned function allows a client of the API to acquire the network topology of
an instance (single-user unified testbed consisting of resources spread all over the federated
testbed environment, called virtualized testbed in WISEBED terminology). Monitoring API
allows querying of changes in the network after the reservation has been made (for example
if a node has died and has been replaced from another).
Web Services Description Language (WSDL) has been used as the specification language to
define the web services related to Wireless Sensor Networks (WSN) APIs, including
Monitoring API.
A markup language, called WiseML (Wireless Sensor Network Markup Language) [WiseML],
has been developed for generating and storing the output of the experiment log and debug
tracing. WiseML is also used for reading and parsing the necessary information about the
testbed’s underlying network resources.
WiseML formatted file is used for listing the entire network environment of each
experiment. The file contains all needed information to identify and reproduce a simulation
trace. In addition, it is used from the Testbed Architecture for Wireless Sensor Networks
(TARWIS) system to display the nodes in the network graph.
A Uniform Resource Naming (URN) schema has been used for describing with a standard
namespace format the sensor nodes, their capabilities, the network topology and the portal
servers. The monitoring API and the WiseML use this URN schema.
After initialization of the experiment, by the TARWIS controller, the user can monitor the
ongoing experiment by observing the output of the selected sensor nodes. The web
interface of TARWIS allows the live monitoring of the ongoing experiment and displays the
output that the sensor nodes print on the serial interface. The sensor nodes’ output and the
experiment control output are updated every 500 milliseconds. The location map is updated
every 30 seconds, including the links between the sensor nodes.
Every output of the finishing experiment is exported by TARWIS to a WiseML-file, zipped and
saved to the TARWIS database. This WiseML-file consists of all the important information
NOVI-D3.2-11.02.28-v3.0
Page 48 of 90
about an experiment run, for example where the experiment took place geographically,
what kinds of nodes were used and what was their sensor environment. Storing experimentrelated information in one WiseML file offers the opportunity for post-experiment analysis.
Experiment results are stored at the designated TARWIS database. Researchers carrying out
experiments can choose to declare experiments public. When choosing public, other
WISEBED members can monitor the ongoing experiment at run-time, and download the
experiment results.
2.7 ORBIT
Open-Access Research Testbed for Next-Generation Wireless Networks (ORBIT) [ORBIT] is a
network testbed consisting of 400 802.11 a/b/g radio nodes deployed in the same indoor
area. The grid of these nodes can be used to create static or dynamic topologies (currently,
only through a mobility emulator). A hardware subsystem has been deployed for physical
infrastructure monitoring, plus chassis control, and a measurement framework has been
developed for data collection.
ORBIT uses a platform-independent subsystem for managing and autonomously monitoring
the status of each wireless node in the ORBIT network testbed. This subsystem is consisted
of Chassis Managers (CM). Each ORBIT grid node consists of one Radio Node with two radio
interfaces, two Ethernet interfaces for experiment control & data, and one Chassis Manager
(CM) with a separate Ethernet network interface. Each CM is tightly coupled with its Radio
Node host. CM subsystems are also used with non-grid support nodes. A central Control and
Monitoring Manager (CMC) exists for all the CM elements of ORBIT.
Each Chassis Manager is used to monitor the operating status of one node. It can determine
out-of-limit voltage and temperature alarm conditions, and can regain control of the Radio
Node when the system must return to a known state. The CM subsystem also aids debugging
by providing telnet to the system console of the Radio Node, as well as telnet access to a CM
diagnostic console.
For measurement purposes a software library has been developed. ORBIT measurement
library (OML) [OML] is a distributed software framework enabling real-time collection of
data in a large testbed environment. It enables the experimenter to define the measurement
points and parameters, collect and pre-process measurements, and organize the collected
data into a single database with the experiment’s context, avoiding logging files in various
formats. The OML framework is based on client/server architecture and uses IP multicast for
the client to report the collected data to the server in real-time. It defines the data
structures and functions for sending/receiving, encoding/decoding and storing experiment
data. OML has three major components:

The OML-Service, which can generate wrapper code for C/C++ applications to
interact with OML and also creates runtime experiment client/server files based on
an experiment definition.

The OML-Client Library, which is deployed on the nodes participating in the
experiment and can aggregate and filter data from the applications participating in
the experiment and send it to a central server.
NOVI-D3.2-11.02.28-v3.0
Page 49 of 90

The OML-Collection-Server, which collects filtered data coming from all nodes
involved in the experiment and stores it in a MySQL database.
2.8 ORCA/BEN
In this section we give a brief overview of monitoring capabilities of ORCA/BEN. ORCA is an
extensible framework for on-demand networked computing infrastructure. Its purpose is to
manage the hosting of diverse computing environments (guests) on a common pool of
networked hardware resources such as virtualized clusters, storage, and network elements.
ORCA is a resource control plane organized around resource leasing as a foundational
abstraction. The control plane architecture of ORCA is presented in NOVI Deliverable D3.1
[D3.1]. In this document we give an overview of the monitoring aspects of this system.
BEN (Breakable Experimental Network) is a platform for network experimentation. It
consists of several segments of NCNI dark fibre, a time-shared resource that is available to
the North Carolina research community. BEN provides access for university researchers to a
unique facility dedicated exclusively to experimentation with disruptive technologies (hence
the term Breakable).
The ORCA/BEN project expands the scope of ORCA to enable it to function as a management
plane for the network as well as edge resources, acting as a GENI management plane. See
also [Chase09] for details of the ORCA control framework architecture and internals.
One of the goals of Automat [YuSI07], the web GUI of ORCA, is to provide a standard
interface for controllers to subscribe to named instrumentation streams in a uniform
testbed-wide name space.
ORCA aims at supporting a plugin-based instrumentation framework for guest elements to
publish instrumentation streams and to subscribe to published streams [Jhase2009]. For
example, a system or a software application in a multi-tier web service running on leased
resources can continuously publish information about the usage of different service
components. A guest controller for the service might subscribe to this data and use feedback
control to implement self-optimizing or self-healing behaviour.
The initial approach in ORCA was to read instrumentation data published through Ganglia
[Ganglia], a distributed monitoring system used in many clusters and grids. It is based on a
hierarchical design targeted at federations of clusters. It leverages widely used technologies
such as XML for data representation, XDR for compact, portable data transport, and RRDtool
for data storage and visualization.
The Ganglia implementation consists of several components [MaCC04]. The Ganglia
Monitoring Daemon (gmond) provides monitoring on a single cluster by implementing the
listen/announce protocol and responding to client requests by returning an XML
representation of its monitoring data. It runs on every node of a cluster. The Ganglia Meta
Daemon (gmetad), on the other hand, provides federation of multiple clusters. A tree of TCP
connections between multiple gmetad daemons allows monitoring information for multiple
clusters to be aggregated. Finally, gmetric is a command-line program that applications can
use to publish application-specific metrics, while the client side library provides
programmatic access to a subset of Ganglia’s features.
NOVI-D3.2-11.02.28-v3.0
Page 50 of 90
The default site controller in ORCA instantiates a Ganglia monitor on each leased node
unless the guest controller overrides it. The monitors collect predefined system-level
metrics, like CPU, memory, disk, or network traffic, and publish them via multicast on a
channel for the owning guest controller.
At least one node held by each ORCA guest controller acts as a listener to gather the data
and store it in a round-robin database (rrd). Guests may publish new user-defined metrics
using the Ganglia Metric Tool (gmetric), which makes it possible to incorporate new sensors
into the Ganglia framework.
2.9 OneLab
2.9.1 ScriptRoute
The current state of the art in OneLab active network monitoring is ScriptRoute
[ScriptRoute]. It provides a scripting language to request sophisticated measurements. More
recently, TopHat, developed by partner UPMC, is improving on ScriptRoute in its reuse of
recent measurements. In OneLab, TopHat was/is being improved with: 1) a greatly expanded
number of measurement nodes and scope of the system through federation with the DIMES
infrastructure; 2) high-precision measurements by federating with the ETOMIC
measurement infrastructure run by ELTE and UAM; 3) improved algorithms for efficient
network probing to better account for the recently discovered widespread existence of
multiple paths, as well as reduced self-interference from active measurements.
2.9.2 CoMo
Researchers running experimental applications can also benefit from information within the
network regarding traffic flows and the flows of their own packets as they pass through
certain monitoring points. OneLab develops solutions based on the CoMo2 monitoring
system, a free open-source system that is notable for its security model. CoMo provides
network operators with the confidence that they can open their systems to outside queries
without compromising the confidentiality of client traffic. External researchers can run
complex queries to obtain detailed information on their own flows and aggregated
information regarding other traffic. In OneLab2, our use of CoMo is improved as we
introduce: 1) a resource management system to prevent resource conflicts for the large user
population expected for PlanetLab Europe; 2) data selection techniques that allow the
control of measurement resources (e.g., processing, transmission) and the amount of result
data; 3) synchronised passive multipoint measurements, which is especially challenging
when using random packet sampling techniques; 4) the IPFIX protocol extensions for storage
and sharing of data, which are being developed at the IETF with strong leadership by partner
Fraunhofer.
2
http://como.sourceforge.net/
NOVI-D3.2-11.02.28-v3.0
Page 51 of 90
Figure 30: OneLab user interface showing host status
Regarding the integration with ’MySlice’ component application, in the new Web User
Interface that is part of the OneLab 4.3 release, the OneLab team has adopted and tuned a
JavaScript library for client-side display of sortable and searchable lists. This feature still
remains to be augmented by the capacity, for the user, to indicate the set of columns that
he/she wants to see displayed in a given page, which would provide even greater flexibility.
However, even without this capability, the resulting library has turned out to be very useful
in the new PLEWWW module that implements the User Web Interface that comes with
release 4.3. It is illustrated in Figure 31.
Figure 31: MySlice overview of networks
NOVI-D3.2-11.02.28-v3.0
Page 52 of 90
2.9.3 APE
With the APE (Active Probing Equipment) monitoring box, OneLab offers platform builders
and users a low-cost probing component suitable for a range of different monitoring needs,
in particular medium-precision active measurements.
The APE box offers a low-cost, low-maintenance solution for tenth-of-microsecond
resolution active measurements.
This stand-alone hardware solution is likely to be of interest to small businesses and
individual developers. The measurements offered by the APE box can be scaled up to a
global measurement infrastructure at a very moderate cost, adding a qualitatively new
feature to the future of active Internet measurements. For this reason APE’s measurements
are less focused on accuracy and programming flexibility than, for example, that of OneLab’s
other measurement box, ANME. Along with the APE box we also provide a GPS device which
allows the synchronization of endpoints on a global level.
Figure 32: OneLab’s APE box
APE hardware operates a robust, stand-alone operational system, which provides a range of
active probing services. The measurement services offered by these boxes are managed and
accessed through the SONoMA system (Service Oriented NetwOrk Measurement Architecture) developed according to the paradigm of web services. This approach offers new
‘plug and play’ measurements for a wide variety of user applications. See also section 3.4 for
more details.
2.9.4 ANME
With the ANME (Advanced Network Monitoring Equipment) box and its related software,
OneLab offers platform builders and users a versatile monitoring solution that carries out
both active and passive measurements to a very high level of precision.
The ANME box contains high-precision tools for both active measurements, giving an
extremely accurate view of the network, as well as for passive monitoring, providing finegrained time-stamping.
NOVI-D3.2-11.02.28-v3.0
Page 53 of 90
By installing the box, users gain the ability to carry out active measurements at a timescale
of tens of nanoseconds, with GPS synchronization. If two or more boxes are purchased and
installed in different locations, the variety of possible measurements to be run (delays,
available bandwidth, etc.) is further increased.
Figure 33: ANME’s ARGOS measurement card.
The ANME box also contains the CoMo software package, OneLab’s passive monitoring
component. This program supports both arbitrary traffic queries that run continuously on
live data streams, and retrospective queries that analyze past traffic data to enable network
forensics. To run CoMo all that is required in addition to the ANME hardware is a switch
equipped with a monitoring port.
The ANME box comprises a server-class computer, an ARGOS measurement card, APE (a
service oriented measurement module), a GPS receiver, and the CoMo software package.
The state-of-the-art VHDL-programmable ARGOS card provides a measurement resolution of
tens of nanoseconds, while the GPS device keeps ANME measurement endpoints globally
synchronized and CoMo timestamps accurate.
2.9.5 Packet Tracking
Nowadays an increasing number of sophisticated network applications (such as multimedia
and gaming) demand a high level of quality of service (QoS) from the network, for example
small latencies and low packet loss. In order to supervise the fulfilment of such demands, a
deeper understanding of the network and what happens within it is required.
OneLab offers a distributed network surveillance solution for tracking the paths of data
packets through networks. Users can download and install a Packet Tracking software
package, which allows them to “track” packets in the network and answer enquiries related
to routes followed by any given “multi-hop” packet through the network, packet delays and
losses between nodes, service chains used by some data traffic, cross traffic influence, and
network QoS on the specific network links.
NOVI-D3.2-11.02.28-v3.0
Page 54 of 90
Figure 34: Map showing a packet tracking scenario involving different nodes from
testbeds such as VINI, G-Lab, PlanetLab and others.
.
Multi-hop Packet Tracking uses a non-intrusive coordinated sampling method (specified in
RFC5475) to show the hop-by-hop path and transmission quality of data packets from
arbitrary applications. The Packet Tracking solution provides a means to verify (multipath)
routing and overlay experiments and supply input to adaptive network algorithms, security
applications and the supervision of service-level agreements.
2.9.6 ETOMIC
ETOMIC [ETOMIC] is closely connected to OneLab, however as a network infrastructure it is
described in detail in section 3 on page 60.
2.9.7 DIMES
DIMES [DIMES] is a testbed aimed at studying the structure and topology of the Internet. It is
based on distributed light-weight software agents hosted by a community of thousands of
volunteers worldwide.
DIMES offers the ability to launch topology measurements of the Internet from multiple
vantage points, as well as access to a vast database of accumulated measurement data.
DIMES deploys the most numerous and highly distributed set of agents of any open
measurement infrastructure available on the Internet today. In addition, they are located in
places where other infrastructures typically do not have agents. For example, many are
located in people’s homes, providing vantage points outside of conventional academic
networks. Each DIMES agent periodically wakes up and performs measurements from its
host machine with no noticeable effect on the machine’s CPU usage or bandwidth
consumption. Measurements are orchestrated from a central server without breeching the
host’s privacy. These measurements include standard daily measurements as well as ones
that users specifically request.
On a typical day, the DIMES server records 3.5 million measurements from 1300 agents in
over 50 countries.
NOVI-D3.2-11.02.28-v3.0
Page 55 of 90
2.10 GpENI
Since March of 20103 GpENI has installed Nagios monitoring system to monitor the status of
individual nodes and services across all the clusters. In addition, the Cacti monitoring tool is
also running in each of the Netgear Gigabit-Ethernet switches to monitor the per-port
network usage. Zenoss core has been evaluated as an alternative to Nagios.
2.10.1 Cacti
Cacti [Cacti] is a frontend to RRDTool [RRDTool]. It collects and stores all of the necessary
information to create graphs and populates them with a round robin database. Along with
being able to maintain Graphs, Data Sources and Round Robin Archives in a database, Cacti
handles the data gathering. There is also SNMP support to allow creating traffic graphs with
Cacti as was possible with its predecessor MRTG (multi router traffic grapher). A typical
screenshot of the Cacti user interface can be seen in the following figure.
Figure 35: Cacti screenshot
3
GpENI, Wiki, Milestones and Deliverables.
https://wiki.ittc.ku.edu/gpeni/Milestones_and_Deliverables#GpENI:_Administration_and_monitoring
NOVI-D3.2-11.02.28-v3.0
Page 56 of 90
2.11 4WARD
Monitoring in the 4WARD [4WARD] project is located on the block named In-Network
Management (see [D3.1] for more information about the 4WARD architecture). Specifically,
4WARD has developed algorithms for situation awareness that address real-time monitoring
of network-wide metrics, group size estimation, topology discovery, plus data search and
anomaly detection. These algorithms provide the necessary input to the self-adaptation
mechanisms of 4WARD.
2.11.1 Monitored Resources
4WARD objective is to provide a model for creating Future Internet networks. Therefore, the
monitoring system of 4WARD is not focused on particular elements of its infrastructure but
it provides a set of algorithms capable of providing monitoring functionalities that 4WARD
regards important and challenging for the management of FI. For that reason, resources to
be monitored are not specific items. Overall, algorithms explained in the next sections
interact with links and node resources.
2.11.2 Algorithms
GAP (Generic Aggregation Protocol)
GAP is a distributed algorithm that provides continuous monitoring of global metrics.
NATO!
"Not All aT Once!" (NATO!) is a statistical probability scheme and algorithms for precisely
estimating the size of a group of nodes affected by the same event, without explicit
notification from each node, thereby avoiding feedback implosion.
Hide & Seek (H&S)
H&S is an algorithm for network discovery, and information propagation and
synchronization. The workflow of H&S starts with a seeker node sending directional contact
messages to its neighbourhood using a probabilistic eyesight direction (PED) function. This
function aims to narrow the directions through which contact messages are sent. Once
contacted, hider and seeker synchronize their knowledge, keeping track of each other. The
hider becomes a new seeker and the process is repeated until all nodes have been
contacted.
NOVI-D3.2-11.02.28-v3.0
Page 57 of 90
2.11.3 Measurements
Unlike other platforms, 4WARD does not propose a set of monitoring tools because its goal
is to offer “Design Processes” (see 4WARD section of *D3.1+) in order to develop new
networks in the FI.
As previously said, monitoring on 4WARD is centred on real-time monitoring of networkwide metrics, group size estimation, topology discovery, data search, and anomaly
detection.
Regarding the monitoring of network-wide metrics, an important task is to detect when
network wide threshold boundaries are exceeded. For example, the detection that some
network wide quantity such as average or peak load has been exceeded may indicate a
problem that may need attention. 4WARD has developed algorithms for threshold detection
based on both tree-based and gossip-based underlying protocols. The functionality is
depicted in figure 36 below.
Figure 36: Detection of threshold-crossing
Crossing the upper threshold (Tg+) from below causes an alarm to be raised and crossing a
lower threshold (Tg-) from above causes the alarm to be cleared again. This symmetry
suggests a "dual mode" operation of the protocols which allows to "switch sign" of the
detection machinery depending on which threshold has been most recently crossed. Its idea
allows defining protocols in both tree-based and gossip-based variants that almost
completely eliminate the overhead for aggregates values that are far from the thresholds.
For group size estimation, NATO! is used by 4WARD. Specifically, NATO! provides network
monitoring (estimating the number of nodes that are in bad state) and QoS management
(estimating the number of nodes that can accommodate certain QoS parameters).
Measurements for Topology discovery are performed using the H&S algorithm (see
tools/algorithms section). It is essential to ensure the existence of relevant and sufficient
information for self-adaptive decision-making processes.
4WARD opts to use random walks algorithms for data search on FI networks. Randomized
techniques prove to be useful in the construction and exploration of self-organizing
networks when the maintenance of search tables and structured indices is expensive.
NOVI-D3.2-11.02.28-v3.0
Page 58 of 90
Moreover, 4WARD proposes variations like multiple random walks in parallel and biased
random walks using partial routing information.
Finally, 4WARD has developed a distributed approach to adaptive anomaly detection and
collaborative fault-localization. The statistical method used is based on parameter
estimation of Gamma distributions obtained by measuring probe response delays. The idea
is to learn the expected probe response delay and packet drop rate of each connection in
each node, and use it for parameter adaptation such that the manual configuration effort is
minimized. Instead of specifying algorithm parameters in time intervals and specific
thresholds for when traffic deviations should be considered as anomalous, manual
parameter settings are here specified either as a cost or as a fraction of the expected probe
response delay.
2.12 Comparison
of
Current
Monitoring
Frameworks for Virtualized Platforms
In the above subsections we presented methods and tools deployed in current virtualized
platforms. Appendix A provides a comparison of types of measurements, common tools
used, monitoring metrics, data export formats, measurements granularity, etc. for the
aforementioned platforms.
NOVI-D3.2-11.02.28-v3.0
Page 59 of 90
3. NETWORK MEASUREMENT INFRASTRUCTURES
3.1 ETOMIC
Obtaining an accurate picture of what is happening inside the Internet can be particularly
challenging using end-to-end measurements. Being able to make high-precision delay
measurements of probe packets is essential for creating much clearer pictures.
The European Traffic Observatory Measurement Infrastructure [ETOMIC] is a network traffic
measurement platform with high precision GPS-synchronized nodes. The infrastructure is
publicly available to the network research community, supporting advanced experimental
techniques by providing high precision hardware equipment and a Central Management
System. Researchers can deploy their own active measurement codes to perform
experiments on the public Internet. ETOMIC is distributed throughout Europe and is able to
carry out active measurements between OneLab’s specially-designed ANME boxes. These
boxes are globally synchronized to a high temporal resolution (~10 nanoseconds).
Figure 37: Network delay tomography, showing delay estimates between different
sites on the ETOMIC testbed.
ETOMIC allows users to infer network topology and discover its specific characteristics, such
as delays and available bandwidths. It also provides users with a dynamic, high-resolution,
and geographically-distributed picture of global network traffic. This data can then be
analyzed using methods including those published in scientific journals by the researchers
working on ETOMIC.
NOVI-D3.2-11.02.28-v3.0
Page 60 of 90
The Etomic central monitoring system (CMS) also frequently invokes available nodes to carry
out trivial experiments against each other, such as traceroute [TRACERT, TR-PARIS], ping
[RFC2681], one-way delay *RFC2679+ and bandwidth *HáDV07+ measurements. These
processes are running in the background and in case a user requests ETOMIC resources
these tasks are stopped. The results of these hidden periodic measurements are
automatically stored in the CMS-database and users can graphically extract relevant
information out of it via dynamic forms on the ETOMIC GUI.
Figure 38: ETOMIC system architecture
Besides high precision and the capability of synchronization, ETOMIC aims at becoming a
large scale measurement platform. To meet the requirements of the former constraint older
nodes are equipped with an Endace Data Acquisition and Generation DAG cards [DAG] and
newer nodes use an ARGOS cards to provide nanosecond-scale time stamping of network
packets. The second requirement is met by the development in the system kernel to
federate and interconnect to a dedicated slice running in PlanetLab nodes.
Monitoring in ETOMIC is achieved in three abstraction levels, namely 1) status monitoring of
each node; 2) interconnecting network statistics collection and 3) activity monitoring of user
processes are defined. In the following subsections these abstraction levels are detailed.
Monitoring of a node
The system kernel running in the Central Management System (CMS) is accounting for the
status of all nodes, which can be in I) available; II) busy or III) unreachable (aka. down for
maintenance). A node is available by default. Available nodes are regularly checked via ssh
and marked unreachable upon few vain requests. Naturally, a node becomes busy when it is
instructed to carry out user experiments, which are blocking and changes back to available
after a measurement is over. CMS with regular frequency instructs available nodes to extract
and return GPS satellite visibility statistics, which are stored in the CMS-database, and
graphical output is delivered via the ETOMIC GUI server over https.
NOVI-D3.2-11.02.28-v3.0
Page 61 of 90
Network Monitoring
CMS kernel frequently calls available nodes to carry out trivial experiments among
themselves, like traceroute, ping, one-way delay and bandwidth measurement. These
processes are running in the background and in case a user requests ETOMIC resources
these tasks are killed and dropped. The results of these hidden periodic measurements are
automatically stored in the CMS-database and users can graphically extract relevant
information out of it via dynamic forms of the ETOMIC GUI.
Figure 39: Extraction of network monitoring statistics
Process Monitoring
Accounting information is graphically available, such as the number of user logins in the
system, the number of experiments run and the amount of raw data downloaded daily.
Similarly the aggregated durations of the busy periods for each node are extracted in the
statistics page of the ETOMIC GUI.
NOVI-D3.2-11.02.28-v3.0
Page 62 of 90
Figure 40: Time evolution of the number of experiments
The states of user experiments are booked by the CMS kernel, and can be either: i) upload;
ii) ready; iii) running; iv) finishing; v) result available; vi) no result; vii) canceled or viii)
deleted. The normal flow is of course i) → v). At any time users can interrupt their processes,
which will be deleted. In the rare case the system cannot finish with experiment script
upload, or breaks during launching experiment, the process will be marked canceled. A
process is finally marked no result if it will not produce any output to the nodes’ local file
system.
The CMS provides a REST remote procedure available to the nodes, which in return reports
the list of active nodes that take part in the same running process the caller belongs to.
3.2 HADES
HADES measurements
HADES is the second monitoring system used by FEDERICA. HADES measurements are
currently also being performed by Erlangen/DFN within the German Research Network
X-WiN, within GÉANT, within MDM and also within the LHCOPN networks using a worldwide
pool of measurement stations equipped with GPS for time synchronization. By adding these
measurement tools to the FEDERICA infrastructure as well, FEDERICA monitoring is
compatible with monitoring frameworks in the US and also Japan.
Measurements obtained by HADES are providing data for comparison with performance
data collected within virtual slices or virtual infrastructures.
NOVI-D3.2-11.02.28-v3.0
Page 63 of 90
Monitoring of IP Performance Metrics with HADES
FEDERICA monitoring was expanded by University of Erlangen/DFN with IPPM performance
metrics (one way delay, delay variation, packet loss, etc.). The tool to make these measures
is HADES (Hades Active Delay Evaluation System), see Figure 41.
Figure 41: Worldwide measurements using HADES
HADES [HADES, HoKa06] is a tool that provides performance measurements following the
IETF approach [RFC3393] and offers IP Performance Metrics (IPPM) such as one-way delay
(OWD) [RFC2679], IP Packet Delay Variation (IPDV/OWDV) [RFC3393] and packet loss
[RFC2680]. HADES is a monitoring tool that was designed for multi-domain delay evaluations
with continuous measurements and was not intended for troubleshooting or on-demand
tests.
HADES incorporates an active measurement process which allows layer 2 and layer 3
network measurements and produces test traffic which delivers high resolution network
performance diagnostics as listed above. Typically, the active measurement process
generates a burst of UDP packets of a certain configurable length (e.g. 25 bytes length
(UDP)) continuously throughout the day. The packets are sent with gaps of 10 ms in order to
avoid any waiting times at the network card interfaces. Each packet contains a sequence
number and a GPS synchronization timestamp. The receiver timestamps the moment of
arrival and uses the sequence number to determine packet loss.
As far as hardware is concerned, a HADES measurement box typically consists of a Linuxbased work station equipped with GPS for time synchronization. All measurement data is
collected in a full mesh. HADES measurements are currently being collected by DFN WiNLabor within the German Research Network X-WiN, within GÉANT, within the LHCOPN
network, within the FEDERICA infrastructure and within NRENs participating in the GÉANT2
MDM Prototype. Access to the measurement data is available through the perfSONAR-UI
graphical interface [PerfSONAR].
NOVI-D3.2-11.02.28-v3.0
Page 64 of 90
In Project
Number of Locations
Monitored Links
X-WiN
57
3500
GEANT
36
1200
MDM
23
500
LHCOPN
10
90
FEDERICA
6
15
Table 4: HADES measurements in various research projects
HADES and Virtualization
Although HADES provides excellent measurements in real-time environments, these realtime measurements cannot simply be transferred over to virtual environments, as virtual
environments typically lack the required time synchronization. Since virtual environments
lead to non-deterministic time behaviour that would make real-time measurements in this
context inappropriate, as part of their work in NOVI the WiN-Labor of the DFN will develop a
new active measurement process for IPPM metrics that is directly tailored to the
requirements of a virtual environment.
As the lack of time synchronization in a virtual environment prohibits time measurements of
delay, other more appropriate IPPM metrics such as jitter, packet loss, reordering,
throughput, etc. will be investigated. To achieve user and slice related measurements, the
VLAN-based active measurement process will focus on the collection of measurements
tailored to the times when a user application process is actually active. The new
measurement concept for virtual environments will be evaluated and compared to active
measurements in real-time environments. For the collection and presentation of the
measurement data, the already existing perfSONAR visualization tools will be used.
For the implementation the FEDERICA-HADES slice that is currently used for IPPM
measurements over the physical infrastructure of [FEDERICA] will be duplicated into a
HADES-II slice where each measurement node of the FEDERICA-HADES slice will be
represented by a FEDERICA virtual machine with a HADES instance installed. The locations
currently under consideration are the four core nodes of the FEDERICA network
infrastructure as well as several non-core nodes at FCCN, RedIRIS and NTUA. By mapping the
measurement process over the physical infrastructure exactly into a second slice with only
virtual machines, the direct influence of the virtualization on the measurement process can
be investigated.
NOVI-D3.2-11.02.28-v3.0
Page 65 of 90
On the website a user can first choose node and date that he is interested in:
Figure 42: Current nodes of FEDERICA that can be monitored by HADES tools
In the next figure can be seen the measurement links currently connected to the selected
node (CESNET node is shown in the figure below).
Figure 43: Website - HADES available measurement links for FEDERICA
NOVI-D3.2-11.02.28-v3.0
Page 66 of 90
The following screen (once submit button is pressed) takes the user to a more detailed
selection menu where he can select specific measurement metrics over each link.
Figure 44: Request of the specific measurements over each link
Clicking on submit button, the requested data is presented via statistic graphics. What is
currently monitored is listed below:




OWD: One-way delay
OWDV: One-way Delay Variation
Loss statistics
Hop count information
NOVI-D3.2-11.02.28-v3.0
Page 67 of 90
Figure 45: HADES statistics. RedIRIS - CESNET
The above figure shows an example of the four statistic graphs that can be requested over
FEDERICA link between RedIRIS and CESNET on November 18, 2010.
NOVI-D3.2-11.02.28-v3.0
Page 68 of 90
Access to HADES is available via their website (http://www.win-labor.dfn.de/cgibin/hades/select-new.pl?config=FEDERICA). Over time it can also serve as a long-term
archive for research activities in FEDERICA II and NOVI. At present, the FEDERICA NOC is the
beneficiary of the HADES monitoring statistics.
3.3 PerfSONAR
PerfSONAR (PERFormance focused Service Oriented Network monitoring ARchitecture;
[PERFSONAR], TiBB09, HaBB05] is a framework for multi-domain network performance
monitoring. The webservices-based infrastructure was developed by several international
partners from Internet2, ESNet and the GÉANT2/GÉANT3 projects with the purpose to
provide network administrators easy access to cross-domain performance information and
facilitate the management of advanced networks. As a special focus perfSONAR offers IP
performance metrics such as delay, loss or error rates that describe the robustness of a
network, indicate possible bottlenecks or identify rerouting after a hardware problem.
The design of the architecture was set up to be well prepared for new metrics and
extensions and automate all processing of measurement data whenever possible. For this
reason perfSONAR was divided into three layers with different modules. On the so-called
Measurement Point Layer, active or passive measurements are conducted over
Measurement Points (MPs); there is a separate MP module for each network metric, such as
packet loss or throughput, for example. A second layer is defined as the Service Layer and is
used for the management of all measurement tasks, i.e. the Service Layer coordinates these
tasks with specific web services. Typically a task of this nature would encompass the
authorization of measurements or the archiving of data at the end of the measurement
process. Communication between theses web services is handled by processing clearly
defined XML (Extensible Markup Language) documents. Finally, the third perfSONAR layer is
the so-called User Interface Layer: This layer offers a set of tools for visualization and
presentation of the collected data. A screenshot of PerfSONAR’s user interface can be
depicted in Figure 46.
NOVI-D3.2-11.02.28-v3.0
Page 69 of 90
Figure 46: PerfSONAR User Interface
PerfSONAR and virtualization
A perfSONAR network performance measurement node can be easily deployed by using the
ps-Performance Toolkit [Perftoolkit]. However, it is not recommended at this time to use this
Toolkit on a virtual machine. The problem is that the measurement tools may not be
accurate enough when hardware such as a network interface card is emulated, as the
emulation process typically entails the dependence on a host clock. Three tools and services
have specifically been identified to suffer from interference in virtual environments:


The NTP (Network Time Protocol, website) time synchronisation service at a
computer is affected whenever host clocks are involved as part of the virtualization
process. Host clocks are usually not as accurate as hardware clocks and if the virtual
machine receives clock signals from the host it is running on, the NTP service may
suffer from high jitter and delay.
OWAMP [OWAMP] (One-Way Active Measurement Protocol) is a tool that defines
one-way delay and is heavily dependent on NTP. Without an accurate NTP service in
a virtual environment, OWAMP cannot provide proper measurement data, since it
requires hosts to provide precise timestamps upon packet arrival to be able to
determine one-way delay values between measurement points.
NOVI-D3.2-11.02.28-v3.0
Page 70 of 90

BWCTL [BWCTL] (Bandwidth Test Controller) measurements for throughput are also
dependent on NTP and would not work reliably in a virtual environment if there is
no accurate clock and tests cannot be scheduled at a precise time. Another problem
in a virtual environment for BWCTL tests is the emulation of network card interfaces:
Compared to hardware, a virtual network interface card does not yield maximum
available bandwidth which will limit throughput based on however much access to
the physical network card the host will allow.
3.4 SONoMA
SONoMA provides users with monitoring information about the status of the measurement
agents and about the state of measurement processes. Furthermore, system administrators
are able to track down temporal behaviour of the whole system through status messages on
various configurable debugging levels.
Monitoring the Measurement Agents
The central core of SONoMA architecture, the Management Layer (ML) periodically queries
the status of the Measurement Agent (MA) that last interacted via its SOAP web-service
interface. The service collects information about the availability of the MA's web service. The
underlying hardware and the operating system running the service are not monitored for
their status or errors. In case MA returns with an error message or does not respond at all
then ML marks it as DOWN. MAs, which are marked DOWN, are also regularly checked and if
their interfaces are available, MAs are UP again. In order to speed up the recovery of lost
agents, MA may send a login request right after back up or they may send a logoff request
before termination and ML will mark them UP or DOWN respectively.
All SONoMA’s services including status information requests are accessed through standard
SOAP based web service interface. The getNodeList() service returns a list of MAs, each item
containing status, address and type information. The caller may query a filtered list. Filter
options are the MA type or their status. SONoMA also provides a web based front end
where amongst experiment requests the status of agents are listed in a table, see Figure 47:
SONoMA Measurement Agents Status Interface.
NOVI-D3.2-11.02.28-v3.0
Page 71 of 90
Figure 47: SONoMA Measurement Agents Status Interface
Monitoring the processes
Monitoring the long running experiments is crucial from a user's point of view, since large
scale distributed network measurements can even take tens of minutes. The
getProcessInfo() web service procedure returns the current status of the given measurement
process, the response contains:

The expected duration of the measurement

The status of the measurement: RUNNING or FINISHED

The number of the MAs still busy with the process

The number of the MAs already done with their tasks

The number of MAs lost during experiment launch.
Based on this information the user has the choice to decide to terminate the measurement
process earlier to release resources and gather partial measurement logs.
NOVI-D3.2-11.02.28-v3.0
Page 72 of 90
System Monitoring Interface
SONoMA logs the operation of the system on various levels:

Each MA logs their invocation and the behaviour of their services in the local file
system

The ML logs user requests, the status of their measurement processes, information
of database transactions, status of local temporal caching including information,
warnings and errors.
This way of logging provides a powerful means to locate and identify the root cause of the
problems. SONoMA system administrators can access these logs through a command line
interface.
3.5 PHOSPHORUS
The PHOSPHORUS project [PHOSPHORUS] focuses on integrating optical networks and einfrastructure computing resources. It addresses the problems of on-demand service
delivery and Grid extension of telecommunications standards like GMPLS. The dedicated
test-bed interconnecting European and worldwide optical infrastructures has been
developed within the PHOSPHORUS project (see Figure 48). It involves European NRENs and
national test-beds, as well as GÉANT2, Cross Border Dark Fiber and GLIF infrastructures.
Figure 48: PHOSPHORUS test-bed architecture overview.
NOVI-D3.2-11.02.28-v3.0
Page 73 of 90
PHOSPHORUS monitoring system consists of two parts: The PerfSONAR SQL MA service
[perfSONAR] and clients. Clients collect information about monitored links and send them to
the service. Users access the www pages to see the visualization of the monitoring system.
The architecture is depicted on the Figure 49.
Phosphorus Network
WWW Page
Figure 49: PHOSPHORUS monitoring system architecture.
PerfSONAR SQL Measurement Archive (SQL MA) service is responsible for storing link data
that are collected by the measurement tools. The relational database is used as the
persistence layer for the service.
Figure 50: PHOSPHORUS monitoring visualization.
Each PHOSPHORUS node has a machine responsible for measuring Round Trip Time of the
node’s links. The metric results are sent to the perfSONAR service. Users may see the
visualization of monitoring with the help of the web browser (see Figure 50).
NOVI-D3.2-11.02.28-v3.0
Page 74 of 90
4. DISCUSSION ON NOVI’S MONITORING
4.1 Challenges on NOVI’s Monitoring
Monitoring a complex infrastructure offers a number of interesting and remarkable
challenges: a) To cater for the federation of the FEDERICA and PlanetLab Europe testbeds, a
monitoring solution adopted within NOVI will have to be able to work in a federated
environment; b) Since NOVI’s resources are drawn from a virtualized environment, it is a
challenge to implement federated monitoring with special constraints given by virtualization
processes, taking into consideration their timing and synchronization issues.
A third challenge for NOVI monitoring will be the fact that the definition of user slices or
slivers entails that monitoring should not be limited to physical infrastructures or virtualized
topologies as a whole, but should offer measurements that are specific to individual virtual
slices or slivers. As a consequence, the monitoring of many such virtual slices may also
require a careful scheduling process of measurements affecting the same physical nodes,
especially in cases such as load, throughput or bandwidth measurements that could affect
the availability of a physical resource belonging to several different virtual slices.
Communities such as the European Grid Infrastructure [EGI] and the DANTE-Internet2CANARIE-ESnet collaboration [DICE] have already started to work on the general issues of
providing monitoring services to multi-domain environments, without a special focus on
virtualization issues. These communities have realized that there are many tools and
monitoring services available, but that for a federated structure there should be a specific
selection of tools that is widely available and also considered a suitable monitoring service
with desirable measurement metrics by the affiliated user community. In 2004, the Open
Grid Forum’s *OGF+ Network Measurements Working Group (NM-WG) already investigated
the network characteristics that would be relevant to GRID computing and various
measurement methodologies. This work resulted in the Recommendation [GFD-R.023]
which then served as a basis for the perfSONAR framework. The typical grid user may not
always want to access all available tools within the perfSONAR suite, but appreciates a set of
simple distributed services, which provide control and management functions concerning
alarms, authentication and authorization services or archiving of monitoring data, etc.
*BiLŁ07+. The DICE collaboration also considers the perfSONAR architecture as a solution to
provide access to cross-domain performance information along with the necessary
automation of measurement processes. In 2010, DICE started the “DICE Service
Challenge”,which focuses on how current monitoring architectures, such as perfSONAR can
be extended to define a set of inter-domain services that the primary DICE partners and
NRENs can provide to their user communities. In this context, DICE proposed an Interdomain Control Plane [VoSK07] and focuses on measurement services to include diagnostic
tools, circuit monitoring, path characterization services and measurement portals [BrMe10].
NOVI will have to adapt a similar approach to provide a common framework for
management and control functions, but also to determine which measurement tools and
metrics will be most suitable and required in a virtualized environment. The virtualization
aspect of NOVI will have a strong impact on what type of metrics and tools that will be part
of the NOVI monitoring framework. Measurement toolkits that are now available, such as
NOVI-D3.2-11.02.28-v3.0
Page 75 of 90
the pS Performance Toolkit for perfSONAR are not recommended for use on a Virtual
Machine (VM), since the measurement tools may not be accurate enough when hardware
such as network interface cards are emulated and host clocks are involved. Within this
context, the HADES monitoring system, which is part of the perfSONAR suite, will be
investigated to check its suitability for NOVI. Special focus will be given to the extension of
HADES for a virtualized environment to provide IP Performance Metrics (IPPM). HADES is a
monitoring tool that was originally designed for multi-domain delay evaluation with
continuous active measurements. Although HADES provides accurate measurements in realtime environments, these measurements techniques can not directly applied within virtual
environments, as they typically lack the required time accuracy and continuity. As the
emulation of a real-time clock in a virtual environment prohibits highly accurate time
measurements of packet delay, the focus will have to shift to other more appropriate IPPM
metrics, such as IP Packet Delay Variation (IPDV), One-way Packet Loss, packet reordering
and throughput.
A preliminary conceptual view of NOVI’s Monitoring architecture is illustrated in Figure 51.
The architecture should cater for two types of users:
- Experimenters. They need to obtain information about their slices
- Operators. They need to obtain information about the physical and virtual resources within
the substrate.
NOVI’s monitoring architecture will employ use of several monitoring tools, deployed across
heterogeneous virtualized infrastructures. Our focus is on the monitoring tools implemented
within the PlanetLab Europe and FEDERICA platforms.
NOVI’s
Monitoring
API Call (s)
NOVI’s Monitoring
Service
Monitoring
Tool A1
Monitoring
Tool A2
Virtualized Infrastructure A
NOVI’s
Monitoring
API Call (s)
Federation’s
secure
API calls
NOVI’s Monitoring
Service
Monitoring
Tool B1
Monitoring
Tool B2
Virtualized Infrastructure B
Figure 51: A preliminary conceptual view NOVI’s Monitoring architecture
NOVI-D3.2-11.02.28-v3.0
Page 76 of 90
One important aspect of monitoring in a federated environment is the correct configuration
and control of monitoring data export (from tools) and exchange (between different
platforms).
Monitoring data export from the monitoring tools or appliances can be done using one of a
set of common techniques using existing protocols for that purpose (e.g. SNMP, IPFIX) or
proprietary solutions (e.g. direct export to MySQL database). The IP Flow Information eXport
(IPFIX) protocol is a very efficient and flexible way to implement push-style bulk data export
from tools or traffic probes, e.g. flow records, QoS values or intrusion notifications. With
SNMP, bulk export is implemented by an SNMP Agent, which is usually queried for data
realizing this way pull-style data export. If needed, SNMP traps are a method to extend such
a setup with push-style data export, e.g. for exporting intrusion alerts. Different export
methods may co-exist within a single platform. However, results should be made available
across different platforms in a well-defined way, so that they can be correlated. For example
bandwidth information from monitoring tool A1 (see Figure 51 above) should be exported to
a data form that is compatible with the bandwidth information provided by the tool B1.
From the experimenter’s point of view, in a federated environment, where two or
more,heterogeneous platforms need to cooperate, e.g. to observe the QoS in a crossplatform user slice, special care needs to be taken on how to access and transfer monitoring
data. The following properties need to be taken into account:
1) Identity management and authentication:
A slice user must be able to be identified in both platforms.
2) Policies and authorization:
Due to booking of a cross-platform experimental slice, a user must also
receive the permission to access monitoring data relevant to the slice’s
nodes in both platforms.
3) Data conversion / adaptation
Monitoring data must possibly be converted in order to harmonize data
formats and semantics, e.g. when using different delay monitoring tools in
different platforms.
4) Data transfer
Monitoring data must be either made accessible from all associated
platforms to the slice user or must be transferred into the user’s access
platform, so that it can be accessed by him from a single point.
In NOVI, slices spawn across multiple platforms. Therefore, the above challenges must be
addressed by all platforms within the NOVI federation. This can be achieved either in a peerto-peer or hierarchical way. In both ways, the slice user must be provided with the
respective monitoring information associated with the slivers contained in the slice.
NOVI-D3.2-11.02.28-v3.0
Page 77 of 90
4.2 NOVI’s Monitoring Requirements
The virtualization aspect of NOVI will have a strong impact on what type of metrics and tools
will be part of the NOVI monitoring framework. According to that aspect, the metrics that
need to be monitored in NOVI’s federated environment could be distinguished to the
following categories:




Monitoring of Networking Devices (Routers/Switches)
Monitoring of Physical Machines (Hosts/Hypervisors)
Monitoring of Virtual Machines
Monitoring of the Network
NOVI’s monitoring should cater for different categories of “users”. Firstly, NOVI should
support the network operators who need to have a view of the network elements’ status
(e.g. traffic/flow values, interface values, etc.). Secondly, NOVI should provide monitoring
data to the experimenters.
Several metrics related to network devices (i.e. physical routers/switches, logical routers)
and computing elements (i.e. hosts, slivers) can potentially be measured in heterogeneous
virtualized environments. Therefore, NOVI will investigate and decide which metrics are
appropriate for a federated virtualized environment. This set of metrics should ideally be
compatible among all platforms of the federation (e.g. PlanetLab Europe and FEDERICA).
Otherwise, a proper transformation or adaptation of the measurement values and their
units should be provided. Usually, min, max, and average values should be recorded over a
period of time. Moreover, percentiles over a set of values of a metric allow extracting very
useful conclusions on its behaviour. Some important examples of the categories of potential
metrics are illustrated below, while a detailed list of metrics is available in Appendix B.
Metrics for Network Devices (Routers/Logical Routers are usually related to:

Overall System Metrics (availability and performance)

Link, network and transport layer metrics (commonly via SNMP)

Per interface metrics (configuration and performance values)
Metrics for Computing Elements are usually related to:

Physical machine metrics (configuration and performance of the host and its
hypervisor)

Virtual machine metrics (configuration and performance)
NOVI-D3.2-11.02.28-v3.0
Page 78 of 90
In addition, the network performance, for both the physical and virtual topology (slices),
should be thoroughly measured by active monitoring tools. Some per-slice metrics for the
network performance monitoring are for example: One Way Delay, One Way Delay
Variation, Round Trip Delay and Packet Loss. Having in mind that a NOVI slice spawns across
different platforms, the monitoring system needs to access several devices and combine
their results, in order to provide the accurate values for the NOVI metrics.
NOVI-D3.2-11.02.28-v3.0
Page 79 of 90
5. CONCLUSIONS
This document presented an overview of the state-of-the-art of monitoring in experimental
facilities for Future Internet research. It presents current approaches for monitoring of
virtualized infrastructures developed under the EC FIRE and the US NSF GENI initiatives. The
main focus is on the FEDERICA and the PlanetLab Europe virtualized infrastructures.
Appendix C provides a listing of tools and related metrics for these two platforms.
In addition, we presented other measurement infrastructures environments like the HADES
and the ETOMIC active measurement infrastructures and monitoring environments that
could be potentially deployed and used in virtualized infrastructures. Examples of such
environments include PerfSONAR, SONoMa and PHOSPHORUS. The monitoring approaches,
tools and metrics are described and compared, illustrated by user-interface snapshots that
highlight the diversity of solutions already in use.
This document provides the basis for decisions in NOVI regarding monitoring and
measurement methods and tools in federated environments. Key NOVI requirements
include the following:

Identify the metrics that will be measured by NOVI’s Monitoring framework

Provide authentication mechanisms for monitoring operations across different
platforms

Define possible ways enabling cross-platform monitoring results to be available to
the slice user

Adapt monitoring methodologies in order to assure that measurements within
virtualized environment still are of high accuracy

Exchange configuration and monitoring data in a generic way to ensure
interoperability and compatibility of the data amongst heterogeneous platforms.
Concluding this deliverable, we can observe that a multitude of architectural and technical
challenges exists for monitoring within a federated virtualized environment. This work
provides the ground for the future tasks in WP3 on designing and building a monitoring
approach, that takes into account virtualization aspects and is aligned to control and
management approaches within NOVI’s federated environment.
NOVI-D3.2-11.02.28-v3.0
Page 80 of 90
6. REFERENCES
[4WARD]
4WARD project page - http://www.4ward-project.eu/
[4WARD-D4.3]
4WARD, D-4.3 In-network management design PU
[BEN]
BEN Testbed - https://ben.renci.org
*BiLŁ07+
Binczewski A., Lawenda M., Łapacz R., Trocha Sz., "Application of
perfSONAR architecture in support of GRID monitoring", In Proceedings
of INGRID 2007: Grid Enabled Remote Instrumentation in Signals and
Communication Technology, ISBN 978-0-387-09662-9, pp. 447-454,
Springer US, 2009.
[BrMe10]
Brian Tierney, Joe Metzger, “Experience Deploying a Performance
Diagnostic Service on ESnet”, TERENA Networking Conference (TNC)
2010, Vilnius, June 2, 2010, available from:
tnc2010.terena.org/core/getfile.php?file_id=290
[BWCTL]
http://www.internet2.edu/performance/bwctl/
[Cacti]
Cacti toolkit - http://www.cacti.net/
[Chase09]
J. Chase. ORCA Control Framework Architecture and Internals.
Technical Report 2009. https://geni-orca.renci.org/trac/rawattachment/wiki/WikiStart/ORCA%20Book.pdf
[Chopstix]
Sapan Bhatia, Abhishek Kumar, Marc E. Fiuczynski and Larry Peterson,
“Lightweight, High-Resolution Monitoring for Troubleshooting Production
Systems”, in Proceedings of OSDI '08
[CoDeeN]
http://codeen.cs.princeton.edu/
[CoMon]
KyoungSoo Park, and Vivek S. Pai. CoMon: A Mostly-Scalable Monitoring
System for PlanetLab. In Proceedings of ACM SIGOPS Operating Systems
Review, Volume 40, Issue 1, January 2006
[CoTest]
CoTest tool - http://codeen.cs.princeton.edu/cotest/
[CoTop]
CoTop tool - http://codeen.cs.princeton.edu/cotop/
[CoVisualize]
CoVisualize tool - http://codeen.cs.princeton.edu/covisualize/
[D3.1]
NOVI Deliverable D3.1 – State-of-the-Art of Management Planes
NOVI-D3.2-11.02.28-v3.0
Page 81 of 90
[DAG]
DAG capture cards - http://www.endace.com/endace-dag-high-speedpacket-capture-cards.html
[DICE]
http://www.geant2.net/server/show/conWebDoc.1308
[DIMES]
[DSA1.3-FED]
DIMES project - http://www.netdimes.org/
DSA1.3 – FEDERICA Infrastructure.
[EGI]
The European Grid Infrastructure - http://www.egi.eu/
[EmulabMap]
ProtoGENI flash based map tool https://www.protogeni.net/trac/protogeni/wiki/MapInterface
[ETOMIC]
ETOMIC Project - http://www.etomic.org/
[Expedient]
Expedient - http://yuba.stanford.edu/~jnaous/expedient/
[FEDERICA]
FEDERICA TWiki Page - https://wiki.fp7-FEDERICA.eu/bin/view
[G3System]
http://www.cesnet.cz/doc/techzpravy/2008/g3-architecture/
http://www.cesnet.cz/doc/techzpravy/2005/g3/
http://www.cesnet.cz/doc/techzpravy/2007/g3-reporter/
[Ganglia]
Matthew L. Massie, Brent N. Chun, and David E. Culler. The ganglia
distributed monitoring system: design, implementation, and experience.
In Proceedings of Parallel Computing, Volume 30, Issue 7, July 2004,
Pages 817-840
Ganglia Monitoring System - http://ganglia.sourceforge.net/
Global Grid Forum, “GFD-R-P.023: A Hierarchy of Network Performance
Characteristics for Grid Applications and Services”, available from:
http://www.cs.wm.edu/~lowekamp/papers/GFD-R.023.pdf
[GFD-R.023]
[HaBB05]
Hanemann, A., Boote, J. W., Boyd, E. L., Durand, J., Kudarimoti, L., Lapacz,
R., Swany, D. M., Zurawski, J., Trocha, S., "PerfSONAR: A Service Oriented
Architecture for Multi-Domain Network Monitoring", In "Proceedings of
the Third International Conference on Service Oriented Computing",
Springer Verlag, LNCS 3826, pp. 241-254, ACM Sigsoft and Sigweb,
Amsterdam, The Netherlands, December, 2005, available from
http://www.springerlink.com/content/f166672085772143/.
[HADES]
HADES Active Delay Evaluation System, http://www.win-labor.dfn.de/.
*HáDV07+
"Granular model of packet pair separation in Poissonian traffic", P. Hága,
K. Diriczi, G. Vattay, I. Csabai, Computer Networks, Volume 51, Issue 3 ,
21 February 2007
[HoKa06]
P. Holleczek, R. Karch, et al, “Statistical Characteristics of Active IP One
Way Delay Measurements”, International conference on Networking and
Services (ICNS'06), Silicon Valley, California, USA, 16-18 July, 2006.
NOVI-D3.2-11.02.28-v3.0
Page 82 of 90
[IPPM-FED]
IPPM measurements over the FEDERICA infrastructure, http://www.winlabor.dfn.de/cgi-bin/hades/map.pl?config=FEDERICA;map=star.
[LAMP]
http://groups.geni.net/geni/wiki/LAMP
[MaCC04]
M. L. Massie, B. N. Chun, and D. E. Culler. The ganglia distributed
monitoring system: design, implementation, and experience. Parallel
Computing 30, 817, 2004.
MADwifi Project - http://madwifi-project.org/
[MADwifi]
[MyOps]
Stephen Soltesz, Marc Fiuczynski, and Larry Peterson. MyOps: A
Monitoring and Management Framework for PlanetLab Deployments.
[MyPLC]
https://www.planet-lab.eu/monitor/
https://svn.planet-lab.org/wiki/MyPLCUserGuide
[Nagios]
Nagios tool - http://www.nagios.org/
[OFA]
Open Federation Alliance - http://www.wisebed.eu/ofa/
[OGF]
Open Grid Forum,
http://www.ogf.org/gf/page.php?page=Standards::InfrastructureArea
[OMF]
OMF testbed - http://mytestbed.net/
[OML]
[Onelab]
M. Singh, M. Ott, I. Seskar, P. Kamat, “ORBIT Measurements
framework and library (OML): motivations, implementation
and features”, First International Conference on Testbeds and
Research Infrastructures for the Development of Networks
and Communities (Tridentcom), Feb.2005
Onelab project - http://onelab.eu/
[ORBIT]
ORBIT Testbed - http://www.orbit-lab.org/
[ORCA]
ORCA - https://geni-orca.renci.org
[ORCABEN]
ORCABEN - http://groups.geni.net/geni/wiki/ORCABEN
[OWAMP]
http://www.internet2.edu/ performance/owamp/
[Panlab]
Website of Panlab and PII European projects, supported by the
European Commission in its both framework programmes FP6
(2001-2006) and FP7 (2007-2013): http://www.panlab.net
[PERFSONAR]
PerfSONAR tool - http://www.perfSONAR.net/
[Perftoolkit]
http://www.internet2.edu/ performance/toolkit/
[PHOSPHORUS]
PHOSPHORUS project page http://www.PHOSPHORUS.pl/
[PlanetFlow]
http://www.cs.princeton.edu/~sapanb/planetflow2/
NOVI-D3.2-11.02.28-v3.0
Page 83 of 90
[Planetlab]
Planet-lab Facility - http://www.planet-lab.org/
[PLuSH]
http://plush.cs.williams.edu/
[ProtoCred]
ProtoGENI credential format http://www.protogeni.net/trac/protogeni/wiki/Credentials
[ProtoDeleg]
ProtoGENI delegated credential example http://www.protogeni.net/trac/protogeni/wiki/DelegationExample
[PROTOGENI]
ProtoGENI Control Plane: www.protogeni.net
[ProtoGENI]
http://www.protogeni.net/trac/protogeni
[PsEPR]
http://psepr.org/
[Quagga]
http://www.quagga.net/
[RFC2679]
IETF Network Working Group Request for Comments 2679, "A One-way
Deay Metric for IPPM", G. Almes, S. Kalidindi, M. Zekauskas, September
1999.
[RFC2680]
IETF Network Working Group Request for Comments 2680, "A One-way
Packet Loss Metric for IPPM", G. Almes, S. Kalidindi, M. Zekauskas,
September 1999.
[RFC2681]
"A Round-trip Delay Metric for IPPM", G. Almes, S. Kalidindi, M.
Zekauskas, September 1999.
[RFC3393]
IETF Network Working Group Request for Comments 3393, "IP Packet
Delay Variation Metric for IP Performance Metrics (IPPM)", C. Demichelis,
P. Chimento, November 2002.
[RRDTool]
http://www.mrtg.org/rrdtool/
[S3]
http://www.planet-lab.org/status
[ScriptRoute]
N. Spring, D. Wetherall, and T. Anderson, “Scriptroute: A public Internet
measurement facility”, in Proc. 4th USENIX Symposium on Internet
Technologies and Systems (USITS'03), Seattle, Washington, USA, March
2003.
[Shibboleth]
Shibboleth - an Internet2 middleware project.
http://shibboleth.internet2.edu
[Slicestat]
http://codeen.cs.princeton.edu/slicestat
[SWORD]
http://sword.cs.williams.edu/
[Teagle]
Teagle website: www.fire-teagle.org
NOVI-D3.2-11.02.28-v3.0
Page 84 of 90
[TiBB09]
Brian Tierney, Jeff Boote, Eric Boyd, Aaron Brown, Maxim Grigoriev, Joe
Metzger, Martin Swany, Matt Zekauskas, Yee-Ting Li, and Jason Zurawski,
Instantiating a Global Network Measurement Framework, LBNL Technical
Report LBNL-1452E, January 2009, available from
http://acs.lbl.gov/~tierney/papers/perfSONAR-LBNL-report.pdf
[TRACERT]
Traceroute tool - ftp://ftp.ee.lbl.gov/traceroute.tar.gz
[TR-PARIS]
Paris traceroute tool - http://www.paris-traceroute.net/
[VINI]
http://www.vini-veritas.net/documentation/userguide
[VI-Toolkit]
see FEDERICA deliverable “DSA1.3 FEDERICA Infrastructure”
[VMware-api]
VMware APIs and SDKs Documentation,
http://www.vmware.com/support/pubs/sdk_pubs.html
[VMware-cmd]
vmware-cmd command line tool manual, http://www.vmware.com/
support/developer/vcli/vcli41/doc/reference/vmware-cmd.html
[VoSK07]
John Vollbrecht, Afrodite Sevasti, Radek Krzywania, “DICE Interdomain
Control Plane”, GLIF, September 2007, available from:
http://www.glif.is/meetings/2007/joint-session/vollbrecht-dice.pdf
[VServer]
Linux VServer project - http://linux-vserver.org/
[Wahle11]
Sebastian Wahle et al. Emerging testing trends and the Panlab enabling
infrastructure. IEEE Communications Magazine: Network Testing Series.
Accepted for publication in March 2011.
[WaMG10]
Sebastian Wahle, Thomas Magedanz, and Anastasius Gavras.
Towards the Future Internet - Emerging Trends from European
Research, chapter Conceptual Design and Use Cases for a FIRE
Resource Federation Framework, pages 51-62. IOS Press, April
2010. ISBN: 978-1-60750-538-9 (print), 978-1-60750-539-6 (online)
[WISEBED]
WISEBED Project – http://www.wisebed.eu
[WiseML]
WiseML schema document http://dutigw.st.ewi.tudelft.nl/wiseml/wiseml_schema.pdf
[WSDL]
http://www.w3.org/TR/wsdl.html
[YuSI07]
A. Yumerefendi, P. Shivam, D. Irwin, P. Gunda, L. Grit, A. Demberel, J.
Chase, and S. Babu. Towards an Autonomic Computing Testbed. In
Workshop on Hot Topics in Autonomic Computing (HotAC), June 2007.
NOVI-D3.2-11.02.28-v3.0
Page 85 of 90
APPENDIX A: COMPARISON TABLE OF CURRENT MONITORING FRAMEWORKS
Name/Platform FEDERICA/HADES
FEDERICA/G3
Onelab
Wisebed
Planetlab
(PLE/PLC)
Types of
active/
Measurements continuous
passive/ continuous active and passive active and passive active and passive
Common Tools HADES
used
SNMP, IGMP,
Nagios
Monitoring
Metrics
Data Export
formats
passive/
continuous
passive/active
-
OWD, topology,
bandwidth
topology, sensors' VServer stats,
measurements
Slice Stats (node
and slice view)
project partners
users, testbed
operator
users, project
slice users,
system
partners' (under PL administrators administrators
user's permission)
, users
GUI/web
GUI/web
GUI/web, ssh
GUI/web
ssh to VServer for
GUI/web,
node stats,
command line
CoMon web i/f for
slice & node stats
html, csv on
request
static html reports IPFIX, csv
WiseML format
file
custom XML
Recipients of project partners
measured data
Control
methods
ORBIT
ETOMIC, packet
tracking, DIMES
OWD, OWDV, loss CPU, memory,
bandwidth, slice
usage
Measurements 10ms
Granularity
NOVI-D3.2-11.02.28-v3.0
collected in 5
different,
500ms for sensor's
minutes intervals; hardware support output, 30s for
charts with
topology
week/24 hour scale
CoMon , MyOps
ORCA/BEN
5mins for passive
monitoring data,
time units vary for
active metrics
Ganglia
active and
passive
GpENI
passive
OMF measurement emulab
NAGIOS, CACTI
library (OML)
software,
netFPGA cards
CPU, memory,
disk, network
traffic
user-defined
VINI
4WARD
active and passive active, continuous
same as PlanetLab generic algorithms
topology,
standard
measurement
and capture
tools
Nodes, services and
same as PlanetLab
switches (servers,
devices, status,
information about VM)
users
users
slice users,
users
PL administrators
SQL database
GUI/web
GUI/web
ssh to VServer for GUI (prototype)
node stats, CoMon
web i/f for slice &
node stats
XDR
standard
capture
formats
users
XDR over UDP,
XML over TCP
metric
dependent
ProtoGENI
-
custom XML
network-wide
metrics, data
search, anomaly
detection
OSGi framework
(prototype)
metric dependent granularity of Nagios and Cacti
5mins for passive the used
allow1min grid, but it is monitoring data,
standard tools configurable.
time units vary for
active metrics
Page 86 of 90
APPENDIX B: POTENTIAL METRICS IN A VIRTUALIZED
ENVIRONMENT
Values for Network Devices (Routers/Logical Routers) can be:




System
o Availability (overall)
o ICMP echo packet loss
o ICMP echo Round Trip Time
o Measurement time step
o Time Spent (time spent with measurement, total time spent)
On IP layer (network layer)
o IP input traffic (received datagrams, locally delivered datagrams)
o IP output traffic (forwarded datagrams, local output requests)
o IP local traffic (locally delivered datagrams, local output requests)
o IP reassembling and fragmentation (Fragments needed to be
reassembled, Reassembled OK, Fragmented OK, Fragments
created, Discarded)
o IP problems (Input header errors, Input destination address error,
Input unknown/unsupported protocol, input discards, output
discards, no output route)
UDP (transport layer)
o UDP traffic (Input datagrams, Output datagrams)
o UDP problems (Invalid port, Input errors)
ICMP (transport layer)
o ICMP input summary (Received messages, Input errors)
o ICMP output summary (Sent messages, Output errors)
o ICMP input echo (Echo requests received, Echo replies received)
o ICMP output echo (Echo replies sent, Echo requests sent)
o ICMP input details ( Destination unreachable received, Time
exceeded received, Parameter problem received, Source quench
received, Redirect received, Timestamp request received,
Timestamp reply received, Address mask request received, Address
mask reply received)
o ICMP output details ( Destination unreachable sent, Time exceeded
sent, Parameter problem sent, Source quench sent, Redirects sent,
Timestamp request sent, Timestamp replies sent, Address mask
request sent, Address mask replies sent)
NOVI-D3.2-11.02.28-v3.0
Page 87 of 90




SNMP
o SNMP traffic (SNMP input, SNMP output)
o SNMP input detailed (Requested variables, Set variables, Get
requests, Get next requests, Set requests, Get responses, Traps)
o SNMP output detailed (Set requests, Get requests, Get responses,
Traps, Silent drop, Proxy drop)
o SNMP input problems (Bad version, Bad community, Bad
community use, ASN parse error, Too big packet, No such name,
Bad value, Readonly, GenError)
o SNMP output problems (Too big packet, No such name, Bad value,
GenError)
Interfaces (type Other or tunnel)
o Measured interfaces (Administrative up, Found, Operating)
o Speed (technical)
o Input bit rate
o Output bit rate
o Packet rates per class – Input (Unicasts, Multicasts, Broadcasts)
o Packet rates per class – Output (Unicasts, Multicasts, Broadcasts)
o Errors (Input errors, Output Errors)
o Discards (Input discards, Output discards)
Interfaces (type Ethernet or in Ethernet interfaces: propVirtual)
o Measured interfaces (Administrative up, Found, Operating)
o Speed (technical, input, output)
o Input bit rate
o Output bit rate
o Input packets
o Packet rates per class – Input (Unicasts, Multicasts, Broadcasts)
o Output packets
o Packet rates per class – Output (Unicasts, Multicasts, Broadcasts)
o Utilization (Input utilization, Output utilization)
o Estimated packet length (Input, Output)
o Input / Output traffic ratio (Input, Output)
o Errors (Input errors, Output Errors)
o Discards (Input discards, Output discards)
Interfaces (type Gigabit Ethernet or in Gigabit interfaces: propVirtual,
L2vlan)
o Measured interfaces (Administrative up, Found, Operating)
o Speed (technical, input, output)
o Input bit rate
o Output bit rate
o Input packets
o Packet rates per class – Input (Unicasts, Multicasts, Broadcasts)
NOVI-D3.2-11.02.28-v3.0
Page 88 of 90
o
o
o
o
o
o
o
o
Output packets
Packet rates per class – Output (Unicasts, Multicasts, Broadcasts)
Utilization (Input utilization, Output utilization)
Estimated packet length (Input, Output)
Input / Output traffic ratio (Input, Output)
Errors (Input errors, Output Errors)
Discards (Input discards, Output discards)
Ethernet Errors (alignment errors, checksum errors, frames too
long, SQE test errors, internal MAC transmit errors, internal MAC
receive errors, carrier sense errors)
Values for Physical Machines (Hosts/Hypervisors) can be:











Number of VMs
Boot time
Network usage [kbits/s]
CPU percentage
CPU usage (MHz)
CPU speed
Number of Cores
Capacity of CPU speed (GHz)
Memory Size (MB)
Memory Usage (MB)
Network Adapters
Values for virtual machines can be:







Virtual Machine (VM) status (e.g. “VM is OK”)
Connection State (e.g. “connected”)
Power State (e.g. “poweredOFF”, “poweredON”)
CPU usage (in MHZ, if is unknown = “Not known”)
Memory Size (in MB, if is unknown = “Not known”)
Guest memory usage (in MB, if is unknown = “Not known”)
Guest Operating System (if is unknown = “Not known”)
Values for network paths spanning multiple hops can be:





One-way delay (OWD), one-way delay variation
round trip delay (RTD), round trip time (RTT)
packet loss, loss burstiness
packet rate/volume, byte rate/volume
packet reordering
NOVI-D3.2-11.02.28-v3.0
Page 89 of 90
APPENDIX C: FEDERICA AND PLANETLAB TOOLS AND
RELATED METRICS
Currently the following metrics are already covered by tools from either the FEDERICA or the
PlanetLab context:
Monitoring
Tool
PING
SNMP client
[G3System]
VMWare API
[vmware-api]
VMware-cmd
[VMware-cmd]
CoMon
[CoMon]
HADES
[HADES]
MyOps [MyOps]
TopHat
NOVI-D3.2-11.02.28-v3.0
Supported metrics
ICMP echo packet loss, ICMP echo round trip time
IP input/output traffic, IP local traffic, IP reassembling and
fragmentation,
IP problems, UDP traffic, UDP problems, ICMP input/output summary,
ICMP input/output echo, ICMP input /output details, SNMP traffic,
SNMP input/output detailed, SNMP input/output problems,
Physical interfaces
interfaces, number of VMs, Boot time, Network usage, CPU
percentage,
CPU usage, CPU speed, Number of cores, Capacity of CPU speed,
Memory Size, Memory Usage, Network Adapters
connection state, power state, up time, has snapshot, guest info
properties (via getguestinfo)
uptime, overall CPU utilization, System CPU utilization, Memory size,
Memory active consumption, Free Memory, Disk Size, Disk Used,
Transmit rate,
Receive rate, slice CPU usage, slice Memory usage,
slice physical memory consumption, slice virtual memory
consumption,
slice receive/transmit bandwidth rate
One Way Delay, One Way Delay Variation, Packet Loss
Kernel version, boot state, are system services running, is filesystem
read-only, hard disk failure (using system logs), is dns working, are key
slices created, last contact, finger prints of the start-up script logs,
Problem detection
Network Topology Information on the AS- and interface-level ;
mapping from IP prefixes to AS domains
Page 90 of 90