Ziua doctoranzilor

Transcription

Ziua doctoranzilor
Advanced Techniques for Ensuring High Availability of Resources
in Parallel and Distributed Systems
Eliana-Dina Tirsa, Valentin Cristea
[email protected], [email protected]
Computer Science Department, Politehnica University of Bucharest
Main topic: high availability in parallel and distributed systems
High availability:
- ability of a system to provide services to the users at a sufficient
performance level, particularly in the presence of failures
- two orthogonal issues: ensuring the system resilience to failures
and maintaining the performance of the services above an
expected threshold
Research Approach:
- address high availability problems at computation, communication
and data levels, in several types of systems
- consider both system resilience and performance aspects
Research Contributions:
- fault tolerance mechanisms
- optimization algorithms (focused on performance metrics related
to high availability – e.g. response time)
- fault tolerant architectures design (also used to validate some of
the proposed mechanisms and algorithms)
Repository Replication and
Synchronization in MonALISA
Distributed Monitoring Framework
Problem: due to repository failures, lack
of monitoring data for some periods
Solution: Repository replication system
- small number of geographically
distributed replicas
- active replication (all replicas subscribe
to same parameters)
- low synchronization cost – only in case
of recovery from failure
- fault tolerant load balancer
- load balancing mechanism uses the
monitored states of the replicas
Challenges:
- replicas on different time zones
- keep track of gap intervals/last update
- synchronize from many sources
- false positives (monitored service down
or existent data with value 0)
Fault tolerant P2P architecture for
efficient multidimensional range
search
Communication Reliability
1. Scalable P2P communication architecture
- structured peer-to-peer topology based on local
interactions between peers
- enhanced reliability: routing messages over
backup/alternate paths
- improved throughput: routing messages
simultaneously over multiple paths
- node and object identifiers mapped in a
multidimensional geometric space
- support for dynamic node arrivals and
departures
- peers perform (extended) geometrical
routing
- replicated objects
- load aware topology based on local
decisions only (periodically, nodes
change identifiers in order to balance load
and objects change node owners)
2. Algorithms for computing backup shortest paths
when the network topology is known in advance
- start by computing the shortest paths
- identify the generic structure of a backup shortest path
and use it during a traversal of the tree of shortest
paths in order to compute backup shortest paths
3. Real-time data transfer scheduling techniques
- algorithms and data structures for improving the
response time
Lock Contention Advisor Tool
- applies to C concurrent programs
-based on a kernel probing tool
- intercepts futex system calls and builds
sorted list of contended locks
- uses both runtime information and static
analysis to determine variables accessed
inside a lock
- suggestions according to detected
access patterns: lock splitting, removing
code from inside the lock, reader-writers
lock etc
Future Research Directions
1. Background (data and resource) replication strategies inside Clouds
2. Application assisted checkpoint-restart API and associated service
Towards XtreemOS in the Clouds
- ongoing work on extending a grid operating system(XtreemOS)
with capabilities of managing virtual Cloud (Nimbus) resources
provided on demand
- technical difficulties: - different authorization, authentication,
and security mechanisms
- resource volatility in Clouds (Grid
schedulers are designed for less dynamic environments)