Srikanth Sasty's home page

My latest work on a modeling framework for a special variant of failure detectors is accepted at the 31st Annual ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing. A preprint of the submission is available [here].

Abstract: Failure detectors — oracles that provide information about process crashes — are an important abstraction for crash tolerance in distributed systems. Although current failure-detector theory provides great generality and expressiveness, it also poses significant challenges in developing a robust hierarchy of failure detectors. We address some of these challenges by proposing a variant of failure detectors called asynchronous failure detectors and an associated modeling framework. Unlike the traditional failure-detector framework, our framework eschews real time completely. We show that asynchronous failure detectors are sufficiently expressive to include several popular failure detectors. Additionally, we show that asynchronous failure detectors satisfy many desirable properties: they are self-implementable, guarantee that stronger asynchronous failure detectors solve more problems, and ensure that their outputs encode no information other than process crashes. We introduce the notion of a failure detector being representative of a problem to capture the idea that some problems encode the same information about process crashes as their weakest failure detectors do. We show that a large class of problems, called finite problems, do not have representative failure detectors.

 

The full version of my OPODIS 2010 paper "Failure Detectors Encapsulate Fairness" has been accepted for publication with the journal Distributed Computing. You can find a preprint of the paper here [pdf].


Abstract:

Failure detectors have long been viewed as abstractions for the synchronism present in distributed system models. However, investigations into the exact amount of synchronism encapsulated by a given failure detector have met with limited success. The reason for this is that traditionally, models of partial synchrony are specified with respect to real time, but failure detectors do not encapsulate real time. Instead, we argue that failure detectors encapsulate the fairness in computation and communication. Fairness is a measure of the number of steps executed by one process relative either to the number of steps taken by another process or relative to the duration for which a message is in transit. We argue that failure detectors are substitutable for the fairness properties (rather than real-time properties) of partially synchronous systems. We propose four fairness-based models of partial synchrony and demonstrate that they are, in fact, the ‘weakest system models’ to implement the canonical failure detectors from the Chandra-Toueg hierarchy. We also propose a set of fairness-based models which encapsulate the Gc parametric failure detectors which eventually and permanently suspect crashed processes, and eventually and permanently trust some fixed set of c correct processes.

 

I presented my recent work titled "Leader Election Using Loneliness Detection" at the 25th International Symposium on DIStributed Computing (DISC) in Rome last month. Briefly, the work focuses on the 'gap in the computational power' in single-hop wireless systems when the information about message loss in the system is varied.

We assume that the cause of message loss in wireless networks is collision, which happens when multiple nodes in the system transmit concurrently. We consider two models --- weak collision detection (WCD) and strong collision detection (SCD) --- which differ only with respect to the information that they provide about message loss. Specifically, when message collision occurs, in WCD systems, all the nodes that are not transmitting (and therefore listening) receive information about the collision whereas the transmitting processes do not receive any such information, and in SCD systems, all the nodes (both transmitting and receiving nodes) receive information about the collision. Intuitively, it makes sense to argue that SCD systems are more `powerful' than WCD systems because SCD systems provide more information about message collision than WCD systems; however, that does not answer the question: How much more powerful are SCD systems than WCD systems; how can the 'gap' be characterized? Alternatively, we may ask: how do we quantify the amount of information that is provided to transmitting processes when they are notified of a message collision?

We showed that the 'gap' between SCD and WCD systems is captured by the answer to the question: Is there exactly one node in the system? In other words, if WCD systems were to be augmented with an oracle that provides the answer to the foregoing question, then such a system would be able to solve all the problems solvable in SCD systems with the same time and message complexity (modulo a constant factor). We call such an oracle a Loneliness Detector (or LD, for short). We showed that LD can be implemented in WCD systems in O(log u - log n) time deterministically (where u is an upper bound on the number of nodes in the system and n is the actual number of nodes in the system) and in O(1) time with high probability. We then used LD to compare the time complexity of solving leader election in SCD and WCD systems. We showed that in both SCD and WCD systems, leader election may be solved in O(log u) time deterministically and in O(loglog n + log(1/epsilon)) time probabilistically, where epsilon is the error probability. We also provided matching lower bounds for each of the upper bounds presented, thus demonstrating the efficiency of our algorithms.

Abstract:

We consider the problem of leader election (LE) in single-hop radio networks with synchronized time slots for transmitting and receiving messages. We assume that the actual number n of processes is unknown, while the size u of the ID space is known, but is possibly much larger. We consider two types of collision detection: strong (SCD), whereby all processes detect collisions, and weak (WCD), whereby only non-transmitting processes detect collisions. We introduce loneliness detection (LD) as a key subproblem for solving LE in WCD systems. LD informs all processes whether the system contains exactly one process or more than one. We show that LD captures the difference in power between SCD and WCD, by providing an implementation of SCD over WCD and LD. We present two algorithms that solve deterministic and probabilistic LD in WCD systems with time costs of O(log u/n ) and O(min(log u/n , log(1/epsilon)/n)), respectively, where epsilon is the error probability. We also provide matching lower bounds. We present two algorithms that solve deterministic and probabilistic LE in SCD systems with time costs of O(log u) and O(min(log u, log log n + log( 1/epsilon ))), respectively, where epsilon is the error probability. We provide matching lower bounds.

The full version of the paper may be found here [link].

 

I successfully defended my dissertation titled "A Prescription For Partial Synchrony" on January 21st, 2011. The dissertation was approved and accepted by the Thesis Office on February 2nd, 2011.

Abstract:

Algorithms in message-passing distributed systems often require partial synchrony to tolerate crash failures. Informally, partial synchrony refers to systems where timing bounds on communication and computation may exist, but the knowledge of such bounds is limited. Traditionally, the foundation for the theory of partial synchrony has been real time: a time base measured by counting events external to the system, like the vibrations of Cesium atoms or piezoelectric crystals.

Unfortunately, algorithms that are correct relative to many real-time based models of partial synchrony may not behave correctly in empirical distributed systems. For example, a set of popular theoretical models, which we call M*, assume (eventual) upper bounds on message delay and relative process speeds, regardless of message size and absolute process speeds. Empirical systems with bounded channel capacity and bandwidth cannot realize such assumptions either natively, or through algorithmic constructions. Consequently, empirical deployment of the many M*-based algorithms risks anomalous behavior.

As a result, we argue that real time is the wrong basis for such a theory. Instead, the appropriate foundation for partial synchrony is {\em fairness}: a time base measured by counting events internal to the system, like the steps executed by the processes. By way of example, we redefine M* models with fairness-based bounds and provide algorithmic techniques to implement fairness-based M* models on a significant subset of the empirical systems. The proposed techniques use failure detectors --- system services that provide hints about process crashes --- as intermediaries that preserve the fairness constraints native to empirical systems. In effect, algorithms that are correct in M* models are now proved correct in such empirical systems as well.

Demonstrating our results requires solving three open problems. (1) We propose the first unified mathematical framework based on Timed I/O Automata to specify empirical systems, partially synchronous systems, and algorithms that execute within the aforementioned systems. (2) We show that crash tolerance capabilities of popular distributed systems can be denominated exclusively through fairness constraints. (3) We specify exemplar system models that identify the set of weakest system models to implement popular failure detectors.

I thank my advisors Scott Pike and Jennifer Welch for all their help, support, and guidance.

A copy of my dissertation (in PDF format) can be accessed here.

 

A new paper titled "Reliable Networks with Unreliable Sensors" has been published (and presented) at the 12th International Conference on Distributed Computing and Networking (ICDCN 2011) at Bangalore, India.

Abstract. Wireless sensor networks (WSNs) deployed in hostile environments suffer from a high rate of node failure. We investigate the effect of such failure rate on network connectivity. We provide a formal analysis that establishes the relationship between node density, network size, failure probability, and network connectivity. We show that as network size and density increase, the probability of network partitioning becomes arbitrarily small. We show that large networks can maintain connectivity despite a significantly high probability of node failure. We derive mathematical functions that provide lower bounds on network connectivity in WSNs. We compute these functions for some realistic values of node reliability, area covered by the network, and node density, to show that, for instance, networks with over a million nodes can maintain connectivity with a probability exceeding 99% despite node failure probability exceeding 57%.

A preprint of the paper can be access here [pdf].

 
«1»2345
You are here:   Home