Download slides

5 downloads 327 Views 18MB Size Report
Oct 17, 2013 - TENEX virtual machine even when one or more of the TENEX ...... https://ramcloud.stanford.edu/wiki/downlo
Rediscovering Distributed Systems Steve Vinoski Basho Technologies Cambridge, MA USA http://basho.com @stevevinoski [email protected] http://steve.vinoski.net/

Thursday, October 17, 13

1

Distributed Systems are Everywhere

Thursday, October 17, 13

2

Distributed Systems are Difficult

Thursday, October 17, 13

3

On The Shoulders of Giants • "Distributed systems" describes an enormous history of research and practice • Dist Sys research/practice addresses many issues from many angles • Know the issues so you can choose the right trade-offs Thursday, October 17, 13

4

Scope • Way way too much to cover • This talk is based in part on my personal history and experiences • Others would give a completely different talk

Thursday, October 17, 13

5

1960s

Thursday, October 17, 13

6

1960s • Beginnings of concurrent systems • 1965: Dijkstra's semaphores • Beginnings of computer networking • J.C.R. Licklider's 1962 dream of an "Intergalactic Computer Network" would eventually lead to the Internet • Beginning of OO: Simula 67 Thursday, October 17, 13

7

Distributed Systems Failure • 1965 Northeast blackout affected 7 US states and Ontario • A single misconfigured relay caused massive cascading failures • Then and now: distributed systems failure is not uncommon

Thursday, October 17, 13

8

Dijkstra and Multiprogramming • 1968: "The Structure of the 'THE' Multiprogramming System" • Describes a whole system designed as a set of hierarchical cooperating sequential processes • System resources shared via mutual synchronization via semaphores Thursday, October 17, 13

9

Dijkstra and Multiprogramming

"At the time this was written the testing had not yet been completed, but the resulting system is guaranteed to be flawless." —E.W. Dijkstra "The Structure of the 'THE' Multiprogramming System"

Thursday, October 17, 13

10

1970s

Thursday, October 17, 13

11

1970s Issues • Interprocess Communication • Resource sharing • Programming languages and distributed computing • Application-to-application protocols

Thursday, October 17, 13

12

Interprocess Communication (IPC) • "Interprocess Communication Facilities for Network Operating Systems", Akkoyunlu, Bernstein, Schantz, 1974 • Discusses IPC over different network topologies • Compares connection-oriented and messageoriented IPC facilities • Discusses concerns of how sender and receiver might find and identify each other Thursday, October 17, 13

13

IPC

Thursday, October 17, 13

14

IPC

Thursday, October 17, 13

15

Thursday, October 17, 13

16

Resource Sharing "Users and administrators of a small computer often desire more service than it can provide. In a network environment additional services can be provided to the small computer, and in turn to the users of the small computer, by one or more other computers." —Akkoyunlu, Bernstein, Schantz "Interprocess Communication Facilities for Network Operating Systems"

Thursday, October 17, 13

17

Resource Sharing

• "An Operational System for Computer Resource Sharing", Cosell et al., 1975 • Ideas like today's cloud computing

Thursday, October 17, 13

18

Resource Sharing "Further , it was becoming clear that for many users, in particular those whose access to the network was via TIPs or other nonTENEX hosts, it should not actually matter which host provides the TENEX service so long as the users could do their computing in the manner to which they had become accustomed." —Cosell et al. "An Operational System for Computer Resource Sharing" Thursday, October 17, 13

19

Resource Sharing "A number of advantages would result from such resource sharing. The user would see TENEX as a much more accessible and reliable resource. Because he would no longer be dependent upon a single host for his computing, he would be able to access the TENEX virtual machine even when one or more of the TENEX hosts were unavailable." —Cosell et al. "An Operational System for Computer Resource Sharing" Thursday, October 17, 13

20

Distributed Processing, January 1978

Thursday, October 17, 13

21

Distributed Processing, January 1978 "A terminal with a resident text editor, whether it is provided by hardware or software, is not an example of a distributed data processing system." "If the terminal coordinates several concurrent and simultaneous remote jobs, giving each a different type of service at a different location, without human intervention, then it more closely resembles a distributed system." —P.H. Enslow, Jr. "What is a 'Distributed' Data Processing System?" Thursday, October 17, 13

22

1978: Issues in Distributed Processing "Participants generally agreed that distributed processing is made possible by the price-performance revolution in microelectronics." —Eckhouse and Stankovic "Issues in Distributed Processing - An Overview of Two Workshops"

Thursday, October 17, 13

23

1978: Issues in Distributed Processing

• IPC • Distributed operating systems • Distributed databases • Load balancing

Thursday, October 17, 13

24

State Machines

"A distributed system can be described as a particular sequential state machine that is implemented with a network of processors." Leslie Lamport http://research.microsoft.com/en-us/um/people/lamport/pubs/pubs.html#time-clocks

Thursday, October 17, 13

25

Ordering of Events

• "Time, Clocks, and the Ordering of Events in a Distributed System", L. Lamport, 1978 • Probably the most cited Dist Sys paper ever

Thursday, October 17, 13

26

Ordering of Events • event A happens before event B if • A occurs before B in the same process • or A is a send of M from process P1, and B is the receive of M in process P2 • happens-before is written as A → B • if A → B and B → C, then A → C

Thursday, October 17, 13

27

Ordering of Events

• Clock Condition: • if A → B then LogicalClock(A) < LogicalClock(B)

Thursday, October 17, 13

28

Ordering of Events e11 P1

t

1

e21 P2

1

t

e31 P3

Thursday, October 17, 13

1

t

29

Ordering of Events e11 P1

e12

1

2

e21 P2

P3

Thursday, October 17, 13

t

e22

1

3

e31

e32

1

2

t

t

30

Ordering of Events e11 P1

e12

1

2

e21 P2

e14

3

3

e31

e32

1

2

t

5

e22 e23

1

P3

e13

e24

4

5

t

e33 e34 3

4

t

Only partial ordering, since e.g. e32 ↛ e14 Thursday, October 17, 13

31

Ordering of Events • Total ordering achieved by choosing an arbitrary ordering for processes • A ⇒ B iff • C(A) < C(B) • or, C(A) == C(B) and Pa < Pb • Remainder of paper shows how it all works with physical clocks Thursday, October 17, 13

32

Languages and Distributed Systems • Handling concurrency in programming languages • Handling distribution in programming languages • What are the best primitives in such languages?

Thursday, October 17, 13

33

Communicating Sequential Processes • "Communicating Sequential Processes", Tony Hoare, 1978 • Defines "surprisingly versatile" primitives for structuring concurrent programs

Thursday, October 17, 13

34

Communicating Sequential Processes • Dijkstra's guarded commands for sequential control structures • Dijkstra's parbegin to specify concurrent execution • Input and output commands to allow processes to communicate • Input commands in guards • Pattern matching on input messages Thursday, October 17, 13

35

Communicating Sequential Processes

• The paper then uses these primitives to show solutions to a variety of programming problems

Thursday, October 17, 13

36

Primitives for Distributed Computing • "Primitives for Distributed Computing", Barbara Liskov, 1979 • Describes distributed computing primitives, focusing on modularity and communication

Thursday, October 17, 13

37

Primitives for Distributed Computing • Reasons for distributed computing: • To match distributed organizations • Reduced contention among organization divisions • Speed of access • Physical control • Better availability, reliability, extensibility • But at the time there was little experience in building distributed programs Thursday, October 17, 13

38

Primitives for Distributed Computing • Approach: extend the CLU language with distributed computing primitives • CLU supported data abstractions and objects Structured programming

CLU ⇒

+ Modularity Data abstraction

Thursday, October 17, 13

39

Modularity • Guardian: a collection of processes and objects that provides a distributed service • Processes in same guardian share objects • Processes in different guardians communicate only via messages • Each guardian bound to a single node • Each node can host multiple guardians Thursday, October 17, 13

40

Communication • No-wait send • Receive with timeout • Messages are commands with zero or more arguments • Messages are sent to ports • Ports are named and typed, defining the messages they accept • Guardians have one or more ports Thursday, October 17, 13

41

Primitives for Distributed Computing • Paper finishes by showing an example of a transactional airline reservation system • One conclusion: more experience with distributed programming needed

Thursday, October 17, 13

42

App-to-App Protocols • ARPANET, forerunner of the Internet, started operating in late 1969 • Early host-to-host protocols facilitated human-to-computer communications • Email in 1971 • FTP and interoperable Telnet in 1973 • Interest started growing in application-toapplication protocols Thursday, October 17, 13

43

RFC 707 • J. E. White, RFC 707, “A High-Level Framework for Network-Based Resource Sharing, 1975 • Expressed concerns that programmers • didn't know how to write distributed applications • but did know how to write libraries • so, why not make distributed programming look just like library programming? Thursday, October 17, 13

44

RFC 674 and 707 • RFCs 674 and 707 basically define what would become the remote procedure call (RPC) • But questioned in RFC 684: "While the procedure call may be an appropriate basis for certain applications, we believe that it can neither directly nor accurately model the interactions and control structures that occur in many distributed multi-computer systems." —R. Schantz, RFC 684 Thursday, October 17, 13

45

See Also • Carl Hewitt's Actor Model • Smalltalk-72 • Robin Milner's Calculus of Communicating Systems • Bruce Lindsay's "Notes on Distributed Databases" • Anything by Jim Gray (in any decade) Thursday, October 17, 13

46

1980s

Thursday, October 17, 13

47

Some 1980s Issues • Languages for distributed programming • Operating systems • Safety and liveness • Consensus

Thursday, October 17, 13

48

Languages for Distribution • Much research in this period focused on whole programming languages and runtimes • Even whole systems consisting of unified programming language, compiler, and operating system

Thursday, October 17, 13

49

Languages for Distribution • RPC was a key abstraction • Significant focus on uniformity: • local/remote transparency • location transparency • strong/static typing across the system

Thursday, October 17, 13

50

Languages for Distribution • Specialized, closed protocols • Protocols were rarely the focus of research efforts, publications almost never mentioned them • Protocol was viewed as part of the RPC “black box,” hidden between client and server RPC stubs Thursday, October 17, 13

51

Liskov's Argus • Integrated programming language and system • Extension of CLU • Designed to help with reliability issues (partitions, downed nodes) • Included atomic actions to support consistency Thursday, October 17, 13

52

Xerox Cedar Programming Environment • Gave us 1984 Birrell/Nelson paper "Implementing Remote Procedure Calls"

Thursday, October 17, 13

53

Interface Definition Language (IDL) • Declarative language used to define remote interface functions and types • Translated/mapped into specific programming language stubs and type definitions • There are many IDLs, not sure of the original Thursday, October 17, 13

54

IDL • In mid-80s Apollo Computer used an IDL to define system interfaces, then translated into C and Domain Pascal • Kept definitions for C and Pascal in sync • Apollo Network Computing System (NCS) also used the IDL to define remote interfaces • NCS was a forerunner of the Distributed Computing Environment (DCE)

Thursday, October 17, 13

55

Some Apollo Trivia • Apollo Aegis and Domain/OS provided a native networked file system (not bolted on later) • Access to a file on host "foo" from any other host: //foo/path/to/file • Sir Tim Berners-Lee told me he later borrowed the Apollo "//" to use in URLs • Microsoft Universal Naming Convention (UNC) path uses "\\", likely due to Paul Leach who left HP/Apollo for Microsoft in 1991

Thursday, October 17, 13

56

Emerald

• distributed RPC-based object language • local/remote transparency • object mobility

Thursday, October 17, 13

57

Erlang • Programming language/system invented in the mid-80s at Ericsson by Joe Armstrong • Provides reliability via concurrency and distribution • Useful features, reasonable trade-offs, clear influence from work preceding it • Open source, available at erlang.org • My favorite programming language Thursday, October 17, 13

58

Vector Clocks • Independently discovered by Mattern and Fidge • Instead of just transmitting clocks or timestamps, keep a vector of clocks, one for each process • Lamport timestamps can't prove causality, vector clocks can Thursday, October 17, 13

59

Vector Clocks

from Fidge "Timestamps in Message Passing Systems That Preserve the Partial Ordering" Thursday, October 17, 13

60

Consensus

• Coordination and reliability • getting processes to agree • even if some are faulty or unavailable • or even if some are malicious

Thursday, October 17, 13

61

Byzantine Generals

• Lamport, 1982 • Proves how to achieve consensus in the presence of malicious processes

Thursday, October 17, 13

62

FLP Impossibility • Fischer, Lynch, Paterson paper, 1983 • Proves that in asynchronous systems, reaching consensus in bounded time can be impossible with just one fault • Uses proof by contradiction • See also Nancy Lynch's "A Hundred Impossibility Proofs for Distributed Computing" Thursday, October 17, 13

63

Why Impossibility? "What good are impossibility results, anyway? They don't seem very useful at first... Most obviously, impossibility results tell you when you should stop trying to devise or improve an algorithm." —Nancy Lynch http://groups.csail.mit.edu/tds/papers/Lynch/podc89.pdf

Thursday, October 17, 13

64

Safety and Liveness • Lamport, "Proving the Correctness of Multiprocess Programs", 1977 • See also Alpern and Schneider, "Recognizing Safety and Liveness", 1987 and their prior related work • These properties help us reason about distributed systems designs, approaches, trade-offs Thursday, October 17, 13

65

Safety and Liveness • Safety: nothing bad happens • e.g. distributed transactions ensure consistency across a system • In consensus terms, only a single proposed value is chosen • Doing nothing is considered safe!

Thursday, October 17, 13

66

Safety and Liveness • Liveness: something good eventually happens • e.g., a system eventually responds to every request • In consensus terms, a proposed value is eventually chosen • Ensures system progress Thursday, October 17, 13

67

See Also

• Dwork, Lynch, Stockmeyer, "Consensus in the Presence of Partial Synchrony" 1988 • Oki's and Liskov's Viewstamped Replication work for high availability

Thursday, October 17, 13

68

See Also • Ken Birman's work on reliable distributed computing (Isis, Horus) • Andrew Black's Eden project, a full distributed OO operating system, RPC-based • Andrew Herbert's "Advanced Network Systems Architecture" (ANSA), models and rules for distributed systems designs. Objects, transactions, interfaces. Influenced the Object Management Group (OMG) Thursday, October 17, 13

69

1990s

Thursday, October 17, 13

70

Some 1990s Issues

• Distributed objects • Practical consensus • The rise of the web

Thursday, October 17, 13

71

Distributed Objects • OOP grew in popularity in the 70s and 80s • Many 80s distributed systems research systems were based on objects • But research systems were often full stacks, including OS, language, and compiler Thursday, October 17, 13

72

Distributed Objects • Computer vendors ultimately had little choice but to • incorporate distributed systems research into their own stacks • but make distributed programming features available for “normal” programming languages, without changing those languages • It all led to CORBA Thursday, October 17, 13

73

Common Object request Broker Architecture • First CORBA spec published 1991 • I co-authored this 1999 book • CORBA is still alive today in 2013, and this book still sells

Thursday, October 17, 13

74

Ideal Distributed Objects Architecture Example: Object Management Architecture (OMA) from the Object Management Group (OMG) AI DI CF OS

DI CF OS

OS

CORBA ORB

OS

AI = Application Interfaces CF = Common Facilities

Thursday, October 17, 13

CF OS

DI = Domain Interfaces OS = Object Services

75

Enterprise Integration Reality

Thursday, October 17, 13

76

Application Integration

• A "common bus" was/is an interesting but impractical goal • Even today in 2013, application integration still involves numerous approaches

Thursday, October 17, 13

77

Fallacies of Distributed Computing 1. The network is reliable.

5. Topology doesn't change.

2. Latency is zero.

6. There is one administrator.

3. Bandwidth is infinite. 4. The network is secure.

Thursday, October 17, 13

7. Transport cost is zero. 8. The network is homogeneous. 78

Revised Fallacies of Distributed Computing 1.

Partitions do not occur.

6.

There is one administrator.

2.

Latency is zero.

7.

Transport cost is zero.

3.

Bandwidth is infinite.

8.

4.

The network is secure.

The network is homogeneous.

9.

Clocks are synchronized.

5.

Thursday, October 17, 13

Topology doesn't change.

10. Concurrency can be ignored.

79

Distributed Systems Concurrency • Dist Sys are inherently concurrent • Yet the distributed objects movement largely ignored concurrent object access • there was an OMG concurrency control service, but used only for the distributed transaction service • also ignored consensus, again except for the transaction service Thursday, October 17, 13

80

A Note on Distributed Computing • 1994 paper by Jim Waldo, Geoff Wyant, Ann Wollrath, Sam Kendall • Addresses a prevalent issue of the era: that local/remote transparency was a desirable goal • Also mentions the concurrency issue

Thursday, October 17, 13

81

Paxos • Lamport's "The Part-Time Parliament" defines the Paxos algorithm, still widely used today • Paper originally submitted in 1990, panned by reviewers • But others recognized its significance, so Lamport finally resubmitted for publication in 1998 Thursday, October 17, 13

82

Implementing Consensus • Butler Lampson understood the importance of Paxos • Published "How to Build a Highly Available System Using Consensus" in 1996 • Practical advice on using Paxos in real systems Thursday, October 17, 13

83

See Also • F. Schneider's "Implementing Fault Tolerant Services Using the State Machine Approach: A Tutorial", 1990 • Karger et al. paper on Consistent Hashing, 1997 • A. Fox and E.A. Brewer, "Harvest, Yield, and Scalable Tolerant Systems", 1999 Thursday, October 17, 13

84

2000s

Thursday, October 17, 13

85

Some 2000s Issues

• REST • Paxos made simple • CAP • Dynamo

Thursday, October 17, 13

86

REST • "Representational State Transfer", defined by Roy Fielding in his doctoral thesis, 2000 based on his work on the web and HTTP • An architectural style for networked applications • Has unfortunately now become an overused & misunderstood buzzword Thursday, October 17, 13

87

REST Constraints • Constraints and trade-offs are what help define an architecture • Client-server • Statelessness • Caching • Layered system • Uniform interface • Code on demand Thursday, October 17, 13

88

REST Properties • REST imposes these constraints to induce desired properties such as • performance, scalability, portability, simplicity • visibility (monitoring and mediation) • modifiability (evolution, extension, reuse) • reliability (handling failure, failover, load balancing, redundancy) Thursday, October 17, 13

89

REST • REST works well for problems that fit its constraints • But it's not for everything • A great general lesson from REST: 1. understand the properties your apps need 2. impose the appropriate constraints/ trade-offs to get them Thursday, October 17, 13

90

Paxos Made Simple

• Lamport was tired of complaints of the complexity of Paxos, so he wrote this in 2001 • It's still complex

Thursday, October 17, 13

91

CAP Theorem • Eric Brewer's Consistency, Availability, Partition Tolerance conjecture, 2000 • Formally proven in 2002 by Gilbert and Lynch • Common interpretation "pick two" isn't quite right

Thursday, October 17, 13

92

CAP Theorem • Distributed systems fail • So, partition tolerance (P) isn't a choice • Under partition, does your system • try to be consistent (C) • try to be available (A) • it can't be both

Thursday, October 17, 13

93

Amazon Dynamo • 2007 paper from Amazon describing a large-scale highly-available eventually consistent key-value datastore • Strong influence on Cassandra, Riak, and Voldemort databases • Riak Core is a framework for implementing Dynamo-like systems https://github.com/basho/riak_core Thursday, October 17, 13

94

See Also • The Google Chubby lock service • Joe Armstrong's 2003 PhD thesis "Making reliable distributed systems in the presence of software errors" • Pastry and Chord distributed hash tables, 2001 • Multicore CPU cache coherence protocols • Papers that have won the Dijkstra Prize Thursday, October 17, 13

95

2010s

Thursday, October 17, 13

96

CRDTs Convergent Commutative

Replicated Data Types

Conflict-free • Replicated data types that handle updates correctly in eventually consistent, highly available systems • E.g., counters, sets, maps • Automatic handling updates that can occur concurrently or under partition Thursday, October 17, 13

97

Raft

• D. Ongaro and J. Ousterhout, "In Search of an Understandable Consensus Algorithm", 2013 • See also Zookeeper Atomic Broadcast (ZAB)

Thursday, October 17, 13

98

CALM and Bloom

• CALM: Consistency as Logical Monotonicity • Bloom: distributed programming language that helps deal with distributed consistency

Thursday, October 17, 13

99

• In general, pay attention to ALL distributed systems work coming from: • Joe Hellerstein, Neil Conway, Peter Alvaro, William Marczak from the UC Berkeley Database Group • Peter Bailis and Ali Ghodsi from AMPLab at UC Berkeley

Thursday, October 17, 13

100

Jepsen • Kyle Kingsbury (@aphyr) breaks systems using Jepsen https:// github.com/aphyr/jepsen • Read his "Call Me Maybe" series of blog posts on aphyr.com describing his experiments, lots of detailed distributed systems knowledge and insights

Thursday, October 17, 13

101

Summary

Thursday, October 17, 13

102

• Distributed systems are everywhere, and many developers work on them

Thursday, October 17, 13

103

• Distributed systems are hard to reason about, due to many subtle details

Thursday, October 17, 13

104

• Distributed systems R&D history is extremely rich

Thursday, October 17, 13

105

• Study the classics • These papers can be hard to understand but are worth it

Thursday, October 17, 13

106

• Learning this history gives you a vocabulary of theory and techniques for tackling the problems you work on

Thursday, October 17, 13

107

• Understand what trade-offs are best for your distributed system • Achieve them with the help of relevant prior work

Thursday, October 17, 13

108

Thanks @stevevinoski

Thursday, October 17, 13

109

References •

E.W. Dijkstra, "Cooperating Sequential Processes", 1965 http://www.cs.utexas.edu/~EWD/ transcriptions/EWD01xx/EWD123.html



E.W. Dijkstra, "The Structure of the 'THE' Multiprogramming System", CACM, May 1968 http://dl.acm.org/citation.cfm?id=363143



Akkoyunlu, Bernstein, Schantz, "Interprocess Communication Facilities for Network Operating Systems", 1974 http://ieeexplore.ieee.org/xpl/articleDetails.jsp? arnumber=6323582



B. Cosell,, P. Johnson, J. Malman, R. Schantz, J. Sussman, R. Thomas, D. Walden, "An Operational System for Computer Resource Sharing", 1975 http://www.webstart.com/ papers/tenex-rsexec.pdf



L. Lamport, "Time, Clocks and Ordering of Events in a Distributed System", 1978, http:// research.microsoft.com/en-us/um/people/lamport/pubs/time-clocks.pdf



C.A.R. Hoare, "Communicating Sequential Processes", 1978 http://dl.acm.org/citation.cfm? id=359585 see also http://www.usingcsp.com



B. Liskov, "Primitives for Distributed Computing", 1975 http://dl.acm.org/citation.cfm? doid=800215.806567

Thursday, October 17, 13

110

References •

J. E. White, RFC 707, “A High-Level Framework for Network-Based Resource Sharing, 1975 http://tools.ietf.org/rfc/rfc707



R. Schantz, RFC 684, "A Commentary on Procedure Calling as a Network Protocol", 1975 http://tools.ietf.org/html/rfc684



Butler W. Lampson and Howard E. Sturgis, "Crash Recovery in a Distributed Data Storage System", 1979 http://research.microsoft.com/en-us/um/people/blampson/21crashrecovery/Abstract.html (originally part of the Distributed Processing Workshop, Brown University, August 1976)



L. Lamport, "Proving the Correctness of Multiprocess Programs", 1977 http:// research.microsoft.com/en-US/um/people/Lamport/pubs/proving.pdf



L. Lamport, "The Implementation of Reliable Distributed Multiprocess Systems", 1978 http:// research.microsoft.com/en-us/um/people/lamport/pubs/implementation.pdf



B. Lindsay et al., "Notes on Distributed Databases", 1979 http://domino.research.ibm.com/ library/cyberdig.nsf/papers/A776EC17FC2FCE73852579F100578964/$File/RJ2571.pdf

Thursday, October 17, 13

111

References •

M. Pease, R. Shostak, L. Lamport, "Reaching Agreement in the Presence of Faults", 1980 http:// research.microsoft.com/en-us/um/people/lamport/pubs/reaching.pdf



L. Lamport, R. Shostak, M. Pease, "The Byzantine Generals Problem", 1982 (available at) http:// www.cs.cornell.edu/courses/cs614/2004sp/papers/lsp82.pdf



B. Liskov and R. Scheifler, "Guardians and Actions: Linguistic Support for Robust, Distributed Programs", 1983 (available at) http://www.cs.brandeis.edu/~cs147a/papers/liskov-argus.pdf



M. Fischer, N. Lynch, and M. Paterson, "Impossibility of Distributed Consensus with One Faulty Process", 1983 http://groups.csail.mit.edu/tds/papers/Lynch/pods83-flp.pdf



A. Black et al., "Object Structure in the Emerald System", 1986 (available at) http://pdf.aminer.org/ 000/521/744/object_structure_in_the_emerald_system.pdf



B. Alpern and F. Schneider, "Recognizing Safety and Liveness", 1987 http://www.cs.cornell.edu/ fbs/publications/RecSafeLive.pdf



C. Fidge, "Timestamps in Message-Passing Systems That Preserve the Partial Ordering", 1988, (available at) http://zoo.cs.yale.edu/classes/cs426/2012/lab/bib/fidge88timestamps.pdf



F. Mattern, "Virtual Time and and Global States of Distributed Systems", 1989 (available at) http:// homes.cs.washington.edu/~arvind/cs425/doc/mattern89virtual.pdf

Thursday, October 17, 13

112

References •

C. Dwork, N. Lynch, L. Stockmeyer, "Consensus in the Presence of Partial Synchrony", 1988 http://groups.csail.mit.edu/tds/papers/Lynch/jacm88.pdf



B. Oki and B. Liskov, "Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems", 1988 (available at) http://www.cs.princeton.edu/ courses/archive/fall09/cos518/papers/viewstamped.pdf



N. Lynch, "A Hundred Impossibility Proofs for Distributed Computing", 1989 http:// groups.csail.mit.edu/tds/papers/Lynch/podc89.pdf



M. Kong, T. H. Dineen, P. J. Leach, E. A. Martin, N. W. Mishkin, J. N. Pato, and G. L. Wyant. 1990. Network Computing System Reference Manual. Prentice-Hall, Inc., Upper Saddle River, NJ.



F. Schneider, "Implementing Fault Tolerant Services Using the State Machine Approach: A Tutorial", 1990 http://www.cs.cornell.edu/fbs/publications/smsurvey.pdf



J. Waldo et al., "A Note on Distributed Computing", 1994 (available at) http:// www.cc.gatech.edu/classes/AY2010/cs4210_fall/papers/smli_tr-94-29.pdf



B. Lampson, "How to Build a Highly Available System Using Consensus", 1996 http:// research.microsoft.com/en-us/um/people/blampson/58-Consensus/Acrobat.pdf

Thursday, October 17, 13

113

References •

D. Karger, et al., "Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web", 1997 http://dl.acm.org/citation.cfm? id=258660



A. Fox and E.A. Brewer, "Harvest, Yield, and Scalable Tolerant Systems", 1999, http:// lab.mscs.mu.edu/Dist2012/lectures/HarvestYield.pdf



R.T. Fielding, "Architectural Styles and the Design of Network-based Software Architectures", 2000 http://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm



E. A. Brewer, "Towards Robust Distributed Systems", 2000 http://www.cs.berkeley.edu/ ~brewer/cs262b-2004/PODC-keynote.pdf



L. Lamport, "Paxos Made Simple", 2001 http://research.microsoft.com/en-us/um/people/ lamport/pubs/paxos-simple.pdf



A. Rowstron and P. Druschel, "Pastry: Scalable, Decentralized Object Location and Routing for Large-Scale Peer-to-Peer Systems", 2001 http://research.microsoft.com/en-us/um/ people/antr/PAST/pastry.pdf

Thursday, October 17, 13

114

References •

I. Stoica et al., "Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications", 2001 http://pdos.csail.mit.edu/papers/chord:sigcomm01/chord_sigcomm.pdf



S. Gilbert and N. Lynch, "Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services", 2002 http://lpd.epfl.ch/sgilbert/pubs/BrewersConjectureSigAct.pdf



J. Armstrong, "Making Reliable Distributed Systems in the Presence of Software Errors", 2003 http://www.erlang.org/download/armstrong_thesis_2003.pdf



M. Burrows, "The Chubby Lock Service for Loosely-coupled Distributed Systems", 2006 http://research.google.com/archive/chubby.html



G. DeCandia et al., "Dynamo: Amazon's Highly Available Key-value Store", 2007 http:// www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf



M. Shapiro et al., "A Comprehensive Study of Convergent and Commutative Replicated Data Types", 2011 http://hal.inria.fr/docs/00/55/55/88/PDF/techreport.pdf

Thursday, October 17, 13

115

References •

M. Shapiro et al., "Conflict-Free Replicated Data Types", 2011 http://hal.inria.fr/docs/ 00/60/93/99/PDF/RR-7687.pdf



E. Brewer, "CAP Twelve Years Later: How the 'Rules' Have Changed", 2012, http:// www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed



P. Bailis, A. Ghodsi, "Eventual Consistency Today: Limitations, Extensions, and Beyond", 2013 http://queue.acm.org/detail.cfm?id=2462076



P. Alvaro et al., "Consistency Analysis in Bloom: a CALM and Collected Approach", 2011 http://db.cs.berkeley.edu/papers/cidr11-bloom.pdf



N. Conway, et al., "Logic and Lattices for Distributed Programming", 2012 http:// db.cs.berkeley.edu/papers/socc12-blooml.pdf



D. Ongaro and J. Ousterhout, "In Search of an Understandable Consensus Algorithm", 2013 https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf



Zookeeper Atomic Broadcast, http://labs.yahoo.com/files/ladis08.pdf



Dotted Version Vector Sets, https://github.com/ricardobcl/Dotted-Version-Vectors

Thursday, October 17, 13

116