Monday, November 24, 2008

MapReduce Again!

Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, Ion Stoica, "Improving MapReduce Performance in Hereogenous Environments"


Intro
  • Hadoop is an open-source implementation of Map Reduce (and distributed filesystem?)
  • new deployment: virtualized environments, datacenters with several generations of hardware display heterogenity
  • want to optimize scheduler to finish fastest without wasting unnecceary resources
Scheduling the next empty task slot
  • Priority 1: failed tasks
  • Priority 2: non-running tasks
  • Priority 3: Speculative execution
Speculative execution, the native Hadoop way
  • Speculative execution depends on progress score [0,1] devided equally between copy, sort and reduce phase.
  • process is marked straggler if its progress <>
  • a task of a straggler is given to another machine (but at most one speculative execution) based on data locality
(underlying assumptions see 2.2)

New speculative execution: LATE=Longest Approximate Time to End
  • Idea: speculatively execute the task that will finish fathest into the future (greatest opportunity to save time)
  • different methods for estimating time left: basic is linear extrapolation from progress score
  • speculative tasks scheduled on fast nodes (through threshold how much work a node already has performed)
  • rank tasks by time finished, schedule speculative task if it is likely to finish before other node
  • refinements to account for costs of speculative execution: SpeculativeCap on number of concurrent speculative tasks and SlowTaskThreshold
Result: LATE reduces running time by a quarter to a half

Policy-aware Switching Layer for Data Centers

Dilip Antony Joseph, Arsalan Tavakoli, Ion Stoica, "A Policy-aware Switching Layer for Data Centers"

Ideas
  • shares some ideas with "Middleboxes no longer considered harmful": traffic can be directed to traverse middleboxes without the middlebox being on the physical path
  • but: differs because it is not introducing new properties to IP, but instead replaces Layer II by PLayer
  • usual approach: put middleboxes on the physical path, hard to configure, possibly multiple physical paths between two nodes, different traffic supposed to go through different middleboxes
  • new approach: among Layer II switches, add (policy-aware) pswitches which route the packets through the middleboxes on Layer II based on policy specification
Goals
  • Correctness: Traffic should traverse middleboxes in the sequence specified by the nework administrator
  • Flexibility: sequence of middleboxes should be easily reconfigured
  • Efficiency: traffic should not traverse unnecessary middleboxes
Principles
  • Separating policy from (physical) reachability
  • taking middleboxes off the physical network data path
  • policies: [Start,(5-tuple traffic selector)] -> sequence
  • translated into rules ([Previous Hop, Traffic Selector]: Next Hop) which is used by pswitch
  • policy schemas (versioned) are created by administrator and pushed onto the pswitches
Mechanisms
  • decoupling switch core (can use traditional core, usual techniques to learn MAX addresses and build spanning trees) from policy core (where the new rules are applied) with additional failure detection
  • encapsule Ethernet frames in Ethernet-II frames (which contain policy version): the algorithm to build the spanning tree needs another Source-MAC (a more physical one) than the FireWall (a more logical one)
  • SrcMacRewrite to decapsulate Ethernet-II frames before entering (legacy) middlebox
  • in some situations pswitches require per-flow state
Other
  • implementation in Click, results
  • detailed discussions on various topologies
  • formal analysis
I like this more than the "Middleboxes no longer considered harmful" because
  • it integrates better into exisiting layerII structures instead of adding yet another identifier
  • in "Middleboxes no longer considered harmful" nodes had to be configured to use the off the physical path FireWall, it was unclear how an IP packet destined for a computer behind a off-the physical path Firewall will actually be hindered from reaching the computer. In this paper this is clear.
I got a bit confused because MAC addresses are used in two different ways, onc address is used for forwarding between pswitches, but another can be used when the packet is handed to a middlebox. I think the paper should have given these MACs different names, if possible.

Monday, November 17, 2008

Internet Data Transfer

Nirja Tolia, Michael Kaminsky, David G. Andersen, Swapnil Patil, "An Architecture for Internet Data Transfer"

Principles
  • separating content negotiation from data transfer (motivation: 70% of Internet traffic is bulk transfer)
  • common interface between various applications and transfer service architecture, application developer does not need to reinvent transfer service for each new application
  • Figure 3 explains how the application works in the DOT framework. The application calls the GTC for the data transfer
Data transfer methods can even include physically moving hard disks. One benefit is that the same data can be cached across different applications like Web and electronic email.

Delay Tolerant Networks

Kevin Fall, "A Delay-Tolerant Network Architecture for Challenged Internets"

This paper is abstract. It suggests a overlay network which is running on networks different from the usual Internet and hence providing a very different service model. But there is no implementation (except one mentioned in one sentence in the conclusion) and no results section, no new algorithms.
I wonder whether the picture in Figure 1 really is a typical example: are there really applications where challenged networks are used for transit, or aren't most applications (e.g. sensor networks, satellite connections, ...) just having challenged networks at one or both ends?
And regarding the US Postal System as model for challenged Internets: looking back at my experiences so far, I kind of doubt whether this is a good idea...

Targeted networks
  • Terrestial Mobile Networks
  • Excotic Media Networks (e.g. deep space, acoustic links)
  • military Ad-Hoc Networks
  • Sensor Networks
Violated Assumptions
  • E2E path exist, no disconnection
  • max RTT not excessive, reasonable queuing time
  • drop probability <<>
Possibilities to deal with the violated assumptions
  • fix link-layer
  • attach challenged networks at the edge of Internet through proxy, e.g. sensor networks, protocol boosters (e.g. HTTP proxy for satellite connections)
  • electronic mail (mostly reliable, likely failure notification)
  • DTN

New principles of DTN
  • Message-switching (instead of packets)
  • Regions with DTN gateways inbetween
  • name includes region and entity name within that region
  • several services (motivated by USPS)
  • path selection and scheduling, trying to predict when the next contact to a certain region takes place
  • custody transfer and reliability, e.g. if a message is sent from one node to another node, does the old node need to keep a copy until confirmation the message was delivered
  • convergence layers
  • time synchronization
  • congestion and flow control

Internet Measurements

Vern Paxson, "End-to-End Internet Packet Dynamics"

I found this paper hard to read, not because it is badly written, but because it discusses the measurements and results in extraordinary depth, discussing the measured effect, coming up with potential explanations, possible effects on higher layers and even suggestions for implementations. The lecture on Thursday luckily turned my frustration when reading the paper in admiration for the pioneering role Paxson played in introducing internet measurement methodology and the enormous body of results he produced.

Methodology
  • large-scale experiment: 37 sites, N to N bulk transfer, done at Dec 1994 and 1995
  • TCP bulk transfer (versus ICMP packet probing): 1. this is how most Internet traffic really looks like, 2. TCP adapts to avoid unduly loading the network
  • TCP measurements intertwine network and transport, tcpanaly separates these behaviors, TCP implementation specific (Vegas versus Reno, Windows NT versus Linux), tcpanaly can not recover time series analysis
Measurements
  • Pathologies: Reordering (on certain paths as high as 36%, due to route flaps), Replication (rare) and Corruption (1 in 5000 packets)
  • Bottleneck bandwidth (difficult to estimate, basic idea: send two consecutive packets, at bottleneck bandwidth, the second packet has to wait for the transmission of the first packet, limited by time resolution, refined through bunches)
  • packet loss (distinguishes between data and ack loss, loss at loaded and unloaded links) and effect on TCP (causes of a redundant retransmission: unavoidable, coarse feedback (no SACK), bad RTO)
  • packet delay (due to queing/bandwidth, but even compression observed)
Conclusion
  • Justification of measurement (TCP) methodology over other measurement techniques
  • wide variety of path properties
  • underlying assumptions about the network often broken
  • robust TCP implementations can work non the less

Thursday, November 13, 2008

X-Trace

Rodrigo Fonseca, George Porter, Randy H. Katz, Scott Shenker, Ion Stoica, "X-Trace: A pervasive Network Tracing Framework"

Motivation
  • current diagnostic tools limited to one particular proctocol, e.g. traceroute
  • need for comprehensive view of the system's behavior
  • complex systems: e.g. wikipedia has different sites, web caches, DNS round-robin, load balancers, web servers, database servers (and memcached)
  • tracing across different administrative domains needed
Ideas and design principles
  • integrated tracing framework
  • network protocols modified to propagate X-Trace metadata
  • works inter-layer
  • works inter-Administrative Domains
  • decouples client of application and recipient of tracing data (Design principle 3), destination part of the X-Trace metadata
  • trace initiated by inserting X-Trace metadata by user application or network operator
  • trace identified by task identifier
  • X-Trace data send to report server (can be client application or delegated server)
  • X-Trace constructs task tree offline, two axis: one across "layers" (an event causes another event in lower layer), one across "time" (an event causes another in the same layer), each node in the task tree has an ID, children link to their parents
  • Design principle 1: trace request are sent in-band
  • Design principle 2: trace data are sent out-of-band
  • ASCII report format
  • report library, report collection thorugh e.g. Postgres
  • visualization of task tree
Deployment
  • API for application has pushNext() and pushDown() to propagate X-trace MetaData across the two axis, device reports information accessible at its own layer, can include additional information like load
  • gradual deployment: for legacy clients, devices in the network can add X-Trace metadata
  • retrofitting X-Trace into exisiting applications faces difficulties: change to various protocols (IP options, TCP, HTTP headers, SQL), partial deployment impairs ability to trace parts of the network, lost trace reports can be interpreted as false positives
  • certain request topologies cannot be captured, e.g. requests spreads through the network and rendezvous at a node
  • unique() function returning identifier for task tree not specified in paper
Uses and Experiences
  • low performance overhead
  • Web request and recursive DNS queries
  • Web hosting site (LAMP), user could intiate traces through JavaScript/PHP library
  • overlay network
  • Tunnels, ISP connectivity
I really liked the framework this paper suggests. I think it is very useful. Though there has been a lot of experience, scalabe websites are still non-trivial to setup, still require some manual work to tune and integrate caches, and work across a lot of different layers as mentioned in the introduction. Some difficulties not mentioned in the introduction are: there are even more caching layers like SQL query cache and memcache, relational databases don't scale and huge websites shard the data across multiple machines, Brewer's theme on performance versus consistency (a website is not truely transactional, just enough so that the users don't preceive it as bad, but when a user sees it, it is hard to track down). This paper introduces the debugging tool I am aware of which is addressing all these things together.
The only thing I would add to this framework is the ability to send encrypted X-Trace data.

Post-lecture: Middleboxes done right?

Last lecture on Thursday again ended in a very interesting discussion.

Randy commented on my last post: "I always like the format you use for these comments! Any thoughts on whether the idea of outsourcing middlebox functions makes any sense?"

Here is my answer:
Yes:
  • framework provides the same functionality without breaking the two principles mentioned in the introduction
  • performance will be competitive only if the middlebox will be close or on the network path, i.e. where the middlebox is anyway
  • user benefits from being able to choose between the middlebox service and direct service
No:
  • extra conceptual overhead
  • how to gradually deploy?
  • because middlebox is not phyisically on the network path blocking IP packets, certain attacks on computers are still possible
  • user needs to configure their computer to use middlebox service (half will forget and leave the network vulnerable)
  • middleboxes might be deployed to actually prevent certain kinds of services, with middlebox being outsourced this can be circumvented
  • NAT on traditional boxes offer anonymity, invisibility from the outside

Wednesday, November 5, 2008

Middleboxes done right?

Michael Walfish, Jeremy Stribling, Maxwell Krohn, Hari Balakrishnan, Robert Morris, Scott Shenker, "Middleboxes No Longer Considered Harmful"

Benefits of Intermediate Network Elements like NAT, firewall, transparent cache
  • NAT: private IP spaces allow protection, more hosts than available IPs
  • Fiewalls prevent attacks on endhosts
  • security, flexibility, convenience
  • exist for a "important and permanent reason"
They violate the following internetworking principles
  • 1. "Every Internet entity has a unique network-layer identifier that allows others to reach it."
  • 2."Network elements should not process [the payload of IP] packets that are not addresses to them."
Consequences
  • "scorn" and "dismay"
  • halts spread of newer protocols, P2P systems
  • layer violation, rigidity in network infrastructure, may not accomodate new traffic classes
New Idea: Delegation-Oriented Architecture (DOA)
  • implement intermediaries without violating principles
  • Extra DOA Header between IP and TCP
  • Firewall does not need to be in the "physical path", but hosts can "outsource" to "off-path" hosts, end host has primitive to choose a machine to delegate NAT or Firewall functionality to
  • DOA header has: 1. references to persistent host identifier (in globally flat name space, stays with host even when IP changes), 2. a way to resolve these references to delegated machine
  • does not reqires change to IP (routers), allows incremental deployment
  • but cannot: circumvent tenet-violating middleboxes (e.g. by censorious government)
  • persistent host identifier= EID (endpoint identifier) is 160 bit, contains cryptographic meaning
  • mapping service: EID -> IP of delegated host, more EIDs (to chain several intermediares, loose source-routing)

Internet Indiretion Infrastructure

Ion Stoica, Daniel Adkins, Shelley Zhuang, Scott Shenker, Sonesh Surana, "Internet Indirection Infrastructure"

What kind of flavor of network is this?
  • It is a little bit like circuit-switched networking (allowing packet processing on top of it) in the sense that the receivers set up triggers which will then identify how a flow will be routed through the Internet.
  • It is not at all like Active Networks because the "routers" are not "programmed" by the senders through injecting code into the packet, but actually by the receivers through setting up triggers.
  • It uses DHTs to load balance.

Goals
  • generalize Internet P2P communication abstraction
  • implement features like multicast, anycast, mobility on application layer (challenging on IP layer, previous proposals provided disjoint solutions for each problem separately). load-balancing
  • data processing (H.263->MPEG conversion) while routing
  • decoupling the act of sending and receiving through rendezvous-based communication abstraction
  • receivers can control routing, can built trees for multicast
Concepts
  • each packet has an identifier, receiver asks infrastruction for delivery of the packets with a given identifier
  • receiver addr expresses interest in identifier id through triggers (id,addr)
  • triggers are matched to packets by largest prefix overlap, have to exceed exact-match threshold
  • sending and receiving hosts don't need to know each others identity or location
  • indirection, e.g. home agent in Mobil IP
  • private versus public identifier, e.g. for flows versus name lookups
  • Advacned: stack of identifiers is a list of identifiers or addresses written as id_stack. A trigger can contain stacks (id,id_stack), a packet can contain a stack (id_stack,payload)
  • soft-state of maintaining triggers (and possibly redundancy)
Main Ideas
  • identifiers map to unique i3 node through Chord, inheriting through Chord Robustness, Scalability, Efficiency, Stability
  • senders send packet to i3 node corresponding to identifier
  • receivers install triggers on the i3 node corresponding to identifier
  • i3 nodes routes packets matching triggers to receiver
  • Advanced: if a trigger has a stack, a packet can be routed to another identifier (e.g. i3 node) who will then essentially further route it down the stack, e.g. loose source routing. The algorithm is given in Figure 3.
  • Advanced: identifiers can be routed through applications, e.g. HTML-> WML transformation
Details and further ideas
  • server selection through least significant bits in identifier, random for load-balancing, through geographic properties for CDN
  • private and public triggers, example in 4.2
  • scalability because per-flow state disributed among many servers
  • avoiding hot-spots in multicast through pushing copies of triggers
  • i3 proxy allows legacy UDP-applications

Saturday, November 1, 2008

DNS more contemporary

Jaeyeon Jung, Emil Sit, Hari Balakrishnan, Robert Morris, "DNS Performance and the Effectiveness of Caching"

The paper makes and mentions a lot of interesting measurements and results. There are two prime investigations:
  • overall DNS performance (from a user perspective and from an amount of traffic perspective)
  • the impact of caches and TTL. The motivation comes from the dichotomoy: load-balancing applications respond with queries with short validity limiting the use of cache BUT the scalability of DNS is said to arise (besides the hierarchical organization) from queries answered from cache
Methodology
  • collect Internet traces: DNS packets and TCP SYN/FIN/RST packets
  • trace through 60 second window the process of iterating lookups until the answer is foun
  • track TCP connections associated to a DNS query
  • group clients' IP addresses and simulate a common DNS cache for them
Performance results
  • distribution of types of DNS lookups (mostly A records hostname -> IP address)
  • half of the DNS lookups are associated to a TCP connection
  • DNS query latency has median of 1/10 second but a significant portion takes up to 10s of seconds, distribution of number of referals
  • 70% of the querries do not hit a root/server gTLD (i.e. cached NS improve performance and greatly reduce load on root servers)
  • a successful DNS query needs on average ~1.3 packets
  • unanswered queries (due to e.g. NS records to no longer existing servers) might cause substantially more traffic per query due to loops and retransmissions
Effect of Caching
  • name popularity is Zipf distribution, 10% of names account for 68% of answers + long tail
  • current TTL distributions
  • most caching benefit is achieved with 10-20 clients per cache
  • most caching benefit is achieved for TTL~several minutes for A records of Webservers (I think though that in Figure 13, they should plot the cache miss rate. A hit rate of 97% and 99% sound the same, but mean a three times lower miss rate, but that implies we can serve 3 times as many clients from the same e.g. gTLD server)
  • effect of eliminating A-record caching, per-client caching...
Concluding Remark: "The reasons for the scalability of DNS are due less to the hierarchical design of its name space or good A-record caching (as seems to be widely believed); rather, the cacheability of NS-records efficiently partition the name space and avoid overloading any single name server in the Internet." I think that is a rather provocative statement. If you cache no single IP address, you have to start every query at a root server, so it is not scalable, so how do I have to understand their remark?

I think the numbers in the abstract are fairly useless unless one has read the paper, because only later in the paper one finds the answers to the following questions:
  • it is mentioned what fraction of lookups are unsuccessful: but are these network failures due to overload or dropped packets (which would be bad) or just because a user typed a domain name wrong (for which we expect the lookup to result in no answer)?
  • is the cache miss rate meant for a query destined for a root, gTLD or domain server? E.g. is the conclusion we can draw from the measurements that if we didn't have caches anymore, suddenly the 13 root servers would be just horribly underscaled, or that the name servers for some popular sites would just hit a little more load?
  • The browser also has cache for DNS entries, so is the cache miss rate with respect to each TCP connection the browser makes, or just for every DNS lookup the browser cannot answer itself?
One thing which doesn't make sense to me is the fact that root servers get more lookups than gTLD (TABLE VI).

DNS Intro

Paul V. Mockapetris, Kevin J. Dunlap, "Development of the Domain Name System"

Basic design assumptions
  • provide at least all the same information as hosts.txt
  • allow the database to be maintained in a distributed manner
  • no obvious size limits for names, name components, data associated with a name, etc.
  • interoperate across the DARPA Internet and in as many other environments as possible
  • tolerable performance
  • Lean service versus general distributed database
Design
  • Hierarchical organization (zones) and namespace
  • Caching, negative caching
  • Resource Record: Type, Class, TTL, data of variable types
  • servers and resolvers
  • root servers (with rates of 1 query/sec in 1988)
  • datagram (UDP) access
Things I was surprised about
  • Pre-DNS: hosts.txt was used for quiet a long time
  • it took time to convert more hosts from pre-DNS hosts.txt to DNS and delegate domains
  • the importance of Berkeley UNIX's bind
  • applications had to be modified to handle transient failures when using DNS instead of hosts.txt lookups
  • DNS was intended to be far more general to lookup up names of anything. Today it is used almost exclusively to map "hostname <-> IP (+MX)"
  • in the early days, people controling a domain didn't necessarily have the expertise to configure DNS correctly (and I thought in the good old days, only people who knew what they were doing had access to the Internet)
  • RR had Class field which would allow different namespaces for DARPA net, ISO directory service, ...