CS 268 Computer Networks: November 2008

Monday, November 24, 2008

MapReduce Again!

Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, Ion Stoica, "Improving MapReduce Performance in Hereogenous Environments"

Intro

Hadoop is an open-source implementation of Map Reduce (and distributed filesystem?)
new deployment: virtualized environments, datacenters with several generations of hardware display heterogenity
want to optimize scheduler to finish fastest without wasting unnecceary resources

Scheduling the next empty task slot

Priority 1: failed tasks
Priority 2: non-running tasks
Priority 3: Speculative execution

Speculative execution, the native Hadoop way

Speculative execution depends on progress score [0,1] devided equally between copy, sort and reduce phase.
process is marked straggler if its progress <>
a task of a straggler is given to another machine (but at most one speculative execution) based on data locality

(underlying assumptions see 2.2)

New speculative execution: LATE=Longest Approximate Time to End

Idea: speculatively execute the task that will finish fathest into the future (greatest opportunity to save time)
different methods for estimating time left: basic is linear extrapolation from progress score
speculative tasks scheduled on fast nodes (through threshold how much work a node already has performed)
rank tasks by time finished, schedule speculative task if it is likely to finish before other node
refinements to account for costs of speculative execution: SpeculativeCap on number of concurrent speculative tasks and SlowTaskThreshold

Result: LATE reduces running time by a quarter to a half

Policy-aware Switching Layer for Data Centers

Dilip Antony Joseph, Arsalan Tavakoli, Ion Stoica, "A Policy-aware Switching Layer for Data Centers"

Ideas

shares some ideas with "Middleboxes no longer considered harmful": traffic can be directed to traverse middleboxes without the middlebox being on the physical path
but: differs because it is not introducing new properties to IP, but instead replaces Layer II by PLayer
usual approach: put middleboxes on the physical path, hard to configure, possibly multiple physical paths between two nodes, different traffic supposed to go through different middleboxes
new approach: among Layer II switches, add (policy-aware) pswitches which route the packets through the middleboxes on Layer II based on policy specification

Goals

Correctness: Traffic should traverse middleboxes in the sequence specified by the nework administrator
Flexibility: sequence of middleboxes should be easily reconfigured
Efficiency: traffic should not traverse unnecessary middleboxes

Principles

Separating policy from (physical) reachability
taking middleboxes off the physical network data path
policies: [Start,(5-tuple traffic selector)] -> sequence
translated into rules ([Previous Hop, Traffic Selector]: Next Hop) which is used by pswitch
policy schemas (versioned) are created by administrator and pushed onto the pswitches

Mechanisms

decoupling switch core (can use traditional core, usual techniques to learn MAX addresses and build spanning trees) from policy core (where the new rules are applied) with additional failure detection
encapsule Ethernet frames in Ethernet-II frames (which contain policy version): the algorithm to build the spanning tree needs another Source-MAC (a more physical one) than the FireWall (a more logical one)
SrcMacRewrite to decapsulate Ethernet-II frames before entering (legacy) middlebox
in some situations pswitches require per-flow state

Other

implementation in Click, results
detailed discussions on various topologies
formal analysis

I like this more than the "Middleboxes no longer considered harmful" because

it integrates better into exisiting layerII structures instead of adding yet another identifier
in "Middleboxes no longer considered harmful" nodes had to be configured to use the off the physical path FireWall, it was unclear how an IP packet destined for a computer behind a off-the physical path Firewall will actually be hindered from reaching the computer. In this paper this is clear.

I got a bit confused because MAC addresses are used in two different ways, onc address is used for forwarding between pswitches, but another can be used when the packet is handed to a middlebox. I think the paper should have given these MACs different names, if possible.

Monday, November 17, 2008

Internet Data Transfer

Nirja Tolia, Michael Kaminsky, David G. Andersen, Swapnil Patil, "An Architecture for Internet Data Transfer"

Principles

separating content negotiation from data transfer (motivation: 70% of Internet traffic is bulk transfer)
common interface between various applications and transfer service architecture, application developer does not need to reinvent transfer service for each new application
Figure 3 explains how the application works in the DOT framework. The application calls the GTC for the data transfer

Data transfer methods can even include physically moving hard disks. One benefit is that the same data can be cached across different applications like Web and electronic email.

Delay Tolerant Networks

Kevin Fall, "A Delay-Tolerant Network Architecture for Challenged Internets"

This paper is abstract. It suggests a overlay network which is running on networks different from the usual Internet and hence providing a very different service model. But there is no implementation (except one mentioned in one sentence in the conclusion) and no results section, no new algorithms.
I wonder whether the picture in Figure 1 really is a typical example: are there really applications where challenged networks are used for transit, or aren't most applications (e.g. sensor networks, satellite connections, ...) just having challenged networks at one or both ends?
And regarding the US Postal System as model for challenged Internets: looking back at my experiences so far, I kind of doubt whether this is a good idea...

Targeted networks

Terrestial Mobile Networks
Excotic Media Networks (e.g. deep space, acoustic links)
military Ad-Hoc Networks
Sensor Networks

Violated Assumptions

E2E path exist, no disconnection
max RTT not excessive, reasonable queuing time
drop probability <<>

Possibilities to deal with the violated assumptions

fix link-layer
attach challenged networks at the edge of Internet through proxy, e.g. sensor networks, protocol boosters (e.g. HTTP proxy for satellite connections)
electronic mail (mostly reliable, likely failure notification)
DTN

New principles of DTN

Message-switching (instead of packets)
Regions with DTN gateways inbetween
name includes region and entity name within that region
several services (motivated by USPS)
path selection and scheduling, trying to predict when the next contact to a certain region takes place
custody transfer and reliability, e.g. if a message is sent from one node to another node, does the old node need to keep a copy until confirmation the message was delivered
convergence layers
time synchronization
congestion and flow control

Internet Measurements

Vern Paxson, "End-to-End Internet Packet Dynamics"

I found this paper hard to read, not because it is badly written, but because it discusses the measurements and results in extraordinary depth, discussing the measured effect, coming up with potential explanations, possible effects on higher layers and even suggestions for implementations. The lecture on Thursday luckily turned my frustration when reading the paper in admiration for the pioneering role Paxson played in introducing internet measurement methodology and the enormous body of results he produced.

Methodology

large-scale experiment: 37 sites, N to N bulk transfer, done at Dec 1994 and 1995
TCP bulk transfer (versus ICMP packet probing): 1. this is how most Internet traffic really looks like, 2. TCP adapts to avoid unduly loading the network
TCP measurements intertwine network and transport, tcpanaly separates these behaviors, TCP implementation specific (Vegas versus Reno, Windows NT versus Linux), tcpanaly can not recover time series analysis

Measurements

Pathologies: Reordering (on certain paths as high as 36%, due to route flaps), Replication (rare) and Corruption (1 in 5000 packets)
Bottleneck bandwidth (difficult to estimate, basic idea: send two consecutive packets, at bottleneck bandwidth, the second packet has to wait for the transmission of the first packet, limited by time resolution, refined through bunches)
packet loss (distinguishes between data and ack loss, loss at loaded and unloaded links) and effect on TCP (causes of a redundant retransmission: unavoidable, coarse feedback (no SACK), bad RTO)
packet delay (due to queing/bandwidth, but even compression observed)

Conclusion

Justification of measurement (TCP) methodology over other measurement techniques
wide variety of path properties
underlying assumptions about the network often broken
robust TCP implementations can work non the less

Thursday, November 13, 2008

X-Trace

Rodrigo Fonseca, George Porter, Randy H. Katz, Scott Shenker, Ion Stoica, "X-Trace: A pervasive Network Tracing Framework"

Motivation

current diagnostic tools limited to one particular proctocol, e.g. traceroute
need for comprehensive view of the system's behavior
complex systems: e.g. wikipedia has different sites, web caches, DNS round-robin, load balancers, web servers, database servers (and memcached)
tracing across different administrative domains needed

Ideas and design principles

integrated tracing framework
network protocols modified to propagate X-Trace metadata
works inter-layer
works inter-Administrative Domains
decouples client of application and recipient of tracing data (Design principle 3), destination part of the X-Trace metadata
trace initiated by inserting X-Trace metadata by user application or network operator
trace identified by task identifier
X-Trace data send to report server (can be client application or delegated server)
X-Trace constructs task tree offline, two axis: one across "layers" (an event causes another event in lower layer), one across "time" (an event causes another in the same layer), each node in the task tree has an ID, children link to their parents
Design principle 1: trace request are sent in-band
Design principle 2: trace data are sent out-of-band
ASCII report format
report library, report collection thorugh e.g. Postgres
visualization of task tree

Deployment

API for application has pushNext() and pushDown() to propagate X-trace MetaData across the two axis, device reports information accessible at its own layer, can include additional information like load
gradual deployment: for legacy clients, devices in the network can add X-Trace metadata
retrofitting X-Trace into exisiting applications faces difficulties: change to various protocols (IP options, TCP, HTTP headers, SQL), partial deployment impairs ability to trace parts of the network, lost trace reports can be interpreted as false positives
certain request topologies cannot be captured, e.g. requests spreads through the network and rendezvous at a node
unique() function returning identifier for task tree not specified in paper

Uses and Experiences

low performance overhead
Web request and recursive DNS queries
Web hosting site (LAMP), user could intiate traces through JavaScript/PHP library
overlay network
Tunnels, ISP connectivity

I really liked the framework this paper suggests. I think it is very useful. Though there has been a lot of experience, scalabe websites are still non-trivial to setup, still require some manual work to tune and integrate caches, and work across a lot of different layers as mentioned in the introduction. Some difficulties not mentioned in the introduction are: there are even more caching layers like SQL query cache and memcache, relational databases don't scale and huge websites shard the data across multiple machines, Brewer's theme on performance versus consistency (a website is not truely transactional, just enough so that the users don't preceive it as bad, but when a user sees it, it is hard to track down). This paper introduces the debugging tool I am aware of which is addressing all these things together.
The only thing I would add to this framework is the ability to send encrypted X-Trace data.

Post-lecture: Middleboxes done right?

Last lecture on Thursday again ended in a very interesting discussion.

Randy commented on my last post: "I always like the format you use for these comments! Any thoughts on whether the idea of outsourcing middlebox functions makes any sense?"

Here is my answer:
Yes:

framework provides the same functionality without breaking the two principles mentioned in the introduction
performance will be competitive only if the middlebox will be close or on the network path, i.e. where the middlebox is anyway
user benefits from being able to choose between the middlebox service and direct service

No:

extra conceptual overhead
how to gradually deploy?
because middlebox is not phyisically on the network path blocking IP packets, certain attacks on computers are still possible
user needs to configure their computer to use middlebox service (half will forget and leave the network vulnerable)
middleboxes might be deployed to actually prevent certain kinds of services, with middlebox being outsourced this can be circumvented
NAT on traditional boxes offer anonymity, invisibility from the outside

Wednesday, November 5, 2008

Middleboxes done right?

Michael Walfish, Jeremy Stribling, Maxwell Krohn, Hari Balakrishnan, Robert Morris, Scott Shenker, "Middleboxes No Longer Considered Harmful"

Benefits of Intermediate Network Elements like NAT, firewall, transparent cache

NAT: private IP spaces allow protection, more hosts than available IPs
Fiewalls prevent attacks on endhosts
security, flexibility, convenience
exist for a "important and permanent reason"

They violate the following internetworking principles

1. "Every Internet entity has a unique network-layer identifier that allows others to reach it."
2."Network elements should not process [the payload of IP] packets that are not addresses to them."

Consequences

"scorn" and "dismay"
halts spread of newer protocols, P2P systems
layer violation, rigidity in network infrastructure, may not accomodate new traffic classes

New Idea: Delegation-Oriented Architecture (DOA)

implement intermediaries without violating principles
Extra DOA Header between IP and TCP
Firewall does not need to be in the "physical path", but hosts can "outsource" to "off-path" hosts, end host has primitive to choose a machine to delegate NAT or Firewall functionality to
DOA header has: 1. references to persistent host identifier (in globally flat name space, stays with host even when IP changes), 2. a way to resolve these references to delegated machine
does not reqires change to IP (routers), allows incremental deployment
but cannot: circumvent tenet-violating middleboxes (e.g. by censorious government)
persistent host identifier= EID (endpoint identifier) is 160 bit, contains cryptographic meaning
mapping service: EID -> IP of delegated host, more EIDs (to chain several intermediares, loose source-routing)

Internet Indiretion Infrastructure

Ion Stoica, Daniel Adkins, Shelley Zhuang, Scott Shenker, Sonesh Surana, "Internet Indirection Infrastructure"

What kind of flavor of network is this?

It is a little bit like circuit-switched networking (allowing packet processing on top of it) in the sense that the receivers set up triggers which will then identify how a flow will be routed through the Internet.
It is not at all like Active Networks because the "routers" are not "programmed" by the senders through injecting code into the packet, but actually by the receivers through setting up triggers.
It uses DHTs to load balance.

Goals

generalize Internet P2P communication abstraction
implement features like multicast, anycast, mobility on application layer (challenging on IP layer, previous proposals provided disjoint solutions for each problem separately). load-balancing
data processing (H.263->MPEG conversion) while routing
decoupling the act of sending and receiving through rendezvous-based communication abstraction
receivers can control routing, can built trees for multicast

Concepts

each packet has an identifier, receiver asks infrastruction for delivery of the packets with a given identifier
receiver addr expresses interest in identifier id through triggers (id,addr)
triggers are matched to packets by largest prefix overlap, have to exceed exact-match threshold
sending and receiving hosts don't need to know each others identity or location
indirection, e.g. home agent in Mobil IP
private versus public identifier, e.g. for flows versus name lookups
Advacned: stack of identifiers is a list of identifiers or addresses written as id_stack. A trigger can contain stacks (id,id_stack), a packet can contain a stack (id_stack,payload)
soft-state of maintaining triggers (and possibly redundancy)

Main Ideas

identifiers map to unique i3 node through Chord, inheriting through Chord Robustness, Scalability, Efficiency, Stability
senders send packet to i3 node corresponding to identifier
receivers install triggers on the i3 node corresponding to identifier
i3 nodes routes packets matching triggers to receiver
Advanced: if a trigger has a stack, a packet can be routed to another identifier (e.g. i3 node) who will then essentially further route it down the stack, e.g. loose source routing. The algorithm is given in Figure 3.
Advanced: identifiers can be routed through applications, e.g. HTML-> WML transformation

Details and further ideas

server selection through least significant bits in identifier, random for load-balancing, through geographic properties for CDN
private and public triggers, example in 4.2
scalability because per-flow state disributed among many servers
avoiding hot-spots in multicast through pushing copies of triggers
i3 proxy allows legacy UDP-applications

Saturday, November 1, 2008

DNS more contemporary

Jaeyeon Jung, Emil Sit, Hari Balakrishnan, Robert Morris, "DNS Performance and the Effectiveness of Caching"

The paper makes and mentions a lot of interesting measurements and results. There are two prime investigations:

overall DNS performance (from a user perspective and from an amount of traffic perspective)
the impact of caches and TTL. The motivation comes from the dichotomoy: load-balancing applications respond with queries with short validity limiting the use of cache BUT the scalability of DNS is said to arise (besides the hierarchical organization) from queries answered from cache

Methodology

collect Internet traces: DNS packets and TCP SYN/FIN/RST packets
trace through 60 second window the process of iterating lookups until the answer is foun
track TCP connections associated to a DNS query
group clients' IP addresses and simulate a common DNS cache for them

Performance results

distribution of types of DNS lookups (mostly A records hostname -> IP address)
half of the DNS lookups are associated to a TCP connection
DNS query latency has median of 1/10 second but a significant portion takes up to 10s of seconds, distribution of number of referals
70% of the querries do not hit a root/server gTLD (i.e. cached NS improve performance and greatly reduce load on root servers)
a successful DNS query needs on average ~1.3 packets
unanswered queries (due to e.g. NS records to no longer existing servers) might cause substantially more traffic per query due to loops and retransmissions

Effect of Caching

name popularity is Zipf distribution, 10% of names account for 68% of answers + long tail
current TTL distributions
most caching benefit is achieved with 10-20 clients per cache
most caching benefit is achieved for TTL~several minutes for A records of Webservers (I think though that in Figure 13, they should plot the cache miss rate. A hit rate of 97% and 99% sound the same, but mean a three times lower miss rate, but that implies we can serve 3 times as many clients from the same e.g. gTLD server)
effect of eliminating A-record caching, per-client caching...

Concluding Remark: "The reasons for the scalability of DNS are due less to the hierarchical design of its name space or good A-record caching (as seems to be widely believed); rather, the cacheability of NS-records efficiently partition the name space and avoid overloading any single name server in the Internet." I think that is a rather provocative statement. If you cache no single IP address, you have to start every query at a root server, so it is not scalable, so how do I have to understand their remark?

I think the numbers in the abstract are fairly useless unless one has read the paper, because only later in the paper one finds the answers to the following questions:

it is mentioned what fraction of lookups are unsuccessful: but are these network failures due to overload or dropped packets (which would be bad) or just because a user typed a domain name wrong (for which we expect the lookup to result in no answer)?
is the cache miss rate meant for a query destined for a root, gTLD or domain server? E.g. is the conclusion we can draw from the measurements that if we didn't have caches anymore, suddenly the 13 root servers would be just horribly underscaled, or that the name servers for some popular sites would just hit a little more load?
The browser also has cache for DNS entries, so is the cache miss rate with respect to each TCP connection the browser makes, or just for every DNS lookup the browser cannot answer itself?

One thing which doesn't make sense to me is the fact that root servers get more lookups than gTLD (TABLE VI).

DNS Intro

Paul V. Mockapetris, Kevin J. Dunlap, "Development of the Domain Name System"

Basic design assumptions

provide at least all the same information as hosts.txt
allow the database to be maintained in a distributed manner
no obvious size limits for names, name components, data associated with a name, etc.
interoperate across the DARPA Internet and in as many other environments as possible
tolerable performance
Lean service versus general distributed database

Design

Hierarchical organization (zones) and namespace
Caching, negative caching
Resource Record: Type, Class, TTL, data of variable types
servers and resolvers
root servers (with rates of 1 query/sec in 1988)
datagram (UDP) access

Things I was surprised about

Pre-DNS: hosts.txt was used for quiet a long time
it took time to convert more hosts from pre-DNS hosts.txt to DNS and delegate domains
the importance of Berkeley UNIX's bind
applications had to be modified to handle transient failures when using DNS instead of hosts.txt lookups
DNS was intended to be far more general to lookup up names of anything. Today it is used almost exclusively to map "hostname <-> IP (+MX)"
in the early days, people controling a domain didn't necessarily have the expertise to configure DNS correctly (and I thought in the good old days, only people who knew what they were doing had access to the Internet)
RR had Class field which would allow different namespaces for DARPA net, ISO directory service, ...

CS 268 Computer Networks