Tuesday, December 2, 2008

Power Laws again

I just read that wars are also scale-free: the distribution of number of deaths in a conflict follows a power law with coefficient 2.5.

This result was found by Neil Johnson (Miami). http://www.dradio.de/dlf/sendungen/forschak/885102/

Monday, November 24, 2008

MapReduce Again!

Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, Ion Stoica, "Improving MapReduce Performance in Hereogenous Environments"

  • Hadoop is an open-source implementation of Map Reduce (and distributed filesystem?)
  • new deployment: virtualized environments, datacenters with several generations of hardware display heterogenity
  • want to optimize scheduler to finish fastest without wasting unnecceary resources
Scheduling the next empty task slot
  • Priority 1: failed tasks
  • Priority 2: non-running tasks
  • Priority 3: Speculative execution
Speculative execution, the native Hadoop way
  • Speculative execution depends on progress score [0,1] devided equally between copy, sort and reduce phase.
  • process is marked straggler if its progress <>
  • a task of a straggler is given to another machine (but at most one speculative execution) based on data locality
(underlying assumptions see 2.2)

New speculative execution: LATE=Longest Approximate Time to End
  • Idea: speculatively execute the task that will finish fathest into the future (greatest opportunity to save time)
  • different methods for estimating time left: basic is linear extrapolation from progress score
  • speculative tasks scheduled on fast nodes (through threshold how much work a node already has performed)
  • rank tasks by time finished, schedule speculative task if it is likely to finish before other node
  • refinements to account for costs of speculative execution: SpeculativeCap on number of concurrent speculative tasks and SlowTaskThreshold
Result: LATE reduces running time by a quarter to a half

Policy-aware Switching Layer for Data Centers

Dilip Antony Joseph, Arsalan Tavakoli, Ion Stoica, "A Policy-aware Switching Layer for Data Centers"

  • shares some ideas with "Middleboxes no longer considered harmful": traffic can be directed to traverse middleboxes without the middlebox being on the physical path
  • but: differs because it is not introducing new properties to IP, but instead replaces Layer II by PLayer
  • usual approach: put middleboxes on the physical path, hard to configure, possibly multiple physical paths between two nodes, different traffic supposed to go through different middleboxes
  • new approach: among Layer II switches, add (policy-aware) pswitches which route the packets through the middleboxes on Layer II based on policy specification
  • Correctness: Traffic should traverse middleboxes in the sequence specified by the nework administrator
  • Flexibility: sequence of middleboxes should be easily reconfigured
  • Efficiency: traffic should not traverse unnecessary middleboxes
  • Separating policy from (physical) reachability
  • taking middleboxes off the physical network data path
  • policies: [Start,(5-tuple traffic selector)] -> sequence
  • translated into rules ([Previous Hop, Traffic Selector]: Next Hop) which is used by pswitch
  • policy schemas (versioned) are created by administrator and pushed onto the pswitches
  • decoupling switch core (can use traditional core, usual techniques to learn MAX addresses and build spanning trees) from policy core (where the new rules are applied) with additional failure detection
  • encapsule Ethernet frames in Ethernet-II frames (which contain policy version): the algorithm to build the spanning tree needs another Source-MAC (a more physical one) than the FireWall (a more logical one)
  • SrcMacRewrite to decapsulate Ethernet-II frames before entering (legacy) middlebox
  • in some situations pswitches require per-flow state
  • implementation in Click, results
  • detailed discussions on various topologies
  • formal analysis
I like this more than the "Middleboxes no longer considered harmful" because
  • it integrates better into exisiting layerII structures instead of adding yet another identifier
  • in "Middleboxes no longer considered harmful" nodes had to be configured to use the off the physical path FireWall, it was unclear how an IP packet destined for a computer behind a off-the physical path Firewall will actually be hindered from reaching the computer. In this paper this is clear.
I got a bit confused because MAC addresses are used in two different ways, onc address is used for forwarding between pswitches, but another can be used when the packet is handed to a middlebox. I think the paper should have given these MACs different names, if possible.

Monday, November 17, 2008

Internet Data Transfer

Nirja Tolia, Michael Kaminsky, David G. Andersen, Swapnil Patil, "An Architecture for Internet Data Transfer"

  • separating content negotiation from data transfer (motivation: 70% of Internet traffic is bulk transfer)
  • common interface between various applications and transfer service architecture, application developer does not need to reinvent transfer service for each new application
  • Figure 3 explains how the application works in the DOT framework. The application calls the GTC for the data transfer
Data transfer methods can even include physically moving hard disks. One benefit is that the same data can be cached across different applications like Web and electronic email.

Delay Tolerant Networks

Kevin Fall, "A Delay-Tolerant Network Architecture for Challenged Internets"

This paper is abstract. It suggests a overlay network which is running on networks different from the usual Internet and hence providing a very different service model. But there is no implementation (except one mentioned in one sentence in the conclusion) and no results section, no new algorithms.
I wonder whether the picture in Figure 1 really is a typical example: are there really applications where challenged networks are used for transit, or aren't most applications (e.g. sensor networks, satellite connections, ...) just having challenged networks at one or both ends?
And regarding the US Postal System as model for challenged Internets: looking back at my experiences so far, I kind of doubt whether this is a good idea...

Targeted networks
  • Terrestial Mobile Networks
  • Excotic Media Networks (e.g. deep space, acoustic links)
  • military Ad-Hoc Networks
  • Sensor Networks
Violated Assumptions
  • E2E path exist, no disconnection
  • max RTT not excessive, reasonable queuing time
  • drop probability <<>
Possibilities to deal with the violated assumptions
  • fix link-layer
  • attach challenged networks at the edge of Internet through proxy, e.g. sensor networks, protocol boosters (e.g. HTTP proxy for satellite connections)
  • electronic mail (mostly reliable, likely failure notification)
  • DTN

New principles of DTN
  • Message-switching (instead of packets)
  • Regions with DTN gateways inbetween
  • name includes region and entity name within that region
  • several services (motivated by USPS)
  • path selection and scheduling, trying to predict when the next contact to a certain region takes place
  • custody transfer and reliability, e.g. if a message is sent from one node to another node, does the old node need to keep a copy until confirmation the message was delivered
  • convergence layers
  • time synchronization
  • congestion and flow control

Internet Measurements

Vern Paxson, "End-to-End Internet Packet Dynamics"

I found this paper hard to read, not because it is badly written, but because it discusses the measurements and results in extraordinary depth, discussing the measured effect, coming up with potential explanations, possible effects on higher layers and even suggestions for implementations. The lecture on Thursday luckily turned my frustration when reading the paper in admiration for the pioneering role Paxson played in introducing internet measurement methodology and the enormous body of results he produced.

  • large-scale experiment: 37 sites, N to N bulk transfer, done at Dec 1994 and 1995
  • TCP bulk transfer (versus ICMP packet probing): 1. this is how most Internet traffic really looks like, 2. TCP adapts to avoid unduly loading the network
  • TCP measurements intertwine network and transport, tcpanaly separates these behaviors, TCP implementation specific (Vegas versus Reno, Windows NT versus Linux), tcpanaly can not recover time series analysis
  • Pathologies: Reordering (on certain paths as high as 36%, due to route flaps), Replication (rare) and Corruption (1 in 5000 packets)
  • Bottleneck bandwidth (difficult to estimate, basic idea: send two consecutive packets, at bottleneck bandwidth, the second packet has to wait for the transmission of the first packet, limited by time resolution, refined through bunches)
  • packet loss (distinguishes between data and ack loss, loss at loaded and unloaded links) and effect on TCP (causes of a redundant retransmission: unavoidable, coarse feedback (no SACK), bad RTO)
  • packet delay (due to queing/bandwidth, but even compression observed)
  • Justification of measurement (TCP) methodology over other measurement techniques
  • wide variety of path properties
  • underlying assumptions about the network often broken
  • robust TCP implementations can work non the less

Thursday, November 13, 2008


Rodrigo Fonseca, George Porter, Randy H. Katz, Scott Shenker, Ion Stoica, "X-Trace: A pervasive Network Tracing Framework"

  • current diagnostic tools limited to one particular proctocol, e.g. traceroute
  • need for comprehensive view of the system's behavior
  • complex systems: e.g. wikipedia has different sites, web caches, DNS round-robin, load balancers, web servers, database servers (and memcached)
  • tracing across different administrative domains needed
Ideas and design principles
  • integrated tracing framework
  • network protocols modified to propagate X-Trace metadata
  • works inter-layer
  • works inter-Administrative Domains
  • decouples client of application and recipient of tracing data (Design principle 3), destination part of the X-Trace metadata
  • trace initiated by inserting X-Trace metadata by user application or network operator
  • trace identified by task identifier
  • X-Trace data send to report server (can be client application or delegated server)
  • X-Trace constructs task tree offline, two axis: one across "layers" (an event causes another event in lower layer), one across "time" (an event causes another in the same layer), each node in the task tree has an ID, children link to their parents
  • Design principle 1: trace request are sent in-band
  • Design principle 2: trace data are sent out-of-band
  • ASCII report format
  • report library, report collection thorugh e.g. Postgres
  • visualization of task tree
  • API for application has pushNext() and pushDown() to propagate X-trace MetaData across the two axis, device reports information accessible at its own layer, can include additional information like load
  • gradual deployment: for legacy clients, devices in the network can add X-Trace metadata
  • retrofitting X-Trace into exisiting applications faces difficulties: change to various protocols (IP options, TCP, HTTP headers, SQL), partial deployment impairs ability to trace parts of the network, lost trace reports can be interpreted as false positives
  • certain request topologies cannot be captured, e.g. requests spreads through the network and rendezvous at a node
  • unique() function returning identifier for task tree not specified in paper
Uses and Experiences
  • low performance overhead
  • Web request and recursive DNS queries
  • Web hosting site (LAMP), user could intiate traces through JavaScript/PHP library
  • overlay network
  • Tunnels, ISP connectivity
I really liked the framework this paper suggests. I think it is very useful. Though there has been a lot of experience, scalabe websites are still non-trivial to setup, still require some manual work to tune and integrate caches, and work across a lot of different layers as mentioned in the introduction. Some difficulties not mentioned in the introduction are: there are even more caching layers like SQL query cache and memcache, relational databases don't scale and huge websites shard the data across multiple machines, Brewer's theme on performance versus consistency (a website is not truely transactional, just enough so that the users don't preceive it as bad, but when a user sees it, it is hard to track down). This paper introduces the debugging tool I am aware of which is addressing all these things together.
The only thing I would add to this framework is the ability to send encrypted X-Trace data.