Tuesday, December 2, 2008

Power Laws again

I just read that wars are also scale-free: the distribution of number of deaths in a conflict follows a power law with coefficient 2.5.

This result was found by Neil Johnson (Miami). http://www.dradio.de/dlf/sendungen/forschak/885102/

Monday, November 24, 2008

MapReduce Again!

Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, Ion Stoica, "Improving MapReduce Performance in Hereogenous Environments"


Intro
  • Hadoop is an open-source implementation of Map Reduce (and distributed filesystem?)
  • new deployment: virtualized environments, datacenters with several generations of hardware display heterogenity
  • want to optimize scheduler to finish fastest without wasting unnecceary resources
Scheduling the next empty task slot
  • Priority 1: failed tasks
  • Priority 2: non-running tasks
  • Priority 3: Speculative execution
Speculative execution, the native Hadoop way
  • Speculative execution depends on progress score [0,1] devided equally between copy, sort and reduce phase.
  • process is marked straggler if its progress <>
  • a task of a straggler is given to another machine (but at most one speculative execution) based on data locality
(underlying assumptions see 2.2)

New speculative execution: LATE=Longest Approximate Time to End
  • Idea: speculatively execute the task that will finish fathest into the future (greatest opportunity to save time)
  • different methods for estimating time left: basic is linear extrapolation from progress score
  • speculative tasks scheduled on fast nodes (through threshold how much work a node already has performed)
  • rank tasks by time finished, schedule speculative task if it is likely to finish before other node
  • refinements to account for costs of speculative execution: SpeculativeCap on number of concurrent speculative tasks and SlowTaskThreshold
Result: LATE reduces running time by a quarter to a half

Policy-aware Switching Layer for Data Centers

Dilip Antony Joseph, Arsalan Tavakoli, Ion Stoica, "A Policy-aware Switching Layer for Data Centers"

Ideas
  • shares some ideas with "Middleboxes no longer considered harmful": traffic can be directed to traverse middleboxes without the middlebox being on the physical path
  • but: differs because it is not introducing new properties to IP, but instead replaces Layer II by PLayer
  • usual approach: put middleboxes on the physical path, hard to configure, possibly multiple physical paths between two nodes, different traffic supposed to go through different middleboxes
  • new approach: among Layer II switches, add (policy-aware) pswitches which route the packets through the middleboxes on Layer II based on policy specification
Goals
  • Correctness: Traffic should traverse middleboxes in the sequence specified by the nework administrator
  • Flexibility: sequence of middleboxes should be easily reconfigured
  • Efficiency: traffic should not traverse unnecessary middleboxes
Principles
  • Separating policy from (physical) reachability
  • taking middleboxes off the physical network data path
  • policies: [Start,(5-tuple traffic selector)] -> sequence
  • translated into rules ([Previous Hop, Traffic Selector]: Next Hop) which is used by pswitch
  • policy schemas (versioned) are created by administrator and pushed onto the pswitches
Mechanisms
  • decoupling switch core (can use traditional core, usual techniques to learn MAX addresses and build spanning trees) from policy core (where the new rules are applied) with additional failure detection
  • encapsule Ethernet frames in Ethernet-II frames (which contain policy version): the algorithm to build the spanning tree needs another Source-MAC (a more physical one) than the FireWall (a more logical one)
  • SrcMacRewrite to decapsulate Ethernet-II frames before entering (legacy) middlebox
  • in some situations pswitches require per-flow state
Other
  • implementation in Click, results
  • detailed discussions on various topologies
  • formal analysis
I like this more than the "Middleboxes no longer considered harmful" because
  • it integrates better into exisiting layerII structures instead of adding yet another identifier
  • in "Middleboxes no longer considered harmful" nodes had to be configured to use the off the physical path FireWall, it was unclear how an IP packet destined for a computer behind a off-the physical path Firewall will actually be hindered from reaching the computer. In this paper this is clear.
I got a bit confused because MAC addresses are used in two different ways, onc address is used for forwarding between pswitches, but another can be used when the packet is handed to a middlebox. I think the paper should have given these MACs different names, if possible.

Monday, November 17, 2008

Internet Data Transfer

Nirja Tolia, Michael Kaminsky, David G. Andersen, Swapnil Patil, "An Architecture for Internet Data Transfer"

Principles
  • separating content negotiation from data transfer (motivation: 70% of Internet traffic is bulk transfer)
  • common interface between various applications and transfer service architecture, application developer does not need to reinvent transfer service for each new application
  • Figure 3 explains how the application works in the DOT framework. The application calls the GTC for the data transfer
Data transfer methods can even include physically moving hard disks. One benefit is that the same data can be cached across different applications like Web and electronic email.

Delay Tolerant Networks

Kevin Fall, "A Delay-Tolerant Network Architecture for Challenged Internets"

This paper is abstract. It suggests a overlay network which is running on networks different from the usual Internet and hence providing a very different service model. But there is no implementation (except one mentioned in one sentence in the conclusion) and no results section, no new algorithms.
I wonder whether the picture in Figure 1 really is a typical example: are there really applications where challenged networks are used for transit, or aren't most applications (e.g. sensor networks, satellite connections, ...) just having challenged networks at one or both ends?
And regarding the US Postal System as model for challenged Internets: looking back at my experiences so far, I kind of doubt whether this is a good idea...

Targeted networks
  • Terrestial Mobile Networks
  • Excotic Media Networks (e.g. deep space, acoustic links)
  • military Ad-Hoc Networks
  • Sensor Networks
Violated Assumptions
  • E2E path exist, no disconnection
  • max RTT not excessive, reasonable queuing time
  • drop probability <<>
Possibilities to deal with the violated assumptions
  • fix link-layer
  • attach challenged networks at the edge of Internet through proxy, e.g. sensor networks, protocol boosters (e.g. HTTP proxy for satellite connections)
  • electronic mail (mostly reliable, likely failure notification)
  • DTN

New principles of DTN
  • Message-switching (instead of packets)
  • Regions with DTN gateways inbetween
  • name includes region and entity name within that region
  • several services (motivated by USPS)
  • path selection and scheduling, trying to predict when the next contact to a certain region takes place
  • custody transfer and reliability, e.g. if a message is sent from one node to another node, does the old node need to keep a copy until confirmation the message was delivered
  • convergence layers
  • time synchronization
  • congestion and flow control

Internet Measurements

Vern Paxson, "End-to-End Internet Packet Dynamics"

I found this paper hard to read, not because it is badly written, but because it discusses the measurements and results in extraordinary depth, discussing the measured effect, coming up with potential explanations, possible effects on higher layers and even suggestions for implementations. The lecture on Thursday luckily turned my frustration when reading the paper in admiration for the pioneering role Paxson played in introducing internet measurement methodology and the enormous body of results he produced.

Methodology
  • large-scale experiment: 37 sites, N to N bulk transfer, done at Dec 1994 and 1995
  • TCP bulk transfer (versus ICMP packet probing): 1. this is how most Internet traffic really looks like, 2. TCP adapts to avoid unduly loading the network
  • TCP measurements intertwine network and transport, tcpanaly separates these behaviors, TCP implementation specific (Vegas versus Reno, Windows NT versus Linux), tcpanaly can not recover time series analysis
Measurements
  • Pathologies: Reordering (on certain paths as high as 36%, due to route flaps), Replication (rare) and Corruption (1 in 5000 packets)
  • Bottleneck bandwidth (difficult to estimate, basic idea: send two consecutive packets, at bottleneck bandwidth, the second packet has to wait for the transmission of the first packet, limited by time resolution, refined through bunches)
  • packet loss (distinguishes between data and ack loss, loss at loaded and unloaded links) and effect on TCP (causes of a redundant retransmission: unavoidable, coarse feedback (no SACK), bad RTO)
  • packet delay (due to queing/bandwidth, but even compression observed)
Conclusion
  • Justification of measurement (TCP) methodology over other measurement techniques
  • wide variety of path properties
  • underlying assumptions about the network often broken
  • robust TCP implementations can work non the less

Thursday, November 13, 2008

X-Trace

Rodrigo Fonseca, George Porter, Randy H. Katz, Scott Shenker, Ion Stoica, "X-Trace: A pervasive Network Tracing Framework"

Motivation
  • current diagnostic tools limited to one particular proctocol, e.g. traceroute
  • need for comprehensive view of the system's behavior
  • complex systems: e.g. wikipedia has different sites, web caches, DNS round-robin, load balancers, web servers, database servers (and memcached)
  • tracing across different administrative domains needed
Ideas and design principles
  • integrated tracing framework
  • network protocols modified to propagate X-Trace metadata
  • works inter-layer
  • works inter-Administrative Domains
  • decouples client of application and recipient of tracing data (Design principle 3), destination part of the X-Trace metadata
  • trace initiated by inserting X-Trace metadata by user application or network operator
  • trace identified by task identifier
  • X-Trace data send to report server (can be client application or delegated server)
  • X-Trace constructs task tree offline, two axis: one across "layers" (an event causes another event in lower layer), one across "time" (an event causes another in the same layer), each node in the task tree has an ID, children link to their parents
  • Design principle 1: trace request are sent in-band
  • Design principle 2: trace data are sent out-of-band
  • ASCII report format
  • report library, report collection thorugh e.g. Postgres
  • visualization of task tree
Deployment
  • API for application has pushNext() and pushDown() to propagate X-trace MetaData across the two axis, device reports information accessible at its own layer, can include additional information like load
  • gradual deployment: for legacy clients, devices in the network can add X-Trace metadata
  • retrofitting X-Trace into exisiting applications faces difficulties: change to various protocols (IP options, TCP, HTTP headers, SQL), partial deployment impairs ability to trace parts of the network, lost trace reports can be interpreted as false positives
  • certain request topologies cannot be captured, e.g. requests spreads through the network and rendezvous at a node
  • unique() function returning identifier for task tree not specified in paper
Uses and Experiences
  • low performance overhead
  • Web request and recursive DNS queries
  • Web hosting site (LAMP), user could intiate traces through JavaScript/PHP library
  • overlay network
  • Tunnels, ISP connectivity
I really liked the framework this paper suggests. I think it is very useful. Though there has been a lot of experience, scalabe websites are still non-trivial to setup, still require some manual work to tune and integrate caches, and work across a lot of different layers as mentioned in the introduction. Some difficulties not mentioned in the introduction are: there are even more caching layers like SQL query cache and memcache, relational databases don't scale and huge websites shard the data across multiple machines, Brewer's theme on performance versus consistency (a website is not truely transactional, just enough so that the users don't preceive it as bad, but when a user sees it, it is hard to track down). This paper introduces the debugging tool I am aware of which is addressing all these things together.
The only thing I would add to this framework is the ability to send encrypted X-Trace data.

Post-lecture: Middleboxes done right?

Last lecture on Thursday again ended in a very interesting discussion.

Randy commented on my last post: "I always like the format you use for these comments! Any thoughts on whether the idea of outsourcing middlebox functions makes any sense?"

Here is my answer:
Yes:
  • framework provides the same functionality without breaking the two principles mentioned in the introduction
  • performance will be competitive only if the middlebox will be close or on the network path, i.e. where the middlebox is anyway
  • user benefits from being able to choose between the middlebox service and direct service
No:
  • extra conceptual overhead
  • how to gradually deploy?
  • because middlebox is not phyisically on the network path blocking IP packets, certain attacks on computers are still possible
  • user needs to configure their computer to use middlebox service (half will forget and leave the network vulnerable)
  • middleboxes might be deployed to actually prevent certain kinds of services, with middlebox being outsourced this can be circumvented
  • NAT on traditional boxes offer anonymity, invisibility from the outside

Wednesday, November 5, 2008

Middleboxes done right?

Michael Walfish, Jeremy Stribling, Maxwell Krohn, Hari Balakrishnan, Robert Morris, Scott Shenker, "Middleboxes No Longer Considered Harmful"

Benefits of Intermediate Network Elements like NAT, firewall, transparent cache
  • NAT: private IP spaces allow protection, more hosts than available IPs
  • Fiewalls prevent attacks on endhosts
  • security, flexibility, convenience
  • exist for a "important and permanent reason"
They violate the following internetworking principles
  • 1. "Every Internet entity has a unique network-layer identifier that allows others to reach it."
  • 2."Network elements should not process [the payload of IP] packets that are not addresses to them."
Consequences
  • "scorn" and "dismay"
  • halts spread of newer protocols, P2P systems
  • layer violation, rigidity in network infrastructure, may not accomodate new traffic classes
New Idea: Delegation-Oriented Architecture (DOA)
  • implement intermediaries without violating principles
  • Extra DOA Header between IP and TCP
  • Firewall does not need to be in the "physical path", but hosts can "outsource" to "off-path" hosts, end host has primitive to choose a machine to delegate NAT or Firewall functionality to
  • DOA header has: 1. references to persistent host identifier (in globally flat name space, stays with host even when IP changes), 2. a way to resolve these references to delegated machine
  • does not reqires change to IP (routers), allows incremental deployment
  • but cannot: circumvent tenet-violating middleboxes (e.g. by censorious government)
  • persistent host identifier= EID (endpoint identifier) is 160 bit, contains cryptographic meaning
  • mapping service: EID -> IP of delegated host, more EIDs (to chain several intermediares, loose source-routing)

Internet Indiretion Infrastructure

Ion Stoica, Daniel Adkins, Shelley Zhuang, Scott Shenker, Sonesh Surana, "Internet Indirection Infrastructure"

What kind of flavor of network is this?
  • It is a little bit like circuit-switched networking (allowing packet processing on top of it) in the sense that the receivers set up triggers which will then identify how a flow will be routed through the Internet.
  • It is not at all like Active Networks because the "routers" are not "programmed" by the senders through injecting code into the packet, but actually by the receivers through setting up triggers.
  • It uses DHTs to load balance.

Goals
  • generalize Internet P2P communication abstraction
  • implement features like multicast, anycast, mobility on application layer (challenging on IP layer, previous proposals provided disjoint solutions for each problem separately). load-balancing
  • data processing (H.263->MPEG conversion) while routing
  • decoupling the act of sending and receiving through rendezvous-based communication abstraction
  • receivers can control routing, can built trees for multicast
Concepts
  • each packet has an identifier, receiver asks infrastruction for delivery of the packets with a given identifier
  • receiver addr expresses interest in identifier id through triggers (id,addr)
  • triggers are matched to packets by largest prefix overlap, have to exceed exact-match threshold
  • sending and receiving hosts don't need to know each others identity or location
  • indirection, e.g. home agent in Mobil IP
  • private versus public identifier, e.g. for flows versus name lookups
  • Advacned: stack of identifiers is a list of identifiers or addresses written as id_stack. A trigger can contain stacks (id,id_stack), a packet can contain a stack (id_stack,payload)
  • soft-state of maintaining triggers (and possibly redundancy)
Main Ideas
  • identifiers map to unique i3 node through Chord, inheriting through Chord Robustness, Scalability, Efficiency, Stability
  • senders send packet to i3 node corresponding to identifier
  • receivers install triggers on the i3 node corresponding to identifier
  • i3 nodes routes packets matching triggers to receiver
  • Advanced: if a trigger has a stack, a packet can be routed to another identifier (e.g. i3 node) who will then essentially further route it down the stack, e.g. loose source routing. The algorithm is given in Figure 3.
  • Advanced: identifiers can be routed through applications, e.g. HTML-> WML transformation
Details and further ideas
  • server selection through least significant bits in identifier, random for load-balancing, through geographic properties for CDN
  • private and public triggers, example in 4.2
  • scalability because per-flow state disributed among many servers
  • avoiding hot-spots in multicast through pushing copies of triggers
  • i3 proxy allows legacy UDP-applications

Saturday, November 1, 2008

DNS more contemporary

Jaeyeon Jung, Emil Sit, Hari Balakrishnan, Robert Morris, "DNS Performance and the Effectiveness of Caching"

The paper makes and mentions a lot of interesting measurements and results. There are two prime investigations:
  • overall DNS performance (from a user perspective and from an amount of traffic perspective)
  • the impact of caches and TTL. The motivation comes from the dichotomoy: load-balancing applications respond with queries with short validity limiting the use of cache BUT the scalability of DNS is said to arise (besides the hierarchical organization) from queries answered from cache
Methodology
  • collect Internet traces: DNS packets and TCP SYN/FIN/RST packets
  • trace through 60 second window the process of iterating lookups until the answer is foun
  • track TCP connections associated to a DNS query
  • group clients' IP addresses and simulate a common DNS cache for them
Performance results
  • distribution of types of DNS lookups (mostly A records hostname -> IP address)
  • half of the DNS lookups are associated to a TCP connection
  • DNS query latency has median of 1/10 second but a significant portion takes up to 10s of seconds, distribution of number of referals
  • 70% of the querries do not hit a root/server gTLD (i.e. cached NS improve performance and greatly reduce load on root servers)
  • a successful DNS query needs on average ~1.3 packets
  • unanswered queries (due to e.g. NS records to no longer existing servers) might cause substantially more traffic per query due to loops and retransmissions
Effect of Caching
  • name popularity is Zipf distribution, 10% of names account for 68% of answers + long tail
  • current TTL distributions
  • most caching benefit is achieved with 10-20 clients per cache
  • most caching benefit is achieved for TTL~several minutes for A records of Webservers (I think though that in Figure 13, they should plot the cache miss rate. A hit rate of 97% and 99% sound the same, but mean a three times lower miss rate, but that implies we can serve 3 times as many clients from the same e.g. gTLD server)
  • effect of eliminating A-record caching, per-client caching...
Concluding Remark: "The reasons for the scalability of DNS are due less to the hierarchical design of its name space or good A-record caching (as seems to be widely believed); rather, the cacheability of NS-records efficiently partition the name space and avoid overloading any single name server in the Internet." I think that is a rather provocative statement. If you cache no single IP address, you have to start every query at a root server, so it is not scalable, so how do I have to understand their remark?

I think the numbers in the abstract are fairly useless unless one has read the paper, because only later in the paper one finds the answers to the following questions:
  • it is mentioned what fraction of lookups are unsuccessful: but are these network failures due to overload or dropped packets (which would be bad) or just because a user typed a domain name wrong (for which we expect the lookup to result in no answer)?
  • is the cache miss rate meant for a query destined for a root, gTLD or domain server? E.g. is the conclusion we can draw from the measurements that if we didn't have caches anymore, suddenly the 13 root servers would be just horribly underscaled, or that the name servers for some popular sites would just hit a little more load?
  • The browser also has cache for DNS entries, so is the cache miss rate with respect to each TCP connection the browser makes, or just for every DNS lookup the browser cannot answer itself?
One thing which doesn't make sense to me is the fact that root servers get more lookups than gTLD (TABLE VI).

DNS Intro

Paul V. Mockapetris, Kevin J. Dunlap, "Development of the Domain Name System"

Basic design assumptions
  • provide at least all the same information as hosts.txt
  • allow the database to be maintained in a distributed manner
  • no obvious size limits for names, name components, data associated with a name, etc.
  • interoperate across the DARPA Internet and in as many other environments as possible
  • tolerable performance
  • Lean service versus general distributed database
Design
  • Hierarchical organization (zones) and namespace
  • Caching, negative caching
  • Resource Record: Type, Class, TTL, data of variable types
  • servers and resolvers
  • root servers (with rates of 1 query/sec in 1988)
  • datagram (UDP) access
Things I was surprised about
  • Pre-DNS: hosts.txt was used for quiet a long time
  • it took time to convert more hosts from pre-DNS hosts.txt to DNS and delegate domains
  • the importance of Berkeley UNIX's bind
  • applications had to be modified to handle transient failures when using DNS instead of hosts.txt lookups
  • DNS was intended to be far more general to lookup up names of anything. Today it is used almost exclusively to map "hostname <-> IP (+MX)"
  • in the early days, people controling a domain didn't necessarily have the expertise to configure DNS correctly (and I thought in the good old days, only people who knew what they were doing had access to the Internet)
  • RR had Class field which would allow different namespaces for DARPA net, ISO directory service, ...

Thursday, October 30, 2008

DHT versus DNS

Today was a great class discussion about Chord, DNS and co. Here is my list of questions and thoughts, I never got to ask in class, some got at least partially answered during the discussion.

Pro DHT:
  • google does not use chord and uses more traditional hierarchical systems, but should we trust all of our data to single companies as hierarchical systems do?
  • Authority: Can we trust the data if we replaced hierarchical DNS by Chord? Yes, if we have a hierarchical system of digital signatures (data and public keys signed by the root, country registrars and domain, subdomain owners), bootstrapped by root signatures which correspond to us needing to know the IP addresses of root servers.
  • DHTs like chord do (theoretically) a beautiful job on load balancing and scalability (compare to BGP where each router needs the 200000 prefix entries routing information whereas chord only needs 31 fingers for 2^32 IDs)
Pro DNS:
  • Malicious nodes can still spoof the non-existence of a certain domain in DHT.
  • DHT provides no geographic locality, doesn't allow for techniques like anycast
  • DNS recursers can be placed at strategic places, e.g. my provides can put a DNS recursers between its clients and and the DNS name servers, so DNS querries would follow how packets would flow instead of detouring which would mean latencies
  • DNS has designated servers
  • DNS caching is very efficient because 99% of all querries are probably for the same 1000 domains. This also plays to the importance of putting htem at good strategic places, i.e. each ISP should put it its own cache close to its edge routers

Some more words on Active Networks

  • It seems to me that Active Networks can only be a compromise between security and usability.
  • In the paper the API was so restricted that the only thing an active node could do was really only forward packets (maybe modify the header a little bit). Using code for this decision is superflous, source routing would give the client actually the same control.
  • This would be more interesting if active nodes could actually in some way process data like aggregate information, so maybe Active Networks are interesting for Sensor-Networks like applications.

Wednesday, October 29, 2008

Chord and tree-like diagrams

I made a picture to show how tree-like DHT forward queries.



and how chord forward queries

The Beauty of Chord

Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balikrishnan, "Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications"

Chapter 3 motivates applications.

Underlying geometry
  • ID space is a ring
  • each ID in ID space is mapped to a node. A node corresponds to a block of consecutive IDs. For redundancy, these blocks can overlap.
  • a node maintains a "routing" table (called fingers) which has an entry to reach the node which is responsible for the ID with distance 1, 2, 4, 8, 16, 32, 64, 128, ...
  • for handling node failures, node also maintains a table of its immeadite k successor in the ID space
  • this really is a small-world network: the entries to its immeadiate successors makes it highly clustered, the routing table entries with long distances make the path length short (log size of ID space)
Mechanisms
  • Node (say A) joins: it is assigned a block in ID space. It computes the IDs with distance 1, 2, 4, 8 ,16, 32, ... from its own block. It uses the existing routing tables of other nodes to query which nodes are responsible for these IDs and constructs a routing table. It also traces back which routing tables of nodes have to be changed (at most log n, querying each costs log n) and updates the existing nodes. Last it has to move data which is left to the application.
  • every node periodically runs stabilization to correct routing tables to account for new nodes
The routing table could list several nodes per n where the distances in ID space are less but close to 2^n. That way lookup failures could be reduced if nodes fail.

P2P DHT Intro

Hari balakrishnan, Frank Kaashoek, David Karger, Robert Morris, Ion Stoica, "Looking up Data in P2P Systems"

General
  • A really nice introduction to P2P systems. It mostly focuses on how to look up data in a decentralized and scalable manner without discussing how to build a system like file-sharing on top of it.
  • P2P has other aspects than illegal music sharing to it
  • Some more words on resilence and definition of correctness would be nice: network will fail, any key could be manipulated or only node's share of keys will be lost or manipulated in case of a failing or malicious node
  • the paper mixed desired features (like scalability) of DHT and mechanisms (like distance function)
Desired Features of DHT
  • hash function: name -> hash (in "ID space"), DHT: hash -> node, node stores value for key, application decides what value means (value could be reference where file is stored)
  • no centralized servers, hierarchy
  • load-balanced key lookup
  • "routing" information at each hop scales logarithmically
  • "routing" hops scale logarithmically
  • adaptive to nodes concurrently leaving and joining
  • low overhead
  • correctness while "routing" tables are still converging, but performance penalty allowed
  • fault tolerance
  • advanced: proximity routing
  • advanced: isolating the impact of malicious nodes
  • advanced: queries like keyword search
How are DHT's different or similar from normal routing?
  • routing forwards packets until reach destination, DHT's redirect key lookups to a different node until the right node is found
  • the table of a DHT maps ID space to nodes, the routing table at A maps space of nodes to nodes connected to A
  • the table of a DHT can map an ID to an arbitray node, a routing table is constrained to connected nodes
Mechanisms
  • All DHT algorithms are inspired by putting structure on the ID space and equipping with a distance function
  • chord: one-dimensional ring, distance is numeric difference (not absolute value, i.e. unidirectional), "routing" table points to nodes at distance 1, 2, 4, 8, 16, 32, 64, 128, ...
  • tree-like routing: nodes are leaves of (binary) tree, distance is prefix-length with numerical difference as tie-breaker, "routing" table at node has an entry for each prefix-length pointing to a random node who is not in that prefix
  • n-dimensional space: each node covers a hypercube of ID's, "routing" table contains neighboring hypercubes, distance probably=Euclidean distance in n-space (was not explained in paper)
  • algorithms maintaining keys, routing information for node joins, leaves, load-balancing. Algorithms maintain invariants for correctness, converge to efficient routing tables for performance

Monday, October 27, 2008

Are Active Networks just Computer Viruses?

David Wetherall, "Active network vision and reality: lessons from a capsule-based system"

If I think about it: the way capsules propagate through a network is by injecting code into the router software to spread to the next router. So this is really how a worm/virus spreads through a network. There have been implemented several active networks already: code red worm, melissa, happy99.exe, Chernobyl. Some of this research lead to imprisonment eventually, apparently the FBI doesn't like research on Active Networks. There are even organizations fighting Active networks, for example Symantec. And unfortunately, Windows is a little ahead of Linux in terms of Active Network implementations.

Active Network basics
  • a packet is called a capsule and carries code (perhaps by reference) executed on the router (active node) to process and eventually forward it
  • benefits: flexible network layer, decoupling services from the underlying infrastructure, enables fast deployment of new features across WANs
  • concerns: mobile code raises security and performance issues
Characterization of Active networks
  • capsule model of programmability
  • accessibility of that model to all users
  • applications that can be constructed
Mechanisms in the paper
  • code is carried by reference which is MD5 encrypted hash = "type of service" field, reduces capsule header to be comparable to IP header
  • code also given a human readable name, distributed through DNS-like directory service
  • code cached locally
  • capsules are forwarded as IP packets between conventional routers, processed as capsules by active routers: this allows for a gradual deployment, deployment on heterogenous network, router can be split into layer processing some packets in hardware fast, and some in software slow
  • code is written in type-safe language (e.g. Java ) or with proof
  • code is executed in isolation from capsules of different type, in sandbox
  • code's runtime and resources are limited
  • code may not change type of service when routing or creating new packets
  • code must lower TTL
  • code can only create limited number of new capsules (e.g. limited by total TTL)
  • but still: code is certified by trusted authority (IETF)
  • API for capsule's code: query node environment, soft-state storage, routing, possibly loss information
  • tests run on ANTA kit, in Java, as overlay network on UDP
Findings
  • code distribution provides insignificant overhead
  • analysis of protection attempts to answer the following questions: 1. can the node runtime be corrupted by service code? 2. can distributed service code be corrupted or sppofed? 3. Can the state cached be inadvertently manipulated by another service?
  • analysis of resource management: 1. Can a capsule consume a large, possibly unbounded amout of resource at a single node? 2. Across a large number of nodes? 3. Can an application running on an end-system be used to inject a large possibly unbounded number of capsules into the network?
Possible Applications
  • Characterization of servies: expressible, compact, fast, incrementally deployable
  • new services based on network layer state
  • multicast, reliable multicast support, ECN, PIP, anycast
Comments
  • The idea of active networks is very interesting, particular regarding the many algorithms and techniques like fair queueing, ECN, XCP, RED, multicast, QoS, anycast which exists but are difficult to deploy on existing infrastructure.
  • But unfortunately the security and resource concerns restrict the abilities of the capsule's code (e.g. isolation from other flows) to implement some of these algorithms. The list of remaining techniques did not convince me. Anycast works reasonably implemented in BGP and through DNS. Maybe it is the revival of multicast?

RON

David Andersen, Hari Balakrishnan, Frans Kaashoek, Robert Morris, "Resilient Overlay Networks"

Shortcomings of today's routing structure
  • A quarter of randomly picked connections between two hosts suffers outages of more than half an hour in a three day period
  • BGP filters and summarizes routing information in the interest of scalability and policy enforcement, delays information to prevent oscillations
  • BGP deems connection alive if BGP session is still alive, however persisten congestion a link might degrade E2E performance
  • BGP policies often do not use private peering links to provide failover capabilities
Proposal
  • Overlay Network of nodes routing packets between them based on RON header
  • Size: a couple of dozen nodes (cluster maintained through RON membership manager)
  • Virtual link between each pair of distinct nodes
  • optimizes different application-specific metrics: API provides explicit way for application to prioritize among throughput, delay, packet loss rate and to define failures
  • fast failure detection and recovery (<20sec) through constant active and passive monitoring of virtual links
  • expressive policy routing
  • Data path: application --conduit API--> entry node --> RON routers --> exit node
  • link state, hop-by-hop routing
  • RON routers and RON membership managers communicate through RON network (versus IP)
  • RON headers have tag identifying flow, keeps flow to a chosen path
  • hysteresis to avoid route flapping
Results
  • outages for no longer than 20secs
  • maximal 30% packet loss rate
  • in most cases RON reduces latency
  • in some cases, RON can significantly improve TCP performance

Comments
  • Exact routing mechanism unclear, looks like hop-by-hop routing, but wouldn't source rouing be better to optimize the application-specific metric, and optimize it only once at the source? (doesn't really matter because most of the time the chosen route has only one RON hop between E2E)
  • What would large-scale deployment of RON mean for the Internet? Will RON further inflate data paths? Will it RON nodes deployed in two peering ASs turn these into ASs defacto transiting traffic?
  • How does RON's forwarding compare to a TCP split connection (split at RON forwarder)?
  • The user experience does contradict the statement that 1/4 of the the routes experience outages of half-an hour in three days. Does this mean that the Internet is really partitioned into Clients and Servers with reliable routes between Clients and Servers, but actually less reliable communication between Clients and Clients? Or is the user experience due to the fact that websites deploy servers across multiple different networks and our querries in fact do not go through the "Inter-"net?

Monday, October 20, 2008

Internet Topology

Michalis Faloutsos, Petros Faloutsos, Christos Faloutsos, "On Power-Law Relationships of the Internet Topology"

Lun Li, David Alderson, Walter Willinger, John Doyle, "A First-Principle Approach to Understanding the Internet's Router-level Topology"

Introduction

There have been developed various measures to classify graphs (=networks) which are coarser than graph isomorphism and various algorithms to generate graphs with different characteristics. These have been applied in numerous areas like neuroscience to understand the wiring of neurons or the web as hypertext, where one particular eigenvector problem called PageRank had tremendous success as a search engine. Among them are:

Measures:
  • Diameter = longest shortest path between any two vertices, can be an unstable measure
  • Average path length = shortest path averaged between any pair of vertices
  • Shortest path length distribution
  • Clustering Coefficient, CC of a vertex is defined by taking the subgraph generated by the vertex and its neighbors and divide its number of edges by the number of edges of a complete graph. CC of a graph is the CC averaged over the vertices
  • degree distributions of vertices
  • eigenvalues of wiring matrix (symmetric with positive entries only, so diagonalizable over R with only positive eigenvalues)
Typical graphs:
  • random graph (short average path length, low CC, normal distribution of degrees)
  • small-world (short average path length, high CC)
  • hierarchical networks
  • scale-free networks where the degree distribution follows a power law
Algorithms:
  • small-world network generation due to Watts/Strogatz observation that little rewiring of a highly clustered network drops the path length [Watts, Strogatz, "Collective dynamics of 'small-world' networks"]
  • scale-free networks can be archived using degree-preserving rewiring: pick four vertices A, B, C, D with edges A-B and C-D. Drop edges A-B and C-D and create edges A-C and B-D. This procedure is ergodic.
On Power-Law Relationships of the Internet Topology

  • Regarding that there have been hundreds of papers who take some real world graph like the highway systems or a friend network and show that it is a scale-free network, it came as little surprise to me that also the degree distributions of Internet router (as of 1995) and AS graph (as measured three times in 1997-1998) are well approximated as scale-free.
  • The paper presents the rank degree distribution and the outdegree exponent as two power-laws when in fact they are the same: the rank of a node is given by integrating the frequency of an outdegree over all outdegrees higher than the node. But integration and inverting functions all turn power-laws into power-laws.
  • P(h) the number of pairs of nodes with distance
  • The eigenvalues follow a power law (it is not clear to me what the eigenvalues of a random graph are, so I don't know how meaningful this is)
A first-Principles Approach to Understanding the Internet's Router-level Topology

  • show that the degree distribution is an incomplete measure of network: by rewiring graphs with the same degree distribution can be obtained which have very different routing properties
  • stresses that low level(physical) links are more important than virtual (IP, AS) links: they determine performance, reliability and robustness
  • generation of realistic graphs has to go beyond creating abstract graphs and consider technological constraints: edges have bandwidths, economic routers only have a certain number of links, trade-off how many links versus BW/link, flow conservation. And economic considerations: degree of nodes decreases and BW increases from edge to core, few long-haul connections
  • new measures:

    • network performance: maximal (in alpha) total throughput where traffic is modeled as X_ij=alpha*x_i*x_j [Criticism: unrealistic regarding that most traffic goes from clients->websites]
    • network likelihood: gives an edge weight=product of degrees of vertices, likelihood=total weight, networks where edges are between high-degree links score higher
  • tested networks:
    • new: heuristically optimal (tree-like at edge, low-degree loose mesh at core) inspired by real networks
    • real networks
    • preferential attachment
    • random network with scale-free distribution
    • suboptimal networks
  • Results:
    • all networks have the same degree-distribution
    • but score differently with respect to new metric
    • performance decreases drastically with the likelihood
    • real networks have low likelihood and high performance
    • a "generic" graph with given degree distribution is having bad performance
This was a really great paper to read!!! It breaks down the assumptions of how relevant the degree-distribution really is, and then comes up with new measurements which yield good results - results which are actually corresponding to real links (physical) and not abstracted networks like IP layer.

Wednesday, October 8, 2008

XORs in the Air

Sachin Katti, Hariharan Rahul, Wenjun Hu, Dina Katabi, Muriel Medard, Jon Crowcroft, "XORs in The Air: Practical Wireless Network Coding"

Principles
  • use broadcast nature of wireless (as previous paper)
  • network coding (here: use XORing of packets, pad packets to make them equal length)
The key example to illustrate the potential benefit is in Figure 1. The caption summarizes it so well that I will only refer to it here. COPE is sitting between IP and MAX layer.

Main techniques
  • opportunistic listening: overhear packets even if they are not destined for you, send reception reports to tell others which packets you received (broadcast, extra transmission or piggy-backed on another)
  • opportunistic coding: "maximize the number of native packets [i.e. not coded ones] delivered in a single transmission, while ensuring that each intended nexthop has enough information to decode its native packet [i.e. send A XOR B XOR C only if the nexthop has already two of the packets A, B, C so that it can decode]." Here the nexthop is picked to be the packet of the head of the queue, so that there is no delay introduced (though potentially reordering)
  • learning neighbor state: leverage probabilistic knowledge for coding decision when you lack feedback from reception reports
Refinement
  • have separate queues for different packet lengths to avoid excessive padding
  • MAC allocated bandwidth alters the dynamics benefinically
  • no delay: do not wait for coding opportunities before sending packet
  • never "code together packets heading for the same nexthop", therefore maintain virtual queues for each nexthop (and packet length)
  • reordering agent to fix reorder
  • details see pseudo code in paper
  • pseudo-broadcast: uses MAC unicast to send and puts other nodes into promicious mode, this enables MAC's backoff and ACK mechanism
Results
  • random UDP flows: 3x4 increase of throughput (even when congested)
  • TCP, gateway to Internet: 5% to 70% more throughput depending on upload/download ratio
  • hidden terminals are spoilers for TCP

ExOR

Sanjit Biswas, Robert Morris, "ExOr: Opportunistic Multi-Hop Routing for Wireless Networks"

Departure from traditional wireless routing
  • do not model wireless network as point-to-point links
  • use the broadcasting nature of wireless
  • Idea: to route packet from A to B, broadcast packet, see how far it went, i.e. which node N received it and is closest to B, repeat process from N
Example
  • Take sequence Src-A-B-C-D-E-Dst
  • It would be great if Src sends it, then the furthest station (which received the packet correctly) forwards it and so on
  • this works better because at most the probabilities are known which stations are likely to receive the packet, the routing decision is deferred until the packet is send and it is known which links worked and which didn't
  • each forwarding step is an independent random event
  • if there are parallel structures, there are independent chances which link will succed, hence the total chance of any of links succeding will be higher

Challenge
  • algorithm must find a node N or a set of nodes N which will forward the packets, all other nodes who received the packet will drop it
  • if this algorithm gives no such node N, the packet is lost, is the algorithm is likely to give more than one such node N, there will be a chain reaction flooding the network wasting capacity: robust algorithm giving 1 in almost all situations
  • algorithm is distributed and has to arrive at consensus
  • has to have little overhead
Proposed solution
  • collect packets together into a batch
  • each node maintains a routing table which has path length (metrix e.g. ETX) from neighbors to destination
  • a node uses the table to determine and prioritize (by closeness to destination) a list of forwarders (small to keep cost to arrive at agreement low)
  • the highest-priority forwarder will broadcast all packets in the batch it received correctly, it will include the list of these packets as "batch map"
  • the next forwarders will send packets which were not acknowledged through the "batch map" of higher priority nodes
  • traditional routing as fall back mechanism for remaining 10% of packets
  • scheduler so that only one node sends a time
  • link state protcol to give input on forwarder list algorithm
  • implemented in Click toolkit
Results
  • end-to-end throughput boosted by around 2 (even higher for multi-hop routes)
  • but bad TCP performance (fixed by split connection)
Criticism
  • algorithmic overhead
  • promicious mode: extra energy?
  • there is no use on allowing only one station to send if the stations are so far apart that no middle man can hear both stations, scheduler should take this into accoung

Monday, October 6, 2008

Four Ad Hoc Routing Protocols

Josh Broch, David Maltz, David Johnson, Yih-Chun Hu, Jorjeta Jetcheva, "A performance Comparison of Multi-Hop Wireless Ad Hoc Network Routing Protocols"

Protocols
  • Destination -Sequenced Distance Vector (DSDV)
    • hop-by-hop protocol
    • sequence numbers to prevent aging of routes
    • each time step, every node sends a full dump of its routing table (destination, next hop, last sqn, metric) to its neighbors with route to itself attached as routing advertisement
    • routes are imported if they are newer (by sqn) or have better metric and not too old (? I did not really understand the WST fudge factor in this protocol ?)
  • Temporally Ordered Routing Algorithm (TORA)
    • hop-by-hop protocol
    • assigns height per node and per each destination
    • packets are routed "downhill" to destination
    • routes are requested through "QUERY" which initiates "UPDATES" which will set the heights
  • Dynamic Source Routing (DSR)
    • source-routing
    • route requests with unique ID flooded, intermediate nodes append own address to request re-broadcast, flooded route requests are repeated with increasing back-off time and TTL
    • destination replies to each route request with "route reply" containing the route sent back to originator of request along the reverse route
    • store overheared route replies in link cache, run Dijkstra's algorithm for source routing, send route request if no route in link cache
    • send route error back to origin of source-routed packet if 802.11 ACK fails, source has to recover
    • maintain blacklist for detected unidirectional links (per node)

  • Ad Hoc On-Demand Distance Vector (AODV)
    • hop-by-hop routing
    • combination of DSR and DSDV

Simulation
  • ns-2 network simulator extended to model MAC (RTS/CTS ACK) and physical layer (delay, capture, carrier sense, transmission power, antenna gain, receiver sensitivity) behavior of 802.11 and node mobility, ARP, packet buffering
  • ns is discrete even simulator
Performance metrics
  • Goal: "measure the ability of [...] routing protocols to react to network topology change while continuing to [...] deliver data packets [...]"
  • 50 nodes on 1500mx300m space, 210 different scenario files, CBR sources
  • No TCP measurements (this is probably a more relevant number as TCP most deployed, but maybe the TCP dynamics has to change to accomodate different link layer characteristics)
Results
  • DSDV-SQ fails to converge for high mobility, has high loss rates then
  • TORA has largest overhead (by almost 10 for high mobility), DSDV-SQ has almost constant overhead independent of number of sources and degree of mobility
  • TORA has high loss rates and bad overhead when >30 sources
  • DSDV-SQ and DSR give slightly better paths

The ETX Path Metric for multi-hop wireless

Douglas S.J. Dr Couto, Daniel Aguayo, John Bicket, Robert Morris, "A High-Throughput Path Metric for Multi-Hop Wireless Routing"

Computation of ETX metric
  • total metric=sum of the links
  • for one link, metric 1/(forward delivery ratio*reverse delivery ratio) (probability that packet was received, probability ACK was received)
  • estimation through 1 probe/sec broadcast by each node with jitter to avoid collision
  • gives expected value of total transmissions (with retransmissions) of packet along that path
Objectives (motivated by measurements of throughput through different routes)
  • accounts wide range of link loss ratios
  • accounts asymmetric loss ratios
  • accounts interference between successive hops of multi-hop paths (one can't send til one has completly received)
  • tends to minimize spectrum use, overall system capacity, energy
Results
  • Tested with two modified routing algorithms: DSDV and DSR in office environment
  • significant higher throughput through multi-hop routes
  • algorithm does not account different loss ratios for different packet sizes
I was wondering whether (when there several packets on its way) the network actually becomes a pipeline which can work (at least) partially parallel and therefore the ETX metric does not optimal for max throughput.
I also still do't understand why the graphs in figure 6 cross: this clearly is a contradiction to the fact that the best route should perform equally or better than any other route, so why does its graph suddenly lie above another route?

Wednesday, October 1, 2008

Roof Network

John Bicket, Daniel Aguayo, Sanjit Biswas, Robert Morris, "Architecture and Evaluation of an Unplanned 802.11b Mesh Network"

I really liked that hand-on project. Ashima has written a nice summary for it on her blog.
I will only write about a couple of details I found interesting about the mechanisms:
  • standard 802.11b hardware with RTS/CTS disabled, really? What is the benefit of disabling RTS/CTS?
  • single-channel
  • self-configuring network on preinstalled Linux boxes
  • network is a mesh with many routing choices
  • routing protocol: each node maintains partial matrix of link metrics between pairs of nodes, each node uses Dijkstra's algorithm for source-routing (not scalable)
  • link metrics communicated through 1. packet header when forwarding, 2. flood queries when route to destination is missing, 3. nodes overhearing responses to flood queries
  • probing of link quality through periodic broadcasts
  • link metric given by adding inverses of throughput
  • overriding default selection of transmit bit-rates of wireless hardware for optimization
I found interesting about the evaluation the detailed figures on e.g. the relation between throughput and distance, especially the figures on how the network performance drops when nodes are dropped.

I was missing a discussion on packet collisions which is relevant because of single-channel and omni-directional antennas.

Modeling Wireless links for Transport Protocols

Andrei Gurtov, Sally Floyd, "Modeling Wireless Links for Transport Protocols"

Interesting idea: cross-layer communication

Motivation
  • wrong assumptions about wireless links yield to bad design of link layer and transport layer protocols with low performance (the paper shows examples like tight assumptions on inter-packet delays to distinguish between loss and congestion)
  • accurate evaluation of mechanisms like Forward Error Correction, link-level error recovery
  • discuss the tussle whether changes should be implemented in the link layer or transport layer and which one should consider the dynamics of the other one (e.g. implementation of error recovery on both layers might yield to unnecessary retransmissions due to competing timeouts): there is interplay between the two layers
  • TCP assumes that packet loss means congestion
  • TCP Reno assumes that Dup ACKs mean packet loss (does not allow (any) reordering)
  • Metrics: goodput, also (for cell phones) power consumption
Types of wireless networks
  • wlan
  • wide-area cellular links (GSM, UTMS)
  • satellite links
Other aspects
  • network topology
  • shape of traffic (e.g. short Web requests versus Megabyte file sharing)
Link characteristics (for each one, the physical cause, the direct impact on the link layer and indirect impact on the transport layer is explained and a model is discussed)
  • error losses and corruption (FEC and limited number of retransmissions on the link layer might not hide them in really poor radio conditions, losses due to handovers between stations might not be recoverable on the link layer): modeled by probabilistic packet dropping, might have a certain distribution or state
  • delay variation (messes up delay-based congestion algorithms, rate-based protocols more affected than window based): model delay spikes by suspending data transmission in simulation
  • on-demand resource allocation (a radio frequency is only allocated after a link layer buffer has passed a threshold or timeout, channel allocation delay causes TCP timeouts): modeled by additional delay at packet arrival to queue
  • bandwidth variation (timeouts, congestion and underutilization)
  • asymmetry in bandwidth (congested TCP ACKs), example satellite 10-100x difference
  • Queue Management (used to accomodate bursty traffic, can cause high queueing delay and poor loss recovery (high RTT)): simulate tail-drop or RED
  • effects of mobility: deeper understanding of inter-system handover necessary
Impact
  • currently: link layer designed to deal with TCP's bad performance under packet reordering or loss (TCP could be changed), TCP designed to work over exisiting link layers (not necessarily having explicit loss notification)
  • explicit loss notification instead of timely link-layer recovery
  • buffers to fix reordering or make TCP robust against reordering
  • reduce delays on link layer or make TCP more robust
  • cross-layer communication

Monday, September 29, 2008

Improving TCP Performance over Wireless Links

Hari Balakrishnan, Venkata Padmanabhan, Srinivasan Seshan, Randy Katz, "A Comparison of Mechanisms for Improving TCP Performance over Wireless Links"

In TCP, Packet loss = congestion, so lossy links like wireless make TCP perform badly.
The paper does not introduce new ideas how to improve this, but does a great job in reviewing, systematizing and comparing the approaches which are listed in Table 1.

Two approaches:
  • Hide losses from TCP sender, local problem solved locally, higher quality link with reduced effective bandwidth
  • make sender aware of lossy links
I found the snoop protocol interesting which works on the link layer, but is TCP aware and suppresses for example non-congestion related dup acks.
Also that there might be timers in the TCP layer and the link layer which might be redundant and issue unnecessart retransmissions.

The conclusion
  • reliable link-layer protocol with TCP awareness (e.g. in order packet delivery) improve performance greatly
  • Can avoid split-connection without performance loss
  • TCP mechanisms like SACK help to recover quicker from multiple packet losses
  • explicit loss notification improves performance

Media Access Protocol for Wireless

Vaduvur Bharghavan, Alan Demers, Scott Shenker, Lixia Zhang, "MACAW: A Media Access Protocol for Wireless LAN's"

Assumptions
  • single channel
  • no multiple-path effects
  • two stations are either in range or have no mutual interference
  • symmetry: if A receives B, B receives A
  • very small delay time (short range)
Possible approaches
  • carrier sense (at sender but congestion happens at receiver, the whole paper is kind of "receiver" driven where the receiver has two signals CTS and RRTS to drive senders)
  • token-based
  • multiple access collision avoidance (taken here)
Basic Idea
  • Request-To-Send Clear-To-Send Handshake
Refinements
  • backoff algorithm, share congestion state, later decoupling shared congestion state
  • separate queues for separate streams
  • ACKnowledgements (link layer faster than TCP), Data-Sending (defer other from sending RTS), Request-for-Request-To-Send
Collision happens when two stations send simultaneously. A collision can also happen with the RTS pakets. So how dows this RTS-CTS handshake bring improvement? A RTS packet is short (30 bytes) compared to a data packet (512 bytes) so a collision between RTSs destroys less time.

As far as I understand, the mechanism when to send an RTS is to wait until the current data transmission is finished + a random time whose distribution depends on the congestion state. Does this really deserve the term "Collision Avoidance"? Could there be a better mechanism which collaboratively develops a schedule when which station is sending to which other station?
There is also no explicit communication among the stations which station is the range of what other station, so a graph of ranges and crosstalk could be constructed which could potentially give a better scheduling scheme for data transmissions. Instead the algorithm is pretty ad hoc. Is this a weakness or a strength of algorithm because it is easier, more flexible and robust?

Monday, September 22, 2008

Routers with two switched backplanes and optics

Isaac Keslassy, Shang-Tse Chuang, Kyoungsik Yu, David Miller, Mark Horowitz, Olav Solgaard, Nick McKeown, "Scaling Internet Routers Using Optics"

This paper blew my mind. 100Terrabit/s.
  • When I read the other paper on routers, I noticed that there are two things about a router we might want to increase: the number of ports N or the per port/total throughput.The other paper still has a centralized structure: the scheduler. Luckily, because whole cells are transfered through the crossbar before reconfiguration, the scheduler is given a relative high amount of time to come up with a scheduling decision, but I don't see how the scheduler can be implemented in a way which has run time less than O(N^2), so the router would not really scale in N.
  • This paper actually analyzes how the components scale with N and overcomes the centralized scheduler by introducing a "load balancer" before the VOQ (see previous post): it has two crossbars (though they can be replaced by one at twice the speed) in total, each crossbar follows a fixed scheme which cyclically permutes every input with every output, as shown in Figure 2. They show that this will bring 100% throughput. I think memory scales with N^2, but not sure.
  • The paper next analyzes how to use an optical backplane. The suggested switched backplane is a two-tier hierarchy of optical switches. I lost track when what optical switch is doing what to which frequency in WDM.
  • The paper discusses how to use this scheme so that some line cards can be removed (the VOQ are distributed across line cards, so if a line card is missing or failed, traffic needs to rerouted around it)

Switched backplane with centralized scheduler

Nick McKeown, "Fast Switched Backplane for a Gigabit Switched Router"

Basic Observations:
  • A router has to forward traffic from every port to every other port
  • The total traffic is the product of the traffic at one port times the number of ports N
  • A structure where all traffic is going through one central structure has to fail because it has to be so much faster
  • A N-to-N connection scheme can be used so that the speed needed at each N-to-N crosspoint doesn't exceed the speed of a port
  • Because the number of N-to-N crosspoints is O(N^2), these crosspoints have to be cheap, i.e. they cannot do routing decisions, should be only on-off-switches which are on a scheduling scheme, this is called "crossbar"
  • Scheduler closes N of the N^2 crosspoints at a time, so we have maximal throughput
Basic concepts:
  • 1 Line card per port, backplane connecting all line cards
  • Components: forwarding decision (next-hop MAC, TTL decrement, IP checksum), backplane (forward to output linecard, queue or drop packets), output link scheduling (FIFO or QoS scheduling)
  • dataplane (per packet) versus control plane
  • specific hardware (ASIC) for dataplane
  • parallelism
  • shared versus switched backplane
  • fixed packet length through splitting into fixed size chunks (cells, around 64Bytes)
Problems with switched backplanes:
  • head-of line (packet in line card has to wait until the crossbar switched to the right output) is solved by queues (one per output) between line card and crossbar, queues are called virtual ouput queues (VOQ)
  • input and output blocking (e.g. several line cards want to use the same output), cause delay, need prioritization or speed-up for QoS
Scheduling algorithm:
  • Goals: high throughput, starvation free, fast, simple to implement
  • iSLIP is iterative 3-step algorithm, underlying is a fixed round-robin priority scheme, the algorithm matches input to outputs to fullfill those requests with highest priority
Other features:
  • Unicast and multicast support
  • QoS
Scale (as of 1997):
  • CISCO 12000 had 16 2.4Gb/s ports
  • maximal shared bus capacity 20Gb/s
  • forwarding decision (Ip prefix lookup, ...) can be done in 2ns
  • scheduler reconfigures crossbar every 40ns

Gigabit- and Terrabit-Router Questions

Here are a bunch of questions which I don't really know the answer to, but would like to know to understand the next two readings better.
  • What is the maximal and economic switching rate of a current transistor? (it might be cheaper to produce several slower transistors who together have the same computing power)
    Since modern processors run at around 3 Ghz and there are a bunch of transistors in a line to do one clock cycle, I guess the answer is around 30-300 Ghz (which is 1/1000 of the frequency of visible light). This question is a fundamental limit how much information can be processed in a single electrical wire, to process more, you need parallelism.
  • As light passing through a fibre is a electromagnetical wave, it can be (in theory) modulated the same ways radio waves are modulated: amplitude modulation/amplitude-shift key, frequency modulation/frequency-shift key, phase modulation/phase-shift key, (a combination of amplitude and phase modulation), OFDM. Which of these modulations are used in practice?
    I know that through detuning lasers, one could probably archive frequency modulation and demodulate it through a phase-locked or frequency-locked loop, but I guess currently, most systems only use on-off/amplitude modulation on different wavelengths (WDM) because you just switch the laser or some filter in front of it on and off.
  • If electronics can only operate at a certain maximum frequency (see question 1), how can a single fibre transport more information than that?
    I guess the only way, implemented right now, is through WDM.

Sunday, September 21, 2008

High-performance routers

I am still reading the papers, but if somebody is already looking for other blog posts to get an idea, here is my comment: read the "Fast Switched Backplane for a Gb/s router" first!

Wednesday, September 17, 2008

Real Time Applications in packet switched networks

David Clark, Scott Shenker, Lixia Zhang, "Supporting Real-Time Applications in an Integrated Services Packet Network: Architecture and Mechanism"

Goal
  • packet switched network which supports real-time applications (ISPN)
    applications are divided into adaptive versus rigid, tolerant versus intolerant
    (ISDN is mentioned as an intermittent form of digital network but used for telephony)
  • preschedule services, packet delay bound known a priori
  • support of guaranteed and predicted service (statistical, dynamic)
    (Motivating observation: video applications can drop frames, audio conversations can introduce small silence in case of increasing delay times, ...)
Architecture
  • service interface communicating quality of service the network can deliver to end host and traffic requirements of end host to routers for resource allocation
  • packet scheduling algorithms meeting service commitments, algorithm supported by additional scheduling information in packet header

Guaranteed Traffic characterized by token bucket filter
Techniques: FIFO, Weighted FQ, ...

Assumption that flows are well-behaved
No discussion on gradual deployment

I got lost in the middle of the paper in technical details.

The future of the Internet as of 1995

Scott Shenker, "Fundamental Design Issues for the Future Internet"

I was amused by the beginning of the article: 3000000 hosts and astonishment that a newspaper has an online column...

The goal of the paper is to build a framework to analyze future techniques deployed on the Internet to solve problems arising from "real-time" applications.

Current (as of 1995):
  • best-effort, no admissions control, no assurance about delay or delivery
  • killer apps: traditional data uses (telnet, FTP, HTTP, DNS, SMTP)
  • apps are elastic, i.e. tolerant to delay or loss, gracefully degrade in these cases
  • apps can change transmission rates in case of congestion
Future:
  • "Real-time" apps: video & audio
  • apps not elastic, bad performance in case of variation of delays (jitter) or loss
  • apps without congestion control
  • "Realt-time" apps interfere badly with traditional data uses, no fairness
Possible Solutions:
  • changing the router implementation, e.g. FQ
  • changing application implementation, e.g. delay adaptive techniques
  • introduce "Qualitiy of Service": implicit (router analyzes traffic patterns) or explicit (in header)
  • Admissions Control
  • Overprovisioning
Metrics:
  • service delivered to application i encoded in vector s_i
  • U_i(s_i) gives performance of application, e.g. U_i for FTP would depend smoothly and mostly linearly on bandwidth, but for a video conference it would degrade when there is not enough bandwidth
  • total performance V=sum U_i(s_i) is to be maximized
  • V does not account for fairness!
The possible solutions are discussed in view of the used metrics in "Gedankenexperimenten".

Monday, September 15, 2008

Congestion Control for High Bandwidth-Delay Product Networks

Dina Katabi, Mark Handley, Charlie Rohrs, "Control for High Bandwidth-Delay Product Networks"

Cristicism of current TCP:
  • oscillatory and instable
  • AIMD limits sender to acquire only one more packet of bandwidth per RTT
  • biased against long RTT flows (e.g. satellite connections need ad hoc methods like proxies)
  • congestion notification only implicit through packet loss: no distinction between error and congestion loss, long time until packet loss identified
Methods:
  • Arguments from Control Theory (though more often cited than used ;-)
  • Simulation
  • Rethinking the transport protocol without requiring backward compatibility
Features:
  • no per-flow state
  • separation of utilization/efficiency and fairness control
  • detects misbehaving sources
  • provides incentive to both: users & ISPs
  • gradual deployment schemes discussed
Idea:
  • new protocol supersedes TCP: eXplicit Control Protocol
  • XCP congestion header informs sender about degree of congestion rather than having only 1-bit explicit congestion notification or packet loss
  • XCP header includes congestion windows size and RTT (set by sender), gateways can modify the feedback field
  • receiver sends XCP header back to sender, sender uses feedback to set new congestion window
  • if a gateway sees spare bandwidth or overutilization, it picks a couple of packets randomly and sets their feedback field to compensate (not necessarily in a fair manner): efficiency control
  • pairs of flows are picked, the feedback fields are set to take some feedback of the higher-bandwidth flow and give it to the lower bandwidth flow, this will eventually converge to fairness: fairness control

Random Early Detection Gateways for Congestion Avoidance

Sally Floyd and Van Jacobson, "Random Early Detection Gateways for Congestion Avoidance"

The proposed algorithm is implemented in the gateway (see 16. of the previous post) and uses FIFO scheduling together with a random drop policy. The gateway can signal congestion through dropping or marking packets, and will do so before the queue will overrun (see previous post for motivation). It will do so in a manner where the packet drop probability will depend continuously on the queue size (see 19. of previous post) and where the drop probability is proportional to the bandwidth of the corresponding flow to archive fairness.

The design goals were:
  • cooperat with current transport protocol implementations and gateways
  • congestion avoidance: avoid long queues/delays and low throughput
  • avoid global syncronity (see 6. of previous post)
  • no negative bias against bursty traffic (see 5.)
  • fair bandwidth allocation (though not complete isolation from misbehaving senders)
  • no per-flow state
The algorithm has several parts:
  • Computing the average queue size (through expontial weight low pass filter: this distinguishes between persistent queue and bursty traffic)
  • Use average queue size to determine the drop/congestion notification probability. This will be a piecewise linear function.
  • Drop packet with that probability, do a little math trick: if a packet wasn't droped for a long time (with respect to the drop packet probability), bias the probability, that way packets will be dropped more uniformly. Since the packet is chosen randomly, a flow with a higher bandwidth has a higher probability of being impacted.
Since packets might have very different sizes, the probability might be multiplied with the packet size to ensure fairness.

The paper discusses good ranges for the various parameters and shows simulations.

Thoughts on Queues and Congestion Avoidance


  1. 1. Ideally, each node knows its bandwidth share and sends exactly with that bandwidth share.
  2. 2. Ideally, the nodes would send their packets in a round-robin fashion, i.e. the packets arrive at the bottleneck gateway in equal time intervals at the same rate they go through the bottleneck link. E.g. if a packet needs 1 ms for transmisson on the bottleneck, node 1 sends now, node 2 1ms later, node 3 2ms later, ... node n n-1 ms later, node 1 sends the packet n ms later.
  3. 3. Provided 1 and 2, there would be no queue necessary.
  4. 4. A congestion protocol should try to maintain 1 and 2 as best as possible, that way no queue is necessary (by 3). There will be low delays.
  5. 5. Given 4, the function of a queue is to catch up bursty traffic.
  6. 6. The opposite situation of 2 is global syncronization: all nodes send their packets at the same time. The queue will be long, there will be long delays.
  7. 7. 2 is hard to accomplish, the best situation to hope for is that each node sends the packet in a random time, the random distribution will have relatively small clusters which can be buffered in the queue.
  8. 8. There are two quantities to control: the queue size and the arrival rates of packets at a gateway. The first one is to be kept low (to have low delay and be able to deal with bursts), the second one is to be kept at the output link rate of the gateway.
  9. 9. The queue size is the integral of the difference of arrival rate-output rate over time.
  10. 10. The necessary reaction to a too large queue size and a too arrival rates have to be different: the first one requires that the senders interrupt sending packets for a short while, the second one requires that the senders change their packet rate. This is similar to the situation when having a car driving in front of you on a highway. When you see that the car is slower than you, you reduce your target speed until it matches the target speed, when you see that the car is too close, you keep the same target speed, just briefly slow down below target speed until there is sufficient distance.
  11. 11. Because traffic demands are changing, 1 can only be archived through feedback loops.
  12. 12. The success of the feedback loop (in terms of efficient use of bandwidth, keeping low delay and fairness) will depend on how soon and how precisely the senders can sense a congestion situation.
  13. 13. The original Jacobson paper on Congestion avoidance assume that the only way of the gateway signaling congestion is dropping packets when the queue is full. It is even unclear what this signals with respect to point 10. Too high arrival rate is signaled indirectly because it will eventually cause a too high queue size.
  14. 14. In the original Jacobson paper on Congestion avoidance (Van Jacobson, Michael J. Karels, "Congestion Avoidance and Control"), congestion was sensed through packet loss, which happens "late".
  15. 15. "In the absense of explicit feedback from the gateway, trasport-layer protocols could infer congestion from ... changes in end-to-end delay [or] packet drops [...]. Nevertheless, the view of an individual connection is limited by the timescales of the connection, the lack of knowledge of the number of congested gateways, the possibilities of routing changes, as well as by other difficulties in distinguishing propagation delay from persistent queueing delay." (Sally Floyd, Van Jacobson, "Random Early Detection Gateways for Congestion Avoidance")
  16. 16. Because of 15, Gateways need to give feedback to the nodes. "The gateway can reliably distinguish between propagation delay and persistent queueing delay." (Sally Floyd, Van Jacobson, "Random Early Detection Gateways for Congestion Avoidance")
  17. 17. Because of 12, the situation gets worse with throughput-delay product.
  18. 18. It is better to buffer a packet at the sender than in a queue of a gateway of the network to keep delays low.
  19. 19. The oscillatory behavior of a congestion avoidance algorithm will be stronger if the feedback times are long and if the actions of the algorithm are discontinuous in the input to the algorithm.
  20. 20. The difference of necessary behavior in 10 is also charaterized by time scales: "Using tools from control theory, we conjecture that congestion feedback based on rate-mismatch should be inversely proportional to delay, and feedback based on queue-mismatch should be inversely proportional to the square of the delay." (Dina Katabi, Mark Handley, Charlie Rohrs, "Congestion Control for High Bandwidth-Delay Product Networks")
  21. 21. The optimal size for the congestion window is the maximal amount of traffic on the network links (not the queues) when using the fair share of bandwidth, i.e. bandwidth*transit times (not including queueing delays)

Thursday, September 11, 2008

Fair Queueing Algorithm - Drawings

Alan Demers, Srinivasan Keshav, "Analysis and Simulation of a Fair Scheduling Algorithm"

Here are my drawings for the Fair Scheduling algorithm, see previous post. Also available as pdf

Wednesday, September 10, 2008

Fair Queueing Algorithms

Alan Demers, Srinivasan Keshav, Scott Shenker, "Analysis and Simulation of a Fair Queuing Algorithm"
Ion Stoica, Scott Shenker, Hui Zhang, "Core-Stateless Fair Queueing: Achieving Approximately Fair Bandwidth Allocations in High Speed Networks"

FQ Algorithm

There are two "points of implementation" for congestion control: source and gateway. The paper is looking at the gateway and is suggestion a Queuing algorithm. Whereas a Queueing algorithm is not changing the total traffic, so not directly avoiding congestion, it "affects the collective behavior of flow control algorithms".

The algorithm has a queue per flow (per-flow state).

The basic algorithm is best explained in a squence of drawings I made and forgot to scan. I found that the notation obfuscated the basic idea.

I found the quantities used when evaluating the algorithm interesting: "bandwidth", "promptness" and "buffer space" are supposed to be nearly independent.
The design requirements were:
- has to work in a heterogenous environment with different flow control algorithms
- fairness (using max-min def, we introduce a common limit for everybody so that the sum of all limited bandwidth usage is equal to the full bandwidth) despite misbehaved sources
- good delay-throughput curve for each flow

The performance was evaluated through computation and simulation. The basic setup had a couple of FTP-like and telnet-like flows. Simulations were run on different network topologies. Flows which use less than their share of bandwidth gain low delay times.

The paper was great: it contained a lot of ideas about the environment the algorithm has to work in, theoretical concepts what conditions an algorithm has to fullfill to provide stable flow control (e.g. continuity), definition of fairness and other wanted criteria, metrics for performance, analysis of performance and simulation. I don't feel I can write a summary doing the paper justice.

Core-stateless FQ algorithm

In the previous paper, a gateway routing traffic has to maintain per-flow state. For a gateway routing thousands or million of different flows, this might be prohibitively expensive. The paper therefore divides routers into two categories: core and edge. Edge routers maintain per-flow state and write into the headers of packets the rate of the respective flow. Core routers only compute global statistics and drop packets probabilistically to achive fair bandwidth allocation. If the core router sees a packet whose header indicates that the respective flow has an arrival rate beyond its fair share of bandwidth, the packet is dropped with the right probability to fall into fair share again. The fair share of bandwidth is the same for all flows and is estimated by rising it until bandwidth is fully utilized and lowering in case of congestion.

I imagine that the network might be an ISP providing Internet access to PCs at home, an edge router is the router connecting such a PC to the ISP and is responsible for calculating the flow rate of that customer. The core routers are the internal routers of the ISP and are responsible for maintaining fair use of bandwidth.

Friday, September 5, 2008

Congestion avoidance

Van Jacobson, Michael J. Karels, "Congestion Avoidance and Control"
Dah-Ming Chiu, Raj Jain, "Analysis of the Increase and Decrease Algorithms for Congestion Avoidance in Computer Networks"

Underlying both papers is the observation that when a link is overutilized there is a very drastic degradation and the throughput can drop suddenly by several orders of magnitude. This can happen when several links converge to one such that the sum of the bandwidths is larger than the bandwidth of the link they converge to. The buffer at the respective router overflows, and it is by pure chance which packets are dropped or make it onto the buffer. ACK packets are dropped, forcing retransmissions, making more traffic. This phenomenon is called congestion.

Both papers are trying to use close to the full bandwidth of links and at the same time avoid congestion. Both papers work at the TCP level on the end-points of a connection. These end-points perceive whether a link is congested and act by changing the amount of data they send through different mechanisms. Both papers assume that the end-points behave cooperatively to avoid congestion and converge to fair resource usage.
The "Analysis of the Increase and Decrease Algorithms for Congestion Avoidance in Computer Networks" makes simplyfing assumptions to make a precise mathematical model of the abstracted network.
Some of these assumptions are:
- time is discretized into time steps
- the endpoint perceives that a link is congested through a simple reliable binary signal send by the router in one time step
- a host is simply sending x_i(t) amount of data/time unit where x_i(t) is main subject of discussion of the paper
The paper analyzes a linear model, i.e. x_i(t) is equal to one of two linear function in x_i(t-1) depending on whether the line is congested or not. It then shows for which pair of linear functions the model will converge to fairness and optimal resource. The easiest example is to multiply x_i(t-1) by <1 class="nfakPe">Congestion Avoidance implements seven changes to TCP. They mostly come
from the principle of "conservation of packets", meaning that the
number of packets traveling through the network from one machine to
another one should stay constant, a new packet is put into the network
exactly when an old ones leaves.
Criticism: the equations in Figure 1 are also true if I just send a
package every ten minutes, which is under the equlibrium.
TCP can be self-clocked. If I measure the frequency of ACK packages, I
know with what frequency packages leave the pipe, hence with what
frequency I should send.
In order to get to this equlibrium point, I should steadily increase
the frequency of packages just until I don't get as many ACK packages
per time unit as I am sending packages out. They call this slow-start.
The retransmit is a very sensitive issue. If retransmits are done to
frequently, they further congest the network, yielding to more
retransmits, and ending in a chain reaction. First: get a good timeout
when to retransmit (chapter 2), second: avoid a chain reaction using
exponentially increasing retransmit times. Assume that every second a
package gets lost and has to be retransmit infinitely many often.
Since the retransmit times are increasing exponentially, the average
bandwidth for retransmitting one package declines exponentially,
meaning that even in the above case where I have to retransmit
infinitely often, the total bandwidth is still finite. Criticism: why
does the paper talk about linear system theory and Lyanpow exponents
when you can do the calculation in three lines.
Criticism: the paper assumes that all hosts are cooperative. When an
ACK package is lost, it causes unnecessary retransmits. In a congested
network, ACK packages should be preferred. Assuming your retransmits
are far apart so that redundant data aren't traversing the pipe, you
still get the full effective bandwidth if ACKs are delivered in a
congested network.

Wednesday, September 3, 2008

On Inferring Autonomous System Relationships in the Internet/Interdomain Internet Routing

Hari Balakrishnan, Nick Feamster, "Interdomain Internet Routing"
Lixin Gao, "
On Inferring Autonomous System Relationships in the Internet"

The notion of the "Autonomous System" is central. A routing protocols inside an AS are called IGP and all routers are programmed to find the optimal path through the network. All these routers are run by the same administration.

A routing protocols for routing between ASs is called an EGP. BGP is the EGP deployed today on the Internet. The design of such a routing protocol has to take into account that different ASs are run by different enterprises which are competitors or have customer-provider relationships. Hence a router is driven by commerical interests when deciding whose packets to forward to who. The goal of BGP is to honor these interests and at the same time establish routing tables allowing global reachability if possible. (Cooperation between competitors) Notice that this prioritizes the accountability in the original list of goals of the DARPA Internet (Clarke). This means that the AS is dividing line between two very different kinds of routing, and why I said that the notion of the AS is central.

Technically speaking, BGP routers enter BGP sessions and exchange reachability information in the forms of routes of AS numbers by enumerating the routes (during initialization) and then updating about new/changed or deleted routes. When the routing information is advertised, i.e. sent from one host to another, that host has to append its AS number to the route. Each BGP router has a choice which of these routes to incorporate into the routing table. It also has a choice which of the routes from the routing table to advertise. "A route advertisement from B to A for a destination prefix is an agreement by B that it will forard packets sent via A destined for any destination in the prefix." To avoid cycles, a router has to drop a route when it notices that it already is on the path. There are two kinds of sessions (but which speak the same "language"): iBGP and eBGP, the first one is to synchronize routing information between different BGP routers in the same AS, the latter one to exchange routes with other eBGPs. iBGP is for merging routing information, but in eBGP, routers have the choice to advertise or import only certain routes. There are further variables (MED) in BGP to decide which exchange point to use in case two ASs are linked at multiple places.

Each time a router exports a route, it could draw more traffic which could flow through other networks to itself. Hence, to understand which routes a BGP router imports and exports, one has to look at the relations between ASs: the major ones are customer-provider transit and peering. In a customer-provider relation, the provider will export routes from its customers to all other ASs the provider is connected to and give the customers routes to all other ASs, since the customer pays for being reachable.
Peering means that two providers A and B agree to forward packets which are going between customers of each other to avoid paying for traffic through a third, higher provider C. Peer A will not export a route to B's customer to, say, C because that would mean A pays for B's customers traffic to C.

ASs are roughly categorized as Tier-1, Tier-2, Tier-3 networks which corresponds to their size and hence to their relationships. For example, Tier-3 networks are small and are customers of Tier-2 networks.

I like the Interdomain Internet Routing paper for illustrating the different business relationships between ASs. It could make more clear to what parts of the text are business-driven policies and which parts of the text are really protocol specifications a router has to adhere to.

The information about the relationships of ASs are in general not known, sometimes under non-disclosure agreement.
The second paper is discussing in detail certain heuristics (like a larger AS is probably a provider of a smaller AS) and how the relationships influence which routers are exporting which routes (III A). It is important to note that there assumptions made about how a router behaves. Under this assumptions certain theorems can be shown, essentially that when looking at the customer-provider, peer-peer and provider-customer edges of an AS path, a path can only "bounce" once, e.g. an AS path like provider-customer-provider-customer is impossible.
Using this knowledge, an algorithm is given which uses routing information from several routers to reverse engineer the relationships between ASs. This algorithm is refined to be more robust because not all routers behave as assumed, e.g. a mal-configured router of a multi-homed customer exported a route linking its providers.
Experimental results are compared to internal data of AT&T.