CS 268 Computer Networks: September 2008

Monday, September 29, 2008

Improving TCP Performance over Wireless Links

Hari Balakrishnan, Venkata Padmanabhan, Srinivasan Seshan, Randy Katz, "A Comparison of Mechanisms for Improving TCP Performance over Wireless Links"

In TCP, Packet loss = congestion, so lossy links like wireless make TCP perform badly.
The paper does not introduce new ideas how to improve this, but does a great job in reviewing, systematizing and comparing the approaches which are listed in Table 1.

Two approaches:

Hide losses from TCP sender, local problem solved locally, higher quality link with reduced effective bandwidth
make sender aware of lossy links

I found the snoop protocol interesting which works on the link layer, but is TCP aware and suppresses for example non-congestion related dup acks.
Also that there might be timers in the TCP layer and the link layer which might be redundant and issue unnecessart retransmissions.

The conclusion

reliable link-layer protocol with TCP awareness (e.g. in order packet delivery) improve performance greatly
Can avoid split-connection without performance loss
TCP mechanisms like SACK help to recover quicker from multiple packet losses
explicit loss notification improves performance

Media Access Protocol for Wireless

Vaduvur Bharghavan, Alan Demers, Scott Shenker, Lixia Zhang, "MACAW: A Media Access Protocol for Wireless LAN's"

Assumptions

single channel
no multiple-path effects
two stations are either in range or have no mutual interference
symmetry: if A receives B, B receives A
very small delay time (short range)

Possible approaches

carrier sense (at sender but congestion happens at receiver, the whole paper is kind of "receiver" driven where the receiver has two signals CTS and RRTS to drive senders)
token-based
multiple access collision avoidance (taken here)

Basic Idea

Request-To-Send Clear-To-Send Handshake

Refinements

backoff algorithm, share congestion state, later decoupling shared congestion state
separate queues for separate streams
ACKnowledgements (link layer faster than TCP), Data-Sending (defer other from sending RTS), Request-for-Request-To-Send

Collision happens when two stations send simultaneously. A collision can also happen with the RTS pakets. So how dows this RTS-CTS handshake bring improvement? A RTS packet is short (30 bytes) compared to a data packet (512 bytes) so a collision between RTSs destroys less time.

As far as I understand, the mechanism when to send an RTS is to wait until the current data transmission is finished + a random time whose distribution depends on the congestion state. Does this really deserve the term "Collision Avoidance"? Could there be a better mechanism which collaboratively develops a schedule when which station is sending to which other station?
There is also no explicit communication among the stations which station is the range of what other station, so a graph of ranges and crosstalk could be constructed which could potentially give a better scheduling scheme for data transmissions. Instead the algorithm is pretty ad hoc. Is this a weakness or a strength of algorithm because it is easier, more flexible and robust?

Monday, September 22, 2008

Routers with two switched backplanes and optics

Isaac Keslassy, Shang-Tse Chuang, Kyoungsik Yu, David Miller, Mark Horowitz, Olav Solgaard, Nick McKeown, "Scaling Internet Routers Using Optics"

This paper blew my mind. 100Terrabit/s.

When I read the other paper on routers, I noticed that there are two things about a router we might want to increase: the number of ports N or the per port/total throughput.The other paper still has a centralized structure: the scheduler. Luckily, because whole cells are transfered through the crossbar before reconfiguration, the scheduler is given a relative high amount of time to come up with a scheduling decision, but I don't see how the scheduler can be implemented in a way which has run time less than O(N^2), so the router would not really scale in N.
This paper actually analyzes how the components scale with N and overcomes the centralized scheduler by introducing a "load balancer" before the VOQ (see previous post): it has two crossbars (though they can be replaced by one at twice the speed) in total, each crossbar follows a fixed scheme which cyclically permutes every input with every output, as shown in Figure 2. They show that this will bring 100% throughput. I think memory scales with N^2, but not sure.
The paper next analyzes how to use an optical backplane. The suggested switched backplane is a two-tier hierarchy of optical switches. I lost track when what optical switch is doing what to which frequency in WDM.
The paper discusses how to use this scheme so that some line cards can be removed (the VOQ are distributed across line cards, so if a line card is missing or failed, traffic needs to rerouted around it)

Switched backplane with centralized scheduler

Nick McKeown, "Fast Switched Backplane for a Gigabit Switched Router"

Basic Observations:

A router has to forward traffic from every port to every other port
The total traffic is the product of the traffic at one port times the number of ports N
A structure where all traffic is going through one central structure has to fail because it has to be so much faster
A N-to-N connection scheme can be used so that the speed needed at each N-to-N crosspoint doesn't exceed the speed of a port
Because the number of N-to-N crosspoints is O(N^2), these crosspoints have to be cheap, i.e. they cannot do routing decisions, should be only on-off-switches which are on a scheduling scheme, this is called "crossbar"
Scheduler closes N of the N^2 crosspoints at a time, so we have maximal throughput

Basic concepts:

1 Line card per port, backplane connecting all line cards
Components: forwarding decision (next-hop MAC, TTL decrement, IP checksum), backplane (forward to output linecard, queue or drop packets), output link scheduling (FIFO or QoS scheduling)
dataplane (per packet) versus control plane
specific hardware (ASIC) for dataplane
parallelism
shared versus switched backplane
fixed packet length through splitting into fixed size chunks (cells, around 64Bytes)

Problems with switched backplanes:

head-of line (packet in line card has to wait until the crossbar switched to the right output) is solved by queues (one per output) between line card and crossbar, queues are called virtual ouput queues (VOQ)
input and output blocking (e.g. several line cards want to use the same output), cause delay, need prioritization or speed-up for QoS

Scheduling algorithm:

Goals: high throughput, starvation free, fast, simple to implement
iSLIP is iterative 3-step algorithm, underlying is a fixed round-robin priority scheme, the algorithm matches input to outputs to fullfill those requests with highest priority

Other features:

Unicast and multicast support
QoS

Scale (as of 1997):

CISCO 12000 had 16 2.4Gb/s ports
maximal shared bus capacity 20Gb/s
forwarding decision (Ip prefix lookup, ...) can be done in 2ns
scheduler reconfigures crossbar every 40ns

Gigabit- and Terrabit-Router Questions

Here are a bunch of questions which I don't really know the answer to, but would like to know to understand the next two readings better.

What is the maximal and economic switching rate of a current transistor? (it might be cheaper to produce several slower transistors who together have the same computing power)
Since modern processors run at around 3 Ghz and there are a bunch of transistors in a line to do one clock cycle, I guess the answer is around 30-300 Ghz (which is 1/1000 of the frequency of visible light). This question is a fundamental limit how much information can be processed in a single electrical wire, to process more, you need parallelism.
As light passing through a fibre is a electromagnetical wave, it can be (in theory) modulated the same ways radio waves are modulated: amplitude modulation/amplitude-shift key, frequency modulation/frequency-shift key, phase modulation/phase-shift key, (a combination of amplitude and phase modulation), OFDM. Which of these modulations are used in practice?
I know that through detuning lasers, one could probably archive frequency modulation and demodulate it through a phase-locked or frequency-locked loop, but I guess currently, most systems only use on-off/amplitude modulation on different wavelengths (WDM) because you just switch the laser or some filter in front of it on and off.
If electronics can only operate at a certain maximum frequency (see question 1), how can a single fibre transport more information than that?
I guess the only way, implemented right now, is through WDM.

Sunday, September 21, 2008

High-performance routers

I am still reading the papers, but if somebody is already looking for other blog posts to get an idea, here is my comment: read the "Fast Switched Backplane for a Gb/s router" first!

Wednesday, September 17, 2008

Real Time Applications in packet switched networks

David Clark, Scott Shenker, Lixia Zhang, "Supporting Real-Time Applications in an Integrated Services Packet Network: Architecture and Mechanism"

Goal

packet switched network which supports real-time applications (ISPN)
applications are divided into adaptive versus rigid, tolerant versus intolerant
(ISDN is mentioned as an intermittent form of digital network but used for telephony)
preschedule services, packet delay bound known a priori
support of guaranteed and predicted service (statistical, dynamic)
(Motivating observation: video applications can drop frames, audio conversations can introduce small silence in case of increasing delay times, ...)

Architecture

service interface communicating quality of service the network can deliver to end host and traffic requirements of end host to routers for resource allocation
packet scheduling algorithms meeting service commitments, algorithm supported by additional scheduling information in packet header

Guaranteed Traffic characterized by token bucket filter
Techniques: FIFO, Weighted FQ, ...

Assumption that flows are well-behaved
No discussion on gradual deployment

I got lost in the middle of the paper in technical details.

The future of the Internet as of 1995

Scott Shenker, "Fundamental Design Issues for the Future Internet"

I was amused by the beginning of the article: 3000000 hosts and astonishment that a newspaper has an online column...

The goal of the paper is to build a framework to analyze future techniques deployed on the Internet to solve problems arising from "real-time" applications.

Current (as of 1995):

best-effort, no admissions control, no assurance about delay or delivery
killer apps: traditional data uses (telnet, FTP, HTTP, DNS, SMTP)
apps are elastic, i.e. tolerant to delay or loss, gracefully degrade in these cases
apps can change transmission rates in case of congestion

Future:

"Real-time" apps: video & audio
apps not elastic, bad performance in case of variation of delays (jitter) or loss
apps without congestion control
"Realt-time" apps interfere badly with traditional data uses, no fairness

Possible Solutions:

changing the router implementation, e.g. FQ
changing application implementation, e.g. delay adaptive techniques
introduce "Qualitiy of Service": implicit (router analyzes traffic patterns) or explicit (in header)
Admissions Control
Overprovisioning

Metrics:

service delivered to application i encoded in vector s_i
U_i(s_i) gives performance of application, e.g. U_i for FTP would depend smoothly and mostly linearly on bandwidth, but for a video conference it would degrade when there is not enough bandwidth
total performance V=sum U_i(s_i) is to be maximized
V does not account for fairness!

The possible solutions are discussed in view of the used metrics in "Gedankenexperimenten".

Monday, September 15, 2008

Congestion Control for High Bandwidth-Delay Product Networks

Dina Katabi, Mark Handley, Charlie Rohrs, "Control for High Bandwidth-Delay Product Networks"

Cristicism of current TCP:

oscillatory and instable
AIMD limits sender to acquire only one more packet of bandwidth per RTT
biased against long RTT flows (e.g. satellite connections need ad hoc methods like proxies)
congestion notification only implicit through packet loss: no distinction between error and congestion loss, long time until packet loss identified

Methods:

Arguments from Control Theory (though more often cited than used ;-)
Simulation
Rethinking the transport protocol without requiring backward compatibility

Features:

no per-flow state
separation of utilization/efficiency and fairness control
detects misbehaving sources
provides incentive to both: users & ISPs
gradual deployment schemes discussed

Idea:

new protocol supersedes TCP: eXplicit Control Protocol
XCP congestion header informs sender about degree of congestion rather than having only 1-bit explicit congestion notification or packet loss
XCP header includes congestion windows size and RTT (set by sender), gateways can modify the feedback field
receiver sends XCP header back to sender, sender uses feedback to set new congestion window
if a gateway sees spare bandwidth or overutilization, it picks a couple of packets randomly and sets their feedback field to compensate (not necessarily in a fair manner): efficiency control
pairs of flows are picked, the feedback fields are set to take some feedback of the higher-bandwidth flow and give it to the lower bandwidth flow, this will eventually converge to fairness: fairness control

Random Early Detection Gateways for Congestion Avoidance

Sally Floyd and Van Jacobson, "Random Early Detection Gateways for Congestion Avoidance"

The proposed algorithm is implemented in the gateway (see 16. of the previous post) and uses FIFO scheduling together with a random drop policy. The gateway can signal congestion through dropping or marking packets, and will do so before the queue will overrun (see previous post for motivation). It will do so in a manner where the packet drop probability will depend continuously on the queue size (see 19. of previous post) and where the drop probability is proportional to the bandwidth of the corresponding flow to archive fairness.

The design goals were:

cooperat with current transport protocol implementations and gateways
congestion avoidance: avoid long queues/delays and low throughput
avoid global syncronity (see 6. of previous post)
no negative bias against bursty traffic (see 5.)
fair bandwidth allocation (though not complete isolation from misbehaving senders)
no per-flow state

The algorithm has several parts:

Computing the average queue size (through expontial weight low pass filter: this distinguishes between persistent queue and bursty traffic)
Use average queue size to determine the drop/congestion notification probability. This will be a piecewise linear function.
Drop packet with that probability, do a little math trick: if a packet wasn't droped for a long time (with respect to the drop packet probability), bias the probability, that way packets will be dropped more uniformly. Since the packet is chosen randomly, a flow with a higher bandwidth has a higher probability of being impacted.

Since packets might have very different sizes, the probability might be multiplied with the packet size to ensure fairness.

The paper discusses good ranges for the various parameters and shows simulations.

Thoughts on Queues and Congestion Avoidance

1. Ideally, each node knows its bandwidth share and sends exactly with that bandwidth share.
2. Ideally, the nodes would send their packets in a round-robin fashion, i.e. the packets arrive at the bottleneck gateway in equal time intervals at the same rate they go through the bottleneck link. E.g. if a packet needs 1 ms for transmisson on the bottleneck, node 1 sends now, node 2 1ms later, node 3 2ms later, ... node n n-1 ms later, node 1 sends the packet n ms later.
3. Provided 1 and 2, there would be no queue necessary.
4. A congestion protocol should try to maintain 1 and 2 as best as possible, that way no queue is necessary (by 3). There will be low delays.
5. Given 4, the function of a queue is to catch up bursty traffic.
6. The opposite situation of 2 is global syncronization: all nodes send their packets at the same time. The queue will be long, there will be long delays.
7. 2 is hard to accomplish, the best situation to hope for is that each node sends the packet in a random time, the random distribution will have relatively small clusters which can be buffered in the queue.
8. There are two quantities to control: the queue size and the arrival rates of packets at a gateway. The first one is to be kept low (to have low delay and be able to deal with bursts), the second one is to be kept at the output link rate of the gateway.
9. The queue size is the integral of the difference of arrival rate-output rate over time.
10. The necessary reaction to a too large queue size and a too arrival rates have to be different: the first one requires that the senders interrupt sending packets for a short while, the second one requires that the senders change their packet rate. This is similar to the situation when having a car driving in front of you on a highway. When you see that the car is slower than you, you reduce your target speed until it matches the target speed, when you see that the car is too close, you keep the same target speed, just briefly slow down below target speed until there is sufficient distance.
11. Because traffic demands are changing, 1 can only be archived through feedback loops.
12. The success of the feedback loop (in terms of efficient use of bandwidth, keeping low delay and fairness) will depend on how soon and how precisely the senders can sense a congestion situation.
13. The original Jacobson paper on Congestion avoidance assume that the only way of the gateway signaling congestion is dropping packets when the queue is full. It is even unclear what this signals with respect to point 10. Too high arrival rate is signaled indirectly because it will eventually cause a too high queue size.
14. In the original Jacobson paper on Congestion avoidance (Van Jacobson, Michael J. Karels, "Congestion Avoidance and Control"), congestion was sensed through packet loss, which happens "late".
15. "In the absense of explicit feedback from the gateway, trasport-layer protocols could infer congestion from ... changes in end-to-end delay [or] packet drops [...]. Nevertheless, the view of an individual connection is limited by the timescales of the connection, the lack of knowledge of the number of congested gateways, the possibilities of routing changes, as well as by other difficulties in distinguishing propagation delay from persistent queueing delay." (Sally Floyd, Van Jacobson, "Random Early Detection Gateways for Congestion Avoidance")
16. Because of 15, Gateways need to give feedback to the nodes. "The gateway can reliably distinguish between propagation delay and persistent queueing delay." (Sally Floyd, Van Jacobson, "Random Early Detection Gateways for Congestion Avoidance")
17. Because of 12, the situation gets worse with throughput-delay product.
18. It is better to buffer a packet at the sender than in a queue of a gateway of the network to keep delays low.
19. The oscillatory behavior of a congestion avoidance algorithm will be stronger if the feedback times are long and if the actions of the algorithm are discontinuous in the input to the algorithm.
20. The difference of necessary behavior in 10 is also charaterized by time scales: "Using tools from control theory, we conjecture that congestion feedback based on rate-mismatch should be inversely proportional to delay, and feedback based on queue-mismatch should be inversely proportional to the square of the delay." (Dina Katabi, Mark Handley, Charlie Rohrs, "Congestion Control for High Bandwidth-Delay Product Networks")
21. The optimal size for the congestion window is the maximal amount of traffic on the network links (not the queues) when using the fair share of bandwidth, i.e. bandwidth*transit times (not including queueing delays)

Thursday, September 11, 2008

Fair Queueing Algorithm - Drawings

Alan Demers, Srinivasan Keshav, "Analysis and Simulation of a Fair Scheduling Algorithm"

Here are my drawings for the Fair Scheduling algorithm, see previous post. Also available as pdf

Wednesday, September 10, 2008

Fair Queueing Algorithms

Alan Demers, Srinivasan Keshav, Scott Shenker, "Analysis and Simulation of a Fair Queuing Algorithm"
Ion Stoica, Scott Shenker, Hui Zhang, "Core-Stateless Fair Queueing: Achieving Approximately Fair Bandwidth Allocations in High Speed Networks"

FQ Algorithm

There are two "points of implementation" for congestion control: source and gateway. The paper is looking at the gateway and is suggestion a Queuing algorithm. Whereas a Queueing algorithm is not changing the total traffic, so not directly avoiding congestion, it "affects the collective behavior of flow control algorithms".

The algorithm has a queue per flow (per-flow state).

The basic algorithm is best explained in a squence of drawings I made and forgot to scan. I found that the notation obfuscated the basic idea.

I found the quantities used when evaluating the algorithm interesting: "bandwidth", "promptness" and "buffer space" are supposed to be nearly independent.
The design requirements were:
- has to work in a heterogenous environment with different flow control algorithms
- fairness (using max-min def, we introduce a common limit for everybody so that the sum of all limited bandwidth usage is equal to the full bandwidth) despite misbehaved sources
- good delay-throughput curve for each flow

The performance was evaluated through computation and simulation. The basic setup had a couple of FTP-like and telnet-like flows. Simulations were run on different network topologies. Flows which use less than their share of bandwidth gain low delay times.

The paper was great: it contained a lot of ideas about the environment the algorithm has to work in, theoretical concepts what conditions an algorithm has to fullfill to provide stable flow control (e.g. continuity), definition of fairness and other wanted criteria, metrics for performance, analysis of performance and simulation. I don't feel I can write a summary doing the paper justice.

Core-stateless FQ algorithm

In the previous paper, a gateway routing traffic has to maintain per-flow state. For a gateway routing thousands or million of different flows, this might be prohibitively expensive. The paper therefore divides routers into two categories: core and edge. Edge routers maintain per-flow state and write into the headers of packets the rate of the respective flow. Core routers only compute global statistics and drop packets probabilistically to achive fair bandwidth allocation. If the core router sees a packet whose header indicates that the respective flow has an arrival rate beyond its fair share of bandwidth, the packet is dropped with the right probability to fall into fair share again. The fair share of bandwidth is the same for all flows and is estimated by rising it until bandwidth is fully utilized and lowering in case of congestion.

I imagine that the network might be an ISP providing Internet access to PCs at home, an edge router is the router connecting such a PC to the ISP and is responsible for calculating the flow rate of that customer. The core routers are the internal routers of the ISP and are responsible for maintaining fair use of bandwidth.

Friday, September 5, 2008

Congestion avoidance

Van Jacobson, Michael J. Karels, "Congestion Avoidance and Control"
Dah-Ming Chiu, Raj Jain, "Analysis of the Increase and Decrease Algorithms for Congestion Avoidance in Computer Networks"

Underlying both papers is the observation that when a link is overutilized there is a very drastic degradation and the throughput can drop suddenly by several orders of magnitude. This can happen when several links converge to one such that the sum of the bandwidths is larger than the bandwidth of the link they converge to. The buffer at the respective router overflows, and it is by pure chance which packets are dropped or make it onto the buffer. ACK packets are dropped, forcing retransmissions, making more traffic. This phenomenon is called congestion.

Both papers are trying to use close to the full bandwidth of links and at the same time avoid congestion. Both papers work at the TCP level on the end-points of a connection. These end-points perceive whether a link is congested and act by changing the amount of data they send through different mechanisms. Both papers assume that the end-points behave cooperatively to avoid congestion and converge to fair resource usage.
The "Analysis of the Increase and Decrease Algorithms for Congestion Avoidance in Computer Networks" makes simplyfing assumptions to make a precise mathematical model of the abstracted network.
Some of these assumptions are:
- time is discretized into time steps
- the endpoint perceives that a link is congested through a simple reliable binary signal send by the router in one time step
- a host is simply sending x_i(t) amount of data/time unit where x_i(t) is main subject of discussion of the paper
The paper analyzes a linear model, i.e. x_i(t) is equal to one of two linear function in x_i(t-1) depending on whether the line is congested or not. It then shows for which pair of linear functions the model will converge to fairness and optimal resource. The easiest example is to multiply x_i(t-1) by <1 class="nfakPe">Congestion Avoidance implements seven changes to TCP. They mostly come
from the principle of "conservation of packets", meaning that the
number of packets traveling through the network from one machine to
another one should stay constant, a new packet is put into the network
exactly when an old ones leaves.
Criticism: the equations in Figure 1 are also true if I just send a
package every ten minutes, which is under the equlibrium.
TCP can be self-clocked. If I measure the frequency of ACK packages, I
know with what frequency packages leave the pipe, hence with what
frequency I should send.
In order to get to this equlibrium point, I should steadily increase
the frequency of packages just until I don't get as many ACK packages
per time unit as I am sending packages out. They call this slow-start.
The retransmit is a very sensitive issue. If retransmits are done to
frequently, they further congest the network, yielding to more
retransmits, and ending in a chain reaction. First: get a good timeout
when to retransmit (chapter 2), second: avoid a chain reaction using
exponentially increasing retransmit times. Assume that every second a
package gets lost and has to be retransmit infinitely many often.
Since the retransmit times are increasing exponentially, the average
bandwidth for retransmitting one package declines exponentially,
meaning that even in the above case where I have to retransmit
infinitely often, the total bandwidth is still finite. Criticism: why
does the paper talk about linear system theory and Lyanpow exponents
when you can do the calculation in three lines.
Criticism: the paper assumes that all hosts are cooperative. When an
ACK package is lost, it causes unnecessary retransmits. In a congested
network, ACK packages should be preferred. Assuming your retransmits
are far apart so that redundant data aren't traversing the pipe, you
still get the full effective bandwidth if ACKs are delivered in a
congested network.

Wednesday, September 3, 2008

On Inferring Autonomous System Relationships in the Internet/Interdomain Internet Routing

Hari Balakrishnan, Nick Feamster, "Interdomain Internet Routing"
Lixin Gao, "On Inferring Autonomous System Relationships in the Internet"

The notion of the "Autonomous System" is central. A routing protocols inside an AS are called IGP and all routers are programmed to find the optimal path through the network. All these routers are run by the same administration.

A routing protocols for routing between ASs is called an EGP. BGP is the EGP deployed today on the Internet. The design of such a routing protocol has to take into account that different ASs are run by different enterprises which are competitors or have customer-provider relationships. Hence a router is driven by commerical interests when deciding whose packets to forward to who. The goal of BGP is to honor these interests and at the same time establish routing tables allowing global reachability if possible. (Cooperation between competitors) Notice that this prioritizes the accountability in the original list of goals of the DARPA Internet (Clarke). This means that the AS is dividing line between two very different kinds of routing, and why I said that the notion of the AS is central.

Technically speaking, BGP routers enter BGP sessions and exchange reachability information in the forms of routes of AS numbers by enumerating the routes (during initialization) and then updating about new/changed or deleted routes. When the routing information is advertised, i.e. sent from one host to another, that host has to append its AS number to the route. Each BGP router has a choice which of these routes to incorporate into the routing table. It also has a choice which of the routes from the routing table to advertise. "A route advertisement from B to A for a destination prefix is an agreement by B that it will forard packets sent via A destined for any destination in the prefix." To avoid cycles, a router has to drop a route when it notices that it already is on the path. There are two kinds of sessions (but which speak the same "language"): iBGP and eBGP, the first one is to synchronize routing information between different BGP routers in the same AS, the latter one to exchange routes with other eBGPs. iBGP is for merging routing information, but in eBGP, routers have the choice to advertise or import only certain routes. There are further variables (MED) in BGP to decide which exchange point to use in case two ASs are linked at multiple places.

Each time a router exports a route, it could draw more traffic which could flow through other networks to itself. Hence, to understand which routes a BGP router imports and exports, one has to look at the relations between ASs: the major ones are customer-provider transit and peering. In a customer-provider relation, the provider will export routes from its customers to all other ASs the provider is connected to and give the customers routes to all other ASs, since the customer pays for being reachable.
Peering means that two providers A and B agree to forward packets which are going between customers of each other to avoid paying for traffic through a third, higher provider C. Peer A will not export a route to B's customer to, say, C because that would mean A pays for B's customers traffic to C.

ASs are roughly categorized as Tier-1, Tier-2, Tier-3 networks which corresponds to their size and hence to their relationships. For example, Tier-3 networks are small and are customers of Tier-2 networks.

I like the Interdomain Internet Routing paper for illustrating the different business relationships between ASs. It could make more clear to what parts of the text are business-driven policies and which parts of the text are really protocol specifications a router has to adhere to.

The information about the relationships of ASs are in general not known, sometimes under non-disclosure agreement.
The second paper is discussing in detail certain heuristics (like a larger AS is probably a provider of a smaller AS) and how the relationships influence which routers are exporting which routes (III A). It is important to note that there assumptions made about how a router behaves. Under this assumptions certain theorems can be shown, essentially that when looking at the customer-provider, peer-peer and provider-customer edges of an AS path, a path can only "bounce" once, e.g. an AS path like provider-customer-provider-customer is impossible.
Using this knowledge, an algorithm is given which uses routing information from several routers to reverse engineer the relationships between ASs. This algorithm is refined to be more robust because not all routers behave as assumed, e.g. a mal-configured router of a multi-homed customer exported a route linking its providers.
Experimental results are compared to internal data of AT&T.

CS 268 Computer Networks