Thursday, November 13, 2008

X-Trace

Rodrigo Fonseca, George Porter, Randy H. Katz, Scott Shenker, Ion Stoica, "X-Trace: A pervasive Network Tracing Framework"

Motivation
  • current diagnostic tools limited to one particular proctocol, e.g. traceroute
  • need for comprehensive view of the system's behavior
  • complex systems: e.g. wikipedia has different sites, web caches, DNS round-robin, load balancers, web servers, database servers (and memcached)
  • tracing across different administrative domains needed
Ideas and design principles
  • integrated tracing framework
  • network protocols modified to propagate X-Trace metadata
  • works inter-layer
  • works inter-Administrative Domains
  • decouples client of application and recipient of tracing data (Design principle 3), destination part of the X-Trace metadata
  • trace initiated by inserting X-Trace metadata by user application or network operator
  • trace identified by task identifier
  • X-Trace data send to report server (can be client application or delegated server)
  • X-Trace constructs task tree offline, two axis: one across "layers" (an event causes another event in lower layer), one across "time" (an event causes another in the same layer), each node in the task tree has an ID, children link to their parents
  • Design principle 1: trace request are sent in-band
  • Design principle 2: trace data are sent out-of-band
  • ASCII report format
  • report library, report collection thorugh e.g. Postgres
  • visualization of task tree
Deployment
  • API for application has pushNext() and pushDown() to propagate X-trace MetaData across the two axis, device reports information accessible at its own layer, can include additional information like load
  • gradual deployment: for legacy clients, devices in the network can add X-Trace metadata
  • retrofitting X-Trace into exisiting applications faces difficulties: change to various protocols (IP options, TCP, HTTP headers, SQL), partial deployment impairs ability to trace parts of the network, lost trace reports can be interpreted as false positives
  • certain request topologies cannot be captured, e.g. requests spreads through the network and rendezvous at a node
  • unique() function returning identifier for task tree not specified in paper
Uses and Experiences
  • low performance overhead
  • Web request and recursive DNS queries
  • Web hosting site (LAMP), user could intiate traces through JavaScript/PHP library
  • overlay network
  • Tunnels, ISP connectivity
I really liked the framework this paper suggests. I think it is very useful. Though there has been a lot of experience, scalabe websites are still non-trivial to setup, still require some manual work to tune and integrate caches, and work across a lot of different layers as mentioned in the introduction. Some difficulties not mentioned in the introduction are: there are even more caching layers like SQL query cache and memcache, relational databases don't scale and huge websites shard the data across multiple machines, Brewer's theme on performance versus consistency (a website is not truely transactional, just enough so that the users don't preceive it as bad, but when a user sees it, it is hard to track down). This paper introduces the debugging tool I am aware of which is addressing all these things together.
The only thing I would add to this framework is the ability to send encrypted X-Trace data.

No comments: