You Can Change It Later

The Blog Of David Marks

How does Google debug problems in a large, distributed infrastructure?

Posted on | April 27, 2010 | 8 Comments

It’s incredibly difficult to debug errors and performance bottlenecks in a large, distributed computing system. Many different components and computing nodes touch each transaction, and it’s extremely time consuming to determine what, exactly, is going on inside the system.

Answering basic questions like “What kind of requests do we handle poorly?” and “Which component of the system is responsible?” are surprisingly hard to determine.

Google just published a paper on one of their tools to help debug problems in their massive search infrastructure, called Dapper. It’s a worthwhile read.

Dapper, a Large-Scale Distributed Systems Tracing Infrastructure

“Here we introduce the design of Dapper, Google’s production distributed systems tracing infrastructure, and describe how our design goals of low overhead, application-level transparency, and ubiquitous deployment on a very large scale system were met.”

When I was at Loomia, I can remember many situations when debugging a performance bottleneck or bug required a large effort. Find the issue takes longer than fixing it in many cases, and the available tools developers use don’t cut it when you have a big, n-tier application with dozens, hundreds, or thousands of nodes.

This is a real challenge for anyone with a large, compute-intensive cloud application.

Kudos to Google for publishing some of their learnings in this paper. Hopefully, we’ll see an open source tool in this space soon.

Comments

8 Responses to “How does Google debug problems in a large, distributed infrastructure?”

  1. dmarks007
    April 27th, 2010 @ 8:06 pm

    New post! : How does Google debug problems in a large, distributed infrastructure? http://youcanchangeitlater.com/2010/04/2

  2. dmarks007
    April 27th, 2010 @ 1:06 pm

    New post! : How does Google debug problems in a large, distributed infrastructure? http://youcanchangeitlater.com/2010/04/2

  3. xamat
    April 27th, 2010 @ 9:02 pm

    RT @dmarks007: New post! : How does Google debug problems in a large, distributed infrastructure? http://youcanchangeitlater.com/2010/04/2

  4. xamat
    April 27th, 2010 @ 2:02 pm

    RT @dmarks007: New post! : How does Google debug problems in a large, distributed infrastructure? http://youcanchangeitlater.com/2010/04/2

  5. conikeec
    April 27th, 2010 @ 9:04 pm

    RT @dmarks007: New post! : How does Google debug problems in a large, distributed infrastructure? http://youcanchangeitlater.com/2010/04/2

  6. conikeec
    April 27th, 2010 @ 2:04 pm

    RT @dmarks007: New post! : How does Google debug problems in a large, distributed infrastructure? http://youcanchangeitlater.com/2010/04/2

  7. eddieyoon
    April 27th, 2010 @ 10:50 pm

    RT @dmarks007: New post! : How does Google debug problems in a large, distributed infrastructure? http://youcanchangeitlater.com/2010/04/2

  8. eddieyoon
    April 27th, 2010 @ 3:50 pm

    RT @dmarks007: New post! : How does Google debug problems in a large, distributed infrastructure? http://youcanchangeitlater.com/2010/04/2

Leave a Reply