Imperfect Network Calls

13 May 2025

I recently audited our estate and noticed many inconsistencies in how we make service calls. Who knew there could be such variation for a commonplace task? Some pitfalls include...

No Logging
No Metrics
No Retry Strategy
No Response Validation
Inconsistent Error Handling
No Correlation ID Propagation

No Logging

In particular, pre-request and post-response logging. Many times I have been debugging a production issue and found it difficult to know what the state of execution was. Did the code reach this point? Did it hang on this network request? Without sufficient logging, these questions can be a pain to answer. Even when the logs are there, contextual information is frequently missing, e.g., request duration and status codes. Consider adding this in a structured way (using structlog in Python) so that the info can be easily parsed if needed.

No Metrics

Metrics, another pillar of observability. These should be published by default, not in reaction to a production issue. The bare minimum should be a metric to track the request duration (in such a way that percentiles can be calculated) and an error counter. I usually prefer generic metric names (mysvc.external.duration.ms) and offload contextual information to tags, being careful with cardinality in order to avoid an explosion in the number of time series generated.

Be wary of any rollups done by your metric backend. Often the local agent will buffer for 30s, aggregating over that window. This can cause confusion when calculating values or plotting time series where further runtime consolidation can happen. There are a number of other pitfalls here, like using the mean for long-tailed distributions (prefer percentiles) or neglecting the worst-case (the "95% lie"). I recommend watching "How NOT to Measure Latency" where Gil Tene goes into greater depth on some of these.

No Retry Strategy

Retries are a powerful tool to improve your robustness with flaky dependencies. However, they can be easily misused and result in work amplification ("retry storms"). There is no one-size-fits-all solution here but I generally use exponential backoff and jitter with sensible defaults (using tenacity in Python) and then tune as needed. One mistake I often see is retrying every possible error scenario, rather than only transient errors. In an HTTP request, an easy way to do this is by using status codes. If you receive a 422 Unprocessable Content then retries are probably not going to help, whereas they can for 429 Too Many Requests (using "Retry-After" headers, if provided).

Another common oversight is not tracking retry counts. A service that occasionally fails represents a significantly different operational risk to one that consistently trips up. At the very least, log a warning when a retry happens so that there is an audit trail.

No Response Validation

Hope is not a strategy. Validate data that you receive using libraries like pydantic (Python) or zod (TypeScript). Consider following Postel's law and being "liberal in what you accept" from others, like ignoring response keys that you never consume, but still failing fast on inappropriate input.

Inconsistent Error Handling

I've seen this done so many different ways. Rather than trying to enumerate all error paths, it's easiest to catch a base exception (e.g., RequestException in Python's requests) unless you need the higher level of granularity. I like to wrap the exception using a custom exception (SomeServiceNameError) and re-raise in order to avoid leaking implementation details of the underlying networking library.

No Correlation ID Propagation

Hopefully you already have correlation IDs implemented, allowing you to reconstruct the request lifecycle by joining logs together. One thing often forgotten is the propagation of these IDs to internal dependencies, e.g., via an X-Request-Id header. This propagation allows you to join logs across service boundaries which is valuable when debugging. Of course, you will need to extract and use these IDs on the callee side and this may be outside your control if the service is owned by another team (which it probably is as per Conway's law). This is a good example of something that should be an organisational standard ("every service should send and receive the correlation ID using header XYZ").

In summary, there are a number of pitfalls even for a simple task like sending a network request. Consider incorporating some of the recommendations above and defining standards or templates for your team to ensure correctness and consistency. I've probably missed a ton of other things here — drop me a message if you think of more and/or you disagree with any that I have listed.