Post-mortem: Photoroom API service degradation – October 11th, 2024

For the first time in years of existence, the Photoroom API experienced a major outage. In this post, we’ll cover the causes as well as the action we’re taking to ensure it doesn’t happen again.

Beginning on Friday, October 11th, 2024 at 15:33 UTC our backend services experienced a disruption that degraded performance and caused a temporary unavailability of our Background Removal API and Image Editing API for approximately 45 minutes.

The issue originated from an unexpected latency spike in one of our internal services that handles monitoring and analytics. Our backend workers became unresponsive as they waited for responses from the impacted service. As a result, new requests were queued, leading to increased wait times and eventual service unavailability.

Root Cause

The latency spike was traced back to a call made to an auxiliary monitoring and analytics service. While our code was designed to handle errors from this service, it did not include a strict enough timeout. This caused our backend workers to remain in a waiting state, unable to process other requests.

Moreover, our API includes a buffer queue to absorb temporary spikes. In case all workers are busy, requests wait in a queue. This aggravated the problem as requests piled up in the queue.

Resolution

Once we identified the root cause, we quickly shut down the auxiliary service. However, as requests were waiting in the buffer queue, this did not resolve the issue immediately. We purged the queues by disabling the traffic coming from our apps, favoring the traffic from our API customers.

The issue was fully resolved by 16:10 UTC, with normal operations resuming shortly thereafter.

Next steps

To prevent similar issues from occurring in the future, we are implementing the following actions:

  1. Timeouts: We will enforce strict timeout settings for all dependencies and services to prevent workers from being blocked for too long.

  2. Testing: Tests have already been added to ensure the API remains responsive even if non-essential services become slow or unavailable.

  3. Deployment procedure: While the timeout behavior was specified before the implementation, it was not tested in a real-life scenario. When adding dependencies on external services, we will more thoroughly test high-latency and failure behaviors to ensure they conform with our architecture design.

We take the reliability of the API as seriously as our customers do: the Photoroom apps - used by tens of millions - are also powered by the Photoroom API and this outage affected our users. We sincerely apologize for any inconvenience caused to our API customers.

Timeline of Events (UTC time):

  • 15:33: Latency spikes begin affecting backend services.

  • 15:37: First system alert received.

  • 15:50: Issue is identified, and the analytics service is shut down.

  • 16:05: System begins recovering.

  • 16:10: Services return to normal.

Eliot AndresCo-founder @ Photoroom
设计你的下一张精美图片

设计你的下一张精美图片

无论是销售,推广还是发帖,通过出众设计,生动呈现你的创意。

Keep reading

What's new in product: September 2024
Jeanette Sha
Building live collaboration in Rust for millions of users, part 1
Florian Denis
What's new in product: May 2024
Jeanette Sha
Understanding feature flags: The foundation of reliable A/B tests
Charlotte de Thiersant
AI Images: a visual toolkit for businesses
Jeanette Sha
What 9,000 community votes taught us about our background remover
Thomas Bordier
Why you should change your mobile app version format to [year].[week].[iteration]
Eliot Andres
New Photoroom API updates (+90% off Background Remover API)
Udo Kaja
Make stable diffusion up to 100% faster with Memory Efficient Attention
Matthieu Toulemont
What's new in product: July 2024
Jeanette Sha

准备好更快制作更优质的产品图片了吗?

创建 Photoroom 账户以激活集成,并获得 10 张免费图片额度,直接在 Pixelz 体验 AI 背景。