Inside PhotoroomProduct updates

Post-mortem: Photoroom API service degradation – October 11th, 2024

For the first time in years of existence, the Photoroom API experienced a major outage. In this post, we’ll cover the causes as well as the action we’re taking to ensure it doesn’t happen again.

Beginning on Friday, October 11th, 2024 at 15:33 UTC our backend services experienced a disruption that degraded performance and caused a temporary unavailability of our Background Removal API and Image Editing API for approximately 45 minutes.

The issue originated from an unexpected latency spike in one of our internal services that handles monitoring and analytics. Our backend workers became unresponsive as they waited for responses from the impacted service. As a result, new requests were queued, leading to increased wait times and eventual service unavailability.

Root Cause

The latency spike was traced back to a call made to an auxiliary monitoring and analytics service. While our code was designed to handle errors from this service, it did not include a strict enough timeout. This caused our backend workers to remain in a waiting state, unable to process other requests.

Moreover, our API includes a buffer queue to absorb temporary spikes. In case all workers are busy, requests wait in a queue. This aggravated the problem as requests piled up in the queue.

Resolution

Once we identified the root cause, we quickly shut down the auxiliary service. However, as requests were waiting in the buffer queue, this did not resolve the issue immediately. We purged the queues by disabling the traffic coming from our apps, favoring the traffic from our API customers.

The issue was fully resolved by 16:10 UTC, with normal operations resuming shortly thereafter.

Next steps

To prevent similar issues from occurring in the future, we are implementing the following actions:

  1. Timeouts: We will enforce strict timeout settings for all dependencies and services to prevent workers from being blocked for too long.

  2. Testing: Tests have already been added to ensure the API remains responsive even if non-essential services become slow or unavailable.

  3. Deployment procedure: While the timeout behavior was specified before the implementation, it was not tested in a real-life scenario. When adding dependencies on external services, we will more thoroughly test high-latency and failure behaviors to ensure they conform with our architecture design.

We take the reliability of the API as seriously as our customers do: the Photoroom apps - used by tens of millions - are also powered by the Photoroom API and this outage affected our users. We sincerely apologize for any inconvenience caused to our API customers.

Timeline of Events (UTC time):

  • 15:33: Latency spikes begin affecting backend services.

  • 15:37: First system alert received.

  • 15:50: Issue is identified, and the analytics service is shut down.

  • 16:05: System begins recovering.

  • 16:10: Services return to normal.

Eliot AndresCo-founder & CTO @ Photoroom
Crie sua próxima ótima imagem

Crie sua próxima ótima imagem

Seja para vender, promover ou publicar, dê vida à sua ideia com um design que se destaque.

Keep reading

Photoroom acquires GenerateBanners and launches Visual Ads Automation—first-to-market GenAI engine for large-scale ad creatives
Lyline Lim
What's new in product: February 2024
Jeanette Sha
What's new in product: October 2024
Jeanette Sha
New Photoroom API updates (+90% off Background Remover API)
Udo Kaja
The Hunt for Cheap Deep Learning GPUs
Eliot Andres
What's new in product: August 2024
Jeanette Sha
Building a fast cross-platform image renderer
Florian Denis
Photoroom partners with Genesis Cloud to lower carbon emissions
Lauren Sudworth
The Photoroom 2024 diversity report: Beyond the DEI backlash
Lyline Lim
10 tools used to ship an iOS app in 2 weeks
Matthieu Rouif