Inside PhotoroomProduct updates

Post-mortem: Photoroom API service degradation – October 11th, 2024

For the first time in years of existence, the Photoroom API experienced a major outage. In this post, we’ll cover the causes as well as the action we’re taking to ensure it doesn’t happen again.

Beginning on Friday, October 11th, 2024 at 15:33 UTC our backend services experienced a disruption that degraded performance and caused a temporary unavailability of our Background Removal API and Image Editing API for approximately 45 minutes.

The issue originated from an unexpected latency spike in one of our internal services that handles monitoring and analytics. Our backend workers became unresponsive as they waited for responses from the impacted service. As a result, new requests were queued, leading to increased wait times and eventual service unavailability.

Root Cause

The latency spike was traced back to a call made to an auxiliary monitoring and analytics service. While our code was designed to handle errors from this service, it did not include a strict enough timeout. This caused our backend workers to remain in a waiting state, unable to process other requests.

Moreover, our API includes a buffer queue to absorb temporary spikes. In case all workers are busy, requests wait in a queue. This aggravated the problem as requests piled up in the queue.

Resolution

Once we identified the root cause, we quickly shut down the auxiliary service. However, as requests were waiting in the buffer queue, this did not resolve the issue immediately. We purged the queues by disabling the traffic coming from our apps, favoring the traffic from our API customers.

The issue was fully resolved by 16:10 UTC, with normal operations resuming shortly thereafter.

Next steps

To prevent similar issues from occurring in the future, we are implementing the following actions:

  1. Timeouts: We will enforce strict timeout settings for all dependencies and services to prevent workers from being blocked for too long.

  2. Testing: Tests have already been added to ensure the API remains responsive even if non-essential services become slow or unavailable.

  3. Deployment procedure: While the timeout behavior was specified before the implementation, it was not tested in a real-life scenario. When adding dependencies on external services, we will more thoroughly test high-latency and failure behaviors to ensure they conform with our architecture design.

We take the reliability of the API as seriously as our customers do: the Photoroom apps - used by tens of millions - are also powered by the Photoroom API and this outage affected our users. We sincerely apologize for any inconvenience caused to our API customers.

Timeline of Events (UTC time):

  • 15:33: Latency spikes begin affecting backend services.

  • 15:37: First system alert received.

  • 15:50: Issue is identified, and the analytics service is shut down.

  • 16:05: System begins recovering.

  • 16:10: Services return to normal.

Eliot AndresCo-founder & CTO @ Photoroom
Design your next great image

Design your next great image

Whether you're selling, promoting, or posting, bring your idea to life with a design that stands out.

Keep reading

What's new in product: August 2024
Jeanette Sha
What's new in product: June 2024
Jeanette Sha
What's new in product: November 2023
Jeanette Sha
4 times faster image segmentation with TRTorch
Matthieu Toulemont
Embracing radical openness: How a “No DM” Slack policy drives impact at Photoroom
Matthieu Rouif
Businesses need more threesomes, reveals market report
Aisha Owolabi
Playing to win: the unexpected way we innovate at Photoroom
Matthieu Rouif
Photoroom foundation diffusion model: why, how, and where do we go from there?
Benjamin Lefaudeux
Core ML performance benchmark iPhone 15 (2023)
Florian Denis
What 9,000 community votes taught us about our background remover
Thomas Bordier