BentoBox websites, backends and e-commerce stores are currently not loading

Incident Report for BentoBox

Postmortem

Post-Mortem: Snowflake outage.

Date: 03/20/2024

Summary of event

At 12:15 PM EST, BentoBox response time spiked to a critical 15.2 seconds, leading to timeouts on pages such as Sushi, Online Ordering, and Kitchen. By 1:27 PM EST, Snowflake identified the root cause, which was throttling client requests hosted in AWS - US East. At 1:58 PM EST, a hotfix was deployed, partially disabling features related to reporting and upsell items to mitigate the issue. The general users perceive an interruption of service that lasts 3 hours and 15 minutes.

Timeline of events

  • 11:25 AM EST: Snowflake issues started. Outage not reported yet.
  • 11:45 AM EST: Alerts were received in #bot-pse-alerts related to celery queue jobs not being processed.
  • 12:15 AM EST: Bentobox response time spike, going from 3.52 seconds to 15.2 seconds [critically high response time]. Customers began noticing slow load times on websites.
  • 12:16 AM EST: A timeout error was reported in #info-plataform-updates. As a result, sites like Kitchen, TOAD, and Sushi were not available in production.  
  • 12:21 AM EST: Snowflake reported an issue with DataCloud, impacting customers with intermittent delays or timeouts as a Partial Outage.
  • 12:59 AM EST: After initial investigation, we suspected a DoS attack caused by requests coming from IP addresses in Europe. As a preventive action, we blocked these IPs.
  • 01:27 PM EST: Snowflake identified the issue causing throttle time for clients hosted in the AWS - US East. Snowflake status changed from Partial Outage to Outage.
  • 01:58 PM EST: A hotfix was implemented, where we partially disabled features related to reporting and upsell items.
  • 02:40 PM EST: Bentobox response time dropped to 313.71 milliseconds [healthy response time]. Customers regained access to the websites after an interruption lasting 3 hours and 15 minutes.
  • 03:45 PM EST: Snowflake has recovered from the outage.
  • 06:14 PM EST: General users experienced degradation loading ecommerce websites.
  • 07:50 PM EST: A fix was applied, restoring the expected loading behavior of the eCommerce websites.
  • 08:11 PM EST: The bentobox status page was updated to reflect a "Resolved" status.
  • March 20, 12:00 PM EST: The hotfix was reverted, and BentoBox's reporting and upsell item features have returned to normal. 

A root cause identification

  • Snowflake outage.

    • An unexpected full outage in Snowflake was the primary cause of the incident. A flaw in Snowflake’s automatic scaling system triggered a code issue, leading to resource exhaustion and throttling in systems processing customer requests. As a client hosted in AWS - US East, we experienced delays and timeouts when executing queries or using Snowflake services and features via Snowsight, with queries appearing to be stuck in a running state.
  • Bentobox < > Snowflake

    • BentoBox relies on Snowflake primarily for gathering and reporting data on sites such as Sushi and Online Ordering, which were significantly impacted by the Snowflake outage. As a result, attempts to access key metrics—such as upsell items, revenue, and others—led to delays in BentoBox's response time, ultimately causing timeouts on these pages.

User impact of the incident

  • Customers/diners impacted?

The incidents resulted in temporary unavailability or increased load time of our backend (Sushi) and client sites.

Diners were also impacted by the outage, they were unable to add items to carts.

  • Revenue impacting?

Yes. Since diners were unable to add items to carts, several orders were not placed.

Impacted customers

  • List of impacted customers or link to report with customer information

This outage impacted all the customers.

Lessons Learned

  • What went wrong

During Snowflake’s outage, we had no mechanism to disable features relying on it, leading to service disruptions. Since code changes were required, mitigation took longer than necessary. Moreover, this third-party service wasn’t included in our high-priority monitoring list, which delayed incident detection

  • What worked

Quick response from all the teams across Bentobox. With everyone on the call, we were able to identify the specific endpoints that were affected by the outage.

Help from the developers that implemented the feature and give us context about how Bentobox has integration with Snowflake.

  • For the future

Subscription to Snowflake’s status page was added to our monitor tools.

We will refactor our API to eliminate real-time dependency on Snowflake. Instead of querying Snowflake live when resources like menus or upsell revenue are requested, we will asynchronously query Snowflake and store the results in a cache or database. This will prevent Snowflake outages from directly impacting our API availability

The SRE team is going to investigate the lack of caching for assets files.

Additional monitors are going to be created for our web server resources.

Posted Mar 20, 2025 - 16:03 EDT

Resolved

As of 14:50 pm EDT, the issue affecting customer websites, backends, and e-commerce stores' ability to load has now been resolved. Due to an outage with one of our 3rd party platforms Snowflake, features like Best Sellers and Upsell on our e-commerce products, as well as revenue and diner insights on the BentoBox dashboard, are still being affected. If you'd like to monitor Snowflake's status, you can visit this page: https://status.snowflake.com/
Posted Mar 19, 2025 - 17:34 EDT

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Mar 19, 2025 - 14:50 EDT

Identified

This issue has been identified and a fix is being implemented.
Posted Mar 19, 2025 - 13:53 EDT

Update

We are continuing to investigate this issue.
Posted Mar 19, 2025 - 13:36 EDT

Investigating

BentoBox is currently experiencing an issue that is affecting customer websites, backends, and e-commerce stores' ability to load.

Our engineering team is actively working to investigate the issue as quickly as possible. We're very sorry for the inconvenience and appreciate your patience as we work through this.
Posted Mar 19, 2025 - 12:30 EDT
This incident affected: Websites, CMS/Backend, and E-commerce, Online Ordering, or Pre-Order & Catering.