Google Workspace Status Dashboard
Incident affecting NotebookLM
Incident began at 2025-09-29 17:53 and ended at 2025-09-29 18:42 (times are in Coordinated Universal Time (UTC)).
Date | Time | Description | |
---|---|---|---|
| Oct 2, 2025 | 8:17 PM UTC | Incident ReportSummaryOn Monday, 29 September 2025, several AI products experienced elevated 500 (internal) errors globally for 49 minutes, between 10:53 and 11:42 US/Pacific. To our customers whose operations were impacted during this disruption, we sincerely apologize. This is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s performance and availability. Root CauseThe incident occurred on a core machine learning (ML) serving platform at Google, responsible for hosting and delivering the Large Language Models that power many of our AI products and services. A software update intended to improve the platform's performance and scalability introduced a bug in a central control system. This system directs traffic to the correct machine learning models, and the bug caused it to incorrectly determine that some models were unavailable, leading to errors for our customers. The impact of this issue was unintentionally magnified because the faulty component was responding more quickly, which caused our traffic management system to direct more requests to that component. Remediation and PreventionGoogle engineers were alerted to the outage via manual escalation from multiple products on Monday, 29 September 2025 at 11:09 US/Pacific and immediately started an investigation. Once the underlying cause was identified, the problematic change was rolled back. All the errors returned by ML serving were then reduced to pre-fault levels, mitigating customer impact at 11:42 US/Pacific. Google is committed preventing a repeat of this issue in the future and is completing the following actions:
We apologize for the length and severity of this incident. We are taking immediate steps to prevent a recurrence and improve reliability in the future. Detailed Description of ImpactOn Monday, 29 September 2025, from 10:53 to 11:42 US/Pacific, Google Workspace products with a dependency on the LM Serving infrastructure experienced routing issues globally. The incident resulted in 21 percent of the request failing due to [model] NOT_FOUND error, which cascaded to customers as an internal error. |
| Sep 30, 2025 | 5:15 AM UTC | Mini Incident ReportWe apologize for the inconvenience this service disruption/outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support or to Google Workspace Support using help article https://support.google.com/a/answer/1047213. (All Times US/Pacific) Incident Start: 29 September, 2025 10:53 Incident End: 29 September, 2025 11:42 Duration: 49 minutes GWS Affected Services and Features: Workspace Gemini and NotebookLM Regions/Zones: Global Description: Various AI products experienced elevated 500 (internal) errors due to the rollout of a faulty change to production. Google will complete a full IR in the following days that will provide a full root cause. Customer Impact: Customers affected by this issue may have observed up to a 100% of requests failing with '500' errors during the impacted time, between 10:53 and 11:42 US/Pacific on 29 September 2025. Additional details: A faulty change was rolled out to the control plane of the canonical Machine Learning (ML) serving stack in one cluster. Traffic routers that received their configuration from the faulty cluster assumed that some models were deleted and returned errors to the clients. Due to the error, the faulty control plane cluster had a significantly lower load, which resulted in more routers getting their configuration from the faulty cluster. This exacerbated the impact of the change. The bad change was rolled back and all the errors returned by ML serving were reduced to pre-fault levels, mitigating customer impact. |
| Sep 29, 2025 | 7:56 PM UTC | Summary: WorkSpace NotebookLM customers may experience elevated errors for online users Description: The issue with Workspace NotebookLM has been mitigated as of Monday, 2025-09-29 11:49 PDT. Based on preliminary investigation, the root cause of the issue was due to a recent change, which has been rolled back. Customer Symptoms: Customers are experiencing elevated "INTERNAL_ERROR" with error code 500s |
| Sep 29, 2025 | 7:39 PM UTC | Summary: NotebookLM customers may experience elevated errors for online users Description: Our engineering team has identified the cause of the issue and a mitigation has been put in place. We are currently monitoring and believe we are seeing full recovery. Customer Symptoms: Customers are experiencing elevated "INTERNAL_ERROR" with error code 500s |
| Sep 29, 2025 | 6:57 PM UTC | We're investigating reports of an issue with NotebookLM. We will provide more information shortly. Some users would observe elevated errors from Gemini for Workspace. |
- Times are listed in Coordinated Universal Time (UTC)