Mar 18, 19:20:50 GMT+0
Investigating -
We're investigating an incident which was automatically triggered by a `health-check` failure..
Mar 18, 19:21:49 GMT+0
Resolved -
A previously reported automated failure on `health-check` has been resolved..
Mar 18, 19:18:54 GMT+0
Investigating -
We're investigating an incident which was automatically triggered by a `health-check` failure..
Mar 18, 19:19:03 GMT+0
Resolved -
A previously reported automated failure on `health-check` has been resolved..
Feb 26, 18:57:00 GMT+0
Investigating -
We're investigating an incident which was automatically triggered by a `health-check` failure.
Starting at 10:57 AM, we received reports of higher-than-expected p99 latency and 500s on both the blocks and users APIs..
Feb 26, 19:01:00 GMT+0
Resolved -
At 10:59 AM, we re-deployed services that had gone into dead-lock waiting for other services to come up. This resolved all issues with higher than expected latencies and errors.
At the peak (\~10:58 AM), less than 3% of requests had to be retried. All systems are back to normal post re-deploy.
We're actively working on mitigating dead-locking and k8s coordination..
Feb 7, 05:00:00 GMT+0
Resolved -
A previously reported automated failure on `health-check` has been resolved.
At 8:50 PM, we received a `health-check` failure notice (and internal incident alarms) that our blocks and users APIs were experiencing higher than expected internal error rates and latencies.
This was due to the slow rollout of underlying GCP infrastructure undergoing routine maintenance upgrades. Our infrastructure self-healed at 9:00 PM once the slow rollout completed; subsequently, latencies returned to normal levels, and internal error rates dropped back to \~0.
For the ten minutes between 8:50 PM and 9:00 PM, we identified that roughly 0.43% of requests experienced higher than expected latencies (and potentially timeouts). Our clients and SDK exponentially retry these requests, and we have not recorded any errors after 9:00 PM..
Feb 7, 04:50:00 GMT+0
Investigating -
We're investigating an incident which was automatically triggered by a `health-check` failure..
Jan 6, 17:52:42 GMT+0
Investigating -
We're investigating an incident which was automatically triggered by a `health-check` failure..
Jan 6, 17:57:33 GMT+0
Resolved -
A previously reported automated failure on `health-check` has been resolved..
Dec 8, 21:15:29 GMT+0
Investigating -
We're investigating an incident which was automatically triggered by a `health-check` failure..
Dec 8, 21:16:03 GMT+0
Resolved -
A previously reported automated failure on `health-check` has been resolved..
Nov 28, 19:47:49 GMT+0
Investigating -
We're investigating an incident which was automatically triggered by a `health-check` failure..
Nov 28, 19:47:00 GMT+0
Resolved -
A previously reported automated failure on `health-check` has been resolved..
Nov 15, 20:07:21 GMT+0
Investigating -
We're investigating an incident which was automatically triggered by a `health-check` failure..
Nov 15, 20:07:00 GMT+0
Resolved -
A previously reported automated failure on `health-check` has been resolved..
Oct 31, 22:01:00 GMT+0
Investigating -
On block transitions and flow intents, Blocks API is reporting degraded performance with intermittent 5xx errors..
Oct 31, 22:44:00 GMT+0
Identified -
We have started rolling back several potentially problematic updates we made this afternoon to the Blocks API to mitigate the degraded performance.
We are updating the status of this incident to a partial outage until we can fully remedy the 5xx errors..
Oct 31, 22:51:00 GMT+0
Monitoring -
We've completed all our rollbacks.
Additionally, we're beginning to redeploy our earlier performance and logging updates in a sequential manner. We've added further redundancy checks in our code to prevent the Blocks API from crashing on transitions and intents..
Oct 31, 22:56:00 GMT+0
Resolved -
This incident has been resolved.
After monitoring, we determined that a version of the Blocks API which began rolling out at approximately 3:00 PM PDT had an incorrectly configured client for one of our new internal services. This client began throwing errors because it could not reach its internal service due to a wrong internal DNS reference.
We first rolled back to a previous version of the Blocks API (from this morning at 9:00 AM) and confirmed that the API was functional (no 5xx reported and e2e tests all passing).
We subsequently rolled out updates to correct the wrong internal DNS reference and also re-configured the client to not crash the API when incorrectly configured..
Oct 31, 02:32:00 GMT+0
Investigating -
We are actively investigating the Blocks and Users APIs returning 502s..
Oct 31, 03:01:00 GMT+0
Identified -
We've identified the issue.
Several of our redis clusters which are responsible for caching and message passing did not come up correctly after a GKE scheduled maintenance.
We're rolling out restarts to our redis clusters now..
Oct 31, 03:49:00 GMT+0
Monitoring -
We redeployed our redis clusters (worker and sentinels) and all downstream dependencies.
All 502 errors were resolved at 8:49:24 PDT.
We'll continue to monitor..
Oct 31, 04:19:57 GMT+0
Resolved -
We're resolving this incident after redeploying and monitoring our redis clusters. We'll update this notice with next steps we're taking to improve our redis cluster and client health and resiliency..
Oct 4, 08:49:00 GMT+0
Investigating -
We noticed that blocks.dopt.com (the blocks API) experienced high latencies (> 1s) during the following periods (pacific time):
- 1:50 - 1:55 AM
- 2:10 - 2:15 AM
- 2:30 - 2:35 AM
- 10:50 - 10:55 AM
- 11:10 - 11:15 AM.
Oct 4, 18:40:00 GMT+0
Monitoring -
At 11:40 AM, we started proactively rolling out a more protective load balancer policy to isolate the requests which were triggering latency spikes, and have not observed any latency spikes since then..
Oct 4, 18:00:00 GMT+0
Identified -
We've identified a set of mitigations and are planning a load balancer policy change..
Oct 5, 01:55:43 GMT+0
Resolved -
We've deployed updates to our internal block(s) state machine(s) which suffered from high latencies in the case where flows contained large (> 100k) numbers of cycles within them.
We've also rolled back any load balancer policies related to this incident..
Sep 22, 21:32:00 GMT+0
Investigating -
Starting at 2:32 PM (pacific), app.dopt.com redirected to a static nginx page rather than Dopt's web app..
Sep 22, 21:55:00 GMT+0
Identified -
At 2:55 PM, we began rolling out a revert to the previous version of app.dopt.com's client..
Sep 22, 21:58:00 GMT+0
Resolved -
At 2:58 PM, once the revert was completed, we confirmed that app.dopt.com functions as expected. We've also rolled back the broken nginx image and its assets internally..
Aug 14, 20:28:51 GMT+0
Investigating -
We're investigating an incident which was automatically triggered by a `health-check` failure..
Aug 14, 20:29:02 GMT+0
Resolved -
A previously reported automated failure on `health-check` has been resolved..
Aug 14, 20:48:55 GMT+0
Resolved -
This automated health-check failure resulted due 502s from our blocks.dopt.com gateway due to an abnormally slow roll-out of new nodes in k8s.
We're investigating why this roll-out was so slow, but all systems are operational and functioning as expected..
Aug 15, 03:32:54 GMT+0
Resolved -
Blocks API deploys took an abnormally long time because of a pre-existing dependency between a few internal services which were all simultaneously updated between 1:20 and 1:30 PM pacific time today.
We've rolled out a change which decouples these services so that these 502s won't re-occur in the future..
Aug 7, 21:58:58 GMT+0
Resolved -
All systems are operational.
This incident is to test that our status.dopt.com related webhooks trigger appropriately..
Aug 3, 20:05:00 GMT+0
Identified -
app.dopt.com deployed with a broken configuration for our internal API, gateway.dopt.com. We have fixed the issue and are deploying a new version right now..
Aug 3, 20:34:11 GMT+0
Monitoring -
We've deployed a fix to app.dopt.com which correctly communicates with gateway.dopt.com over HTTPS..
Aug 3, 20:37:29 GMT+0
Resolved -
app.dopt.com loads and communicates with our internal APIs as expected..
Dec 3, 08:44:47 GMT+0
Investigating -
users.dopt.com cannot be accessed at the moment. This incident was created by an automated monitoring service..
Dec 3, 08:49:44 GMT+0
Resolved -
users.dopt.com is now operational! This update was created by an automated monitoring service..
Dec 3, 08:35:01 GMT+0
Investigating -
users.dopt.com cannot be accessed at the moment. This incident was created by an automated monitoring service..
Dec 3, 08:39:44 GMT+0
Resolved -
users.dopt.com is now operational! This update was created by an automated monitoring service..
Oct 4, 00:02:30 GMT+0
Investigating -
app.dopt.com cannot be accessed at the moment. This incident was created by an automated monitoring service..
Oct 4, 00:07:27 GMT+0
Resolved -
app.dopt.com is now operational! This update was created by an automated monitoring service..
Oct 1, 05:27:02 GMT+0
Resolved -
www.dopt.com is now operational! This update was created by an automated monitoring service..
Oct 1, 05:12:06 GMT+0
Investigating -
www.dopt.com cannot be accessed at the moment. This incident was created by an automated monitoring service..
Oct 1, 05:07:36 GMT+0
Investigating -
blog.dopt.com cannot be accessed at the moment. This incident was created by an automated monitoring service..
Oct 1, 05:27:33 GMT+0
Resolved -
blog.dopt.com is now operational! This update was created by an automated monitoring service..
Oct 1, 04:04:47 GMT+0
Investigating -
users.dopt.com cannot be accessed at the moment. This incident was created by an automated monitoring service..
Oct 1, 06:29:44 GMT+0
Resolved -
users.dopt.com is now operational! This update was created by an automated monitoring service..
Oct 1, 04:02:36 GMT+0
Investigating -
blocks.dopt.com cannot be accessed at the moment. This incident was created by an automated monitoring service..
Oct 1, 06:07:32 GMT+0
Resolved -
blocks.dopt.com is now operational! This update was created by an automated monitoring service..
Oct 1, 04:02:29 GMT+0
Investigating -
app.dopt.com cannot be accessed at the moment. This incident was created by an automated monitoring service..
Oct 1, 06:27:26 GMT+0
Resolved -
app.dopt.com is now operational! This update was created by an automated monitoring service..
Oct 1, 00:37:36 GMT+0
Investigating -
blog.dopt.com cannot be accessed at the moment. This incident was created by an automated monitoring service..
Oct 1, 00:42:33 GMT+0
Resolved -
blog.dopt.com is now operational! This update was created by an automated monitoring service..
Sep 30, 23:37:30 GMT+0
Investigating -
app.dopt.com cannot be accessed at the moment. This incident was created by an automated monitoring service..
Sep 30, 23:42:26 GMT+0
Resolved -
app.dopt.com is now operational! This update was created by an automated monitoring service..