At 10:59 AM, we re-deployed services that had gone into dead-lock waiting for other services to come up. This resolved all issues with higher than expected latencies and errors.
At the peak (~10:58 AM), less than 3% of requests had to be retried. All systems are back to normal post re-deploy.
We're actively working on mitigating dead-locking and k8s coordination.