Table of contents
It was a typical evening …, except that close to the end of the work day we’ve received a notification, saying:
“An issue is affecting 110.8% of sessions in the past hour”.
“Well, 110.8%… Not bad … How could it be over 100%?” — was the first thought and, at first, we assumed it’s a bug in the statistics server.
Nevertheless, we opened an app to check and make sure it’s OK and … saw that the app crashes in 100% of times.
However, we were able to identify an issue and fix it within an hour.
Here are a few practices that helped us to do it:
1.Monitor, monitor, and monitor again. We’re using services that provide a real-time statistic of the app and instantly notify as soon as an anomaly is found. And yes, the situation was an anomaly for sure, as typically we have 98+% of crash free users. There are plenty of such, both free and paid. At that moment we were using Fabric.
2. Be proactive — don’t hesitate to act. Instead of trying to set up meetings/conversations to discuss what might be wrong and create a plan for fixing it — see what can be done within your power of influence. We’ve acted immediately after noticing it. Tracked the bug, identified the potential source, and wrote directly to all people involved, proposing several plans for resolving it.
3. Effective communication. Develop effective communication with all parties involved. And develop it from the very beginning of the project — so that when it’s hot you know exactly who to contact and what to say to achieve the best results. Otherwise, it would be too late.
4. Got your back. Try to have alternative plan “B” for situations like this. After proposing our plan “A” to people who have an authority to execute it — we’ve already started the work on our plan “B” (that was worse, but within our power of influence, as we were not sure the other party would be able to execute plan “A” fast enough). So that even if the plan “A” wouldn’t work out for us, we would still resolve the issue, just a few hours later.
5. Never cease for exploration. The situation like this might scare the s*** out of anyone, and can make you start triple-check every single step forward, try to plan years ahead and being skeptical to every innovation. However, this road won’t drive you to any significant result. Fail fast. Always do your best, be risky sometimes, but never too much. Listen your gut.
Let us know if the advice was helpful.
And how do you deal with hot exceptional situations?
P. S. The cause of the issue was the third party integration that started crashing out of a sudden.