Liqwid App Outage Incident Report
Summary
On the 13th of September, 2023, the Liqwid app was left in a state in which at first, users were unable to submit transactions, and later, market data stopped showing for users who had their wallets connected to the app. What this means for end users is that from the morning until the end of the afternoon of September 13th, users could not perform the following actions in the app:
Supply & borrow assets
Modify collateral in open loans
Repay loans & withdraw assets
Stake & unstake LQ
Modify the amount of LQ in a stake
Vote on proposals & unlock stakes locked against past proposal votes
*No loans were liquidated during this incident and the complete list of all liquidations can be viewed on the Liquidations page.
Incident Timeline
The incident was first discovered and reported by Discord community member @Ruskyy on the 12th of September at 23:40 UTC here.
Response Time: The first team member to acknowledge the incident initially responded by contacting other team members at 05:27 UTC. After the initial point of contact was established between the infrastructure and development teams at Liqwid Labs, the team spent the following 11 and a half hours pin-pointing where the root of the problem was, followed by implementing a solution that would mitigate the issue.
Resolution Time: The incident was completely resolved by 19:01 UTC.
Root Causes
The outage that made up the incident was caused by an overload of Liqwid’s public Ogmios instances which we suspect came from a denial-of-service (DoS) attack on those same instances. Some technical background on how the Liqwid off-chain and infrastructure worked prior to this incident can be found below:
The way the Liqwid app worked before the incident relied on a WebSocket connection to Ogmios on start-up (from the moment a wallet is connected), which would be used for performing a subset of the front-end-oriented on-chain data queries, in parallel with performing transaction and UTxO balancing for submitting transactions.
The Ogmios instance to which the client would be required to connect to in order to make use of the app would need to be strictly public in order for this to work.
Because the Ogmios instances were all public at the time of this incident, joined together with the fact that, although they were behind an Amazon Web Services (AWS) Web Application Firewall (WAF), this firewall was lenient and bypassable, the attack resulted in the Ogmios instances sending through too many requests to Liqwid’s in-house Cardano nodes, which would then crash them.
To sum up what’s described above (without all the technical jargon), we believe that the root causes of this incident lay in:An off-chain design that forces Ogmios to be exposed to the interne
An Ogmios firewall configuration that was unprepared for handling with this DOS attack vector.
Incident Response
Although the core team at Liqwid Labs worked their absolute hardest at solving this issue in a timely manner, we’re aware of the fact that we can improve the effectiveness of our incident response by:
Iterating over our off-chain design to work in a way that it doesn’t depend on publicly exposed services that can easily be overloaded with requests.
Improving our internal communication and coordination.
Implementing a stricter and smarter self-healing infrastructure.
Preparing a more formal risk and incident management structure to rely on for mitigating future issues.
We believe that our strongest tools in our effective response to this type of incident are:
The fact that we’re a totally decentralized team, meaning that we have developers and team members from all over the world, living and working from different countries and time-zones, which allows us to have a faster initial response to incidents and problems.
Our incredibly capable engineers who have ease in understanding the technicalities and identifying problems that might arise.
Lessons Learned
The key takeaways from the incident were:
We need to identify blind spots like the one that allowed for this incident to happen (the public Ogmios instances, the lack of stronger firewall rules, an unideal incident management framework in place).
We need to place an emphasis and focus on improving the app and all of its dependencies (might they be software or not).
Action Items
Actions that we’re dedicated to working on in order to mitigate future incidents and app attack vectors like the one from September 13th, include:
Improving our incident management framework (already started, long-term).
Periodically identifying possible risks and implementing mitigation strategies for those risks (continuous risk review).
Replacing the need for public Ogmios instances in our off-chain and infrastructure for Blockfrost (completed).
Resorting to Cloudflare and enforcing much stricter firewall rules (completed).
Rewriting our off-chain with infrastructure in mind (already started with the implementation of Aqueduct and the off-chain rewrite, mid-term).
Improving our communication channels and educating our engineers on the workings of our current infrastructure (completed).
Deploying a smarter, more efficient, improved, self-healing infrastructure using Kubernetes and further improving the rules on our firewalls (already started, mid-term).
Additional Comments
The biggest risk (for end users) that this incident caused was:
Possibility of liquidation due to the inability to adjust borrow positions. This risk was mitigated by the fact that all the assets in Liqwid-deployed markets were stable to the point that no loans were brought under a 1 health factor (criteria of liquidation - a loan that has HF < 1.0 is eligible for liquidation) between the time window starting when the issue was originally identified (23:40 UTC, 12 Sept.) and ending when the issue was resolved (19:01 UTC, 13 Sept.).
Closing Remarks
To conclude this app outage incident report, the team at Liqwid Labs would like to thank the community and Liqwid user-base for your support and patience during this incident. We are extremely committed to avoiding any further issues down the line and are currently working very hard to improve our infrastructure and off-chain, improving the usability of our app and protocol to the very best of our abilities.