The Knight Capital Disaster

Nov 22, 2023 · 9 min read · Software Engineering ·

This account comes from several publicly available sources as well as accounts from insiders who worked at Knight Capital Group at the time of the issue. I am telling it second- or third-hand.

On August 1, 2012, Knight Capital fell on its sword. It experienced a software glitch that literally bankrupted the company. Between 9:30 am and 10:15 am EST, the employees of Knight capital watched in disbelief and scrambled to figure out what went wrong as the company acquired massive long and short positions, largely concentrated in 154 stocks, totaling 397 million shares and $7.65 billion. At 10:15, the kill switch was flipped, stopping the company's trading operations for the day. By early afternoon, many of Knight Capital's employees had already sent out resumes, expecting to be unemployed by the end of the week.

The root cause of the failure? A comedy of errors in several parts of their development and ops processes.

Knight's Order-entry Software

Knight Capital had an application called "SMARS" which sends orders to the stock exchange. It contained some logic to break up large orders into much smaller orders that would have less of an effect on the market. The SMARS software accepted orders from trading strategies using a binary protocol, and contained logic to make sure that those orders got filled at a desired price. The protocol contained fields for price, desired size, time in force, etc. The protocol also had a flag field, which set options for a given order. Nanoseconds counted, so protobufs and JSON were too heavy here: these commands were sent over the wire as serialized structs.

One such option, from the very early 2000's, was called "power peg." Power peg was an order type used for manual market making: a power peg order would stay open at a given price, effectively "pegging" the stock to a given price. If a power peg order was filled, SMARS would refresh it at the same price. It kept a count of how many shares were filled for a power peg order, and when a certain (very large) cumulative number was hit, the power peg order would be automatically canceled. The intended user flow was for a market maker to open a power peg order, get it filled as many times as needed, and then cancel the order when the market was about to move.

In 2003, Knight Capital deprecated the power peg option. They followed almost all the steps for flag deprecation that you would expect from a disciplined engineering department: they marked the flag as deprecated, they switched users away from using it, and they defaulted the clients to prevent use of the option. However, they never took the last step: removing the server code. SMARS had been written somewhat hastily, and some of the code for Power Peg was deeply entwined with other code in the server, and as long as the tests kept working, it wouldn't be a big deal. During a refactor 2 years later, the tests for power peg were breaking, so they were deleted. Nobody was using the long-deprecated option, so there was no longer a need to check its correctness.

Adding a New Feature

In July 2012, Knight Capital needed a flag for orders from their new Retail Liquidity Program (RLP). Knight had opened a new line of business: buying order flow from retail brokerage and executing those orders. Retail orders needed special handling, so SMARS needed a new flag. However, the flag word was out of new bits for flags, so an engineer reused a bit from a deprecated flag: the power peg flag. The remaining power peg code in SMARS was disconnected from the flag, and new RLP code was added. The code went through review successfully, and passed a battery of automated tests.

The new RLP code was deployed to the SMARS system on July 27, 2012. Knight Capital officially ran a manual deployment process: the person assigned to deploy the code would SSH into each SMARS machine, rsync the new binary to that machine, and update some configuration to set it up to run instead of the old binary. Knight's operations team had seen the danger of this: to avoid missing a machine, they set up a script to perform the process for each machine. On July 27, 2012, one member of the operations team ran the deployment script for the new version of SMARS.

Unbeknownst to the team, the deployment script had a small bug: when it failed to open an SSH connection to a machine, it would fail silently, continue to update the other machines, and report success. It was never tested or checked in like a piece of software because it's a script that one person wrote for convenience.

That day, one of ten SMARS machines was down for maintenance during the software upgrade, and rejected an SSH connection. After its planned maintenance, the server came back up with an old version of SMARS.

Knight allowed the new SMARS binary to soak for 3 days before turning on RLP trades, and caught no errors. That was to happen on August 1. They also did a limited test of RLP orders in one of the production SMARS servers to make sure that the new logic was working correctly. The server they tested had received the new software version, and the test was successful.

August 1, 2012

Beginning at 8:01 EST on August 1, Knight began receiving retail orders through the RLP. Things were going well, and the internal servers handling RLP orders were working exactly as they had in testing the prior days.

At 9:30, the market opened. Initially, trading in about 150 stocks looked like it was going wrong. Engineers and quants were called to figure out what the problem was. New and experimental trading algorithms were shut off. Quantitative researchers, not known for their programming prowess, were thought to have created the bug. The RLP, now past its experimental soaking period, was allowed to continue operating.

From debug logs, engineers later narrowed down the problem to a bug in SMARS: orders were leaving trading servers correctly, but somehow the firm was starting to accrue large positions on these orders, filling them many times over. Noticing the flaw, engineers decided to roll back SMARS to its previous version, hoping to continue trading with a known-good version.

After the rollback, the abnormal behavior accelerated and spread to seemingly every stock on the market. The losses accelerated, and the SMARS software kept acquiring massive positions that were not allocated to any trading strategy. Trading algorithms also continued to be rolled back, as bugs in those machines could have caused the same issue, but none of this helped. Unknown to the operations department, they hadn't rolled back to a good version of SMARS---they had rolled back to the same bad version that had been the cause of their problems.

At 10:15, the call was made to shut down trading for the day. Knight had been losing money and accruing positions so quickly that the computers took a while to figure out exactly how bad it was.

Knight's executive team now needed to figure out how to cover these positions. Some positions could be closed manually, and some of these trades even made money. However, most of Knight's positions were too large for this approach. Talks began with banks and other trading partners to figure out how to get out of the hole. Exchanges were asked if they could reverse the trades. It appeared that Knight would have a $1 billion loss on their hands, and not anywhere near enough cash to cover it.

Line employees caught wind of the trouble, and many started to answer the emails from recruiters that they had long ignored. In the afternoon of August 1, more Knight employees were working on resumes than anything else.

The Final Fatal Flaws

During the 2005 refactor of SMARS, the code for reporting power peg positions back to trading strategies had broken, which was what caused the test failures (and the subsequent deletion of tests). If not for this breakage, the strategies would have been allocated correct positions, and the correct trading algorithms could have been shut down.

Finally, SMARS was built to be fast, and did not conduct a lot of pre-trade risk checks. That was the job for the trading servers, and they were very accurate at it. SMARS simply accepted orders and executed them, regardless of whether the strategy (or the firm) had the requisite capital. Since Knight was a broker-dealer and had a direct connection to the exchange, the exchange didn't know whether Knight had the money for their trades either, and continued accepting orders. This type of check is the responsibility of the broker, and Knight was their own. Trading strategies, whose risk management code had an inaccurate view of their own positions, continued to send orders like nothing was wrong. Nobody at Knight had built any infrastructure to manage the financial risks related to rogue order entry servers.

Aftermath

When the dust settled, Knight was able to close its positions at a $440 million loss. On August 5, 2012, Knight received $400 million of rescue financing that allowed them to continue operations. They rebranded as "KCG," and were acquired in 2013 by GETCO, another algorithmic trading company, to form KCG Holdings. They were later acquired by Virtu Financial in 2017.

The story of Knight Capital prompted other trading firms to review their processes and adopt new layers of risk checks as well as modern DevOps practices to protect themselves from being the next Knight. Some of them quietly admitted that they were lucky: their practices were similar to the ones that brought down their competitor. Adding risk checks to the last stage of an order's life became universal in the industry, and testing and deployment practices were largely brought into the modern era.

Knight Capital, the SEC, the exchanges, and FINRA conducted thorough postmortem reviews of what happened during this incident. There was a lot of blame to go around at Knight, and they ended up paying an additional $12 million of fines for failing to hold up their responsibilities as broker. A lot of Knight's development practices were changed. As of 2016, the engineer who did the update still worked at KCG. His entire management chain had been replaced (resigned or fired) in light of this incident, all the way up to the CTO.

The story of Knight Capital today serves as a cautionary tale for trading firms who ask "what's the worst that could happen?"