I find it odd that knowledgable people are giving these companies a pass. Having been in the software industry myself, I see these as events that SHOULD have been anticipated, with stronger measures put in place to prevent before they occurred.
Let’s start with CloudFlare. Here’s their official explanation: https://blog.cloudflare.com/cloudflare-outage-on-july-17-202… More interesting to me was CEO Matthew Prince’s tweeted explanation: The root cause was a typo in a router configuration on our private backbone. https://twitter.com/eastdakota/status/1284298908156346368 I bolded the operative word.
This strongly indicates an unacceptable lack of testing for production code. Whether you’re building games or operating in a regulated industry like automotive or healthcare or finance, you must test everything before it is rolled out to production. I’ve never been at a company that didn’t have layers of testing - from code reviews to compiler warnings and lint runs to black box testing to simulation testing to white box testing to staged/gradual roll-outs there should be many layers in place.
Prince acknowledges this in a subsequent tweet: The root problem was we didn’t have systems in place to keep them from causing a widespread issue. That’s a problem of leadership that I am more responsible for than the engineer who made the typo. It’s great that he’s taking blame, but the question remains as whether he’s also the CTO, or CISO, or Security Architect, etc. I sure hope not.
Now, there are some situations that are just impossible to fully test, since replicating the production environment is simply not feasible. I suspect that a global network topology is one such situation. However, that does not excuse not having a simulation environment in place, not running simple “lint” checks on interpreted languages, and especially not having a second set of eyes on any code changes (known as “code reviews”). At my last company I put in place many layers of testing, and it was made clear that even though we had a QA department, engineers were considered responsible for code reviews and unit testing.
We did have one instance that we were not equipped to perform a pre-production nor simulation test - a small cellular network configuration change. In some ways Cloudflare’s typo seems to be of a similar nature. We agonized over this change, performing a lint and multiple code reviews (including an external one with the cellular provider), and when we rolled out the change we did so at 3am and immediately had our black box testers run through various scenarios to ensure things were working. This was a big deal for us. I suspect that Cloudflare does changes like this much more frequently. This means they should be better at it.
So, while it’s fine to say that CloudFlare responded well to the problem, that a typo could cause this indicates to me that they have a very immature development/testing process. Too often I have seen tech companies grow from a development-oriented base without growing their discovery/testing regimen. This isn’t the first time something like this as happened to CloudFlare. Just a year ago, they had something similar: https://blog.cloudflare.com/cloudflare-outage/
The cause of this outage was deployment of a single misconfigured rule within the Cloudflare Web Application Firewall (WAF) during a routine deployment of new Cloudflare WAF Managed rules…We make software deployments constantly across the network and have automated systems to run test suites and a procedure for deploying progressively to prevent incidents. Unfortunately, these WAF rules were deployed globally in one go and caused today’s outage…Our testing processes were insufficient in this case and we are reviewing and making changes to our testing and deployment process to avoid incidents like this in the future.
Clearly, they still have a ways to go on their “testing and deployment process,” even a year later. How many apologies do we accept?
Which brings us to Twitter. The NY Times has the best reporting on this hack so far: https://www.nytimes.com/2020/07/17/technology/twitter-hacker… and as of this morning I see that Twitter has posted it’s own information: https://blog.twitter.com/en_us/topics/company/2020/an-update…
I won’t summarize it all, but long story short is that Twitter apparently has Admin accounts that can make far-reaching changes, and that a single Admin/Super-user (named “Kirk” in the NY Times story) had been able for days to wreak havoc. As the NY Times says:
As the morning went on, customers poured in and the prices that Kirk demanded went up. He also demonstrated how much access he had to Twitter’s systems. He was able to quickly change the most fundamental security settings on any user name and sent out pictures of Twitter’s internal dashboards as proof that he had taken control of the requested accounts.
How did Kirk do this? Twitter says :The attackers successfully manipulated a small number of employees and used their credentials to access Twitter’s internal systems, including getting through our two-factor protections
The NY Time provides some additional details: Kirk got access to the Twitter credentials when he found a way into Twitter’s internal Slack messaging channel and saw them posted there, along with a service that gave him access to the company’s servers.
It appears the hackers used phishing to get access to Twitter’s internal Slack channel, where a Twitter admin had posted credentials. Now, outside of training admins to keep credentials private and not share them on even internal messaging boards, and to train them not to be susceptible to phishing attacks, this is still not acceptable in my view. Large impact security related changes should adhere to a Separation of Duties (https://en.wikipedia.org/wiki/Separation_of_duties ) process. You’ve seen the missile launch sequence at the beginning of “War Games” - I don’t know if that’s the exact setup, but having two people each enter their own credentials in order to perform critical changes is standard operating procedure for security and critical administration. So, a phishing/social hack on one admin would not be sufficient.
It appears that the BitCoin aspects may have been a diversion from the real intent - Twitter confirmed that private data was downloaded from 8 accounts, and that phone numbers and emails from 130 accounts were exposed. At least it appears Twitter hashes user passwords so those were not compromised, but we still don’t know the full extent of what the hackers did or what may come later.
I fear these kinds of mistakes and attacks are going to become more and more widespread. It’s too easy for technology-pushing, fast-growing companies to ignore security and testing aspects, as there’s constant pressure to do more too quickly but with too few resources. It’s too easy to push security concerns to the side since you can get lucky for long periods of time without hacks. But, while these two and the previous Zoom incidents gathered wide-spread attention, there are many that are more restricted. For instance, for just this year so far read: https://www.fintechfutures.com/2020/04/2020-review-top-five-…
1) Travelex quarantines website, internal systems after New Year’s Eve cyber-attack (duration: 3 weeks)
2) Lloyds, Halifax and Bank of Scotland customers hit by system outage (duration: 8 hours)
3) Robinhood outage locks users out of recovering US market (duration: Whole day)
4) Finastra brings servers back online after ransomware attack
5) Greece’s major banks cancel 15,000 cards after travel website breach