A different perspective on mistakes & hacks

I find it odd that knowledgable people are giving these companies a pass. Having been in the software industry myself, I see these as events that SHOULD have been anticipated, with stronger measures put in place to prevent before they occurred.

Let’s start with CloudFlare. Here’s their official explanation: https://blog.cloudflare.com/cloudflare-outage-on-july-17-202… More interesting to me was CEO Matthew Prince’s tweeted explanation: The root cause was a typo in a router configuration on our private backbone. https://twitter.com/eastdakota/status/1284298908156346368 I bolded the operative word.

This strongly indicates an unacceptable lack of testing for production code. Whether you’re building games or operating in a regulated industry like automotive or healthcare or finance, you must test everything before it is rolled out to production. I’ve never been at a company that didn’t have layers of testing - from code reviews to compiler warnings and lint runs to black box testing to simulation testing to white box testing to staged/gradual roll-outs there should be many layers in place.

Prince acknowledges this in a subsequent tweet: The root problem was we didn’t have systems in place to keep them from causing a widespread issue. That’s a problem of leadership that I am more responsible for than the engineer who made the typo. It’s great that he’s taking blame, but the question remains as whether he’s also the CTO, or CISO, or Security Architect, etc. I sure hope not.

Now, there are some situations that are just impossible to fully test, since replicating the production environment is simply not feasible. I suspect that a global network topology is one such situation. However, that does not excuse not having a simulation environment in place, not running simple “lint” checks on interpreted languages, and especially not having a second set of eyes on any code changes (known as “code reviews”). At my last company I put in place many layers of testing, and it was made clear that even though we had a QA department, engineers were considered responsible for code reviews and unit testing.

We did have one instance that we were not equipped to perform a pre-production nor simulation test - a small cellular network configuration change. In some ways Cloudflare’s typo seems to be of a similar nature. We agonized over this change, performing a lint and multiple code reviews (including an external one with the cellular provider), and when we rolled out the change we did so at 3am and immediately had our black box testers run through various scenarios to ensure things were working. This was a big deal for us. I suspect that Cloudflare does changes like this much more frequently. This means they should be better at it.

So, while it’s fine to say that CloudFlare responded well to the problem, that a typo could cause this indicates to me that they have a very immature development/testing process. Too often I have seen tech companies grow from a development-oriented base without growing their discovery/testing regimen. This isn’t the first time something like this as happened to CloudFlare. Just a year ago, they had something similar: https://blog.cloudflare.com/cloudflare-outage/

The cause of this outage was deployment of a single misconfigured rule within the Cloudflare Web Application Firewall (WAF) during a routine deployment of new Cloudflare WAF Managed rules…We make software deployments constantly across the network and have automated systems to run test suites and a procedure for deploying progressively to prevent incidents. Unfortunately, these WAF rules were deployed globally in one go and caused today’s outage…Our testing processes were insufficient in this case and we are reviewing and making changes to our testing and deployment process to avoid incidents like this in the future.

Clearly, they still have a ways to go on their “testing and deployment process,” even a year later. How many apologies do we accept?


Which brings us to Twitter. The NY Times has the best reporting on this hack so far: https://www.nytimes.com/2020/07/17/technology/twitter-hacker… and as of this morning I see that Twitter has posted it’s own information: https://blog.twitter.com/en_us/topics/company/2020/an-update…

I won’t summarize it all, but long story short is that Twitter apparently has Admin accounts that can make far-reaching changes, and that a single Admin/Super-user (named “Kirk” in the NY Times story) had been able for days to wreak havoc. As the NY Times says:

As the morning went on, customers poured in and the prices that Kirk demanded went up. He also demonstrated how much access he had to Twitter’s systems. He was able to quickly change the most fundamental security settings on any user name and sent out pictures of Twitter’s internal dashboards as proof that he had taken control of the requested accounts.

How did Kirk do this? Twitter says :The attackers successfully manipulated a small number of employees and used their credentials to access Twitter’s internal systems, including getting through our two-factor protections
The NY Time provides some additional details: Kirk got access to the Twitter credentials when he found a way into Twitter’s internal Slack messaging channel and saw them posted there, along with a service that gave him access to the company’s servers.

It appears the hackers used phishing to get access to Twitter’s internal Slack channel, where a Twitter admin had posted credentials. Now, outside of training admins to keep credentials private and not share them on even internal messaging boards, and to train them not to be susceptible to phishing attacks, this is still not acceptable in my view. Large impact security related changes should adhere to a Separation of Duties (https://en.wikipedia.org/wiki/Separation_of_duties ) process. You’ve seen the missile launch sequence at the beginning of “War Games” - I don’t know if that’s the exact setup, but having two people each enter their own credentials in order to perform critical changes is standard operating procedure for security and critical administration. So, a phishing/social hack on one admin would not be sufficient.

It appears that the BitCoin aspects may have been a diversion from the real intent - Twitter confirmed that private data was downloaded from 8 accounts, and that phone numbers and emails from 130 accounts were exposed. At least it appears Twitter hashes user passwords so those were not compromised, but we still don’t know the full extent of what the hackers did or what may come later.


I fear these kinds of mistakes and attacks are going to become more and more widespread. It’s too easy for technology-pushing, fast-growing companies to ignore security and testing aspects, as there’s constant pressure to do more too quickly but with too few resources. It’s too easy to push security concerns to the side since you can get lucky for long periods of time without hacks. But, while these two and the previous Zoom incidents gathered wide-spread attention, there are many that are more restricted. For instance, for just this year so far read: https://www.fintechfutures.com/2020/04/2020-review-top-five-…

1) Travelex quarantines website, internal systems after New Year’s Eve cyber-attack (duration: 3 weeks)
2) Lloyds, Halifax and Bank of Scotland customers hit by system outage (duration: 8 hours)
3) Robinhood outage locks users out of recovering US market (duration: Whole day)
4) Finastra brings servers back online after ransomware attack
5) Greece’s major banks cancel 15,000 cards after travel website breach

54 Likes

I think it’s safe to assume any company will be susceptible to this issue. The frequency of events and how they are handled are probably the best barometer of their significance. I was wondering whether Fastly had similar problems. I could find one significant Fastly “global” outage which occurred 3 years ago. According to the link, this was handled in a “rapid” manner.

https://bgr.com/2017/06/28/internet-outage-fastly-cnn-nyt/

As for our companies, it demonstrates that demand for security and monitoring of systems will remain in demand.

Dave

As for our companies, it demonstrates that demand for security and monitoring of systems will remain in demand.

Security must be built into the software. A typo bringing the system down is a sign of negligence. I wonder how much of the code I have written is about defeating typos, 25%? 30%?

An insurance company I worked with would issue documents where the ending date was earlier than the start date. That’s gross negligence.

I’m the first to admit that there is no bug free software but some kinds of bugs are unacceptable.

BTW, TMF issue with Cloudflare was about an excessively high security setting to login. Cloudflare presented a PITA captcha. This is what I had to say:

captcha is one of the most stupid and annoying inventions I have ever had to deal with. If that’s what Cloudflare (NET) does for a living I’m not investing a red penny in that crap. I wish TMF would find a better security service.

https://discussion.fool.com/out-of-the-blue-34555727.aspx?sort=w…

Ever since I started writing code for Mac I have been a great fan of Apple’s User Interface Guidelines. Captcha punishes ever user because there are some bad apples out there. Captcha is a terrible user interface.

Denny Schlesinger

7 Likes

The root cause was a typo in a router configuration on our private backbone.

Many companies don’t treat configuration files as in any way similar to code. Many don’t even have them under source code control, and often changes are done by hand on the fly and without review, especially if they are changed often. This is partly a status thing – the folks who deal with such things are often system administrators who are not considered to be engineers or even coders. Their work is considered to be relatively trivial, so often the people responsible for making things work right don’t take it very seriously.

Of course, errors in trivial stuff can break the world just as easily as errors in anything else. In general computer systems are fragile, and small anomalies can have large consequences. But when things are simple and almost always gotten right, it’s hard to justify full-blown configuration management and test procedures being run around every change. For the non-technical among you, think of this sort of thing as similar to getting a single digit wrong when writing down a phone number. Then imagine there’s some small chance that writing down the number (even if you got it right) might cause every other number you’ve got to not work either. Are you really going to call them all to make sure they still work?

-IGU-
(used to test critical software stuff for Apple)

6 Likes

Security must be built into the software. A typo bringing the system down is a sign of negligence. I wonder how much of the code I have written is about defeating typos, 25%? 30%?

An insurance company I worked with would issue documents where the ending date was earlier than the start date. That’s gross negligence.

I’m the first to admit that there is no bug free software but some kinds of bugs are unacceptable.

BTW, TMF issue with Cloudflare was about an excessively high security setting to login. Cloudflare presented a PITA captcha. This is what I had to say:

captcha is one of the most stupid and annoying inventions I have ever had to deal with. If that’s what Cloudflare (NET) does for a living I’m not investing a red penny in that crap. I wish TMF would find a better security service.

https://discussion.fool.com/out-of-the-blue-34555727.aspx?sort=w…

Ever since I started writing code for Mac I have been a great fan of Apple’s User Interface Guidelines. Captcha punishes ever user because there are some bad apples out there. Captcha is a terrible user interface.

Denny Schlesinger

I appreciate your informed perspective. So you’re not invested in NET? Are you an investor in FSLY? It had a similar problem, possibly more significant, and it didn’t slow them down. Or how about ZM? It had a very large security issue, one that also seemed to be unacceptable, something that shouldn’t happen, one that cost them many customers and wasn’t some temporary glitch. But that’s also not slowing them down.

So while I can understand your perspective, I’m not sure it’s a game changer from an investment perspective. That’s what really matters to me.

Dave

2 Likes

So while I can understand your perspective, I’m not sure it’s a game changer from an investment perspective. That’s what really matters to me.

I don’t think any of us are saying that these incidents in and of themselves should change our current investing theses. However, I do believe it’s worth following them and understanding them and seeing if they’re getting out of hand and especially how the company is responding.

For instance, as had been brought up, Zoom had multiple security issues, including some they intentionally didn’t fix. That was a game changer for me and I sold out of ZM. However, when it became apparent that Zoom’s CEO, Eric Yuan, was truly serious about addressing security issues, I bought back in (Saul’s patient explanations on Zoom’s growth was crucial there as well).

I agree that this CloudFlare incident is not in and of itself a reason to change one’s investment in NET. If TMF ever interviews Matthew Prince, I’d love to submit a couple of questions to be asked.

BTW, someone pointed out that Fastly had an outage in 2017, which is true. Looks like they fixed it in 19 minutes: https://news.ycombinator.com/item?id=14654231

1 Like

So while I can understand your perspective, I’m not sure it’s a game changer from an investment perspective.

Trying to connect the dots from a bug to the stock’s performance is simply not possible. I’m making no prediction about NET, the stock. The inferior x86 beat out the superior 680x0. That’s how high tech works.

That’s what really matters to me.

As it should.

Denny Schlesinger

2 Likes

Even though remmdawg, ItsGoingUp and captainccs already brought voices of reason here, this thread contains some dangerous assumptions based on the word “typo” in a tweet (and some surface-level reporting). I felt the need to post myself to call this out more aggressively.

To post the assumption that there is gross negligence, based on the information available here, as fact, is totally irresponsible. There is inside information here that we do not have access to.

There are bugs/issues in systems found and fixed all day every day (those big R&D budgets are mostly this). It is a mistake to think that just because this one got some limelight that it is somehow avoidable and unique.

  • Code is never done.
  • Test-coverage is never 100%.
  • Linters (programs that find simple formatting and syntax errors) are far from perfect.
  • Code is text and a typo is just typing text wrong.
  • Bugs are unavoidable no matter how well intentioned a developer is.
  • Security always has holes, which is why CRWD is needed in this world.
  • Even if NONE of the above were true, code systems still have to interact with each other and “integration testing” is also a hard problem and not always possible to test in an automated way. There is also a continuum of granularity in this testing, making it even harder to cover than simple testing, and may even include 3rd party systems and the world at large.

None of this is new. It has been this way since code was invented and it will never go away, only mutate. One day we might have AI bots constantly testing things and letting us know of issues, or even fixing them on the fly, but issues will still exist to be found and fixed. Every now and then one will get increased visibility, either internally or externally to the department or company.

The Zoom security conversation is old. I’m fatigued by the regurgitated spin. Only the news and short sellers made it a big deal. Looking at it directly it was a non-issue, or growing-pains due to different user types. I never saw any real security hole mentioned or found, so I bought more, and more. I only bother mentioning this because there is actually a funny parallel between the Zoom issues and this thread here though. Several “security flaws” were just configuration defaults they then changed, or one-time events like the China routing.

TO BE CLEAR. I totally agree with Smorgasbord1 on this point: “I don’t think any of us are saying that these incidents in and of themselves should change our current investing theses. However, I do believe it’s worth following them and understanding them and seeing if they’re getting out of hand and especially how the company is responding.” I just felt like this thread had more conclusions and statements than questions and exploration. Perhaps I’m just reading it wrong, but I felt compelled to speak up anyway.

27 Likes

There is inside information here that we do not have access to.

Thanks to CloudFlare’s transparency (kudos to them for that!), we actually have a lot of information. CloudFlare’s blog posted the actual incorrect line of configuration code and explained why it was wrong: https://blog.cloudflare.com/cloudflare-outage-on-july-17-202…

As there was backbone congestion in Atlanta, the team had decided to remove some of Atlanta’s backbone traffic. But instead of removing the Atlanta routes from the backbone, a one line change started leaking all BGP routes into the backbone…The correct change would have been to deactivate the term instead of the prefix-list.

This is the kind of thing that a code review with another engineer would have a good shot at uncovering. The fixes described in the blog and by the CEO are, IMO, just bandaids.

None of this is new. It has been this way since code was invented and it will never go away, only mutate.

The problem at most companies is budget and time. Company executives want features and performance under tight deadlines and with low development costs. That culture encourages managers to emphasize code development over validation. That sends a message to engineers as well. Where lives and millions of dollars are on the line, validation is almost always much more thorough. Space-X, for instance, does a ton of software validation (https://space.stackexchange.com/questions/9243/what-computer… ).

Sure, not all bugs can be found, but the kind of simple mistakes as Cloudflare just made are very much preventable, and not at great cost. How much money did Shopify and its customers lose during the outage? Do you not think Shopify will remember two such outages in two years when they go about renewing subscriptions or expanding usage with Cloudflare?

And remember, this outage cost Cloudflare money. They have business SLAs (Service Level Agreements) that literally guarantee 100% uptime, with 25 times damages in monetary penalties if they fail:

100% uptime and 25x Enterprise SLA: In the rare event of downtime, Enterprise customers receive a 25x credit against the monthly fee, in proportion to the respective disruption and affected customer ratio.
https://www.cloudflare.com/plans/enterprise/

The Zoom security conversation is old.

Seems like you didn’t actually read what I wrote. I was using Zoom as a positive example of a company that changed its development and validation procedures, and more importantly, changed the culture of the company to emphasize security as much as reducing user friction. This is the kind of change that apparently companies like Cloudflare need to undergo. Most companies don’t have to go all triple redundancy like Space-X, but they sure do need to prevent the kind of mistakes Cloudflare just made.

18 Likes