New DDOG Article on Seeking Alpha

Kayode Omotosho released a new article on Seeking Alpha on Sunday. Kayode seems to be pretty well informed of the general technology landscape and the long term growth aspects of the tech companies he writes about and comments on. He does think that a lot of companies are overvalued, and commits what some here might call “the error of trying to buy the cheap”, but his articles are interesting to read and he does offer a balanced perspective. He is simply risk averse, and he believes high valuations are inherently risky. :::shrug::: fair enough.

Every once in a while, he will talk about how one of our “overvalued” stocks is actually an interesting prospect. In this case, he is doing so about DataDog. He likes the stock, but he does see important risks to keep an eye on.

I want to share the article here because I think it offers a good idea of what to pay attention to as DataDog’s market develops. The main problem he sees is that DataDog can get expensive, and open source alternatives may be a threat. Also, firms might switch to a cheaper, “good enough” solution if they want to save money.

Here is the article: https://seekingalpha.com/article/4328632-datadog-stay-long-s…

Personally, I have really struggled with this name. I’ve traded in and out of it several times. I think I feel about DataDog the same way Saul feels about The Trade Desk. I hate the business it’s in. It is hard to understand and seems very competitive and susceptible to disruption at any time (like security). It’s also an infrastructure play and doesn’t get wired into the company culture the way software like Alteryx does. But I can’t ignore its results. I have taken a 5% position this morning. I wonder if I will hang on to it this time…

16 Likes

I hate the business it’s in. It is hard to understand and seems very competitive and susceptible to disruption at any time

You are the Fool IT guy. Imagine Fool has 10,000 servers in a warehouse to handle all the traffic reading and posting on Fool. Each of those servers has an “operating system” (OS), kind of like how Windows is probably the operating system on your PC. Operating systems write logs to hard drives to monitor all the things the OS does. Each OS reports things it is programmed to notice on the underlying PC server hardware and software. Temperature sensors in each server is one for sure. How many threads on the CPU’s are active. How many users each second, what board they are on, etc etc. Another server somewhere is running the air conditioning in the room. Now at 11 AM Eastern, the whole Fool website slows to a crawl. David can’t post his stuff, Tom can’t upload the corporate taxes, and coincidentally, the server room gets hot the alarms go off. Tom and David come looking for you to see what you did to them. But by 2 PM, problem is gone and nobody knows what happened. What do you tell Tom and David, and how will you protect them from this unknown thing “next time”?

Some options:

1: Look at the air conditioner controller server log to see if anything seems broken. Nope, but you see it got overworked. Look at the fire alarms to see if something in the walls. Nope. Look at all 10,000 server logs to see what things were reported in computer-geek language. Spend a week analyzing lines of cryptic log text and guessing.
2: Hire a guy to write a Perl or Python program to open all those logs and print stuff that might be interesting. Then spend a couple days analyzing and guessing.
3: Open the data dog log system. It automatically reads and stores all the things you thought might be “interesting” from all 10,000 servers and the AC server, fire monitor, etc. You see a graph that says 15 million people all went to read Saul’s monthly portfolio update at the same time. You see another graph with a link to the first one showing the air conditioner and room temperature spikes. You see that the servers all slowed down because the extra traffic causes the CPU’s internal power management microcode to slow the clocks to keep the CPU’s from overheating. You call Tom and David and tell them “Better Call Saul” and go back to lunch. 20 minutes.

This is an entertaining example, but that is a simple example of why Data Dog’s product has traction.

146 Likes

Thanks ibuilding. You made that “What does Datadog Do” example so clear and understandable that even I could grasp it.
Best,
Saul

2 Likes

thank you so very much, Ibuildthings!

You are the Fool IT guy. Imagine Fool has 10,000 servers in a warehouse to handle all the traffic reading and posting on Fool.

ibuildthings,

This was an excellent example! Very well done. Please allow me to take your example one step further (because I build things too :wink:

Not only do you possibly have 10,000 servers in a data center somewhere, along with a huge, complex, and ever-expanding database containing every iota of data about every Fool who has an account with TMF. But you also have a tremendous amount of storage (disk drives, RAID arrays, etc.) to store that database on.

Now imagine that FoolHQ is in the middle of a migration from those 10,000 servers, with databases and storage, to a cloud-native re-design of everything! On the one hand you have 10,000 servers, which are in reality probably a combination of “bare-metal”, and VMs (virtual machines), and on the other, you have a smattering of systems up in “the cloud”. These systems in the cloud are likely a mix of VMs, possibly controlled by auto-scaling groups to better handle dynamic load, and things like Kubernetes clusters orchestrating a bunch of disparate Docker containers, each of which may be running smaller micro-services. And, layer on top that, the further complexity of many components moving into a “serverless” framework which may be a combination of things like Lambdas, static content stored in S3 buckets and distributed via CDN, Application Gateways, etc. And, you’ve got a variety of databases from Postgres clusters, to DynamoDB, to Redis, etc.

Now, with all that complexity, how do you find where the slowdown is. Or even what it is? Some of these things can’t be “logged into” to check logs. They do have logs, and they store them “somewhere”. But where? And how to do read them? How do correlate that the lambda over here that took a long time to spin up and hit the API Gateway is at the root of problem? Or, was it the lambda that launched a long-running query against 3 different types of databases that timed out before the query returned? Or is it because the instance running the giant Java app that was merely forklifted to the cloud is lacking the permissions to access the DB that’s the problem?

Maybe the instance lacking permissions is allowed to query, but is returning the results incorrectly, causing the lambda to spin until it times out, and it manifests as if it’s a CDN problem distributing the next page?

Datadog (and Splunk before it) allow you to have all of your cloud widgets log to a central location, then ingest those logs and allow you to do something called “time-event correllation”. They correlate events based on time stamps. The answer the question, “What was happening at this moment in time across my entire infrastructure?”.

This is not an easy problem to solve. And, with so many different moving parts between cloud-based infrastructures, data centers, hybrid-cloud, etc. it is a NIGHTMARE to debug!

At one time you had “a” server to check for a particular type of problem. Now, you could literally have 1000s. Additionally, the system that actually caused the problem, may not even exist anymore by the time you start debugging it. Things like auto-scaling groups and container orchestration tools like kubernetes constantly spin things up and down based on a variety of criteria and demand. So it’s very possible, probable even, that you need to be able to debug code running somewhere that no longer exists!

Time-event correlation is an absolute necessity in this environment. It used to be that data centers could get away with minimal monitoring of only the essentials. But now, with products like Datadog, we can, and even have to in some case, add instrumentation and logging to every bit of code so that when something goes wrong, we can easily pinpoint when, where, why, and how, and then figure out how to fix it. Sometimes the fix is actually easier than identifying the problem itself, but we can’t fix it because we don’t know the when, where, why, and how!

BobbyBe: I hate the business it’s in. It is hard to understand and seems very competitive and susceptible to disruption at any time (like security). It’s also an infrastructure play and doesn’t get wired into the company culture the way software like Alteryx does.

While I understand your concerns about this being an infrastructure play, and feeling like it’s very competitive and susceptible to disruption, take a look around. It’s Top Dog in its area. And it’s a VERY VERY HARD problem to solve, never mind solve it well!

There are absolutely open source solutions which could compete with it. One can build out a complete ELK (Elastic Search, Logstash, Kibana) stack using Elastic’s (ESTC) productions free of charge. But this is even harder. This will require at least a full-time person designing and implementing this infrastructure. And, being exactly that person who builds out that sort of infrastructure, let me assure you, it’s not simple! Elastic’s products are well designed, but they are incredibly complex! You more likely need a small team of people who are Elastic experts to work with a team of infrastructure people. At that point, you’re looking at at least 3, maybe 4 FTEs… That gets very expensive very fast.

Or, you can buy Datadog’s product (or Splunk), which is expensive, but it’s probably worth every single penny! And, maybe, being a small start-up, strapped for cash, you decide to build your ELK stack at first. And maybe it’s “good enough” and you don’t buy Datadog. At some point, if your company is growing and successful, you will outgrow your ELK stack, or want it to more, but run out of bandwidth in your team to grow that. And you will quickly realize, or be pressured into by these people, that Datadog is the better solution.

People like me, who build infrastructure, do not enjoy monitoring. And we don’t enjoy building the infrastructure to deal with monitoring. It’s tedious, boring, sucks up a lot of time and resources, and our skills can be better put to use elsewhere. We want to use things like Datadog, if for no other reason than to allow us to do more interesting things than deal with monitoring :slight_smile:

While DDOG is techincally an infrastructure play, it is so at a level higher than ESTC. DDOG is a company who benefits from the infrastructure that ESTC lays down. ESTC is like dark fiber laid all over the place that is there and ready for anyone to use. Where as, DDOG is like the service running over the newly lit-up fiber. DDOG benefits from the fiber being laid, and allows it to sell very useful things to people who don’t have to or want to know how to light up the fiber!

I don’t really see a competitive threat to DDOG at this point. Sure, there are other companies in the space, but they’re either legacy companies like Splunk trying to re-position themselves, or they’re smaller start-ups trying to attack the exact same space with an inferior product.

It is entirely possible they’ll be challenged. It’s high-tech. Everything and everyone is constantly being challenged. And when another new up-start comes along to steal market-share from DDOG, if they have a better mouse-trap, and better numbers, we’ll all move over there just as those customers who first used Splunk, and moved to DDOG, are now going to Latest-Shiny-Thing.


Paul - who hates building monitoring infrastructure…

62 Likes

First, thanks ibuildthings for your entertaining example :slight_smile:

I had mentioned in my AYX post to write about my thoughts on DDOG but instead of waiting for a detailed writeup here are some quick comments. Please note that I haven’t personally used Datadog yet but understand the monitoring space and can see why it’s doing well.

“Now at 11 AM Eastern, the whole Fool website slows to a crawl.”

The above scenario is what typically causes the primary (and maybe the secondary) on-calls to get paged and sometimes pulling in a lot of other folks to douse the fire( not a happy situation for any dev or ops person). I’ve been there many times.

Here’s my perspective: That Fool website slowing to a crawl is exactly what Datadog is intended to prevent and fix asap.

The Datadog Agent appears to do a real smart job in monitoring compared to the public cloud apis. I believe AWS updates the metrics every minute with CloudWatch (which is a pretty long time in today’s context when users expect for a page to load or a service to provide you the information almost instantaneously.) The Datadog Agent can be installed locally and stores metrics at a 1 second resolution. This allows to detect service/application anomalies and spikes ( like CPU usage etc.) in greater detail. Also integration with tools like PagerDuty enables to promptly notify on-call personnel as soon as issues appear.

Availability and Latency are two super important metrics for any service and you would want to monitor those all the time. The key areas that really help are “how often the metrics are being updated”, “how easy it is to see trends” and “how much granularity is provided for debugging any issue”. Those provide the invaluable information needed to ensure your application/service meets it’s SLAs and keep your customers happy.

This was about 3-4 years back…I worked very hard to achieve a 99.9% ( just 3 nines forget about 4 nines) consistent availability of a service over a few months time. That needed a lot of plumbing and weaving to put together a system which was fragile and needed oversight and maintenance. So if Datadog would take care of all of that ( of course with a subscription fee) and help you focus on building your product/features that your customers want. Which option would you choose?

Even if you’re not technically inclined, it’s fun to watch these short videos to get some high level overview (though the videos are a little old).https://www.youtube.com/watch?v=uI3YN_cnahk

Also Saul and muji have done great deep dives on DDOG which are worth reading if you’re an investor in DDOG.

For the Fool’s purpose, you could mount some displays in your office, corridors or wherever with screen boards that displays the latest health of your services. That’s visibility!

Caveat: The only thing with a monitoring service like DDOG is that it is not very difficult to replace it with another one. So, as long as DDOG keeps adding and enhancing itself to stay ahead of the competition it will be fine. I’ll be keeping a close watch on DDOG ( how about a monitoring service to monitor DDOG :)). Modern day services cannot survive without great monitoring.

They also have a Watchdog feature that looks for irregularities in metrics, like a sudden spike in the hit rate etc.
Here’s a quote from DDOGs site from one of their users using that feature…

“Watchdog is giving us faster incident response. It’s showing us where the problems are in our system that we wouldn’t have otherwise seen. And it’s showing us where other impacts are throughout the system… it’s allowing us to essentially deliver a better level of service to our customers.”
–Joe Sadowski, Engineering Manager, Square

NOTE: Their recent integration with Azure DevOps shows that they are doing a great job in the game: https://seekingalpha.com/news/3526024-datadog-announces-azur…

And from a dev perspective their HTTP API allows devs to build custom features on top of what’s offered out of the box.

Cheers!

ron

<I have a 10% position in DDOG>

35 Likes

Please allow me to take your example one step further (because I build things too :wink:

Flix Fool, that was a simply awesome example. Thank you so much for taking the time to help us understand what we are investing in in Datadog !!!

Saul

9 Likes

I had mentioned in my AYX post to write about my thoughts on DDOG but instead of waiting for a detailed writeup here are some quick comments.

My God Ron, thanks to you too. It’s amazing, the caliber of people we have on this board. We really appreciate it.

Saul

15 Likes

There are absolutely open source solutions which could compete with it. One can build out a complete ELK (Elastic Search, Logstash, Kibana) stack using Elastic’s (ESTC) productions free of charge. But this is even harder. This will require at least a full-time person designing and implementing this infrastructure. And, being exactly that person who builds out that sort of infrastructure, let me assure you, it’s not simple! Elastic’s products are well designed, but they are incredibly complex! You more likely need a small team of people who are Elastic experts to work with a team of infrastructure people. At that point, you’re looking at at least 3, maybe 4 FTEs… That gets very expensive very fast.

Or, you can buy Datadog’s product (or Splunk), which is expensive, but it’s probably worth every single penny!

FlixFool,

With hosted Elastic, is it still this complicated? I figured that was their way of doing it for you.

Also, as far as Splunk goes, how is Datadog better? You mentioned later those customers who first used Splunk, and moved to DDOG

Thanks,
Bear

3 Likes