DDOG Customer and Internal Discussion from Devel

Datadog is my largest position. It has grown since my early purchases at ~ $32/share (yes, I got in during the early days). I have been buying more DDOG (probably more than I should), as it has been dropping these past few months.

Now to the story.

I work at a tech company in the SF Bay Area. The company is a small, to midsize company. This morning, I was searching our Slack channels and randomly came across a conversation regarding DDOG from a few of our software developers. After reading through the conversation, I was concerned, but then not concerned, but then somewhat confused. I thought I would share the transcript with the board, as the conversation may offer thought for a lively discussion.

Employee #1: Bringing back the topic of Datadog… I came across this sentence while reading through their docs:

“Any metrics with fractions of a second timestamps are rounded to the nearest second. If any points have the same timestamps, the latest point overwrites the previous ones.”

The word “overwrites” here is concerning: does anyone know whether data points are dropped or just grouped into the same 1-second bucket for querying?

Employee #2: I tried this out locally with a test metric using the XXXX API and it does in-fact seem to overwrite the value for the same second based timestamp. And, I am not able to retrieve the “older” value.

Sample command:

  • X = 1 for the first submission
  • X = 5 for the second comm

Employee # 2: @Employee #3 and @Employees #4 – would you have any insights and/or thoughts on this?

Employee #1: Yikes! Thanks for testing this out. @Employee #2. If this is indeed the case, then I think we’ll want to quickly find another solution.

Employee #3: Hi. Yes, this is expected. Dogstatsd is based on statsd and metrics do not accumulate. I would expect that you would have to use additional dimensions (for example machine-name or the like) to get two different points of the same metric at the same timestamp.

Employee #3: In typical statsd implementations the path includes metdadata such as the machine name or IP, and then metric math is done to combine those. I can’t imagine dogstatsd is different here. It’s hard for the backend to know how to combine metrics (avg, sum) for counts, let alone gauges.

Employee #1: DogStatsD is different than their API though. @Employee #2’s testing was through the API afaik.

Employee #1: There’s some available here (link to doc.datadoghq.com). The short version is that it looks like you’re required to use their custom agent if you want support for HISTOGRAM and DISTRIBUTION metric types.

Employee #1: Their docs also seem to indicate that all statistical analysis (e.g. calculation of p95) has to be done up-front, meaning you can’t just submit metrics and run arbitrary analysis later. If that’s true, then this seems extremely restrictive.

Employee #3: DogStatsD is StasD with dimensions added. StatsD supports HISTOGRAM and DISTRIBUTION

Employee #3: That is true (the p95 thing) and pretty typical of statsd based systems. It’s one of (but far from the only) reason I dislike ddog (the real reason is that they just had high latency on alerts or forgot to send them).

Employee #1: Soo does this mean we can migrate to something more suited to our needs? :slight_smile:

Employee #3: That’s a good question that I can’t just blatantly answer. I have had very good success with a custom statsd hosting with graphite aggregators or Prometheus. Nut now daysI would personally (if were up to me) advocate we use cloudwatch metrics:

Employee #1: +1 for cloudwatch, though I’m admittedly biased.

Employees #3: It also supports log based metrics which is a hugely beneficial feature.

Employee #3: I’m 100% biased. My team at Twitch started on statsd, moved to datadog, moved to signal and then moved to cloudwatch.

Employee #3: signal fx was bought by splunk and possibly murdered.

Employee #2: One of the things I do like about DataDog is centralized visualization across all applications/systems.

Employee #3: Cloudwatch allows that. The typical pattern is to create a new AWS account, share cw from all other accounts with it and use it as a centralized dashboard app. Gives you the exact same thing. As does using Garfana for dashboards.

Employees #1: As @Employes #3 mentioned, one benefit of cloudwatch is that it’s supported data source for many viz tools (DD, Grafana, others I’m not aware of), so we could reasonable maintain nice dashboards on another platform.

Employee #3: One thing I am sure of is that cloudwatch alerts are rock solid and that matters more to me.

Employee #3: (We also spend a lot of money sending all our cw data to ddog, but that shouldn’t be a primary decision factor)

Employee #1: Agreed, the built-in monitoring is hugely beneficial. I know many platforms (including DD) have something similar, though I’d wager that CW is the most robust and friendly.

Employee #1: How can we come to a consensus in this? Do we know who the right stakeholders are?

Employee #2: For the centralized CW AWS account, how easy has it been for engineers to add/modify/configure existing and new dashboards, metrics, alarms (etc.) in your past experience? Has it been fairly seamless or are there more hoops to jump through?

Employee #2: @Employee #1, we should get opinions from Employee #5 and Employee #6. There may be others – but those are folks that I am not aware of.

Employee #3: Well also people like @Employee #7 and @Employee #8.

Employee #3: CW dashboards are easy but somewhat barebones.

Employee #1: We essentially did the centralized CW on my last team at Amazon. Once you have the initial PoC of streaming metrics from one account to another, adding additional source accounts and setting up dashboards/monitors is pretty straightforward (obviously you need some familiarity with CW). The main issue we ran into was ensuring that metrics were properly tagged and dimensionalized, but that’s something that e.g. a common metrics library could help greatly with.

Employee #3: At twitch someone built a “statd to cloudwatch” library since people were so ingrained with using (link to github.com)

Employee #1: Is anyone opposed to me setting up a meeting with the stakeholders called out here so that we can start working towards consensus? I’d like to make headway as soon as we can so that we’re not digging a deeper hole.

Employee #3: a good common library is recommended, clouwatch charges by the call, not by the metric, so batching up metrics is a huge money saver

Employee #3: Please do @Employees #1

Employee #2: Nope. Go ahead.

Employee #1: Invite has been sent. Please let me know if anyone would like to be added!

Employee #2: @Employee #1 – for my edification, could you share the other issues you noticed with CW?

Employee #3: The only issue I ever had with cloudwatch was the 5 minute window, but they changed it.

Employee #3: they added high resolution metrics for 1 secondly and regular are not 1 minutely.

Employees #3: Oh and I guess like any AWS service, there are limits on it. I think CW has 150 calls a second limit. This is another good reason to have more and smaller accounts (and batch metrics)

Employee #1: I’m also not a fan of their paginated log UI, but are ways to mitigate it.

Employee #1: The main issues I’ve had with it in the past were all related to our own data. I think the platform itself is good.

Employee #3: I never use their stupid console logs UI

Employee #3: I rarely use a web browser when I’m coding. Haha whom I kidding, I haven’t coded in 2 years.

Employee #1: haha my last team ended up writing a greasemonkey plugin to remove the pagination, but getting it out of the browser entirely is even better.

Employee #3: That package I pasted is very good, that’s what I used.

Employee #5: My 2 cents around this – we spent a long time building this out, and I don’t know if it’s a good use of our time to migrate out from it at this stage. If we have concerns about DD, let’s list them out, get DD to investigate them/come back with an answer, and then evaluate the impact on our platforms.

Employee :5: Open to discussing it, but don’t want to spend another few months to move to something that is marginally better. I think there are a lot of other areas that could use improvement and have higher impact than this.

Employee #3: Almost all the data in DD is just from CW, so the data is already in both places.

Employee #1: Want to make sure your concerns are heard @Employee #5. Nothing has been decided yet; let’s discuss on Friday.

I hope you found this helpful. I found it to be a bit concerning, but the sample size is smaller than small.

Note: this conversation occurred a couple of months back, and I do not know the outcome of the meeting.

Best,

~ Assiduous

14 Likes

I hope you found this helpful. I found it to be a bit concerning, but the sample size is smaller than small.

Note: this conversation occurred a couple of months back, and I do not know the outcome of the meeting.

Hi Assiduous,
I think one would have to be a real super techie to understand what the heck they are talking about and whether it has any significance. As you point out “the sample size is smaller than small” (it’s a discussion between three people). I think you are getting lost in the tiny details. I would suggest following the numbers (how is the business doing) where as of now I see no need for concern at present.
Best,
Saul

35 Likes

I think one would have to be a real super techie to understand what the heck they are talking about…

Sampling data loss, might or might not be significant. To know that “one would have to be a real super techie.” :wink:

Denny Schlesinger

2 Likes

I think the important question is: is it a bug or by design?

Hi, after a quick conversation with a friend using Datadog here’s what he said.

“This is no issue, it’s that way by design. That’s why you use unique tags as an identifier. Also distribution metrics solve that”.

So, I never used it myself and can’t really say more to that. But this doesn’t look like a problem at all.

16 Likes

I think one would have to be a real super techie to understand what the heck they are talking about and whether it has any significance. – Saul

There are oddities and blemishes with every company.

One beauty of quarterly reports is that they provide clarity on what is really important. A lot of the navel gazing is not important.

Rob
Former RB and BL Home Fool, Supernova Portfolio Contributor & Maintenance Fool
He is no fool who gives what he cannot keep to gain what he cannot lose.

1 Like

Saul → I would suggest following the numbers (how is the business doing) where as of now I see no need for concern at present.

Clearly the present (or recent past) has been very good operationally for DDOG. Curious what the group thinks the numbers might look like in the near future (say next 3 years or so).

Revenues for the last 4 quarters has been growing impressively - up to $866M with positive operating margin (but still very low). Anyone forecasting what revenue and margins might be in 3 years? Could they get to $3.5B? Similar with operating margin - is this a business that can sustain 20%+ margins?

tecmo

Tecmo,

You might want to check the source for your info. DDOG TTM revenue is $1.19B (almost 40% higher what you note in your post).

Could they get to $3.5B in 3 years? Yes, I believe it’s likely. It would take a 43% revenue growth CAGR over that time. That is not out of the question when growth last Q was in the mid 80%s.

Yes, SaaS businesses at scale can produce FCF margins >30%.

Bnh

8 Likes

Been part of 100s of theses types of conversations and was waiting for the punchline to finally come and it came from Employee #5 when he/she ultimately stated (paraphrased):

“Let’s document our concerns and then get DDOG in here to respond.”

I’d be blown away that these concerns were not addressed by DDOG. They are taking way too much marketshare for this not to be the case. In the world of software, my experience would lean strongly to DDOG being able to help this team out to better implement their use cases.

11 Likes