Datadog is my largest position. It has grown since my early purchases at ~ $32/share (yes, I got in during the early days). I have been buying more DDOG (probably more than I should), as it has been dropping these past few months.
Now to the story.
I work at a tech company in the SF Bay Area. The company is a small, to midsize company. This morning, I was searching our Slack channels and randomly came across a conversation regarding DDOG from a few of our software developers. After reading through the conversation, I was concerned, but then not concerned, but then somewhat confused. I thought I would share the transcript with the board, as the conversation may offer thought for a lively discussion.
Employee #1: Bringing back the topic of Datadog… I came across this sentence while reading through their docs:
“Any metrics with fractions of a second timestamps are rounded to the nearest second. If any points have the same timestamps, the latest point overwrites the previous ones.”
The word “overwrites” here is concerning: does anyone know whether data points are dropped or just grouped into the same 1-second bucket for querying?
Employee #2: I tried this out locally with a test metric using the XXXX API and it does in-fact seem to overwrite the value for the same second based timestamp. And, I am not able to retrieve the “older” value.
- X = 1 for the first submission
- X = 5 for the second comm
Employee # 2: @Employee #3 and @Employees #4 – would you have any insights and/or thoughts on this?
Employee #1: Yikes! Thanks for testing this out. @Employee #2. If this is indeed the case, then I think we’ll want to quickly find another solution.
Employee #3: Hi. Yes, this is expected. Dogstatsd is based on statsd and metrics do not accumulate. I would expect that you would have to use additional dimensions (for example machine-name or the like) to get two different points of the same metric at the same timestamp.
Employee #3: In typical statsd implementations the path includes metdadata such as the machine name or IP, and then metric math is done to combine those. I can’t imagine dogstatsd is different here. It’s hard for the backend to know how to combine metrics (avg, sum) for counts, let alone gauges.
Employee #1: DogStatsD is different than their API though. @Employee #2’s testing was through the API afaik.
Employee #1: There’s some available here (link to doc.datadoghq.com). The short version is that it looks like you’re required to use their custom agent if you want support for HISTOGRAM and DISTRIBUTION metric types.
Employee #1: Their docs also seem to indicate that all statistical analysis (e.g. calculation of p95) has to be done up-front, meaning you can’t just submit metrics and run arbitrary analysis later. If that’s true, then this seems extremely restrictive.
Employee #3: DogStatsD is StasD with dimensions added. StatsD supports HISTOGRAM and DISTRIBUTION
Employee #3: That is true (the p95 thing) and pretty typical of statsd based systems. It’s one of (but far from the only) reason I dislike ddog (the real reason is that they just had high latency on alerts or forgot to send them).
Employee #1: Soo does this mean we can migrate to something more suited to our needs?
Employee #3: That’s a good question that I can’t just blatantly answer. I have had very good success with a custom statsd hosting with graphite aggregators or Prometheus. Nut now daysI would personally (if were up to me) advocate we use cloudwatch metrics:
Employee #1: +1 for cloudwatch, though I’m admittedly biased.
Employees #3: It also supports log based metrics which is a hugely beneficial feature.
Employee #3: I’m 100% biased. My team at Twitch started on statsd, moved to datadog, moved to signal and then moved to cloudwatch.
Employee #3: signal fx was bought by splunk and possibly murdered.
Employee #2: One of the things I do like about DataDog is centralized visualization across all applications/systems.
Employee #3: Cloudwatch allows that. The typical pattern is to create a new AWS account, share cw from all other accounts with it and use it as a centralized dashboard app. Gives you the exact same thing. As does using Garfana for dashboards.
Employees #1: As @Employes #3 mentioned, one benefit of cloudwatch is that it’s supported data source for many viz tools (DD, Grafana, others I’m not aware of), so we could reasonable maintain nice dashboards on another platform.
Employee #3: One thing I am sure of is that cloudwatch alerts are rock solid and that matters more to me.
Employee #3: (We also spend a lot of money sending all our cw data to ddog, but that shouldn’t be a primary decision factor)
Employee #1: Agreed, the built-in monitoring is hugely beneficial. I know many platforms (including DD) have something similar, though I’d wager that CW is the most robust and friendly.
Employee #1: How can we come to a consensus in this? Do we know who the right stakeholders are?
Employee #2: For the centralized CW AWS account, how easy has it been for engineers to add/modify/configure existing and new dashboards, metrics, alarms (etc.) in your past experience? Has it been fairly seamless or are there more hoops to jump through?
Employee #2: @Employee #1, we should get opinions from Employee #5 and Employee #6. There may be others – but those are folks that I am not aware of.
Employee #3: Well also people like @Employee #7 and @Employee #8.
Employee #3: CW dashboards are easy but somewhat barebones.
Employee #1: We essentially did the centralized CW on my last team at Amazon. Once you have the initial PoC of streaming metrics from one account to another, adding additional source accounts and setting up dashboards/monitors is pretty straightforward (obviously you need some familiarity with CW). The main issue we ran into was ensuring that metrics were properly tagged and dimensionalized, but that’s something that e.g. a common metrics library could help greatly with.
Employee #3: At twitch someone built a “statd to cloudwatch” library since people were so ingrained with using (link to github.com)
Employee #1: Is anyone opposed to me setting up a meeting with the stakeholders called out here so that we can start working towards consensus? I’d like to make headway as soon as we can so that we’re not digging a deeper hole.
Employee #3: a good common library is recommended, clouwatch charges by the call, not by the metric, so batching up metrics is a huge money saver
Employee #3: Please do @Employees #1
Employee #2: Nope. Go ahead.
Employee #1: Invite has been sent. Please let me know if anyone would like to be added!
Employee #2: @Employee #1 – for my edification, could you share the other issues you noticed with CW?
Employee #3: The only issue I ever had with cloudwatch was the 5 minute window, but they changed it.
Employee #3: they added high resolution metrics for 1 secondly and regular are not 1 minutely.
Employees #3: Oh and I guess like any AWS service, there are limits on it. I think CW has 150 calls a second limit. This is another good reason to have more and smaller accounts (and batch metrics)
Employee #1: I’m also not a fan of their paginated log UI, but are ways to mitigate it.
Employee #1: The main issues I’ve had with it in the past were all related to our own data. I think the platform itself is good.
Employee #3: I never use their stupid console logs UI
Employee #3: I rarely use a web browser when I’m coding. Haha whom I kidding, I haven’t coded in 2 years.
Employee #1: haha my last team ended up writing a greasemonkey plugin to remove the pagination, but getting it out of the browser entirely is even better.
Employee #3: That package I pasted is very good, that’s what I used.
Employee #5: My 2 cents around this – we spent a long time building this out, and I don’t know if it’s a good use of our time to migrate out from it at this stage. If we have concerns about DD, let’s list them out, get DD to investigate them/come back with an answer, and then evaluate the impact on our platforms.
Employee :5: Open to discussing it, but don’t want to spend another few months to move to something that is marginally better. I think there are a lot of other areas that could use improvement and have higher impact than this.
Employee #3: Almost all the data in DD is just from CW, so the data is already in both places.
Employee #1: Want to make sure your concerns are heard @Employee #5. Nothing has been decided yet; let’s discuss on Friday.
I hope you found this helpful. I found it to be a bit concerning, but the sample size is smaller than small.
Note: this conversation occurred a couple of months back, and I do not know the outcome of the meeting.