DDOG shines at managing chaos

I don’t post much on this forum, I mostly lurk and learn. I am thankful for the many here who post great insights and perspective, it has really helped me and my returns. This is one opportunity where I thought I may be able to help others with some of my thoughts.

In my IT related role, I gain experience first hand or research topics for strategic planning purposes. In response to a thread comparing CRWD and DDOG investment in another forum, multiple respondents chimed in about the virtues of CRWD, and why it is a good investment. I felt the urge to share some thoughts about what DDOG does for companies like mine. So I posted the following.

I am invested in both [CRWD and DDOG] for different reasons. These players operate in different niches, but they are two sides of the same coin.

As [name withheld] noted, CRWD is focused on endpoint security, which is growing increasingly important with the proliferation of mobile and IOT. With all sorts of intelligent computing devices popping up everywhere, security is very important, so CRWD is benefiting from the trend. But just as important is monitoring, tracking and managing all those endpoints, interacting with each other. That is where DDOG is essential.

As anyone even minimally involved in IT knows, the above proliferation has led to IT sprawl. Servers, workstations, laptops, phones, and millions of different devices are contributing and consuming data at exponential rates. Knowing what all these endpoints are up to, how they are performing, and if they are doing something unexpected is where DataDog shines.

These days there are literally billions, if not trillions, of lines of logs being generated every day in a typical enterprise by the sprawling endpoints mentioned above. No human can track even a tiny fraction, let alone all of it, even in the smallest enterprise. You need tools that can rapidly collect, ingest, categorize, analyze and act on all that noise in near real time. That’s what DataDog does.

DataDog provides visibility to the enterprise systems, where IT personnel can instantaneously assess how their systems are behaving through easy to understand visual dashboards. Out of the box, the tooling is already smart enough to know how to ingest and understand the feeds from common systems, like the various AWS, Microsoft Azure and other types of servers, appliances and IT endpoints. It also has easy to use scripting and ingest capabilities to create custom monitors and interfaces for less common endpoint types.

Another major benefit of a tool set like DataDog is the rule-sets you can create with it to enable IT to highlight and act upon anomalies and exceptions, triggering alerts, or even automating responses to events occurring in real time. It gives the IT team power to instantaneously react to threats and irregularities with the greatest of ease.

Let’s take an example; Suppose the marketing team releases a new campaign to boost sales of some widget, and the campaign suddenly drives millions of new buyers to your e-commerce site. Your DataDog monitors have been tracking usage in real time, and the traffic monitor rule-set you trained highlights that website traffic is up 20% over normal, and customers are starting to experience delays exceeding 20 seconds while trying to navigate your online catalog and checking out. Worse, your mobile customer traffic is up 90% over normal, and those customers are experiencing 1 minute delays, and it is only 8:00am.

Your DataDog monitors automatically trigger alerts to key personnel and management, highlight the abnormal activity on the dashboard, and instantaneously fire off automated events to startup 2 additional web site servers, and 5 additional mobile servers. Within 1 minutes from the first alert your systems have automatically addressed the surge, and a harmful site crash was averted.

That’s a simple use case, but there are far more complex situations where the power of a tool set like DataDog can avert hackers, trigger alerts and notifications, fire off automated responses to deal with complex irregularities, and allow a small group of technologists to monitor the vast sprawling networks of systems and endpoints that would be otherwise impossible. Besides powerful real-time monitoring, these tools are also excellent at producing metrics and giving corporations insight into their IT landscape, so they can better plan their growth and react to changing conditions and demands. Imagine being able to instantaneously know exactly how many users are currently browsing your catalog, how many have a certain item in their shopping cart, how quickly your inventory is being depleted, how many are abandoning their carts, how many are using fraudulent credit cards, etc., etc., etc.

In the complex world of modern IT, the proliferation of technology would not be possible if there weren’t tools like DataDog or Sumo Logic and their ilk to enable the enterprises to monitor and manage it effectively. These tool sets may be less visible in the headlines, and they may not get a lot of credit for stopping a state-backed hacking attack, but modern IT operations would not be possible without them. Think of them as the ECU in your car, monitoring combustion events, ignition, valves, injectors, pumps, heat, cooling, and all the other things going on in a modern engine. These new engines would blow up in seconds if it wasn’t for modern engine management systems that keep things just shy of the explosive edge of self destruction.

CRWD and DDOG, among others, tame all that chaos, and enable innovators to extend the digital frontier.


Both DataDog and CrowdStrike are the best products in their respective niches, which is allowing them to grow at tremendous rates. It is like the old saying goes ‘Nobody ever got fired for buying IBM’. That logic now applies to AWS as well, and in turn companies such as DataDog and CrowdStrike.

In addition to having personal experience using and configuring DataDog, what stood out most to me in the S1 of DataDog was that they had companies which had more subscriptions than engineers. They highlighted an unnamed company with 800 engineers and over 1,000 users for that company. The extra users are for management to see how systems are doing with a single-pane of glass. For example, a VP can login to DataDog and see that one of their systems is getting a lot of errors. They can then divert resources to help this division of the company.

I am noticing that Cloudflare also advertises a single-pane of glass, where any user can get a global view of how security is performing. I have heard from a few engineers over the years that working with Cloudflare is superior to Fastly and AWS’ offering… that led me to wondering why Fastly was outperforming Cloudflare so handily earlier this year. It has taken years to play out in the marketplace, but now Cloudflare is pulling ahead and Fastly is being put up for sale.

Crowdstrike sounds like some massive hive mind where every threat against the system only makes the entire system stronger. It is telling with the SolarWinds hack that SolarWinds is securing their own internal systems with CrowdStrike


Someone asked me off board:
what role (either direct or indirect) might DataDog have in counteracting state-backed hacking attacks?

I’m not an expert on countering cyber attacks, but I do work with folks who are very good at it, and they use these tools in their day to day work.

One of the great features of log aggregation and analysis platforms like DataDog is the ability to scan live data, incrementally, or en masse, and graph the results visually for easy accountability. From my own hands-on experience I’ve used these tools to create rules that alert us in real-time when usage or behavior outside normal thresholds is detected.

For example; one of the sensitive systems I monitor logs over 200 million events per month. We actively monitor the origins, the frequency and volumes of interactions with our systems, both coming in and going out. We also monitor failed versus successful attempts, retries, and trends in the activity. The tools also provide sophisticated graphics to identify outliers, either outside thresholds, or even where a measured event changes too rapidly. For example the normal range of activity for a certain endpoint might be a gradual buildup of activity from 20 events/minute early in the morning, to a peak of 200 events/minute at midday, and then it tapers off slowly through the afternoon. But if the endpoint starts at 20 and then suddenly spikes to 150 events at 6:30am, that would be abnormal behavior and it would get flagged as suspicious. In our case we would do some deeper analysis, we might zoom in on the actual packets, and we may inspect other metrics from to that endpoint, but more aggressive responses could be warranted for certain systems.

When things are normal, and activity is within acceptable thresholds the charts on the dashboard are really just for a quick glance confirmation. A lot of these tools are usually in passive autopilot mode. But when the upper or lower thresholds are breached, things begin to light up, notifications are immediately dispatched to certain team members over chat, text messages, and emails. In some more serious situations, we use webhooks to trigger automated responses and countermeasures. Sometimes it’s like watching a science fiction movie, all sorts of visual cues go off, people’s phones start buzzing, and a flurry of activity occurs. When we were still using our offices people would start gathering around the monitors and furiously typing on their laptops.

My interest is in responsiveness and performance, but the same tools and methods can be used to detect malicious behavior. Those who earn their living countering cyber attacks use these tools in even more sophisticated ways. For example, they may be tracking certain TCP/IP ranges for suspicious activity from external threats. They may be looking for error rates that suggest someone or something is attempting to penetrate or attack a certain endpoint. They could monitor someone’s activity and compare it to that users activity over the past few weeks in an instant, and see if their behavior has changed. There are all sorts of things you can do when you have all this data at your fingers and a tool that can monitor it, analyze it, and track it. It is astonishing how fast and effective these tools are.

On weekly, monthly and quarterly basis I produce reports and charts from these tools on various metrics for my leadership, and you would be shocked at how quickly these tools can analyze hundreds of GB of data, and how granular and graphical you can get. These are not the old clunky command line green screen tools of years past.