Enabling Better Detections with Data Pipelines
In this blog
Smarter data for smarter detections
Every security practitioner would agree that a core pillar of a mature SOC is strong Detection Engineering. Teams want high-fidelity alerts, fewer false positives, and less overall noise. SOAR and hyper-automation have attempted to address many of the redundancies we see today by introducing AI and automating where it makes sense. But in many cases, those same workflows would be far more effective if the logic were built directly into the data pipeline, rather than bolted on downstream.
The "left shift"
The point in time when a security alert is created can be considered our anchor point for this discussion. Logs are ingested. Events are correlated. Alerts are generated. For years, the approach to noise reduction has generally been to tune using allow/block lists. What's left is the noise that we can't outright block or drop, yet still provides little context. SOAR and hyper-automation are a shift-right approach to correcting the ship. By addressing alerts that have surfaced, analysts utilize workflows to streamline repetitive tasks and parts of the triage process.
A typical workflow looks something like this:
GET Alert -> asset enrichment -> Map hosts / usernames -> Scan IOCs -> Append results
To reiterate, these steps are all happening to the "right" of the alert. This workflow unloads a significant amount of tasks from the analyst, but it does very little to address the original issues of noise and false positives. These issues have not gone away; they are just hidden under the umbrella of automation and have become "out of sight, out of mind."
Instead of sweeping the problem under the rug, SOC teams are beginning to address these issues closer to their source. Until recently, very few capabilities have existed to take a common automation workflow and apply it prior to ingestion. Security Data Pipeline Platforms (SDPPs) like Cribl have created a data-plane automation layer that now allows us to shift left of the alert.
Efficient data pipelining
If we examine a typical organization, we won't find a significant increase in new alerts and compromises over time. What we will find is an explosion in the amount of telemetry data. We can automate all day, but that doesn't address our ingestion costs or noise. It simply pushes the real issues further down the road.
Data pipelines began as a means to reduce ingestion and storage costs for a SIEM. Over time, capabilities have quickly accelerated and the core functions of transforming and reducing feel medial. Now, capabilities such as dynamic routing, deduplication, tagging, normalization, enrichment and lookups have become the new standard. We could delve even further into topics like federated searching, storage, rehydrating data and SIEM migrations, but we'll save that discussion for another day. Instead, I want to focus on how these new standards are changing the workflow for SOCs.
Examples from the field
"Shift Left" is a relatively new term that describes the dynamics between components of data ingestion, alerting, and workflow actions. However, today I am not trying to coin a new phrase. I simply want to explain a concept, and the best way to do that is through the examples we are seeing in the field.
The first example illustrates how customers are starting to utilize their asset inventories in conjunction with their data pipelines. This allows organizations to provide much more context, not just to alerts, but to logs in general. Not only does this reduce the amount of work required during triage, but it also introduces entirely new tagging schemas that help make alerts smarter and more relevant. Host and user context applied early on results in fewer false alerts and higher fidelity.
Another example is the use of lookup tables early in the data pipeline phase. One of the most common processes is to append information through lookup tables or APIs to users, hosts, and IPs during the triage process. These lookups play a significant role in the outcome of investigations, but until recently, this process has been strictly reactive. Similar to asset inventories, data pipelines can enrich logs, not just alerts. Not only can teams achieve higher fidelity and reduced noise, but complete customization of labeling and tagging allows for much quicker rule creation tailored to specific environments.
Lastly, log grouping and reducing. A lot of SIEMs today can group alerts together by entities, IPs, and other IOCs. However, while that solves the noise issue, it doesn't solve the cost issue. Yes, this is a conversation on improving our detections, but to be honest, data pipelines have some cost reduction strengths that a lot of teams aren't aware of. Aside from dynamic routing as a cost control mechanism, logs can also be grouped or counted together at the ingestion layer. The result: lowered ingestion rates and less workload by the SIEM.
Better data outcomes
Hopefully, by now, you have a better understanding of what it means to shift left and the impact it can have. Data pipelines have grown from a "nice-to-have" addition to a core pillar of most SOCs today. They are no longer soley focused on directing traffic; they are also focused on increasing efficiency across many areas.
- Higher Detection Accuracy: How many alerts have you come across where an extra piece of enrichment could have negated an entire investigation? How many workflows have been completed, but "human in the loop" still requires analysts to confirm and close out? If we simply shifted some of our current workflows prior to ingestion, logs would contain much more context and detection engineering would be able to focus on accuracy more than reduction.
- Quicker Rule Development: A large problem with creating rules in the past was that the logic itself wasn't always the issue. It's been the diverse amount of log formats. Parsers help to an extent but from my experience, SIEM vendors don't provide many solutions for unrecognized formats, unstructured data, missing fields. Data Pipeline vendors have much more robust OOTB support while providing the ability to provide granular customization. Instead of Detection Engineers spending 25-50 lines of code to cross-correlate data sources, they can now focus on common schemas with pre-enriched data.
- Noise Reduction: Shaping how the data is processed, filtered and shaped upstream will address a large amount of "non-actionable" alerts and noise as a result. There is no magic button that exists to eliminate all false positives, but noise reduction is a key benefit of proper data management.
- Faster Response and Remediation: A byproduct of noise reduction and higher fidelity alerts will almost always mean that analysts get to investigate important alerts much faster and remediate much sooner.
- Cost Optimization: Detection Engineers can often feel under the microscope, with many SIEM pricing models having moved to a workload-based subscription. Large search scopes, sub queries, transforms, and entity stitching are all common in inputs used within detection logic. A lot of pre-processing that gets done in the query or through a workflow is now capable of being completed before ingestion. Queries can become more efficient, and workflow actions can be reduced by shifting the processes to the left of the alert.
Conclusion
Many SOC teams still heavily rely on automation tools to perform alert enrichment, which only pushes many of the underlying issues further down the road. As data pipeline vendors evolve and their capabilities grow, shifting our workflows to enrich logs with the "who, what, where, when, and why" prior to ingestion can greatly enhance our detections, reduce noise, and provide higher-fidelity alerts. Better data management will always lead to a more efficient security practice.