AI In Security Data Pipelines 

Security Data Pipeline Platforms (SDPP) has been a growing landscape thanks to high SIEM costs and the explosion of data. Log reduction and transformation were the solutions for organizations to lower their expenses while still ingesting the data they need to detect the right threats. However, like most vendors do, pipeline tools have evolved to meet new demands placed on them by the market. 

It's no secret that organizations are embedding AI wherever it adds value, solving historically complex problems without the burden of additional staff or resources. Top vendors recognize their customers' pain and lean into solving hard problems by Integrating AI into their day-to-day functions. Data Pipelines are no longer tools to simply reduce and transform logs; they are now the foundation of a true data strategy. 

Whether organizations are building in-house LLMs, leveraging robust data schemas, or ingesting data for cross-organizational use, a solid data strategy is the backbone for every initiative. By engaging with AI, customers are able to be much more agile, moving quickly to accomplish their priorities. What once took months of pain and resources, can now be done in weeks with much more efficiency. So where are we exactly seeing AI assist in the data pipeline space?  Let's talk about 3 main findings that our team uncovered after a 3-month long vendor exercise.  

Log Reduction & SIEM Feedback Loops 

We can't start the conversation anywhere else besides log reduction. Vendors have provided the means to reduce logs for the last few years, but it still requires a deep understanding of your environment and the data. It's something that any analyst or engineer would struggle with on year one.  

Now, AI is here to lend a helping hand. Databahn, who shocked us all with its deep, AI-first capabilities, is tackling this problem in an extremely concise manner. Through their bidirectional feedback loop, their platform can reach into your SIEM, find your rules, detections and dashboards, and map that to the MITRE framework to make logical recommendations.  

From day one, without an understanding of your data, it can insert itself into your daily SOC life and begin to provide tuning recommendations with scalpel like precision. Customers can also execute these actions automatically or with a human in the loop approval system. The weeks and months that employees spent combing through queries and detections to ensure they won't drop a needed field are coming to an end.  

An additional capability worthy of mentioning is the ability to "patternize" log sets. Whether this is being used in tools like Cribl Guard or Databahn to automatically detect PII, or Observo.AI (now called "SentinelOne AI Pipeline") and Abstract to find high volume logs that can be grouped, patterning has been a new capability that's useful for not just log reduction, but finding a needle in a hay stack during investigations.  

Data Pipeline Health Monitoring and Troubleshooting 

Monitoring log ingestion has long been a binary process. The source is either up or down; green or red. If a source was down in the past, most teams wouldn't know until someone had eyes on or analysts notice that detections have ceased. Today, it's a little more nuanced than that, with the addition of anomaly detection in streams. However, this is still just an end result. Customers want to know the "why". What happened to the firewall that stopped sending traffic?  Why did an endpoint quit logging? 

This is where we start to see AI step in. With vendors like Edge Delta, when a source goes down, it doesn't wait for an analyst to take notice. It kicks into an autonomous investigation mode to find the root cause of an issue. If an action occurred, like a recent patch or a shutdown, the agent would explain this and provide remediation guidance if necessary. Not all pipeline vendors are quite there yet, but it's a feature that should soon be adopted by all.  

Data Onboarding 

Ingesting log sources has been an extremely tedious process. We all know the pain of trying to bring in a custom log format that a SIEM doesn't support, a 3rd party application that's not an OOTB integration, or in-house built applications. Simply setting up an HTTP source, syslog endpoint or file collector is easier said than done. And once the data finally starts flowing, you are now left with the tasks of parsing and normalizing all of that data to make it meaningful to your security products. 

A big push has been made to solve this pain point by Databahn. With AI, you can now simply connect to a source and let the platform do the rest. But, what about all of the parsing?  Check. How about normalizing all of the fields? Done. Maybe you are even asking about how the sources are defined? Done and done.  

The capabilities are coming into existence to simply connect to the source and let AI perform the tasks that used to take hours and sometimes days. Platforms are gaining the ability to search the logs to learn the source type, parse them based on the log type, and normalize them against your detections and rules. Sounds too good to be true?  I thought so too, until we did the deep dive. 

Conclusion 

The data pipeline space has come a long way from being just a single stop for data ingest. From enabling in-house LLMs, to establishing true data strategies, to providing detection in stream capabilities, these platforms are changing the way that customers can interact with their data. They own it, they shape it, they search it, and they leverage it. It's become a true backbone for the SOC of the Future.