Data Pipelines - From Past to Future
In this blog
Data pipelines: From past to future
The conception of SIEM
In the early 2000s, an industry shift took place where we saw SEM (Security Event Management) and SIM (Security Information Management) coalesce. As if the long list of English homophones didn't make the language confusing enough, the two concepts were combined to create SIEM (Security Information and Event Management). That's right: SEM, SIM and SIEM. So, what does this have to do with data pipelines? Well, before we begin to understand the concept of Security Data Pipelines, it's important that we know the whole story.
SEM, SIM and SIEM: Know the difference
I remember when I started off my career as an analyst. ArcSight was the seemingly dominant SEM solution in the market space, and every person I knew was using their product, ESM. If you think "query, pivot, query, pivot" is a burden now, try it with tool sets from the early 2000s. Enterprise Security Manager, or "ESM," focused on monitoring data streams and applied simple, user-defined correlation rules to create alerts.
Sounds similar to today's workflow, right? While SEM may have been the initial building blocks for the SOC world today, one major aspect was missing: log management!
When alerts were triggered in a SEM, you couldn't stay in a single product to investigate. The luxuries of indexed logs, data models and normalization hadn't stepped into the light yet. Instead, most of the searching was done on the directories, firewalls/switches and endpoints themselves.
As an alternative, organizations could invest in a SIM to solve the issue of log decentralization. The downside, however, was that a SIM had very few correlation capabilities and was mainly built for compliance and historical searching. Think of it as yesterday's "cold storage."
At the time, joining the two products together fit the bill. However, growing sophistication in cyber-attacks and the beginning of the data explosion created a need for a single solution that combined the best of both worlds.
Analytics (SEM) + Storage (SIM) = SIEM
SIEM-centric: A new era of pain
Around 2005, Gartner coined the term "SIEM," and a new era was born. SOC teams started buying in, and organizations started rebuilding their architecture to fit the needs of on-prem storage, computing and routing. It appeared to be the answer to all the current pain points.
Little did we know that we were creating a whole new beast of problems. The hype was real, but as years went by, organizations literally started to pay the price of using a SIEM as a log dumping ground. Anything and everything was frivolously ingested, and the next era of pain was born: SIEMs became expensive!
However, in a world where protecting an organization's data is critical, what is the solution? Nearly every security program asked this question, and unfortunately, almost every answer had a "gotcha."
Continue to ingest everything and run up costs, or pick and choose what to send and miss important data. Every action has a reaction, and both decisions come with repercussions in the form of extreme bill overages or critical gaps in detections. We try our best to make decisions with the information we have, but unfortunately, all we did was create new mountains to climb.
The rise of data pipelines
Let's go back up a few years to the beginning of SIM and SEM. In the earliest days, Batch ETL (Extract, Transform, Load) attempted to solve some of the issues that plagued data storage. Logs were pushed to stores overnight after undergoing some type of reduction and were ready to use for historical searches and dashboarding. But this was not a perfect fix as there were several issues in the form of latency, lack of real-time insights and rigid data schemas.
The explosion of cloud data and convergence of logs, metrics and traces have magnified and brought down on the pains of Batch ETL: like having a bruise just to have your sibling come up and punch it! As expected, the next requirement for a data pipeline was low-latency ingestion, which created a shift from the batch scheduling of data movement to an event-driven ingestion concept.
Thanks to tools like Apache Kafka, Logstash and Fluentd, organizations can regain more control over their logs. Now, in addition to the ETL capabilities, data normalization and data routing have stepped onto the scene. In my opinion, this is when the real spark occurred that led to the soon-to-be creation of Security Data Pipeline Platforms (SDPP).
Now, let's speed back up again to the point where SIEMs are becoming extremely expensive. Although we've implemented some controls (which, to be honest, are just band-aids for a bigger problem), costs are still rising rapidly. It took this entire journey for organizations to finally realize that a SIEM is not a data strategy; it's merely a facilitator.
So, what does a data strategy look like? While the processes may be different from company to company due to different business goals, I think we can level set on one thing. It is a comprehensive strategy to reduce costs, drive compliance, facilitate business requirements and enable the SOC to respond better and faster. According to this definition, ETL, normalization and routing are no longer enough. We need more!
The pipeline-centric SOC
Organizations like Gartner and Forrester have traditionally shaped the language and direction of our industry. This time, however, the signal did not come from the analyst giants. It came from Francis Odum. Through Software Analyst Cyber Research, he has not just commented on the data pipeline evolution in the SOC; he has forged a new direction entirely.
While most were still trying to fit modern architectures into old SIEM frameworks, Francis introduced a new vocabulary and mental model that better reflects where security operations are actually heading. He did not wait for the market to validate his thinking; he defined it, published it and let the industry catch up.
Beyond the vision, his continuous work in research, content development, advisory guidance and technical translation has done more than shape opinion; it is steadily becoming the standard others will reference.
Welcome to the stage, Security Data Pipeline Platforms!
Up until now, I'd argue that we've had data management concepts but lacked a solid data strategy. We've lacked the ability to route to multiple destinations for storage requirements. There has been an inability to determine routing pipelines based on the context of raw data and alerts. Cold storage logs would be stored without any context. Enrichment wouldn't take place until after alert creation. These were all pain points that needed to be solved to truly take down the expensive SIEM beast.
As the pioneer, Cribl has not only laid the groundwork for companies like Abstract, Databahn and Observo AI, but they have also set the standards and proven the need for a true data management solution. These solutions have revamped what we have been missing all these years: The capabilities to take back control of our data and wallets.
Initially, SDPP was simply used as a tool to cut costs or switch over to SIEM technologies, but that's like driving a Ferrari to go to the grocery store. It works, but there is so much being left on the table! If you are still asking yourself, "What is SDPP?" let's quickly cover what it was created to do and does so well.
- Data tagging
- Dynamic routing
- Multi-destination delivery
- Log collection health monitoring
- Noise filtering
- Correlation capabilities
- Enrichment with TI/Asset management
While this is by no means an exhaustive list and every vendor has their own strengths, it's evident now the pieces of the puzzle that we've been missing for so long.
The SOC of the future
Many of the conversations that the SecOps practices have revolved around the evolution of the SOC and where it is heading. Despite a few different methods to get there, SDPP is more frequently seen as the backbone. If you were to go on vacation and the beach is the destination, the roads will get you there. SDPP is the road for your data strategy.
Simply leveraging filter and reduction solutions offered by some SIEM vendors has been proven insufficient. Instead, solutions are needed that will provide a long list of benefits while allowing complete data ownership.
The future of data pipelines
AI has become the most prevalent topic among security professionals. The most common conversations revolve around alerts and incidents and how AI can drive a more efficient SOC.
Since SDPP is still a relatively new subject to most, I doubt that conversation of "Where is this headed" has surfaced much. When we go back and think about SIM and SEM, it's exciting to think about how much this space has evolved and where it is going. Whether you talk to vendors or read a paper from an industry insight review, there are a few common themes floating around for the future. As the market evolves, new demands are being placed on vendors. The future of SDPP looks promising, and below are several reasons why.
Self-healing Pipelines: Most solutions in the past have simply provided "up or down" statuses for collection sources. While important, this provided very little context into why something quit working, leaving admins with hours of troubleshooting and periods of data loss. Self-healing routes will instantly attempt to identify when something goes wrong, automatically retry paths, re-route data accordingly and notify an analyst as needed. This is a form of health monitoring, fused with AI, that will drastically reduce time and potential data loss.
Detection at the Edge: While some vendors have begun to implement this, one of the future goals is to alleviate compute at run time within an analytics engine and push more correlation rules out to the edge. This will drastically help real-time detection while decreasing MTTD/MTTR.
AI Parsing: A large benefit of leveraging an SDPP solution is the ability to hand off tedious tasks like parser creation and normalization. Even today, there will be log formats and third-party applications that are simply not pre-installed. Detecting new log formats on the fly with AI parsers will drastically reduce the human effort it currently takes to make the "one-off" data formats usable in a security context.
Adaptive Routing: Currently, most data routes are pre-defined based on correlation logic or tagging. In the future, SDPP aims to intelligently route data based on reasoning and logic to endpoints without the need to build routes.
Final thoughts
The future of SDPP is the shift from simply being data movers to intelligent, autonomous brokers that adapt, heal, enrich, govern and orchestrate across the entire security ecosystem. If you are considering how SDPP can help guide and build the SOC of the Future or simply want to discuss the capabilities further, please contact GSASecOps@wwt.com. We are happy to assist in your cyber journey.