Multiline log parsing with regex: Keeping multiline events intact for your SIEM

Most telemetry pipelines treat every newline as the end of an event. That assumption holds for a tidy syslog stream but breaks the moment a Java stack trace, a Python traceback, or a pretty-printed JSON payload lands in the file. One event becomes forty lines, and your SIEM ingests forty fragments instead of one record.

For a SecOps team, the cost is operational. Detection rules match on fragments or miss the event entirely, correlation loses the context that made the event worth alerting on, and the event count balloons against a volume-based license. The fix is to define where each event starts and ends with a regular expression, and to do it at the collection layer before it reaches the SIEM.

Why "one line, one event" breaks

Plenty of the sources a SecOps team cares about emit events across several lines:

JVM exceptions and Java stack traces, where the message line is followed by dozens of at … frames.
Python tracebacks, which wrap the error in Traceback (most recent call last): and an indented call chain.
Application logs that print a request, a response body, and a result across separate lines.
Pretty-printed XML or JSON, indented for humans and spread over many lines.

When a line-oriented collector splits these, three things go wrong. Your detection logic sees at com.example.Service.run() as a standalone event and has nothing to match against. Correlation rules that depend on the error message and its stack don’t see them together. And a single 40-line trace counts as 40 events — inflating dashboards, skewing baselines, and burning ingest quota you’re paying for by volume.

Defining event boundaries with regex

Multiline log parsing with regex needs one decision: how do you tell the parser where an event begins or ends? There are three patterns, and the right one depends on what your log source gives you.

Match the header line

The most dependable approach matches the first line of each event — usually a leading timestamp. Everything after it belongs to the current event until the next header line appears.

A timestamp anchored to the start of the line makes a reliable boundary:

/^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}/

This matches 2026-06-11 14:22:01 at the beginning of a line and treats every indented stack frame that follows as part of the same record. Most structured application logs lead with a timestamp, which makes this the default choice.

Match the header and end line

When both a reliable header and a distinct terminator are present, you can match both. The event opens on the header pattern and closes as soon as the end pattern matches — useful when the footer is more reliably distinct than waiting for the next header alone, for example, a closing </event> tag.

Fixed line count

If every event is the same number of lines, you can count instead of pattern-match. This one is brittle: one source change and your boundaries silently drift. Reserve it for rigid, machine-generated formats where the line count won’t change.

A header-only pattern can’t tell that a file’s last event is complete until the next header arrives. If you need the last record emitted promptly, add an end-line pattern or a fixed line count.

Regex that holds up in production

The pattern that works on three sample lines often falls over on a million. Three habits keep multiline regex dependable at scale:

Anchor your patterns: Start header and end patterns with ^. Without the caret, the engine searches for a match anywhere in the line — a costly operation on every line of a high-volume source. Forgetting the anchor is an easy mistake to make, and an expensive one.
Set the right flags: Once lines are joined into a single event string, . stops at the first newline unless you set the /s (dotall) flag, which lets . match line terminators too. Use /m (multiline) when you want ^ and $ to match at internal line breaks. Mixing these up is why a field-extraction pattern "works" on one line and returns nothing on a reassembled event.
Watch for catastrophic backtracking: Nested or ambiguous quantifiers like (.*)+ can send a regex engine into exponential backtracking. A short line hides the cost. A 2,000-line stack trace turns it into a CPU spike and a stalled pipeline. Write specific patterns, prefer explicit character classes over broad wildcards, and test against your largest real events, not your tidiest ones.

How NXLog Agent handles multiline parsing

NXLog Agent handles all three boundary strategies through one module, the Multiline Parser extension. You define the module once and point an input at it.

The module uses the PCRE engine, so the regex syntax matches what you already write in Perl: patterns quoted with slashes, the =~ and !~ operators, and the /s and /m modifiers described above.

Here’s a configuration that reassembles Java stack traces from a log file using a timestamp header:

<Extension multiline>
    Module        xm_multiline
    HeaderLine    /^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}/
</Extension>

<Input app_logs>
    Module        im_file
    File          '/var/log/app/application.log'
    InputType     multiline
</Input>

The HeaderLine directive sets the boundary, and InputType ties the File input module instance to the multiline instance. Need an end pattern too? Pair it with EndLine. Fixed-size records? Use FixedLineCount.

Once the event is whole, you can pull fields out of it. Add an Exec block to the input — it runs on the full, reassembled record, which is where /s does its work: it lets . match across the newlines you just joined.

<Exec>
    # Runs once per reassembled event
    if $raw_event =~ /^(\S+ \S+) (\w+) (.*)$/s
    {
        $EventTime = parsedate($1);
        $Severity = $2;
        $Message = $3;
    }
</Exec>

This is the part that matters for your SIEM: because processing is at the source level, that 40-line stack trace increments your event counter once, not forty times. NXLog Agent reassembles the event, so what reaches Splunk, Microsoft Sentinel, or Elasticsearch is a single record, already parsed and intact. The same approach applies whether you’re collecting from flat files, Windows Event Log, or a TCP listener.

Conclusion

Almost every application log you collect contains multiline events. Get the boundary right with a well-anchored regex, choose the strategy that fits your source, and reassemble events at the collection layer so your SIEM only ever sees whole records. Your detection rules, your correlation logic, and your ingest bill all benefit.

NXLog Agent supports 100+ input and output modules for SecOps, DevOps, and compliance pipelines, and NXLog Platform is how you deploy and manage agents across a fleet. To try this on your own logs, start with the Multiline Parser documentation, and you can try the full pipeline for free before you scale.

NXLog Platform is an on-premises solution for centralized log management with
versatile processing forming the backbone of security monitoring.

With our industry-leading expertise in log collection and agent management, we comprehensively
address your security log-related tasks, including collection, parsing, processing, enrichment, storage, management, and analytics.

Start free Contact us