Regex in retrograde

VP of Product, Security, Fastly

May 31, 2023

“Some people, when confronted with a problem, think ‘I know, I'll use regular expressions.’ Now they have two problems.” – Jamie Zawinski

Regular expressions are a concise way of describing how to match specific patterns or sequences of tokens, usually text. But is the approach of matching sequences of text effective at detecting malware or web application attacks? More precisely, for what types of web app attacks is the regular expression (regex) approach effective? We don’t want to succumb to status quo bias and sabotage our security outcomes by sticking with something ineffective, after all.

Regex is useful for the copy-and-paste stuff beloved by novice attackers – when the attacker doesn’t understand how the attack works because someone else built it for them. The “novice attacker” (i.e. amateur) simply operationalizes it and marauds around the internet through a tool like Shodan. But the simple stuff is where regex’s effectiveness ceases.

The people who craft the reusable bits of the attack know they could morph any of it to circumvent the simplistic pattern matching mechanism regex offers. In fact, the precise sequence of tokens really isn’t that meaningful – it’s the full expression of those tokens that matters.

What matters is the attack behavior – how actions unfurl over spacetime. The specific bits or text are an implementation detail rather than descriptive of how the attack works.

As a real world analogy, consider a TV being taken off the wall. Is it useful for us to know the specific type of screwdriver used? Maybe we’re the ones using the screwdriver to adjust the TV’s height, rather than a burglar pilfering it. But if the living room window is smashed open and then someone rips the TV off the wall – that behavior is oozing malice.

In essence, patterns of text in different settings mean different things and produce different outcomes. Context matters. A textual description of a SQL injection (SQLi) exploit in a blog post is very different from someone hurling a SQLi attack against the hoster of that blog. Both will contain the pattern of text representative of that attack class, but only one instance reflects a real attack.

This is not to say that regular expressions are useless always everywhere. Regexes are thrifty representations for pattern matching: they’re idiomatic and comparatively readable. The problem is the pattern matching – that pattern matching tokens, like text in query parameters or POST body fields, only catches the most rudimentary attacks. The same goes outside of web app land, too; a regex for a specific file hash assumes the attacker is so unmotivated and careless as to forgo the effort to transmogrify the file to elude such simple detection. Sometimes that will be true; much of the time it will not.

Even if we put accuracy aside, regex can foment overwhelming overhead for defenders. Regex is often inflexible, unwieldy, and difficult to maintain – especially when you try to bake more context into it. For instance, attempting to capture a list of valid Linux / Windows commands to detect command injection would be unreadable if not impossible to do well with regex – and the same goes for SQLi. The resulting pattern would be unreadable to a human, which isn’t sustainable (even if you construct regex grammars).

Consider this “simple” regex for validating email addresses:

`(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])`

Do you enjoy parsing it with your eyeballs or does it feel a bit uncivilized?

Image source

When wanting to keep up with emerging or evolving attack techniques and patterns, the overhead involved (time, effort, caffeine, etc.) means change happens less frequently and thus the solution’s efficacy decays. Regex can also be slow and resource intensive, especially when dealing with complex expressions that are difficult to optimize. Whether human overhead or performance overhead, the return on investment of regex for attack detection is often not worth it.

Regex has its place, and attack detection isn’t it. Many vendors do not want buyers to discover this because it is much less expensive to maintain existing technology than to invest in innovation. Nevertheless, regex doesn’t worry attackers or make it more expensive for them to attack us.

Techniques like parsing HTTP requests can offer a thriftier investment, allowing defenders to bake in more context that starts to pattern-match attack behaviors while reducing the effort required by humans and machines alike. Dissecting attack behavior will catch both rudimentary and more effortful attacks – and any variants of them – without requiring a new rule for every new vulnerability that emerges (which is often). If we look at the attack context and how a request is processed at runtime, we can make more accurate decisions than with regex.

Thus, we arrive at the market distortion today: security buyers’ checklists still include “regex matching” as a key capability in their web app sec products (and often in other products, like YARA rules in endpoint tools) despite its inadequacy. That requirement served customers well for a time, when most vendors’ products relied on regexes – when there weren’t better ideas around. We are no longer in that world.

Better security without regex. Learn how Fastly signals give you better visibility for decisioning.

Request a demo

Take your blinders off and fix your checklists

Imagine the following scene for illustrative purposes: a customer walks into a car dealership and asks if the car has blinders.

“No,” the salesperson replies, confused. “You don’t need blinders on a car.” The customer shakes their head disapprovingly, tapping their checklist.

“What about buggy whips?” asks the customer.

“You don’t need buggy whips, either. We don’t do whippings anymore.”

“I suppose next you’ll tell me there’s no harness or bridle, either!”

“That’s correct. It’s a car.” The salesperson pinches the bridge of their nose with a sigh and asks, “What exactly do you need to accomplish?”

“To travel from point A to B posthaste,” replies the confident customer.

“Right, a car is much better for that than a horse carriage. It’s designed this way for a reason. It’s innovation to make your life better.”

“And yet all these features are missing!”

This example is obviously absurd and yet this is not dissimilar to conversations overhead in cybersecurity today. It is understandable why we might cling to our familiar list of features as buyers; questions about blinders, buggy whips, and bridles are all very wise inquiries when buying a horse carriage. But these checklists are inherently static, mooring us to an often obsolescent past.

We bemoan fast, ever-evolving attackers and yet too often we select products based on tech features that were passable only in the days of yore. Attackers keep their options open and choose the right tools to accomplish their goals; we should do the same as defenders. While not the only option, we can consider newer but well-vetted defensive techniques like parsing request parameters to analyze the results of a request at runtime – gaining more speed, accuracy, and flexibility.

It’s tough to realize our mental models about a problem area are incorrect and outdated – that reality has evolved faster than our conception of it, even though we’ve sunk so many resources into these ultimately fruitless pastures. It is precisely that staleness, those unchallenged assumptions, that attackers exploit. If we want to succeed in our goal of outmaneuvering attackers, then we must progress beyond regex into a more modern era of detection.

Learn more about how Fastly helps defenders evolve past regex with parsing, read our datasheet about SmartParse detection or our blog about how we leveraged parsing for the Log4Shell attack. Or request a demo – we’d love to show you how it works.