While Haskell is great for writing parsers, sometimes the simplest solution is just to do some text munging with regular expressions. Whenever I find myself needing to do some simple pattern matching on strings, I always reach for
regex-tdfa; it’s fast and supports all the directives I need. But whenever I do, I inevitably find myself having to reread the documentation all over again to figure out how to use anything. So here’s a cheatsheet for the most common use cases for regexes in Haskell.
Importing and using
Add to your
In modules where you need to use regexes:
regex-tdfa package only lets you match on
ByteString; hence the import of
(=~~) are polymorphic in their return type. This is so that
regex-tdfa can pick the most efficient way to give you your result based on what you need. For instance, if all you want is to check whether the regex matched or not, there’s no need to allocate a result string. If you only want the first match, rather than all the matches, then the matching engine can stop after finding a single hit.
This does mean, though, that you may sometimes have to explicitly specify the type you want, especially if you’re trying things out at the REPL.
Common use cases
Get the first match
Check if it matched at all
Get first match + text before/after
-- if no match, will just return whole -- string in the first element of the tuple a =~ b :: (String, String, String) λ> "alexis-de-tocqueville" =~ "de" :: (String, String, String) >>> ("alexis-", "de", "-tocqueville") λ> "alexis-de-tocqueville" =~ "kant" :: (String, String, String) >>> ("alexis-de-tocqueville", "", "")
Get first match + submatches
-- same as above, but also returns a list of /just/ submatches -- submatch list is empty if regex doesn't match at all a =~ b :: (String, String, String, [String]) λ> "div[attr=1234]" =~ "div\\[([a-z]+)=([^]]+)\\]" :: (String, String, String, [String]) >>> ("", "div[attr=1234]", "", ["attr","1234"])
Get all matches
regex-tdfa only supports a small set of special characters and is much less featureful than some other regex engines you might be used to, such as PCRE.
\`— Match start of entire text (similar to
^in other regex engines)
\'— Match end of entire text (similar to
$in other regex engines)
\<— Match beginning of word
\>— Match end of word
\b— Match beginning or end of word
\B— Match neither beginning nor end of word
Less common stuff
Get match indices
Get submatch indices
regex-tdfa doesn’t seem to provide functionality to do find-and-replace.
If you find yourself writing a lot of regexes, take a look at
raw-string-qq. It’ll let you write regexes without needing to escape all your backslashes.
If you find that you need to do something more complicated with your text, it may be that you’re trying to use the wrong tool; take a look at using parser combinators instead. For parsing human-generated files, take a look at
megaparsec. If you need maximum speed for over-the-wire formats,
attoparsec is probably what you’re looking for.
Still have questions? Talk to me!