A cheatsheet to regexes in Haskell

April 11, 2019

« Previous post Next post »

UPDATE: This cheatsheet is now part of the documentation for regex-tdfa!

While Haskell is great for writing parsers, sometimes the simplest solution is just to do some text munging with regular expressions. Whenever I find myself needing to do some simple pattern matching on strings, I always reach for regex-tdfa; it's fast and supports all the directives I need. But whenever I do, I inevitably find myself having to reread the documentation all over again to figure out how to use anything. So here's a cheatsheet for the most common use cases for regexes in Haskell.

Importing and using

Add to your package.yaml/cabal file:

dependencies:
  - regex-tdfa
  - regex-tdfa-text

In modules where you need to use regexes:

import Text.Regex.TDFA
import Text.Regex.TDFA.Text ()

The regex-tdfa package only lets you match on String/ByteString; hence the import of regex-tdfa-text.

Basics

λ> emailRegex = "[a-zA-Z0-9+._-]+@[a-zA-Z-]+\\.[a-z]+"
λ> "my email is email@email.com" =~ emailRegex :: Bool
>>> True

-- non-monadic
<to-match-against> =~ <regex>

-- monadic, uses MonadFail on lack of match
<to-match-against> =~~ <regex>

(=~) and (=~~) are polymorphic in their return type. This is so that regex-tdfa can pick the most efficient way to give you your result based on what you need. For instance, if all you want is to check whether the regex matched or not, there's no need to allocate a result string. If you only want the first match, rather than all the matches, then the matching engine can stop after finding a single hit.

This does mean, though, that you may sometimes have to explicitly specify the type you want, especially if you're trying things out at the REPL.

Common use cases

Get the first match

-- returns empty string if no match
a =~ b :: String  -- or ByteString, or Text...

λ> "alexis-de-tocqueville" =~ "[a-z]+" :: String
>>> "alexis"

λ> "alexis-de-tocqueville" =~ "[0-9]+" :: String
>>> ""

Check if it matched at all

a =~ b :: Bool

λ> "alexis-de-tocqueville" =~ "[a-z]+" :: Bool
>>> True

Get first match + text before/after

-- if no match, will just return whole
-- string in the first element of the tuple
a =~ b :: (String, String, String)

λ> "alexis-de-tocqueville" =~ "de" :: (String, String, String)
>>> ("alexis-", "de", "-tocqueville")

λ> "alexis-de-tocqueville" =~ "kant" :: (String, String, String)
>>> ("alexis-de-tocqueville", "", "")

Get first match + submatches

-- same as above, but also returns a list of /just/ submatches
-- submatch list is empty if regex doesn't match at all
a =~ b :: (String, String, String, [String])

λ> "div[attr=1234]" =~ "div\\[([a-z]+)=([^]]+)\\]"
     :: (String, String, String, [String])
>>> ("", "div[attr=1234]", "", ["attr","1234"])

Get all matches

-- can also return Data.Array instead of List
getAllTextMatches (a =~ b) :: [String]

λ> getAllTextMatches ("john anne yifan" =~ "[a-z]+") :: [String]
>>> ["john","anne","yifan"]

Special characters

regex-tdfa only supports a small set of special characters and is much less featureful than some other regex engines you might be used to, such as PCRE.

\` — Match start of entire text (similar to ^ in other regex engines)
\' — Match end of entire text (similar to $ in other regex engines)
\< — Match beginning of word
\> — Match end of word
\b — Match beginning or end of word
\B — Match neither beginning nor end of word

Less common stuff

Get match indices

-- can also return Data.Array instead of List
getAllMatches (a =~ b) :: [(Int, Int)]  -- (index, length)

λ> getAllMatches ("john anne yifan" =~ "[a-z]+") :: [(Int, Int)]
>>> [(0,4), (5,4), (10,5)]

Get submatch indices

-- match of __entire__ regex is first element, not first capture
-- can also return Data.Array instead of List
getAllSubmatches (a =~ b) :: [(Int, Int)]  -- (index, length)

λ> getAllSubmatches ("div[attr=1234]" =~ "div\\[([a-z]+)=([^]]+)\\]")
     :: [(Int, Int)]
>>> [(0,14), (4,4), (9,4)]

Replacement

Unfortunately, regex-tdfa doesn't seem to provide functionality to do find-and-replace.

Avoiding backslashes

If you find yourself writing a lot of regexes, take a look at raw-strings-qq. It'll let you write regexes without needing to escape all your backslashes.

{-# LANGUAGE QuasiQuotes #-}

import Text.RawString.QQ
import Text.Regex.TDFA

λ> "2 * (3 + 1) / 4" =~ [r|\([^)]+\)|] :: String
>>> "(3 + 1)"

If you find that you need to do something more complicated with your text, it may be that you're trying to use the wrong tool; take a look at using parser combinators instead. For parsing human-generated files, take a look at megaparsec. If you need maximum speed for over-the-wire formats, attoparsec is probably what you're looking for.

Still have questions? Talk to me!