UPDATE: This cheatsheet is now part of the documentation for regex-tdfa
!
While Haskell is great for writing parsers, sometimes the simplest solution is just to do some text munging with regular expressions. Whenever I find myself needing to do some simple pattern matching on strings, I always reach for regex-tdfa
; it's fast and supports all the directives I need. But whenever I do, I inevitably find myself having to reread the documentation all over again to figure out how to use anything. So here's a cheatsheet for the most common use cases for regexes in Haskell.
Importing and using
Add to your package.yaml
/cabal file:
dependencies:
- regex-tdfa
- regex-tdfa-text
In modules where you need to use regexes:
import Text.Regex.TDFA
import Text.Regex.TDFA.Text ()
The regex-tdfa
package only lets you match on String
/ByteString
; hence the import of regex-tdfa-text
.
Basics
> emailRegex = "[a-zA-Z0-9+._-]+@[a-zA-Z-]+\\.[a-z]+"
λ> "my email is email@email.com" =~ emailRegex :: Bool
λ>>> True
-- non-monadic
<to-match-against> =~ <regex>
-- monadic, uses MonadFail on lack of match
<to-match-against> =~~ <regex>
(=~)
and (=~~)
are polymorphic in their return type. This is so that regex-tdfa
can pick the most efficient way to give you your result based on what you need. For instance, if all you want is to check whether the regex matched or not, there's no need to allocate a result string. If you only want the first match, rather than all the matches, then the matching engine can stop after finding a single hit.
This does mean, though, that you may sometimes have to explicitly specify the type you want, especially if you're trying things out at the REPL.
Common use cases
Get the first match
-- returns empty string if no match
=~ b :: String -- or ByteString, or Text...
a
> "alexis-de-tocqueville" =~ "[a-z]+" :: String
λ>>> "alexis"
> "alexis-de-tocqueville" =~ "[0-9]+" :: String
λ>>> ""
Check if it matched at all
=~ b :: Bool
a
> "alexis-de-tocqueville" =~ "[a-z]+" :: Bool
λ>>> True
Get first match + text before/after
-- if no match, will just return whole
-- string in the first element of the tuple
=~ b :: (String, String, String)
a
> "alexis-de-tocqueville" =~ "de" :: (String, String, String)
λ>>> ("alexis-", "de", "-tocqueville")
> "alexis-de-tocqueville" =~ "kant" :: (String, String, String)
λ>>> ("alexis-de-tocqueville", "", "")
Get first match + submatches
-- same as above, but also returns a list of /just/ submatches
-- submatch list is empty if regex doesn't match at all
=~ b :: (String, String, String, [String])
a
> "div[attr=1234]" =~ "div\\[([a-z]+)=([^]]+)\\]"
λ :: (String, String, String, [String])
>>> ("", "div[attr=1234]", "", ["attr","1234"])
Get all matches
-- can also return Data.Array instead of List
=~ b) :: [String]
getAllTextMatches (a
> getAllTextMatches ("john anne yifan" =~ "[a-z]+") :: [String]
λ>>> ["john","anne","yifan"]
Special characters
regex-tdfa
only supports a small set of special characters and is much less featureful than some other regex engines you might be used to, such as PCRE.
\`
— Match start of entire text (similar to^
in other regex engines)\'
— Match end of entire text (similar to$
in other regex engines)\<
— Match beginning of word\>
— Match end of word\b
— Match beginning or end of word\B
— Match neither beginning nor end of word
Less common stuff
Get match indices
-- can also return Data.Array instead of List
=~ b) :: [(Int, Int)] -- (index, length)
getAllMatches (a
> getAllMatches ("john anne yifan" =~ "[a-z]+") :: [(Int, Int)]
λ>>> [(0,4), (5,4), (10,5)]
Get submatch indices
-- match of __entire__ regex is first element, not first capture
-- can also return Data.Array instead of List
=~ b) :: [(Int, Int)] -- (index, length)
getAllSubmatches (a
> getAllSubmatches ("div[attr=1234]" =~ "div\\[([a-z]+)=([^]]+)\\]")
λ :: [(Int, Int)]
>>> [(0,14), (4,4), (9,4)]
Replacement
Unfortunately, regex-tdfa
doesn't seem to provide functionality to do find-and-replace.
Avoiding backslashes
If you find yourself writing a lot of regexes, take a look at raw-strings-qq
. It'll let you write regexes without needing to escape all your backslashes.
{-# LANGUAGE QuasiQuotes #-}
import Text.RawString.QQ
import Text.Regex.TDFA
> "2 * (3 + 1) / 4" =~ [r|\([^)]+\)|] :: String
λ>>> "(3 + 1)"
If you find that you need to do something more complicated with your text, it may be that you're trying to use the wrong tool; take a look at using parser combinators instead. For parsing human-generated files, take a look at megaparsec
. If you need maximum speed for over-the-wire formats, attoparsec
is probably what you're looking for.
Still have questions? Talk to me!
You might also like
Before you close that tab...
Want to write practical, production-ready Haskell? Tired of broken libraries, barebones documentation, and endless type-theory papers only a postdoc could understand? I want to help. Subscribe below and you'll get useful techniques for writing real, useful programs straight in your inbox.
Absolutely no spam, ever. I respect your email privacy. Unsubscribe anytime.