Regular Expressions in Haskell
I just spent a few hours trying to figure out how to use basic regular
expressions in Haskell. All of the tutorials in the search results (even Bryan
O'Sullivan’s) are several years out of date, and the Hackage docs are
singularly unhelpful: Should you use
Text.Regex.TDFA? And how do you use them?
Which one should I use?
I’m here to help: Use
regex-pcre package) and its
=~ operator. You only need
For example, in GHCi:
> import Text.Regex.PCRE > import qualified Data.ByteString.Char8 as BC > > "hello there" =~ "e" :: Bool True > "hello there" =~ "\\bthere\\b" :: Bool True > "hello there" =~ "\\bthere\\b" :: Bool True > let stringResult = "hello there" =~ "e" :: AllTextMatches  String > getAllTextMatches stringResult ["e", "e", "e"] -- Use a ByteString for efficiency > let bsResult = (BC.pack "hello there") =~ "e" :: AllTextMatches  BC.ByteString > :t bsResult bsResult :: AllTextMatches  BC.ByteString > :t getAllTextMatches bsResult getAllTextMatches bsResult :: [BC.ByteString]
=~ returns a different type depending on what we cast it to. For
example, we can cast it to a
Bool to get a simple yes/no on whether it
matched. The various types that
=~ can return are all
I found those
RegexContext instance types hard to read, so here are some tips.
When looking at the instances, look at the last type in the line.
For example, given
RegexContext a b MatchArray,
=~ will return
b in the various
signatures is the type of the left side of
stringResult above, it’s
bsResult above, it’s
ByteString, and we can see that the
type signature changes accordingly.
For more documentation on the information that each instance contains, I
recommend this link. It says that it’s out of date, but it’s
fairly accurate nevertheless, and gives an idea of what (for example)
And if you’re worried about UTF-8 support: it all just works.
More control over matching
=~ is fine and dandy for the 80% use case, but sometimes we need more
control. For example, we may want to do case-insensitive matching. Then we must
use the lower-level RegexMaker methods, like
They’re not as nice as
=~, but there are some things that
=~ doesn’t offer.
> import Text.Regex.PCRE > import Data.Bits ((.|.)) > > let compilationOptions = defaultCompOpt .|. compCaseless -- In real programs, the `:: Regex` should be unnecessary > let regex = makeRegex "hey" :: Regex > let caselessRegex = makeRegexOpts compilationOptions defaultExecOpt "hey" > > matchTest regex "HEY" False > matchTest caselessRegex "HEY" True > matchCount caselessRegex "HEY hEy HeY" 3 -- For the specific case of case insensitivity, we can still use =~ with (?i). -- But note that `=~` cannot be used with a Regex, only a String. > "HEY hEy HeY" =~ "(?i)hey" :: Int 3
We combine the default compilation options (
defaultCompOpt) with the
compCaseless option to
make the regex case-insensitive.
The options are a bitmask so we must combine them with the bitwise “or” operator,
(You can see the available
CompOptions in the docs.)
matchCount functions come from
There are a few more handy functions in there too, like
=~ operator is great for finding matches. How about replacing, like Ruby’s
Unfortunately, we can’t use
Text.Regex.PCRE for this. The only available regex
search-and-replace function is subRegex from
Text.Regex, which uses the
Text.Regex.Posix backend. Therefore we must construct our regexes using
Also note that we can’t use PCRE syntax (like
\b for word boundaries), because the backend
uses Posix regexes.
> import Text.Regex > -- Use default regex options > let regex = mkRegex "hello" -- The first Bool controls multi-line matching, -- the second Bool controls case-insensitivity. -- False/False means "match across lines" and "case-insensitive". -- The default is True/True. > let caseInsensitiveRegex = mkRegexWithOpts "hello" True False > subRegex regex "hello there" "hi" "hi there" > subRegex caseInsensitiveRegex "HELLO there" "case says hi" "case says hi there"
Why so many modules?
It’s a good question — why does Haskell have
Text.Regex.PCRE? According to the maintainer:
The regex-compat module that provides Text.Regex is the compatibility that provides the “ancient” API that I superceded. The new system, which also underpins regex-compat, has the API in regex-base. This only defines the API, the implementation usually comes from regex-posix, regex-pcre, or regex-tdfa. The regex-compat uses regex-posix which agrees with the “ancient” API it preserves. The regex-pcre module wraps PCRE itself.
Text.Regex.Base defines an API (the
RegexContext bits) that
is implemented by backends like
Text.Regex methods like
subRegex are old, and almost
entirely superceded by
Text.Regex.Base. In a perfect world, we wouldn’t use
them at all. But nothing else provides
subRegex, so we need to drop down to
its ancient API to search-and-replace.
The Programming Language Examples Alike Cookbook, or PLEAC, is a project that solves the same problems in different languages for comparison purposes. It’s very handy when you know how to do something in one language but need to do it in another language. Their page on regular expressions is instructive.