Regular Expressions in Haskell
I just spent a few hours trying to figure out how to use basic regular
expressions in Haskell. All of the tutorials in the search results (even Bryan
O'Sullivan’s) are several years out of date, and the Hackage docs are
singularly unhelpful: Should you use Text.Regex
, Text.Regex.Base
,
Text.Regex.PCRE
or Text.Regex.TDFA
? And how do you use them?
Which one should I use?
I’m here to help: Use Text.Regex.PCRE
(the regex-pcre
package) and its
polymorphic =~
operator. You only need import Text.Regex.PCRE
.
For example, in GHCi:
> import Text.Regex.PCRE
> import qualified Data.ByteString.Char8 as BC
>
> "hello there" =~ "e" :: Bool
True
> "hello there" =~ "\\bthere\\b" :: Bool
True
> "hello there" =~ "\\bthere\\b" :: Bool
True
> let stringResult = "hello there" =~ "e" :: AllTextMatches [] String
> getAllTextMatches stringResult
["e", "e", "e"]
-- Use a ByteString for efficiency
> let bsResult = (BC.pack "hello there") =~ "e" :: AllTextMatches [] BC.ByteString
> :t bsResult
bsResult :: AllTextMatches [] BC.ByteString
> :t getAllTextMatches bsResult
getAllTextMatches bsResult :: [BC.ByteString]
Note that =~
returns a different type depending on what we cast it to. For
example, we can cast it to a Bool
to get a simple yes/no on whether it
matched. The various types that =~
can return are all
instances of Text.Regex.Base
’s RegexContext
.
I found those RegexContext
instance types hard to read, so here are some tips.
When looking at the instances, look at the last type in the line.
For example, given RegexContext a b MatchArray
,
=~
will return MatchArray
. The b
in the various RegexContext
type
signatures is the type of the left side of =~
. In stringResult
above, it’s
String
. In bsResult
above, it’s ByteString
, and we can see that the
type signature changes accordingly.
For more documentation on the information that each instance contains, I
recommend this link. It says that it’s out of date, but it’s
fairly accurate nevertheless, and gives an idea of what (for example) [[b]]
actually contains.
And if you’re worried about UTF-8 support: it all just works.
More control over matching
=~
is fine and dandy for the 80% use case, but sometimes we need more
control. For example, we may want to do case-insensitive matching. Then we must
use the lower-level RegexMaker methods, like makeRegexOpts
.
They’re not as nice as =~
, but there are some things that =~
doesn’t offer.
Here’s how:
> import Text.Regex.PCRE
> import Data.Bits ((.|.))
>
> let compilationOptions = defaultCompOpt .|. compCaseless
-- In real programs, the `:: Regex` should be unnecessary
> let regex = makeRegex "hey" :: Regex
> let caselessRegex = makeRegexOpts compilationOptions defaultExecOpt "hey"
>
> matchTest regex "HEY"
False
> matchTest caselessRegex "HEY"
True
> matchCount caselessRegex "HEY hEy HeY"
3
-- For the specific case of case insensitivity, we can still use =~ with (?i).
-- But note that `=~` cannot be used with a Regex, only a String.
> "HEY hEy HeY" =~ "(?i)hey" :: Int
3
We combine the default compilation options (defaultCompOpt
) with the compCaseless
option to
make the regex case-insensitive.
The options are a bitmask so we must combine them with the bitwise “or” operator, .|.
.
(You can see the available CompOption
s in the docs.)
The matchText
and matchCount
functions come from RegexLike
.
There are a few more handy functions in there too, like matchAll
.
Replacing
The =~
operator is great for finding matches. How about replacing, like Ruby’s
"something".gsub(/old/, "new")
?
Unfortunately, we can’t use Text.Regex.PCRE
for this. The only available regex
search-and-replace function is subRegex from Text.Regex
, which uses the
Text.Regex.Posix
backend. Therefore we must construct our regexes using
functions from Text.Regex
, like mkRegex
and mkRegexWithOpts
.
Also note that we can’t use PCRE syntax (like \b
for word boundaries), because the backend
uses Posix regexes.
> import Text.Regex
>
-- Use default regex options
> let regex = mkRegex "hello"
-- The first Bool controls multi-line matching,
-- the second Bool controls case-insensitivity.
-- False/False means "match across lines" and "case-insensitive".
-- The default is True/True.
> let caseInsensitiveRegex = mkRegexWithOpts "hello" True False
> subRegex regex "hello there" "hi"
"hi there"
> subRegex caseInsensitiveRegex "HELLO there" "case says hi"
"case says hi there"
Why so many modules?
It’s a good question — why does Haskell have Text.Regex
, Text.Regex.Base
,
and Text.Regex.PCRE
? According to the maintainer:
The regex-compat module that provides Text.Regex is the compatibility that provides the “ancient” API that I superceded. The new system, which also underpins regex-compat, has the API in regex-base. This only defines the API, the implementation usually comes from regex-posix, regex-pcre, or regex-tdfa. The regex-compat uses regex-posix which agrees with the “ancient” API it preserves. The regex-pcre module wraps PCRE itself.
So Text.Regex.Base
defines an API (the RegexLike
/RegexContext
bits) that
is implemented by backends like Text.Regex.PCRE
and Text.Regex.Posix
.
The Text.Regex
methods like mkRegex
and subRegex
are old, and almost
entirely superceded by Text.Regex.Base
. In a perfect world, we wouldn’t use
them at all. But nothing else provides subRegex
, so we need to drop down to
its ancient API to search-and-replace.
Whew!
Further reading
The Programming Language Examples Alike Cookbook, or PLEAC, is a project that solves the same problems in different languages for comparison purposes. It’s very handy when you know how to do something in one language but need to do it in another language. Their page on regular expressions is instructive.