G

Regular Expressions in Haskell

I just spent a few hours trying to figure out how to use basic regular expressions in Haskell. All of the tutorials in the search results (even Bryan O'Sullivan’s) are several years out of date, and the Hackage docs are singularly unhelpful: Should you use Text.Regex, Text.Regex.Base, Text.Regex.PCRE or Text.Regex.TDFA? And how do you use them?

Which one should I use?

I’m here to help: Use Text.Regex.PCRE (the regex-pcre package) and its polymorphic =~ operator. You only need import Text.Regex.PCRE.

For example, in GHCi:

> import Text.Regex.PCRE
> import qualified Data.ByteString.Char8 as BC
>
> "hello there" =~ "e" :: Bool
True
> "hello there" =~ "\\bthere\\b" :: Bool
True
> "hello there" =~ "\\bthere\\b" :: Bool
True
> let stringResult = "hello there" =~ "e" :: AllTextMatches [] String
> getAllTextMatches stringResult
["e", "e", "e"]
-- Use a ByteString for efficiency
> let bsResult = (BC.pack "hello there") =~ "e" :: AllTextMatches [] BC.ByteString
> :t bsResult
bsResult :: AllTextMatches [] BC.ByteString
> :t getAllTextMatches bsResult
getAllTextMatches bsResult :: [BC.ByteString]

Note that =~ returns a different type depending on what we cast it to. For example, we can cast it to a Bool to get a simple yes/no on whether it matched. The various types that =~ can return are all instances of Text.Regex.Base’s RegexContext.

I found those RegexContext instance types hard to read, so here are some tips. When looking at the instances, look at the last type in the line. For example, given RegexContext a b MatchArray, =~ will return MatchArray. The b in the various RegexContext type signatures is the type of the left side of =~. In stringResult above, it’s String. In bsResult above, it’s ByteString, and we can see that the type signature changes accordingly.

For more documentation on the information that each instance contains, I recommend this link. It says that it’s out of date, but it’s fairly accurate nevertheless, and gives an idea of what (for example) [[b]] actually contains.

And if you’re worried about UTF-8 support: it all just works.

More control over matching

=~ is fine and dandy for the 80% use case, but sometimes we need more control. For example, we may want to do case-insensitive matching. Then we must use the lower-level RegexMaker methods, like makeRegexOpts. They’re not as nice as =~, but there are some things that =~ doesn’t offer.

Here’s how:

> import Text.Regex.PCRE
> import Data.Bits ((.|.))
>
> let compilationOptions = defaultCompOpt .|. compCaseless
-- In real programs, the `:: Regex` should be unnecessary
> let regex = makeRegex "hey" :: Regex
> let caselessRegex = makeRegexOpts compilationOptions defaultExecOpt "hey"
>
> matchTest regex "HEY"
False
> matchTest caselessRegex "HEY"
True
> matchCount caselessRegex "HEY hEy HeY"
3
-- For the specific case of case insensitivity, we can still use =~ with (?i).
-- But note that `=~` cannot be used with a Regex, only a String.
> "HEY hEy HeY" =~ "(?i)hey" :: Int
3

We combine the default compilation options (defaultCompOpt) with the compCaseless option to make the regex case-insensitive. The options are a bitmask so we must combine them with the bitwise “or” operator, .|.. (You can see the available CompOptions in the docs.)

The matchText and matchCount functions come from RegexLike. There are a few more handy functions in there too, like matchAll.

Replacing

The =~ operator is great for finding matches. How about replacing, like Ruby’s "something".gsub(/old/, "new")?

Unfortunately, we can’t use Text.Regex.PCRE for this. The only available regex search-and-replace function is subRegex from Text.Regex, which uses the Text.Regex.Posix backend. Therefore we must construct our regexes using functions from Text.Regex, like mkRegex and mkRegexWithOpts.

Also note that we can’t use PCRE syntax (like \b for word boundaries), because the backend uses Posix regexes.

> import Text.Regex
>
-- Use default regex options
> let regex = mkRegex "hello"
-- The first Bool controls multi-line matching,
-- the second Bool controls case-insensitivity.
-- False/False means "match across lines" and "case-insensitive".
-- The default is True/True.
> let caseInsensitiveRegex = mkRegexWithOpts "hello" True False
> subRegex regex "hello there" "hi"
"hi there"
> subRegex caseInsensitiveRegex "HELLO there" "case says hi"
"case says hi there"

Why so many modules?

It’s a good question — why does Haskell have Text.Regex, Text.Regex.Base, and Text.Regex.PCRE? According to the maintainer:

The regex-compat module that provides Text.Regex is the compatibility that provides the “ancient” API that I superceded. The new system, which also underpins regex-compat, has the API in regex-base. This only defines the API, the implementation usually comes from regex-posix, regex-pcre, or regex-tdfa. The regex-compat uses regex-posix which agrees with the “ancient” API it preserves. The regex-pcre module wraps PCRE itself.

So Text.Regex.Base defines an API (the RegexLike/RegexContext bits) that is implemented by backends like Text.Regex.PCRE and Text.Regex.Posix.

The Text.Regex methods like mkRegex and subRegex are old, and almost entirely superceded by Text.Regex.Base. In a perfect world, we wouldn’t use them at all. But nothing else provides subRegex, so we need to drop down to its ancient API to search-and-replace.

Whew!

Further reading

The Programming Language Examples Alike Cookbook, or PLEAC, is a project that solves the same problems in different languages for comparison purposes. It’s very handy when you know how to do something in one language but need to do it in another language. Their page on regular expressions is instructive.