Regular expressions

To process data you very often need to understand and use regular expressions. In this article we introduce them and use them to solve the Advent of Code Day 1 challenge.

Even if you are not familiar with the term regular expression you will almost certainly have come across them when searching or doing basic file operations. For example, using a “star” in place of part of a filename such as *.xlsx to refer to all Excel files. Once you get the hang of them you will find all sorts of uses for them, even as a handy tool in everyday life using spreadsheets etc for you personal admin.

A language within a language

That * is just one “word” in a much wider vocabulary of so-called regular expressions, regex for short. All programming languages offer some form of regex and aside from what might be called dialect differences, they are all the same.

The “expression” is a sequence of letters and symbols, each of which is an instruction to a mechanism. That mechanism takes a string as input and follows the instructions in the regular expression, looking character by character in the string for a match. For example, the expression “ch” tells the matcher to look for the letter “c” followed by the letter “h”. Given that regex and the string “church” the matcher would find two matches, one at the start and one at the end of the word. If we wanted to determine whether a document contained the word “pneumonia” we would give the matcher the document and the regular expression pneumonia and see whether there are any matches.

What about matching on the word “cell”? Obviously the expression cell works but it also matches cellulitis and varicella . We need to instruct the matcher to only match the word standing on its own. There are several ways to do this. One is to assume that words are surrounded by space, so we could add a space before and after our word in the expression, like this -> cell <-. That works sometimes but misses the cases where cell is the last word in a sentence and is therefore followed by a full stop. Needing to match on words is a common case that regex’s have evolved a special instruction to deal with: \b. This uses the notion of an “escape” character, \, which says the following character does not take on its usual meaning but instead should be treated in a special way. We say it is “escaped”. In this case, \b does not mean match the letter “b” but instead match the boundary of a word. So a better regex would be \bcell\b.

As well as word boundaries we can match the start of a line ^ and the end of a line $. These kind of matches include full stops, spaces, tabs, a whole class of things the matcher might encounter. The idea of a class of characters is common in regex’s. The class of all digits: 0, 1, 2, …, 8, 9 is represented by [0-9]. So AB[0-9]cd matches the captial letters AB followed by any single digit followed by the lowercase letters cd. All the lowercase letters are: [a-z] and all the uppercase ones [A-Z]. [a-zA-Z] is all letters in either case. Another concept is repetition: zero or more *, one or more +, at most once ?. So, ab+ matches a followed by one or more bs: ab, abb, abbbbbb would all match but just a on its own would not. ab? matches either a or ab. Sometimes you just want to match anything in a list, in which case we use the | or operator between each item. For example, we make a regex that matches some medical terms (and ignores their case).

things_to_match = 'spine|cervical|lung|mediastinum|liver|kidney|viscera|organ'
regex = re.compile(things_to_match, re.I) # Make a regex that ignores case

Once we have the regex we can put it to use. As the matcher can finds lots of matches there are ways to ask the matcher for just the first match(), or the next finditer() or all at once findall(). Here we choose findall():

re.findall(regex, 'The lungs and mediastinum are normal')
['lung','mediastinum']
re.findall(regex, 'The elfBrain is inside the elfSkull')
[]
re.findall(regex, 'Spines in elves?')
['spine']

It’s complicated

As you can see, regular expressions get very complicated very quickly. Don’t worry if they look baffling, even seasoned programmers experience this and depend heavily on documentation. There are also lots of tools and websites to help you build-up and explain regexs, such as Pythex. And as ever, stackoverflow is a good source of ready made regexs, see the Oven ready section for examples.

Truthiness

What comes out of the matcher is very often used as the condition in an if statement or while statement: if there is a match …, or while we are not finding a match …. To support this, the output of the matcher has “truthiness” which means it can be used where you would expect a truth value such as True or False, in conditions for IF and while statements.

Oven ready

Being able to build up your own regexs is a skill you should certainly develop. Being able to copy other people’s is another. Regexs have been around for decades so almost everything has been matched at some point. And yes, there is a regex to match a valid regex. We list some examples here and it is always worth a quick search for answers to your current problem before building something complicated. Finally, you can ask ChatGPT, although do verify the response in a checker such as Pythex.

UK mobile phone number:

^(07[\d]{8,12}|447[\d]{7,11})$

Email address:

([A-Za-z0-9]+[.-_])*[A-Za-z0-9]+@[A-Za-z0-9-]+(\.[A-Z|a-z]{2,})+

NHS Number:

\b\d{3}\s\d{3}\s\d{4}\b

UK National Insurance Number:

^(?!BG)(?!GB)(?!NK)(?!KN)(?!TN)(?!NT)(?!ZZ)(?:[A-CEGHJ-PR-TW-Z][A-CEGHJ-NPR-TW-Z])(?:\s*\d\s*){6}([A-D]|\s)$

Putting it all together: Advent of Code

Let’s solve a problem with regular expressions. The Advent of Code is an annual programming challenge and as such is a great source of interesting programming puzzles. Here we look at Avent of Code 2023 day 1, which comes in two parts.

Part 1

In a list of strings, each line contains digits. Combine the first digit with the last digit to form a two digit number. Add all the two digit numbers together to find a total.

For example, in:
1abc2
pqr3stu8vwx
a1b2c3d4e5f
treb7uchet
The numbers are: 12, 38, 15, 77. Adding them gives 142.

We have to write a small program to solve the example and then there is a 1,000 line file to try out, with the answer checked by the Advent of Code site.

We will need to open and read a file, loop over all the lines and print a result but the core problem is extracting the first and last digits from a given line. This is clearly a task for regular expressions. Often the best way to build up a regex is by openning the command line and playing. Let’s do that and start by importing the re library containing regular expression functionality:

$ python
Python 3.10.9 (main, Mar  1 2023, 12:33:47) [Clang 14.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>>

We use the examples to get a feel for the problem. The first one is 1abc2. So we need to get the 1 and the 2, then put them together to make 12. If we have the 1 and 2 so long as they are strings we can just combine them then use Python’s int() function to get the number twelve.

We need a regular expression to get the 1 and 2. Obviously ‘1’ will match 1 and ‘2’ will match 2, but that is too specific to this example; we want to match any digit. For our previous discussion we know there is a regex for that: ‘[0-9]’:

>>> re.search('[0-9]', '1abc2')
<re.Match object; span=(0, 1), match='1'>

That has worked but given us back a re.Match object which we need to further process. If instead we use the findall() function from the re module we get something more useful:

>>> re.findall('[0-9]','1and2')
['1','2']

A list of strings, which is great because we just need to combine the first (at index 0) with the last (at index -1) then convert the two character string to an integer:

>>> matches = re.findall('[0-9]','1and2')
>>> int( matches[0] + matches[-1] )
12

Often with regualar expressions there is more than one way to do something. In this case there is a shortcut for the class of all digits: \d is equivalent to [0-9]:

>>> matches = re.findall('\d','1and2')
>>> int( matches[0] + matches[-1] )
12

Let’s try it on the other examples. If we get a list containing the first and last digit, our regex has worked. We know the combination of strings and conversion to an integer works so we don’t need to test that again:

>>> re.findall('\d','1and2')
['1','2']
>>> re.findall('\d', 'pqr3stu8vwx')
['3','8']
>>> re.findall('\d', 'a1b2c3d4e5f')
['1', '2', '3', '4', '5']
>>> re.findall('\d', 'treb7uchet')
['7']

That last one is tricky. It works because the first and the last element are the same, so we will ultimately get 77.

Now all we need to do is open the file, loop through the lines and add the results. Let’s do a naive implementation with the examples in a file we create called test.txt then refactor the code to something more Pythonic before running it on the 1,000 line competition file. To ensure we are opening the file, looping and calling a function on each line we start with a “Hello, world!” program. It doesn’t say hello world, but it is just as simple, returning the number 1 for each line in the input file, so we expect to get the number four for a four line file:

def my_func(arg):
    return 1

total = 0
for l in open("test.txt"):
   total += my_func(l)

assert total == 4

There are no complaints from assert when we run it, so that is all the uninteresting stuff working. Now we can focus on adding our regex’s to my_func

def my_func(arg):
    matches = re.findall('\d', arg)
    return int( matches[0] + matches[-1])
# And change the assertion at the bottom
assert total == 142

Great, that works too. Let’s refactor before trying it on the competition input, by turning that loop into a sum over a list comprehension:

total = sum ( [my_func(l) for l in open("test.txt")] )
assert total == 142

Now we want the answer for the competition so we replace the assertion with a print:

print( sum( [my_func(l) for l in open("competition.txt")]) )

And run that:

$ python aoc23_q1.py
53921

Which is the correct answer.

Part 2

In the second part of the challenge we are told that the file contains numbers as words. For example:

two1nine
eightwothree
abcone2threexyz
yields 29, 83, 13.

This opens the possibiliy of overlaps because eight starts with e but one, three, five and nine all end with e too. However it does not give any examples such as fiveight or twone. Having solved it with and without overlaps, I know that it considers the solution with overlaps to be correct.

As with part 1, we open Python and play around with some regexs. Matching a word is easy, the regex is just the word itself. To match one word or another we need to use the | instruction which creates an either/or match. So A|B matches either A or B.

>>> re.findall('two|nine','two1nine')
['two', 'nine']

We can see how that would extend to all ten digits. But what about overlaps?

>>> re.findall('five|eight','fiveight')
['five']

That only matches five because the e has been consumed by the matcher and is not available alongside ight to yield a match on eight. There are two ways round this:

Use the (?=...) construct. This instructs the matcher to look ahead for ... and include it in the match but not to consume those characters, leaving them available for further matches.

Use a different regular expression library, regex rather than re, that allows the option of matching on overlaps.

>>> # Option 1
>>> re.findall('(?=(five|eight))', 'fiveight')
['five', 'eight']
>>> # Option 2
>>> import regex
>>> regex.findall('five|eight', 'fiveight', overlapped=True)
['five', 'eight']

We can find the words representing numbers but how do we convert them into actual numbers so that we can combine and sum them? This looks like another very common use for regular expressions: finding and replacing. We could find “one” and replace it with “1”, “two” with “2” and so on. In the re and regex modules there is a sub() function that substitutes a replacement string everywhere it finds a match. This all looks very useful but how would it cope with the overlaps?

>>> re.sub('one', '1', 'twone')
'tw1'

The substitution replaces one with 1 leaving tw1 which will not match two. So although regular expression match and replace is very often the solution, in this case it is not because of the overlaps.

We return to the findall() function and think of a way to deal with the mix of numbers and words that we get from that. We can use a Python dict to convert the strings we have into the strings we need. We set it up like this:

lookup = {"one": "1", "two": "2", ...}

The trick is to also have entries for “0”, “1”, … “9” that just converts them into themselves:

>>> lookup = {"one": "1", ..., "nine": "9", "1": "1", ... "9": "9"}
>>> lookup['three']
'3'
>>> lookup['3']
'3'

Along with that dict, the final code looks like this:

def fn(l):
    m = regex.findall("one|two|three|four|five|six|seven|eight|nine|[0-9]",
                      l, overlapped=True)
    return int( lookup[m[0]] + lookup[m[-1]] )

print( sum( [ fn(l) for l in open("input.txt") ] ) )