Regular expressions
To process data you very often need to understand and use regular expressions. In this article we introduce them and use them to solve the Advent of Code Day 1 challenge.
Even if you are not familiar with the term regular expression you will almost
certainly have come across them when searching or doing basic file operations.
For example, using a “star” in place of part of a filename such as *.xlsx
to
refer to all Excel files. Once you get the hang of them you will find all sorts
of uses for them, even as a handy tool in everyday life using spreadsheets etc
for you personal admin.
A language within a language
That *
is just one “word” in a much wider vocabulary of so-called regular
expressions, regex for short. All programming languages offer some form of regex
and aside from what might be called dialect differences, they are all the same.
The “expression” is a sequence of letters and symbols, each of which is an
instruction to a mechanism. That mechanism takes a string as input and follows
the instructions in the regular expression, looking character by character in
the string for a match. For example, the expression “ch” tells the matcher to
look for the letter “c” followed by the letter “h”. Given that regex and the
string “church” the matcher would find two matches, one at the start and one at
the end of the word. If we wanted to determine whether a document contained the
word “pneumonia” we would give the matcher the document and the regular
expression pneumonia
and see whether there are any matches.
What about matching on the word “cell”? Obviously the expression cell
works
but it also matches cellulitis and varicella . We need to instruct the matcher
to only match the word standing on its own. There are several ways to do this.
One is to assume that words are surrounded by space, so we could add a space
before and after our word in the expression, like this -> cell
<-. That works
sometimes but misses the cases where cell is the last word in a sentence and is
therefore followed by a full stop. Needing to match on words is a common case
that regex’s have evolved a special instruction to deal with: \b
. This uses
the notion of an “escape” character, \
, which says the following character
does not take on its usual meaning but instead should be treated in a special
way. We say it is “escaped”. In this case, \b
does not mean match the letter
“b” but instead match the boundary of a word. So a better regex would be
\bcell\b
.
As well as word boundaries we can match the start of a line ^
and the end of a
line $
. These kind of matches include full stops, spaces, tabs, a whole class
of things the matcher might encounter. The idea of a class of characters is
common in regex’s. The class of all digits: 0, 1, 2, …, 8, 9 is represented by
[0-9]
. So AB[0-9]cd
matches the captial letters AB
followed by any single
digit followed by the lowercase letters cd
. All the lowercase letters are:
[a-z]
and all the uppercase ones [A-Z]
. [a-zA-Z]
is all letters in either
case. Another concept is repetition: zero or more *
, one or more +
, at most
once ?
. So, ab+
matches a
followed by one or more b
s: ab
, abb
,
abbbbbb
would all match but just a
on its own would not. ab?
matches
either a
or ab
. Sometimes you just want to match anything in a list, in
which case we use the |
or operator between each item. For example, we make a
regex that matches some medical terms (and ignores their case).
things_to_match = 'spine|cervical|lung|mediastinum|liver|kidney|viscera|organ'
regex = re.compile(things_to_match, re.I) # Make a regex that ignores case
Once we have the regex we can put it to use. As the matcher can finds lots of
matches there are ways to ask the matcher for just the first match()
, or the
next finditer()
or all at once findall()
. Here we choose findall()
:
re.findall(regex, 'The lungs and mediastinum are normal')
['lung','mediastinum']
re.findall(regex, 'The elfBrain is inside the elfSkull')
[]
re.findall(regex, 'Spines in elves?')
['spine']
It’s complicated
As you can see, regular expressions get very complicated very quickly. Don’t worry if they look baffling, even seasoned programmers experience this and depend heavily on documentation. There are also lots of tools and websites to help you build-up and explain regexs, such as Pythex. And as ever, stackoverflow is a good source of ready made regexs, see the Oven ready section for examples.
Truthiness
What comes out of the matcher is very often used as the condition in
an if
statement or while
statement: if there is a match …, or
while we are not finding a match …. To support this, the output of
the matcher has “truthiness” which means it can be used where you
would expect a truth value such as True
or False
, in conditions
for IF
and while
statements.
Oven ready
Being able to build up your own regexs is a skill you should certainly develop. Being able to copy other people’s is another. Regexs have been around for decades so almost everything has been matched at some point. And yes, there is a regex to match a valid regex. We list some examples here and it is always worth a quick search for answers to your current problem before building something complicated. Finally, you can ask ChatGPT, although do verify the response in a checker such as Pythex.
UK mobile phone number:
^(07[\d]{8,12}|447[\d]{7,11})$
Email address:
([A-Za-z0-9]+[.-_])*[A-Za-z0-9]+@[A-Za-z0-9-]+(\.[A-Z|a-z]{2,})+
NHS Number:
\b\d{3}\s\d{3}\s\d{4}\b
UK National Insurance Number:
^(?!BG)(?!GB)(?!NK)(?!KN)(?!TN)(?!NT)(?!ZZ)(?:[A-CEGHJ-PR-TW-Z][A-CEGHJ-NPR-TW-Z])(?:\s*\d\s*){6}([A-D]|\s)$
Putting it all together: Advent of Code
Let’s solve a problem with regular expressions. The Advent of Code is an annual programming challenge and as such is a great source of interesting programming puzzles. Here we look at Avent of Code 2023 day 1, which comes in two parts.
Part 1
In a list of strings, each line contains digits. Combine the first digit with the last digit to form a two digit number. Add all the two digit numbers together to find a total.
For example, in:
1abc2 pqr3stu8vwx a1b2c3d4e5f treb7uchet
The numbers are: 12, 38, 15, 77. Adding them gives 142.
We have to write a small program to solve the example and then there is a 1,000 line file to try out, with the answer checked by the Advent of Code site.
We will need to open and read a file, loop over all the lines and print a result
but the core problem is extracting the first and last digits from a given line.
This is clearly a task for regular expressions. Often the best way to build up a
regex is by openning the command line and playing. Let’s do that and start by
importing the re
library containing regular expression functionality:
$ python
Python 3.10.9 (main, Mar 1 2023, 12:33:47) [Clang 14.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>>
We use the examples to get a feel for the problem. The first one is 1abc2
. So
we need to get the 1 and the 2, then put them together to make 12. If we have
the 1 and 2 so long as they are strings we can just combine them then use
Python’s int()
function to get the number twelve.
We need a regular expression to get the 1 and 2. Obviously ‘1’ will match 1 and ‘2’ will match 2, but that is too specific to this example; we want to match any digit. For our previous discussion we know there is a regex for that: ‘[0-9]’:
>>> re.search('[0-9]', '1abc2')
<re.Match object; span=(0, 1), match='1'>
That has worked but given us back a re.Match
object which we need to further
process. If instead we use the findall()
function from the re
module we get
something more useful:
>>> re.findall('[0-9]','1and2')
['1','2']
A list of strings, which is great because we just need to combine the first (at index 0) with the last (at index -1) then convert the two character string to an integer:
>>> matches = re.findall('[0-9]','1and2')
>>> int( matches[0] + matches[-1] )
12
Often with regualar expressions there is more than one way to do something. In
this case there is a shortcut for the class of all digits: \d
is equivalent to
[0-9]
:
>>> matches = re.findall('\d','1and2')
>>> int( matches[0] + matches[-1] )
12
Let’s try it on the other examples. If we get a list containing the first and last digit, our regex has worked. We know the combination of strings and conversion to an integer works so we don’t need to test that again:
>>> re.findall('\d','1and2')
['1','2']
>>> re.findall('\d', 'pqr3stu8vwx')
['3','8']
>>> re.findall('\d', 'a1b2c3d4e5f')
['1', '2', '3', '4', '5']
>>> re.findall('\d', 'treb7uchet')
['7']
That last one is tricky. It works because the first and the last element are the
same, so we will ultimately get 77
.
Now all we need to do is open the file, loop through the lines and add the
results. Let’s do a naive implementation with the examples in a file we create
called test.txt
then refactor the code to something more Pythonic before
running it on the 1,000 line competition file. To ensure we are opening the
file, looping and calling a function on each line we start with a “Hello,
world!” program. It doesn’t say hello world, but it is just as simple, returning
the number 1 for each line in the input file, so we expect to get the number
four for a four line file:
def my_func(arg):
return 1
total = 0
for l in open("test.txt"):
total += my_func(l)
assert total == 4
There are no complaints from assert
when we run it, so that is all the
uninteresting stuff working. Now we can focus on adding our regex’s to my_func
def my_func(arg):
matches = re.findall('\d', arg)
return int( matches[0] + matches[-1])
# And change the assertion at the bottom
assert total == 142
Great, that works too. Let’s refactor before trying it on the competition input, by turning that loop into a sum over a list comprehension:
total = sum ( [my_func(l) for l in open("test.txt")] )
assert total == 142
Now we want the answer for the competition so we replace the assertion with a print:
print( sum( [my_func(l) for l in open("competition.txt")]) )
And run that:
$ python aoc23_q1.py
53921
Which is the correct answer.
Part 2
In the second part of the challenge we are told that the file contains numbers as words. For example:
two1nine eightwothree abcone2threexyz
yields 29, 83, 13.
This opens the possibiliy of overlaps because eight starts with e but one,
three, five and nine all end with e too. However it does not give any examples
such as fiveight
or twone
. Having solved it with and without overlaps, I
know that it considers the solution with overlaps to be correct.
As with part 1, we open Python and play around with some regexs. Matching a word
is easy, the regex is just the word itself. To match one word or another we need
to use the |
instruction which creates an either/or match. So A|B
matches
either A or B.
>>> re.findall('two|nine','two1nine')
['two', 'nine']
We can see how that would extend to all ten digits. But what about overlaps?
>>> re.findall('five|eight','fiveight')
['five']
That only matches five because the e has been consumed by the matcher and is
not available alongside ight
to yield a match on eight
. There are two ways
round this:
- Use the
(?=...)
construct. This instructs the matcher to look ahead for...
and include it in the match but not to consume those characters, leaving them available for further matches. - Use a different regular expression library,
regex
rather thanre
, that allows the option of matching on overlaps.>>> # Option 1 >>> re.findall('(?=(five|eight))', 'fiveight') ['five', 'eight'] >>> # Option 2 >>> import regex >>> regex.findall('five|eight', 'fiveight', overlapped=True) ['five', 'eight']
We can find the words representing numbers but how do we convert them into
actual numbers so that we can combine and sum them? This looks like another very
common use for regular expressions: finding and replacing. We could find “one”
and replace it with “1”, “two” with “2” and so on. In the re
and regex
modules there is a sub()
function that substitutes a replacement string
everywhere it finds a match. This all looks very useful but how would it cope
with the overlaps?
>>> re.sub('one', '1', 'twone')
'tw1'
The substitution replaces one
with 1
leaving tw1
which will not match two.
So although regular expression match and replace is very often the solution, in
this case it is not because of the overlaps.
We return to the findall()
function and think of a way to deal with the mix of
numbers and words that we get from that. We can use a Python dict to convert the
strings we have into the strings we need. We set it up like this:
lookup = {"one": "1", "two": "2", ...}
The trick is to also have entries for “0”, “1”, … “9” that just converts them into themselves:
>>> lookup = {"one": "1", ..., "nine": "9", "1": "1", ... "9": "9"}
>>> lookup['three']
'3'
>>> lookup['3']
'3'
Along with that dict, the final code looks like this:
def fn(l):
m = regex.findall("one|two|three|four|five|six|seven|eight|nine|[0-9]",
l, overlapped=True)
return int( lookup[m[0]] + lookup[m[-1]] )
print( sum( [ fn(l) for l in open("input.txt") ] ) )