Fizzbuzz in RegEx
During some internets, I was accosted by this smug anime girl's whiteboard:
If you haven't seen this problem before, give it a shot. If you find an interesting and different way to solve it, let me know!
This took me a couple days. It's not that difficult, I've just had a job at a bank for too long.
Let it be known the reason I pursued this goal was not the taunt of this 2D scientist.
It was for a dumb idea.
The classic fizzbuzz problem is to emit numbers (usually specifically non-negative integers) unless they are completely divisible by 3 (emit fuzz), 5 (emit buzz), or both (emit fizzbuzz).
Whether a number is divisible by 5 is very easy to determine in regex (proving this is left as an exercise to the reader) so with a tool for divisibility by 3 we can both fizz and buzz and even fizzbuzz with nary a modulo operator in sight.
A non-zero part of my dayjob is trivially reducing compute costs against
trillions of records a day by explaining why chaining .*?
10 times to match
against strictly defined patterns is a bad idea actually (in most cases
a really bad idea),
so it seemed like a good way to let off some steam.
If you understand the logic, you can build a regex for the divisibility of any digit (or any number, even fairly simply if you use a different order numeric system to base 10).
This is a terrible idea, but also funny. If you can memorize the equivalent of the below incantation in your language of choice you can inflict it upon the next unwitting interviewer for great profit:
package main
import (
"fmt"
"regexp"
"bufio"
"os"
)
func main() {
// Patterns are self explanatory:
div3_re := regexp.MustCompile("^(?:[3690]+|[147][3690]*(?:[147][3690]*[258][3690]*)*(?:[258]|[147][3690]*[147])|[258][3690]*(?:[258][3690]*[147][3690]*)*(?:[147]|[258][3690]*[258]))+$")
div5_re := regexp.MustCompile("^[0-9]*[05]$")
// Here we are taking input from IO...a char sequence we would have to convert to an int.
// No allocating unnecessary numeric types required! User input validation built in!
// Actually, we don't know if the user input is a number. We spit anything back out. This is fine.
scanner := bufio.NewScanner(os.Stdin)
for scanner.Scan() {
chars := scanner.Bytes()
div3 := div3_re.Match(chars)
div5 := div5_re.Match(chars)
switch {
case div3 && div5:
fmt.Println("fizzbuzz")
case div3:
fmt.Println("fizz")
case div5:
fmt.Println("buzz")
default:
fmt.Printf("%s\n", chars)
}
}
}
NOTE: To clarify, I will be making 2 deviations from the rules in the image. The
first is that +
is allowed, because for a regex A
, A+ == AA*
, which is
just more tedious to write for sufficiently large A
. Second, judging from the
image, the anime girl in the lab coat probably partially matches numbers not
divisible by 3 that contain a number divisible by 3, and doesn't match a number
in a single whole match but in multiple - we can do better by matching word
boundaries.
RegEx to Check If Multiple of 3
Starting simple; all numbers that match the below regex are divisible by 3:
\b[3690]+\b
For each regex a regexr page will be linked to demonstrate and explain the syntax, see https://regexr.com/7s4p6 for this one.
All of [3690]
when divided by 3 have remainder 0. The
important bit is that this is true no matter where the digit occurs in the
string representing the number. For example: 3690 = 3 * 1000 + 3 * 200 + 3 * 30 + 3 * 0
.
We use the \b
either side to make sure we're not matching the middle of some
other number. If you're only matching 1 string or line you can use ^$
instead.
Double Digit Pairs
Looking at the integers from 0-100 and pattern matching with our brains, we can observe that we match every 3rd number with the following regex:
\b(?:[3690]+|[147][258]|[258][147])\b
We can obtain this empirically, but why does this work?
The importance of the groups [147]
and [258]
are that their remainders when
divided by 3 are 1 and 2 respectively.
If the total remainder of a number divided by 3 is a multiple of 3, we have a
whole number. You can also consider this fractionally, e.g. for 12: 1/3+2/3=1
.
For the same reasons as [3690]+
, the effect on the remainder is the same no
matter what position in the string the digit is. There is likely some deeper
mathematical principle about congruence but I'm just here to dunk on
interviewers.
Since [3690]
does not change the remainder, we can sprinkle it in the
middle too. This is important for our first multiples
of 3 after 100, starting with 102.
\b(?:[3690]+|[147][3690]*[258]|[258][3690]*[147])\b
Triple Digit Trios
Continuing from 100, the next false negative cases are three in a row of [147]
or [258]
, e.g. 111, 114, 117. Once again, the sum of remainders is divisible
by 3, so we can include this as an alternative terminator for both groups:
\b(?:[3690]+|[147][3690]*(?:[258]|[147]{2})|[258][3690]*(?:[147]|[258]{2}))\b
Since any amount of [3690]
could occur in-between without effecting the
remainder, we should account for this too:
\b(?:[3690]+|[147][3690]*(?:[258]|[147][3690]*[147])|[258][3690]*(?:[147]|[258][3690]*[258]))\b
The Cornerstone
It's time to uncork the character that blows the whole problem wide open.
Although we're starting to account for certain higher digit counts, a string
like 111 is still immediately thwarted simply by placing [3690]
before or
after it, which we know doesn't effect the remainder.
The key is to notice that we have a regex that matches any string that has
a remainder divisible by 3 when divided by 3, even if it's a substring in a
larger number. This is the same thing as having a digit that's divisble by 3,
i.e [3690]
, only with more characters. There is no difference between 63 and
123 for our purposes - as long as we only match against numbers divible by 3,
then we can just repeatedly apply our existing expression:
\b(?:[3690]+|[147][3690]*(?:[258]|[147][3690]*[147])|[258][3690]*(?:[147]|[258][3690]*[258]))+\b
The technical way to say this is probably like: the effect of digits on the remainder is additive, and since addition is associative and commutative we don't need to continue to evaluate a substring whose remainder is a multiple of 3. I'm not such a maths guy but associativity and commutativity of functions shows up a lot even if your day-to-day engineering doesn't strictly involve maths. Pretty handy concept.
Finishing Touches
Our current expression has no false positives or negatives for values between 0
and 1000, but has what looks like one final pattern to match in 4+ digit
strings. Consider 1122 and 11211 - the expression isn't able to loop back around
as although we can capture [147][258]
or [258][147]
pairs, we need a way to handle them
when they leave a hanging [147] or [258] which means they can't terminate early
as mod 0.
The final regex spaghetti reveals itself:
\b(?:[3690]+|[147][3690]*(?:[147][3690]*[258][3690]*)*(?:[258]|[147][3690]*[147])|[258][3690]*(?:[258][3690]*[147][3690]*)*(?:[147]|[258][3690]*[258]))+\b
Let's test for larger numbers:
package main
import (
"fmt"
"regexp"
"strconv"
)
func main() {
div3_re := regexp.MustCompile("^(?:[3690]+|[147][3690]*(?:[147][3690]*[258][3690]*)*(?:[258]|[147][3690]*[147])|[258][3690]*(?:[258][3690]*[147][3690]*)*(?:[147]|[258][3690]*[258]))+$")
for i := 0; i <= 100000000; i++ {
mod3 := (i % 3) == 0
i_str := strconv.Itoa(i)
div3 := div3_re.MatchString(i_str)
if (div3 && !mod3) {
fmt.Println(i_str, "False positive")
} else if (!div3 && mod3) {
fmt.Println(i_str, "False negative")
}
}
}
Looks decent. As in another post about numeric representations I wrote, I never learned how to do rigorous proofs so I don't know how to prove this regex works for all non-negative integers in base 10, but I'm pretty sure it's correct. If you have proper maths chops let me know what that would look like.
Conclusion
Part of what I wanted to do with this post is show off, I mean show that regexes are not indecipherable magic. Creating regexes can be finicky, but both in reading and writing them there are great tools like regexr, regex101, or debuggex available online. A lot of very bad, no good code persists for a long time because people don't build a deep understanding of this tool and what it's good at and what it's not.
For the record, once your regex gets reasonably complex, write an actual parser instead. Whoever needs to alter it after will thank you, and you are much less likely to have nonsensical performance issues and bugs due to a single character a human could otherwise catch at a glance.
Speaking of performance, here are some simple pprof top10 readings for some fizzbuzz programs (regex vs modulo, stdin vs for loop) in Go for 0 to 100000000 inclusive. Like the rest of the post nothing rigorous, only for the curious. Note that Go uses a fundamentally different Regex implementation to many popular languages even today which is much faster in the general case and scales linearly with input, see this excellent post by Russ Cox and mind that the scales of the units are different in the 1st graphs. If implementing in a different implementation expect different performance.
> go run fizzbuzz_stdin_modulo.go < nums.txt > /dev/null
Duration: 133.38s, Total samples = 133.30s (99.94%)
Showing nodes accounting for 118.22s, 88.69% of 133.30s total
Showing top 10 nodes out of 44
flat flat% sum% cum cum%
107.42s 80.59% 80.59% 107.42s 80.59% runtime/internal/syscall.Syscall6
2.24s 1.68% 82.27% 4.79s 3.59% runtime.mallocgc
2.05s 1.54% 83.80% 2.16s 1.62% runtime.casgstatus
1.04s 0.78% 84.58% 1.04s 0.78% internal/poll.(*fdMutex).rwlock
0.97s 0.73% 85.31% 0.97s 0.73% internal/poll.(*fdMutex).rwunlock
0.92s 0.69% 86.00% 3.47s 2.60% runtime.exitsyscall
0.92s 0.69% 86.69% 1.47s 1.10% runtime.exitsyscallfast
0.91s 0.68% 87.37% 132.99s 99.77% main.fizzbuzz_stdin_modulo
0.91s 0.68% 88.06% 0.91s 0.68% runtime.nextFreeFast (inline)
0.84s 0.63% 88.69% 1.83s 1.37% fmt.(*pp).printArg
> go run fizzbuzz_stdin_regex.go < nums.txt > /dev/null
Duration: 233.59s, Total samples = 233.27s (99.87%)
Showing nodes accounting for 205.47s, 88.08% of 233.27s total
Showing top 10 nodes out of 43
flat flat% sum% cum cum%
109.39s 46.89% 46.89% 109.39s 46.89% runtime/internal/syscall.Syscall6
42.98s 18.43% 65.32% 93.70s 40.17% regexp.(*Regexp).tryBacktrack
17.19s 7.37% 72.69% 17.19s 7.37% regexp/syntax.(*Inst).MatchRunePos
14.54s 6.23% 78.92% 14.54s 6.23% regexp.(*bitState).shouldVisit (inline)
7.77s 3.33% 82.25% 8.23s 3.53% regexp.(*bitState).push (inline)
7.43s 3.19% 85.44% 7.43s 3.19% regexp.(*inputBytes).step
1.86s 0.8% 86.23% 19.05s 8.17% regexp/syntax.(*Inst).MatchRune (inline)
1.83s 0.78% 87.02% 1.98s 0.85% runtime.casgstatus
1.26s 0.54% 87.56% 100.81s 43.22% regexp.(*Regexp).backtrack
1.22s 0.52% 88.08% 2.23s 0.96% sync.(*Pool).Put
> go run fizzbuzz_iter_modulo.go > /dev/null
Duration: 125.13s, Total samples = 124.89s (99.81%)
Showing nodes accounting for 114.95s, 92.04% of 124.89s total
Showing top 10 nodes out of 36
flat flat% sum% cum cum%
105.62s 84.57% 84.57% 105.62s 84.57% runtime/internal/syscall.Syscall6
1.66s 1.33% 85.90% 1.82s 1.46% runtime.casgstatus
1.25s 1.00% 86.90% 1.68s 1.35% fmt.(*fmt).fmtInteger
1.21s 0.97% 87.87% 1.88s 1.51% runtime.exitsyscallfast
1s 0.8% 88.67% 1s 0.8% internal/poll.(*fdMutex).rwlock
0.97s 0.78% 89.45% 116.31s 93.13% internal/poll.(*FD).Write
0.96s 0.77% 90.22% 3.77s 3.02% runtime.exitsyscall
0.89s 0.71% 90.93% 0.89s 0.71% internal/poll.(*fdMutex).rwunlock
0.71s 0.57% 91.50% 2.63s 2.11% runtime.reentersyscall
0.68s 0.54% 92.04% 3.04s 2.43% fmt.(*pp).printArg
> go run fizzbuzz_iter_regex.go > /dev/null
Duration: 234.25s, Total samples = 233.96s (99.88%)
Showing nodes accounting for 204.34s, 87.34% of 233.96s total
Showing top 10 nodes out of 46
flat flat% sum% cum cum%
110.06s 47.04% 47.04% 110.06s 47.04% runtime/internal/syscall.Syscall6
40.90s 17.48% 64.52% 90.57s 38.71% regexp.(*Regexp).tryBacktrack
16.22s 6.93% 71.46% 16.23s 6.94% regexp/syntax.(*Inst).MatchRunePos
14.44s 6.17% 77.63% 14.44s 6.17% regexp.(*bitState).shouldVisit (inline)
7.60s 3.25% 80.88% 8.03s 3.43% regexp.(*bitState).push (inline)
7.56s 3.23% 84.11% 7.56s 3.23% regexp.(*inputString).step
2.18s 0.93% 85.04% 2.24s 0.96% runtime.casgstatus
1.96s 0.84% 85.88% 4.92s 2.10% runtime.mallocgc
1.88s 0.8% 86.68% 18.11s 7.74% regexp/syntax.(*Inst).MatchRune (inline)
1.54s 0.66% 87.34% 97.88s 41.84% regexp.(*Regexp).backtrack