Fizzbuzz in RegEx

During some internets, I was accosted by this smug anime girl's whiteboard:

You should be able to solve this.

If you haven't seen this problem before, give it a shot. If you find an interesting and different way to solve it, let me know!

This took me a couple days. It's not that difficult, I've just had a job at a bank for too long.

Let it be known the reason I pursued this goal was not the taunt of this 2D scientist.

It was for a dumb idea.

The classic fizzbuzz problem is to emit numbers (usually specifically non-negative integers) unless they are completely divisible by 3 (emit fuzz), 5 (emit buzz), or both (emit fizzbuzz).

Whether a number is divisible by 5 is very easy to determine in regex (proving this is left as an exercise to the reader) so with a tool for divisibility by 3 we can both fizz and buzz and even fizzbuzz with nary a modulo operator in sight.

A non-zero part of my dayjob is trivially reducing compute costs against trillions of records a day by explaining why chaining .*? 10 times to match against strictly defined patterns is a bad idea actually (in most cases a really bad idea), so it seemed like a good way to let off some steam.

If you understand the logic, you can build a regex for the divisibility of any digit (or any number, even fairly simply if you use a different order numeric system to base 10).

This is a terrible idea, but also funny. If you can memorize the equivalent of the below incantation in your language of choice you can inflict it upon the next unwitting interviewer for great profit:

package main

import (
	"fmt"
	"regexp"
	"bufio"
	"os"
)

func main() {
	// Patterns are self explanatory:
	div3_re := regexp.MustCompile("^(?:[3690]+|[147][3690]*(?:[147][3690]*[258][3690]*)*(?:[258]|[147][3690]*[147])|[258][3690]*(?:[258][3690]*[147][3690]*)*(?:[147]|[258][3690]*[258]))+$")
	div5_re := regexp.MustCompile("^[0-9]*[05]$")

	// Here we are taking input from IO...a char sequence we would have to convert to an int.
	// No allocating unnecessary numeric types required! User input validation built in!
	// Actually, we don't know if the user input is a number. We spit anything back out. This is fine.
	scanner := bufio.NewScanner(os.Stdin)
	for scanner.Scan() {
		chars := scanner.Bytes()
		div3 := div3_re.Match(chars)
		div5 := div5_re.Match(chars)

		switch {
		case div3 && div5:
			fmt.Println("fizzbuzz")
		case div3:
			fmt.Println("fizz")
		case div5:
			fmt.Println("buzz")
		default:
			fmt.Printf("%s\n", chars)
		}
	}
}

RegEx to Check If Multiple of 3

Starting simple; all numbers that match the below regex are divisible by 3:

\b[3690]+\b

For each regex a regexr page will be linked to demonstrate and explain the syntax, see https://regexr.com/7s4p6 for this one.

All of [3690] when divided by 3 have remainder 0. The important bit is that this is true no matter where the digit occurs in the string representing the number. For example: 3690 = 3 * 1000 + 3 * 200 + 3 * 30 + 3 * 0.

We need the \b either side to make sure we're not matching the middle of some other number. If you're only matching 1 string or line you can use ^$ instead.

Double Digit Pairs

Looking at the integers from 0-100 and pattern matching with our brains, we can observe that we match every 3rd number with the following regex:

\b(?:[3690]+|[147][258]|[258][147])\b

We can obtain this empirically, but why does this work?

The importance of the groups [147] and [258] are that their remainders when divided by 3 are 1 and 2 respectively.

If the total remainder of a number divided by 3 is a multiple of 3, we have a whole number. You can also consider this fractionally, e.g. for 12: 1/3+2/3=1. For the same reasons as [3690]+, the effect on the remainder is the same no matter what position in the string the digit is. There is likely some deeper mathematical principle about congruence but I'm just here to dunk on interviewers.

Since [3690] does not change the remainder, we can sprinkle it in the middle too. This is important for our first multiples of 3 after 100, starting with 102.

\b(?:[3690]+|[147][3690]*[258]|[258][3690]*[147])\b

Triple Digit Trios

Continuing from 100, the next false negative cases are three in a row of [147] or [258], e.g. 111, 114, 117. Once again, the sum of remainders is divisible by 3, so we can include this as an alternative terminator for both groups:

\b(?:[3690]+|[147][3690]*(?:[258]|[147]{2})|[258][3690]*(?:[147]|[258]{2}))\b

Since any amount of [3690] could occur in-between without effecting the remainder, we should account for this too:

\b(?:[3690]+|[147][3690]*(?:[258]|[147][3690]*[147])|[258][3690]*(?:[147]|[258][3690]*[258]))\b

The Cornerstone

It's time to uncork the character that blows the whole problem wide open. Although we're starting to account for certain higher digit counts, a string like 111 is still immediately thwarted simply by placing [3690] before or after it, which we know doesn't effect the remainder.

The key is to notice that we have a regex that matches any string that has a remainder divisible by 3 when divided by 3, even if it's a substring in a larger number. This is the same thing as having a digit that's divisble by 3, i.e [3690], only with more characters. There is no difference between 63 and 123 for our purposes - as long as we only match against numbers divible by 3, then we can just repeatedly apply our existing expression:

\b(?:[3690]+|[147][3690]*(?:[258]|[147][3690]*[147])|[258][3690]*(?:[147]|[258][3690]*[258]))+\b

The technical way to say this is probably like: the effect of digits on the remainder is additive, and since addition is associative and commutative we don't need to continue to evaluate a substring whose remainder is a multiple of 3. I'm not such a maths guy but associativity and commutativity of functions shows up a lot even if your day-to-day engineering doesn't strictly involve maths. Pretty handy concept.

Finishing Touches

Our current expression has no false positives or negatives for values between 0 and 1000, but has what looks like one final pattern to match in 4+ digit strings. Consider 1122 and 11211 - the expression isn't able to loop back around as although we can capture [147][258] or [258][147] pairs, we need a way to handle them when they leave a hanging [147] or [258] which means they can't terminate early as mod 0.

The final regex spaghetti reveals itself:

\b(?:[3690]+|[147][3690]*(?:[147][3690]*[258][3690]*)*(?:[258]|[147][3690]*[147])|[258][3690]*(?:[258][3690]*[147][3690]*)*(?:[147]|[258][3690]*[258]))+\b

Let's test for larger numbers:

package main

import (
	"fmt"
	"regexp"
	"strconv"
)

func main() {
	div3_re := regexp.MustCompile("^(?:[3690]+|[147][3690]*(?:[147][3690]*[258][3690]*)*(?:[258]|[147][3690]*[147])|[258][3690]*(?:[258][3690]*[147][3690]*)*(?:[147]|[258][3690]*[258]))+$")

	for i := 0; i <= 100000000; i++ {
		mod3 := (i % 3) == 0
		i_str := strconv.Itoa(i)
		div3 := div3_re.MatchString(i_str)

		if (div3 && !mod3) {
			fmt.Println(i_str, "False positive")
		} else if (!div3 && mod3) {
			fmt.Println(i_str, "False negative")
		}
	}
}

Looks decent. As in another post about numeric representations I wrote, I never learned how to do rigorous proofs so I don't know how to prove this regex works for all non-negative integers in base 10, but I'm pretty sure it's correct. If you have proper maths chops let me know what that would look like.

Conclusion

Part of what I wanted to do with this post is show off, I mean show that regexes are not indecipherable magic. Creating regexes can be finicky, but both in reading and writing them there are great tools like regexr, regex101, or debuggex available online. A lot of very bad, no good code persists for a long time because people don't build a deep understanding of this tool and what it's good at and what it's not.

For the record, once your regex gets reasonably complex, write an actual parser instead. Whoever needs to alter it after will thank you, and you are much less likely to have nonsensical performance issues and bugs due to a single character a human could otherwise catch at a glance.

Speaking of performance, here are some simple pprof top10 readings for some fizzbuzz programs (regex vs modulo, stdin vs for loop) in Go for 0 to 100000000 inclusive. Like the rest of the post nothing rigorous, only for the curious. Note that Go uses a fundamentally different Regex implementation to many popular languages even today which is much faster in the general case and scales linearly with input, see this excellent post by Russ Cox and mind that the scales of the units are different in the 1st graphs. If implementing in a different implementation expect different performance.

> go run fizzbuzz_stdin_modulo.go < nums.txt > /dev/null
Duration: 133.38s, Total samples = 133.30s (99.94%)
Showing nodes accounting for 118.22s, 88.69% of 133.30s total
Showing top 10 nodes out of 44
      flat  flat%   sum%        cum   cum%
   107.42s 80.59% 80.59%    107.42s 80.59%  runtime/internal/syscall.Syscall6
     2.24s  1.68% 82.27%      4.79s  3.59%  runtime.mallocgc
     2.05s  1.54% 83.80%      2.16s  1.62%  runtime.casgstatus
     1.04s  0.78% 84.58%      1.04s  0.78%  internal/poll.(*fdMutex).rwlock
     0.97s  0.73% 85.31%      0.97s  0.73%  internal/poll.(*fdMutex).rwunlock
     0.92s  0.69% 86.00%      3.47s  2.60%  runtime.exitsyscall
     0.92s  0.69% 86.69%      1.47s  1.10%  runtime.exitsyscallfast
     0.91s  0.68% 87.37%    132.99s 99.77%  main.fizzbuzz_stdin_modulo
     0.91s  0.68% 88.06%      0.91s  0.68%  runtime.nextFreeFast (inline)
     0.84s  0.63% 88.69%      1.83s  1.37%  fmt.(*pp).printArg


> go run fizzbuzz_stdin_regex.go < nums.txt > /dev/null
Duration: 233.59s, Total samples = 233.27s (99.87%)
Showing nodes accounting for 205.47s, 88.08% of 233.27s total
Showing top 10 nodes out of 43
      flat  flat%   sum%        cum   cum%
   109.39s 46.89% 46.89%    109.39s 46.89%  runtime/internal/syscall.Syscall6
    42.98s 18.43% 65.32%     93.70s 40.17%  regexp.(*Regexp).tryBacktrack
    17.19s  7.37% 72.69%     17.19s  7.37%  regexp/syntax.(*Inst).MatchRunePos
    14.54s  6.23% 78.92%     14.54s  6.23%  regexp.(*bitState).shouldVisit (inline)
     7.77s  3.33% 82.25%      8.23s  3.53%  regexp.(*bitState).push (inline)
     7.43s  3.19% 85.44%      7.43s  3.19%  regexp.(*inputBytes).step
     1.86s   0.8% 86.23%     19.05s  8.17%  regexp/syntax.(*Inst).MatchRune (inline)
     1.83s  0.78% 87.02%      1.98s  0.85%  runtime.casgstatus
     1.26s  0.54% 87.56%    100.81s 43.22%  regexp.(*Regexp).backtrack
     1.22s  0.52% 88.08%      2.23s  0.96%  sync.(*Pool).Put

> go run fizzbuzz_iter_modulo.go > /dev/null
Duration: 125.13s, Total samples = 124.89s (99.81%)
Showing nodes accounting for 114.95s, 92.04% of 124.89s total
Showing top 10 nodes out of 36
      flat  flat%   sum%        cum   cum%
   105.62s 84.57% 84.57%    105.62s 84.57%  runtime/internal/syscall.Syscall6
     1.66s  1.33% 85.90%      1.82s  1.46%  runtime.casgstatus
     1.25s  1.00% 86.90%      1.68s  1.35%  fmt.(*fmt).fmtInteger
     1.21s  0.97% 87.87%      1.88s  1.51%  runtime.exitsyscallfast
        1s   0.8% 88.67%         1s   0.8%  internal/poll.(*fdMutex).rwlock
     0.97s  0.78% 89.45%    116.31s 93.13%  internal/poll.(*FD).Write
     0.96s  0.77% 90.22%      3.77s  3.02%  runtime.exitsyscall
     0.89s  0.71% 90.93%      0.89s  0.71%  internal/poll.(*fdMutex).rwunlock
     0.71s  0.57% 91.50%      2.63s  2.11%  runtime.reentersyscall
     0.68s  0.54% 92.04%      3.04s  2.43%  fmt.(*pp).printArg

> go run fizzbuzz_iter_regex.go > /dev/null
Duration: 234.25s, Total samples = 233.96s (99.88%)
Showing nodes accounting for 204.34s, 87.34% of 233.96s total
Showing top 10 nodes out of 46
      flat  flat%   sum%        cum   cum%
   110.06s 47.04% 47.04%    110.06s 47.04%  runtime/internal/syscall.Syscall6
    40.90s 17.48% 64.52%     90.57s 38.71%  regexp.(*Regexp).tryBacktrack
    16.22s  6.93% 71.46%     16.23s  6.94%  regexp/syntax.(*Inst).MatchRunePos
    14.44s  6.17% 77.63%     14.44s  6.17%  regexp.(*bitState).shouldVisit (inline)
     7.60s  3.25% 80.88%      8.03s  3.43%  regexp.(*bitState).push (inline)
     7.56s  3.23% 84.11%      7.56s  3.23%  regexp.(*inputString).step
     2.18s  0.93% 85.04%      2.24s  0.96%  runtime.casgstatus
     1.96s  0.84% 85.88%      4.92s  2.10%  runtime.mallocgc
     1.88s   0.8% 86.68%     18.11s  7.74%  regexp/syntax.(*Inst).MatchRune (inline)
     1.54s  0.66% 87.34%     97.88s 41.84%  regexp.(*Regexp).backtrack