Regular Expression That Checks If A Number Is Prime

TimWolla · on Sept 8, 2016

It should be noted that this is not possible with regular expressions in the traditional sense (i.e. regular expressions only matching regular languages): http://math.stackexchange.com/a/181233/21405

Because of back references PCRE regular expressions can match non-regular languages as well.

jdnier · on Sept 9, 2016

For some great exposition on what "regular" actually means, check out http://nikic.github.io/2012/06/15/The-true-power-of-regular-...

"Thus the question arises: Can regular expressions match only regular grammars, or can they also match more? The answer to this is both yes and no"...

umanwizard · on Sept 9, 2016

"regular" has a fairly simple mathematical definition: the set of languages that can be matched with a finite state automaton.

You can think of this as the languages that can be matched with an algorithm that is bounded (i.e., O(1) ) in memory, no matter what is the string to be matched.

The following pseudocode is not bounded in memory -- can you guess why?

    bool is_prime(Number n)
    {
        for (Number i = 2; i < n; ++i) {
            if (n % i == 0) {
                return false;
            }
        }
        return true;
    }

Chinjut · on Sept 9, 2016

The value i can take an a priori unbounded amount of memory.

It should be noted, so can the value n, but in defining regular languages in this way, we allow our O(1) memory algorithms to be fed the characters of an a priori unbounded input string one by one sequentially.

OskarS · on Sept 9, 2016

Yeah, the O(1) refers to the size of the EXTRA memory you need, aside from the input. Defining it any other way would be strange. So, for instance, binary search over a sorted list needs O(log n) of memory (for the same reason as the prime algorithm, the pointer in the array is size O(log n) of the size of the array), even though the input is O(n).

Chinjut · on Sept 9, 2016

Yup. Note that the method of specifically being fed the input sequentially, one by one, means that, for example, "Strings of odd length whose middle character is 'X'" does not comprise a regular language, even though one might naively reason this to be trivial to detect with O(1) memory ("Just go look at the middle character!").

gerty · on Sept 9, 2016

FWIW, (i < n) can be replaced by (i <= floor(sqrt(n))) to avoid unnecessary iterations.

OskarS · on Sept 9, 2016

Yeah, obviously, there are lots of ways to improve that code. First off all, much more efficient to to do (i * i <= n) instead of (i <= sqrt(n)), or at the very least you can cache sqrt(n) so you don't recalculate it every step.

You can also do a separate test for 2, then start at 3 and use i+=2 instead of i++, that cuts the test time in half. Once you've done that, you can observe that all primes are 6k +/- 1 except 2 and 3, which makes your code 3 times faster than the naive version.

At this point, you've basically got a simple wheel primality testing algorithm going, so why not go all the way?

I don't think the parent poster was making the point that his was the most efficient algorithm for testing primes. I think his point was that regular numbers are O(log n) in CS terms, not O(1), which is not obvious at first.

jnordwick · on Sept 9, 2016

One of my pet peeves is how PRCE destroyed the definition of "regular" in regular expressions. It has basically made a huge number of programmers illiterate as to what truly is regular in the formal sense.

btilly · on Sept 9, 2016

That wasn't PCRE. That was Perl. PCRE just copied the features that people liked in Perl, and made them available elsewhere.

The same features have also been implemented independently many times. For example Ruby, Java and Python all have independent implementations of regular expressions with a lot of the same features.

dllthomas · on Sept 9, 2016

When were back-references introduced in sed?

Normal_gaussian · on Sept 9, 2016

http://sed.sourceforge.net/sedfaq3.html

> Grouping and backreferences: All versions of sed support grouping and backreferences on the LHS and backreferences only on the RHS.

dllthomas · on Sept 9, 2016

I saw that. I wasn't sufficiently confident as to whether it meant "all historical versions" or just "all current versions."

If the former, then it predated perl by quite a bit, of course.

btilly · on Sept 12, 2016

Perl started with Henry Spenser's regular expression library, which was written in the mid-1980s. That got used in a lot of places, and may have been used in sed. It includes backreferences. I don't know what predates that.

Perl used that as a starting place and began adding extensions such as the difference between .* and .*?.

So using "non-regular thing" as a regular expression predates Perl. But the dialect we standardized on with the features people now expect is due to Perl's influence.

Normal_gaussian · on Sept 9, 2016

Ah I get you.

I've tried finding some source for early versions but all the links I've found are dead. It's an interesting question!

conistonwater · on Sept 9, 2016

But why should people care about what is regular in the formal sense? Rather, regular in this context would mean it can be recognized with a restricted type of algorithm, which resembles the formalism.

Merovius · on Sept 9, 2016

http://davidvgalbraith.com/how-i-fixed-atom/ http://stackstatus.net/post/147710624694/outage-postmortem-j... https://swtch.com/~rsc/regexp/regexp3.html

The difference isn't academical. It has huge practical implications, from software that is just annoyingly slow to DoS attacks.

People need to understand that. And stop using non-regular regexes (or better yet: regex in general) in production code. It increases world-suck.

jnordwick · on Sept 9, 2016

It destroys mechanical sympathy. Many people can no longer understand why some of their regex calls are so slow or why some are so much quicker.

cstrahan · on Sept 9, 2016

PCRE doesn't implement that "restricted type of algorithm", and I think that's what the parent is suggesting is wrong, not so much the difference between formal understanding vs informal, intuition of the same subject (I would argue that the latter would be sufficient, but few even have that).

Understanding REs (a little more) formally, one can exploit their properties for e.g. performance and/or concision. For an example of what that understanding gives you, I'd recommend checking out the Ragel state machine compiler.

Retra · on Sept 9, 2016

They were named "regular" in the formal sense. Thus if they are not regular, it is a misnomer. It would have been just as useful to call them "Perl Matching Expressions", but without all the additional confusion about what a regular expression can do.

Chinjut · on Sept 9, 2016

Is there a standard for which additional features one can add on top of regular expressions in the original limited sense and still be considered a regex?

db48x · on Sept 9, 2016

"regular language" is a mathematical term. You can add any features you want to your matcher, but some features allow you to match languages that are not "regular".

However, most programmers are not mathematicians or computer scientists, and so are effectively using lay terminology which is similar to but not precisely the same as the mathematical terminology. Most programmers only care that their regex library allows them to concisely match strings, not that it allows them to match regular languages.

Thus there are two standards you could follow: the precise mathematical definition of "regular", or the lay definition.

greydius · on Sept 9, 2016

I don't buy the layman argument. Programmers might not be mathematicians or computer scientists, but they should be professionals and held to the same standards as professionals in other fields. For example, a mechanical engineer doesn't use a "laymen" definition of tensile strength when choosing materials for a design.

eponeponepon · on Sept 9, 2016

No, but he probably does use layman's terms when describing molecular lattices and electron shells. If he ever does.

The point here is that users of (colloquial) regular expressions, however professional they are or aren't, are just engaging with the surface of a deeper set of mathematical principles that require greater nuances of meaning to discuss than their applications do.

OskarS · on Sept 9, 2016

I basically agree, but I would say that a slightly deeper understanding of regular expressions is a really useful thing to know for basically all professional programmers. If for no other reason they can recognize the "pathological" cases where backtracking implementations (i.e. almost all modern regex implementations) would need until the heat death of the universe to evaluate.

chongli · on Sept 9, 2016

Programmers might not be mathematicians or computer scientists, but they should be professionals

What? No. Many programmers out there are just working on fun projects in their spare time. I'm sure there exist some aerospace engineers doing that, but it's not many. I don't see why your average bedroom open source programmer should be treated like an aerospace engineer.

db48x · on Sept 9, 2016

Programmers have more freedom from the constraints of reality than mechanical or civil engineers.

UlyssesSKrunk · on Sept 9, 2016

What the fuck are you talking about? Those aren't even remotely the same thing at all.

Chinjut · on Sept 9, 2016

Right. My point (which may not be a particularly great point, but it was what I was trying to convey) is that the lay definition doesn't have any particular tight borders; presumably different regex libraries are capable of matching different things, use different syntax, etc.

(And this is part of why I find it confusing terminology and would prefer "regular expression" only to mean specifications of regular languages in the original formal sense. But, of course, I'm not the king of language, and it's natural that people would speak the way they speak, given the circumstances.)

Retra · on Sept 9, 2016

The lay definition isn't a definition, per se, it's a misappropriate of the term. Since the lay user only cares about text matching, the term for what they need is "text matching syntax" or some such thing.

And surely you don't think those implementing the regular expression engine are lay people. They should know better.

thaumasiotes · on Sept 9, 2016

> there are two standards you could follow: the precise mathematical definition of "regular", or the lay definition

Considering this writeup leads with a giant headline saying "Regular Expressions - The Theory"... I think the complaints about how it's not actually a regular expression are even more valid than usual.

conistonwater · on Sept 9, 2016

I always understood it as being whatever can be straightforwardly tacked onto a typical DFA implementation. I though that's how people came up with it—whichever extras were the easiest to implement without mucking anything else up, so in a way they "came for free". (It's possible I misunderstood, I don't know for sure.)

Chinjut · on Sept 9, 2016

Well, sure, but "whatever can be straightforwardly tacked on" is highly subjective, no?

btilly · on Sept 9, 2016

No, it is not.

You can invent a variety of notations to describe a DFA. But you can't change the formal definition of a DFA, or what kinds of things a DFA can match without making it not a DFA.

Chinjut · on Sept 9, 2016

I'll say to you what I said above, but even moreso: Er... I'm not sure what you're saying here, so I'll clarify what I was saying. What I mean is, we all agree on what DFAs are, and so we all agree on what regular languages in the original, Kleene sense are. But once we start admitting non-regular languages into our notion of "regular expression" because they're "straightforwardly tacked on", it's unclear what the limits are supposed to be.

I'm not actually a fan of using "regular expressions" in a sense broader than that of Kleene's regular, DFA-accepted languages, to be clear. That is exactly what I've been questioning.

DasIch · on Sept 9, 2016

You can actually quite easily add features to regular expressions, that are typically excluded while still being actually regular. Examples would be an `and` or a `not` operator.

btilly · on Sept 9, 2016

What I am saying is that you can come up with a lot of extensions of the basic regular expression language. But there is no subjectivity at all in whether the extensions can be implemented by a DFA. Either you can, or you can't.

duaneb · on Sept 9, 2016

Not really. A graph is either a valid DFA or it is not. At that point, it's a matter of tooling in terms of "straightforward"ness.

Chinjut · on Sept 9, 2016

Er... I'm not sure what you're saying here, in that it doesn't seem to be engaging with what I was saying, or trying to say, so I'll clarify what I was saying. What I mean is, we all agree on what DFAs are, and so we all agree on what regular languages in the original, Kleene sense are. But once we start admitting non-regular languages into our notion of "regular expression" because they're "straightforwardly tacked on", it's unclear what the limits are supposed to be.

I'm not actually a fan of using "regular expressions" in a sense broader than that of Kleene's regular, DFA-accepted languages, to be clear. That is exactly what I've been questioning.

JoshTriplett · on Sept 9, 2016

I think the intent of the parent comments was to say that it's fine to consider a regex a standard "regular expression" in the mathematical sense if it still translates to a DFA and matches a regular language. For instance, mathematical regular expressions typically only include '*' (0 or more), not '+' (1 or more). However, you can still translate a regex including '+' to a DFA. Likewise arbitrary repetition mechanisms like {m,n}, case-insensitivity, and other features that add convenience without actually matching non-regular languages.

On the other hand, backreferences can allow matching non-regular languages.

Chinjut · on Sept 9, 2016

Ah, yeah, I see now that that is indeed probably what they meant; thanks! In that case, yes, I'm sympathetic to that, using regex to just mean "Anything that ultimately describes a regular language".

conistonwater · on Sept 9, 2016

I don't think so, not too much anyway, I think the "software engineering" aspects of it would place reasonably strong constraints on what is desirable and on what is feasible (for a DFA). That's probably how the (informal?) consensus emerged.

theseoafs · on Sept 9, 2016

If you can compile any string in the language to a DFA then it's equivalent in power to regular expressions.

smegel · on Sept 9, 2016

Maybe we should stop talking about regular expressions and talk about pattern matching instead...

duaneb · on Sept 9, 2016

Sure, but on the pragmatic hand, "regular" isn't super useful unless you're designing a language yourself. If you're parsing something, not understanding "regular" is going to hurt a lot less than attempting to use PCRE to do something it can't.

Retric · on Sept 9, 2016

Regular expressions has implications in space time complexity. PCRE tosses that away.

MrManatee · on Sept 9, 2016

Indeed. To support backreferences, "regex" libraries are forced to use algorithms that can be very slow in the worst case.

The sad thing is that the libraries use the same algorithms even if the expression doesn't contain backreferences. A while ago, Stack Overflow had a brief outage because of regular expression performance, although the expression that caused it didn't even use backreferences or other non-regular features:

http://stackstatus.net/post/147710624694/outage-postmortem-j...

In contrast, Google Code Search - when it still existed - supported regular expression searches over world's public codebases. One key ingredient making this possible was to only use proper regular expressions:

https://swtch.com/~rsc/regexp/regexp4.html

DasIch · on Sept 9, 2016

Regular is actually incredibly useful because it allows you to confidently parse or validate (large amounts of) untrusted data without having to worry about DoS attacks.

I'd argue that this is useful, even if you use it as part of a parser that doesn't just parse regular languages, as you'll have a slightly easier time reasoning about the time and space complexity of the construction.

jlarocco · on Sept 9, 2016

That's true, but it is actually possible to construct true regular expressions that check if a binary string of digits is divisible by an arbitrary integer. Obviously not the quickest or most intuitive way to do it, though.

There's an interesting book that covers some unusual aspects of automata and regular expressions: https://www.amazon.com/Finite-Automata-Regular-Expressions-S...

Chinjut · on Sept 9, 2016

Sure, for a fixed arbitrary integer M. (Supposing the input is fed in in big-endian order, you just keep track of the remainder modulo M as each new bit comes in, updating it by turning r into (r * 2 + new bit) mod M. Since this only requires finite state, it is regular).

But that won't get you a primality checker. You can't get a primality checker. Primes don't comprise a regular language, neither in unary nor in any nontrivial base.

jayshua · on Sept 9, 2016

What do you mean by "non trivial base?"

Chinjut · on Sept 9, 2016

Oh, just as opposed to unary. Writing numbers in the ordinary way in any ordinary base, which is to say, bases greater than 1; base 2, base 3, base ten, whatever.

jerf · on Sept 9, 2016

It's worth pointing out that we are talking about representing numbers in unary, though. Or even more directly, plain ol' string length since the RE under discussion doesn't even care what symbols you are using, so "11111" = "abcde" under this RE.

etatoby · on Sept 9, 2016

Back-references may not be part of the mathematical definition of regular language, but they are definitely part of the traditional (UNIX) definition of regular expressions.

The regexp in the article is perfectly compatible with UNIX grep:

    $ s=.; for i in `seq 50`; do echo $s | grep -qE '^.?$|^(..+)\1+$' || echo -n $i\ ; s=$s.; done; echo
    2 3 5 7 11 13 17 19 23 29 31 37 41 43 47

(Tested on Mac OS X)

In this case, bashing PCRE is completely off-topic.

stplsd · on Sept 9, 2016

Nobody is bashing PCRE.

>but they are definitely part of the traditional (UNIX) definition of regular expressions.

This is not exactly true. In 1968 or ealier Ken Thompson wrote first grep implementation using NFA simulation, an algorithm which is now known as Thompson Construction [1][2] There were no back-references in that grep.

In 1975 Al Aho wrote an egrep which used DFA instead of NFA [3] (both NFA and DFA accepts regular languages, but in some cases DFA will have exponentially more states than the same regular language accepting NFA automata [4]) This added features such as alternation and grouping which was not supported by grep, but it not supported back-references.

Current GNU grep have -E switch which accepts extended regular expressions as described in Posix standard. Theses extended regular expressions supports back-references as do PCRE available in grep with -P switch

So no, back-references are not in traditional Unix definition of regular language.

[1] https://en.wikipedia.org/wiki/Thompson%27s_construction

[2] http://dl.acm.org/citation.cfm?doid=363347.363387

[3] http://dl.acm.org/citation.cfm?id=55333

[4] http://cs.stackexchange.com/questions/3381/nfa-with-exponent...

delsarto · on Sept 8, 2016

tl;dr if you know a little regex

   - convert your number to a string of that length (1 = 1, 2 = 11, 3 = 111)
   - handle 0 / 1 separately
   - ^(..+?)\1+$
   - trick; the WHOLE thing has to match to return (^$)
   - first \1 match is "11", ergo string must be "11", "1111", "111111" to match 
   - second \1 match "111", ergo string must be "111", "111111", "111111111" to match
   - and so on.  if you find before length of string, it was not prime.

Clever trick. Look forward to being asked it in your next google interview :)

bigiain · on Sept 9, 2016

That'd be right - a 20 year old trick becomes Google's hottest new hiring roadblock/challenge...

newsgroup comp.lang.perl.misc, October 1997. Message-ID slrn64sudh.qp.abigail@betelgeuse.wayne.fnx.com

(I don't know if abigail "invented" that, but I remember discussing it on usenet when it appeared in her .sig back in the mid/late '90s...)

biot · on Sept 9, 2016

In other words, it's the Sieve of Eratosthenes implemented via regex: https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes

hexane360 · on Sept 9, 2016

Well, not really. It's closer to trial division, though even that doesn't go through every single multiple from 1-n.

The difference is the sieve works on a large batch of numbers, and only tests divisibility with numbers not known to be composite, and less than sqrt(n).

https://en.wikipedia.org/wiki/Trial_division

Veedrac · on Sept 9, 2016

No, that's completely different. This is just trial division, with really slow division.

baristaGeek · on Sept 9, 2016

Filling the Sieve has a time complexity of O(n log log n), and accessing a value is clearly O(1). This is far away from that.

cantrevealname · on Sept 9, 2016

That succinct explanation was excellent. Something like that should be a sidebar to the original article.

(I was curious about how it worked but didn't need a long article about the basics of RE's, prime numbers, and other things I understand well enough.)

aiiane · on Sept 9, 2016

"clever trick" is something we explicitly avoid asking these days.

posterboy · on Sept 9, 2016

Is the applicant asked to explain the regex or to come up with the regex and the conversion to unary?

_wiv7 · on Sept 9, 2016

Doesn't this have some relation to https://en.wikipedia.org/wiki/Church_encoding

loeg · on Sept 9, 2016

It's basically a prime sieve done on a string encoding.

jsrn · on Sept 9, 2016

Invented in 1998 by Perl hacker Abigail:

http://neilk.net/blog/2000/06/01/abigails-regex-to-test-for-...

(check out Abigail's other JAPHs if you like stuff like this)

iluxonchik · on Sept 9, 2016

Yep, you are correct :)

fxn · on Sept 9, 2016

This regexp is totally brilliant. It solves this problem in an unexpected way, because you think numbers, but the solution thinks strings. Not only that, the paradox, or its beauty, is that it goes to the root of the number abstraction: counting sticks.

I explored this technique some years later to solve a bunch of other problems, coprimality, prime factorization, Euler's phi, continued fractions, etc. (see https://github.com/fxn/math-with-regexps/blob/master/one-lin...).

jemfinch · on Sept 8, 2016

I wish they'd stop calling these expressions "regular".

iluxonchik · on Sept 9, 2016

Okay, technically it's regex, but I didn't want to get into this distinction, since "regex" and "regular expression" are used pretty much interchangeably (unless you're in the academia :) ).

rrauenza · on Sept 8, 2016

> How would we go about that? Well, all we have to do is add ? in front of the +. This will lead us to the <.+?> regex.

I was very confused until I realized the author's definition of 'in front' wasn't the same as mine...

iluxonchik · on Sept 9, 2016

What do you mean? Could you please clarify that? What did your understand by "in front"?

Chinjut · on Sept 9, 2016

Presumably, they thought of A as in front of B in the expression "AB" rather than in the expression "BA".

(Which is also how I would normally think of "in front", for what it's worth!)

iluxonchik · on Sept 9, 2016

Oh, that was probably it. That would be true if we were talking about FIFO queues :)

Normal_gaussian · on Sept 9, 2016

Which, in my experience, is how writing then reading text works.

thaumasiotes · on Sept 9, 2016

I'm another one who thinks the letter "in front" is the one that comes first. May I ask how you were thinking about the "front" and "back" of text?

iluxonchik · on Sept 9, 2016

Well, we read from left to right, naturally that is the order of the letters, so the one on the right is "in front" of the left one.

thaumasiotes · on Sept 9, 2016

I guess if you imagine a person walking from the beginning of the text to the end, their front (the side with the face on it) would face the end of the text.

But English never conceives of text in this manner; we view text as being arranged in a chronological order, where text that occurs chronologically earlier comes "before" text that occurs chronologically later. This mirrors the application of "before" and "after" to time in the rest of the language. Whether you conceive of reading as the reader traveling through text from the beginning to the end, or as text arriving at the reader, the reader will always encounter text on the left before text on the right, and therefore the text on the left is in front of the text on the right.

(In the only other language I'm qualified to talk about this for, mandarin chinese, earlier and later time might be indicated by either of two spatial metaphors: "up" for the past and "down" for the future ["up" is also used as a metaphor for beginning things]; or "front" for earlier and "back" for later. When text is read from left to right, "front" is used to indicate text on the left.)

Here ( http://www.friesian.com/egypt.htm ) is someone writing about the ancient Egyptian writing system, inadvertently assuming that the front of text is its beginning and the back of text is its end:

> Note that Egyptian glyphs have a front and a back. All the images above and below face to the left, [...] which indicates that the text is to be read from left to right. This is conformable with the usage of English and other European languages. However, although this would be familiar and agreeable to the Egyptians, Egyptian usage was ordinarily to write from right to left, as today is done in Hebrew and Arabic. They indicated this direction by having all the glyphs face to the right instead of to the left

(Egyptian glyphs often depict a person or an animal with an actual face. They face towards the beginning of the text, not the end.)

You seem to speak English at a fully native level, based on your writeup here. (Although you don't seem to have picked up on the idea that if a quantifier "precedes" a '?', it must be "in front" of that '?'.) Do you have another native language? Are you based in a country that primarily speaks some other language? What is the metaphor that determines that later words are "in front" of earlier words?

nanny · on Sept 9, 2016

>Well, we read from left to right, naturally that is the order of the letters

So that means "W", "e", "l", and "l" are in positions 1, 2, 3, and 4 respectively? The "W" comes first?

Isn't the first position always in front of the second by definition?

shthed · on Sept 9, 2016

So your definition of "in front" is synonymous with "after" or "behind"?

snicky · on Sept 9, 2016

Is it just me or the author of this blog post has a very annoying way of writing? He keeps explaining things and then a couple of lines below he writes something like "remember the discussion above?" or "if you remember correctly ...". Hell, sure I remember, because I just read about this 10 seconds ago. Do people really have such a short attention span and can't remember what was in a previous paragraph? Repeating stuff over and over seems almost like content farming to me.

nothrabannosir · on Sept 9, 2016

It's seo. The entire article is optimised for search terms relating to regular expression, prime, etc.

On a tangent, I really wish Google would release a new algorithm that punished this stuff. It's killing articles about common search terms. :/

mwpmaybe · on Sept 8, 2016

Very interesting. Thanks for writing this up.

This is a bit more Perlish (not that I'm an authority):

    sub is_prime {
      (1x$_[0]) !~ /^(?:.?|(.{2,}?)\1+)$/;
    }

But perhaps less clear to a novice reader. You can also leave off the semicolon.

bigiain · on Sept 9, 2016

perl -wle 'print "Prime" if (1 x shift) !~ /^1?$|^(11+?)\1+$/' [number]

(My first exposure to that was in '97 or so as a .sig from abigail in comp.lang.perl.misc)

peteretep · on Sept 9, 2016

Some asshole gave me this in an interview for a Perl dev job, and asked me what it did.

bigiain · on Sept 13, 2016

Wasn't at an interview, but I remember quizzing the "new guy" Perl dev with "So explain what this one does":

  @sorted = map  { $_->[0] }
           sort { $a->[1] cmp $b->[1] }
           map  { [$_, foo($_)] }
                @unsorted;

(Twenty years or so back, I could occasionally be "that asshole"... I'm better now, honest...)

iluxonchik · on Sept 9, 2016

Thank you :)

You are probably right, but this function was my first head-to-head encounter with Perl, I researched just a little bit what I needed to know in order to write it.

0xmohit · on Sept 9, 2016

I suppose that this has been around for a while:

https://news.ycombinator.com/item?id=9039537

http://montreal.pm.org/tech/neil_kandalgaonkar.shtml

baristaGeek · on Sept 9, 2016

Performing a Miller-Rabin primality test or filling a Sieve of Eratosthenes will almost always be the way to go. However, this line of code seems like magic and that's kind of impressive.

jimjimjim · on Sept 9, 2016

how many problems do we have now?

JBiserkov · on Sept 9, 2016

Approximately n / ln n according to https://en.wikipedia.org/wiki/Prime-counting_function

passwordreset · on Sept 9, 2016

98. Checking for primality ain't one.

daenney · on Sept 9, 2016

Why does the example given by the author with L=15 state that the regex matches once it reaches 5, instead of 3?

> So first, we’ll be testing the divisibility by 2, then by 3, then by 4 and then by 5, after which we would have a match.

jk563 · on Sept 9, 2016

It tests from the highest number first I believe.

>> As a heads-up, I just want to say that I’m lying a little in the explanation in the paragraph about the ^(..+?)\1+$ regex. The lie has to do with the order in which the regex engine checks for multiples, it actually starts with the highest number and goes to the lowest, and not how I explain it here. But feel free to ignore that distinction here, since the regular expression still matches the same thing

daenney · on Sept 9, 2016

Right, that makes sense. But in the explanation they go from 2, to 3 to 4 to 5, following along with the lie, yet picking the real solution. It's a bit confusing.

ldom22 · on Sept 9, 2016

have regular expressions gone too far?

ipozgaj · on Sept 9, 2016

It's basically the Sieve of Eratosthenes[1] algorithm, and it's possible to do it with regular expressions because numbers here are represented in unary[2] number system, where number of characters/tokens equals the number itself. It's a common trick for testing various Turing-machine stuff.

[1] https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes [2] https://en.wikipedia.org/wiki/Unary_numeral_system

conistonwater · on Sept 9, 2016

No it's not, this is the trial division algorithm for primality testing.

YeGoblynQueenne · on Sept 9, 2016

Can't quite get this to work in vim [1]:

  \(^1\{-,1}$\)\|\(^\(11\+\)\1\+$\)

The two parts match what you'd expect on their own but the OR-ing screws it up: it means the whole regex matches everything.

Is this something vim gets right and every other engine wrong, the other way around, or...?

_____________

[1] I'm matching 1's rather than dots to avoid highlighting every bit of text ever anywhere all the time.

Also, that way it's so much more pleasing to the eye and easy to read, don't you think?

thaumasiotes · on Sept 9, 2016

> Also, that way it's so much more pleasing to the eye and easy to read, don't you think?

No, it's harder to read. For ease of reading, you need to match something that isn't already part of the expression, like 2s or Ks.

YeGoblynQueenne · on Sept 10, 2016

>> No, it's harder to read.

Don't worry, I had my humour removed at birth also. It grows back, eventually.

Senji · on Sept 9, 2016

Our hubris will be our undoing.

kgdinesh · on Sept 9, 2016

how does it fare on time complexity compared to normal tests like say AKS?

YomiK · on Sept 9, 2016

It's very clever, but obnoxiously slow. It's useful for code golf and as a pretty impressive party trick. But like your banker will not be impressed with your college funding plan of pulling a quarter out of his ear, this is not going to make it in any real use.

Imagine naive absolute-beginner-programmer trial division. This is worse. Now add the overhead of counting via regex backtracking and integer comparison via matching strings. A fair number of regex engines will also start using enormous amounts of memory.

AKS is of theoretical interest, but not really a "normal test." It's very slow in practice, being beat by even decent trial division for 64-bit inputs (it's eventually faster, as expected, but it takes a while). But it is quickly faster than this exponential-time method. The regex is in another universe of time taken when compared to the methods typically used for general form inputs (e.g. pretests + Miller-Rabin or BPSW, with APR-CL or ECPP for proofs).

As others have noted, "has been popularized by Perl" is because it was created by Abigail, who is a well-known Perl programmer (though almost certainly a polyglot). It's also been brought up many times, though it's a nice new blog article. I hope the OP found something better when "researching the most efficient way to check if a number is prime." In general the answer is a hybrid of methods depending on the input size, form, input expectation, and language. The optimal method for a 16-bit input is different than for a 30k-digit input, for example.

kgdinesh · on Sept 11, 2016

Why is AKS only of theoretical interest? Isn't it proven to be a deterministic test of primality?

Also, what is the fastest way to test for primality that's practically feasible?

YomiK · on Sept 13, 2016

It is a proven deterministic test of primality. We already had those before AKS, and they are significantly faster than AKS (even the various improvements). But they don't check all the boxes that are useful for stating things in computational complexity proofs without a paragraph of weasel words. So from the theory side of things, it's great since we don't particularly care about the actual execution time, but the asymptotic behavior and whether it is solidly shown to be in P.

Lots of math packages have deterministic primality tests, but none use AKS as a primary method, because AKS offers no benefits over other methods and is many orders of magnitude slower.

For inputs of special form, there are various fast tests. E.g. Mersenne numbers, Proth numbers, etc.

The fastest method depends on the input size and how much work you want to put in. For tiny inputs, say less than 1M, trial division is typically fastest. For 32-bit inputs, a hashed single Miller-Rabin test is fastest. For 64-bit inputs, BPSW is fastest (debatable vs. hashed 2/3 test). The BLS methods from 1975 are significantly faster than AKS up to at least 80 digits, but ECPP and APR-CL easily beat those. ECPP is the method of choice for 10k+ digit numbers, with current records a little larger than 30k digits.

cyphar · on Sept 9, 2016

Horrific. It's implicitly checking for every divisor (which is O(n)) but there's also the complexity from using a regex engine (which can be O(n*m)). So the overall complexity is probably of the order O(n^2).

caub · on Sept 9, 2016

it's highly inefficient tho