Nice find. It works because it rejects "2 or more x's" repeated "2 or more times". So xx doesn't get rejected, but any multiple of that (xxxx, xxxxxx, ...) will be. The same way xxx doesn't get rejected, but any multiple of that (xxxxxx, xxxxxxxxx, ...) will be.
You've solved it using the actual definition of prime numbers, no trickery needed. Well played.
FYI, you don't need brackets around the \1, so can score 286.
More interesting than the definition of primes, it's almost the definition of multiplication that is embedded in this regex. We have two numbers(of occurrences) being multiplied:
- the first one is represented by the group (..+) it represents the number of occurrences n between 2 and +∞
- the second one is represented by (\1)+. We will repeat the first number m times, between 1 and +∞ times.
So the result of the multiplication is n*(m+1), which cannot be a prime. We just have to take the opposite with negative lookahead. It's very beautiful indeed.
Thanks for the web site links! Both are pretty interesting and I actually learned something from the detailed description(s) that regex101 provides.
(I learned that for (\1)+, "Note: A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data")
Later, I realized that in a regular program / on the command line, the negative lookahead can be avoided by using the !~ (doesn't match) operator,i.e., we just check that the number is not a non-prime using a simpler regex:
DB<63> print "Matches!" if (("x" x 31) !~ /^(..+)\1+$/)
Matches!
DB<64> print "Matches!" if (("x" x 18) !~ /^(..+)\1+$/)
The Regex Golf site only asserts matches, i.e., it's using =~. That's why the negation using negative lookahead was needed.
(The simpler regex merely looks for non-primes by matching any number of characters which are a multiple of two numbers, n x m, i.e., those which can be factorized. n comes from (..+), m comes from \1+).
Thanks, I was going for the definition of primes but was just going a bit loopy about the negation. I didn't want to trim surplus characters until I understood it but yes I can see those brackets are unnecessary.
Therefore, the score it provides should be a running tally of how many characters you've used to make it match the scoring system of golf; which is the lowest number of strokes wins.
The scoring system for this is incremental which is the opposite of golf.
A proper scoring system with this would provide a character limit (par) for each section and the goal would be to write a shorter regex formula to complete the task. Final score would be how many characters under or over the total character limits (course par) you scored.
Seems this is more like Regex Darts or something like that.
The title was (most probably) derived from Code Golf, which is a competition in coding something with as few characers as possible. Code Golf was derived from Golf, where you want to use as few turns as possible.
The score going up and not down, which is done because you get more points the more objectives you fulfill, does not change the objective of this game or what it is based on.
> The score going up and not down, which is done because you get more points the more objectives you fulfill, does not change the objective of this game or what it is based on.
Golf scoring penalizes you with more "points" by how many strokes you take.
If you are playing a Par 4 and it takes you 6 swings to get in the cup you just got _penalized_ +2
If your partner gets in the cup in 2 swings he is awarded -2
From my quick read, Code Golf uses a similar scoring system as to the sport of golf, the lower the score the better.
This game is the opposite, meaning the higher the score the better.
This has nothing to do with the objectives of the games in question, just the scoring method.
To make the sport of golf have a similar scoring system as this game you would grant ten points for every stroke under par, deduct ten points for every stroke over par, and deduct one point for every penalty stroke.
Alternatively, it's a loose analogy and the main point of the game is to save keystrokes (with the positive scoring system added to reward partial effort; if it came down to a scored competition the leaders would anyway all probably have full credit for solving the problem).
Regex golf is to code golf as paintball golf is to normal golf. Or something.
The golf part is that you can only get the most points by writing the shortest regex possible. However, I think the game avoids just giving a score based on the number of characters. To do that, you would be required to match all on the left and none on the right (the equivalent of actually getting the golf ball in the hole). This game lets you pick up your ball whenever you want and scores you based on distance to the hole. For the golfers, think of it more like a closest-to-the-hole competition than the more common stroke play.
Many times I would match the exact opposite opposite of what I wanted. Is there a general rule for inverting regexps? ^ and ?! don't seem general purpose.
You need to pin the match to the start and end of the string with ^ and $ respectively otherwise the negation just matches an empty string or other irrelevant string.
As you say, the negation matches an empty string. So I put on anchors:
^(?!(.)(.)\3\2)$
But now it matches nothing. I'm not even sure what that regexp says. The entire string is a negation? Would anything match that regexp?
I see elsewhere on this page that the answer involves putting in an extra dummy character, putting that new negation-and-dummy-character in parens, and then requiring that, between the anchors, there be 0-or-more of negation-and-dummy-character.
^((?!(.)(.)\3\2).)∗$
Two questions:
1. Why are my backrefs still \3 and \2? I added another pair of parens. (I thought ?! might not count, but it counted in my second example above.
2. Why does abba no longer match? It has no matches to the negation-and-dummy character construct, which ∗ should match, right?
Things like (?!..) are called assertions. I like to think of assertions as patterns that match the space between two characters. So (?!a) means that it matches the space, where the next character is not 'a'.
So in this case, your regex matches an empty string (since it doesn't match any character at all between ^ and $, only spaces.)
> ^((?!(.)(.)\3\2).)∗$
> 1. Why are my backrefs still \3 and \2?
Well, you're referencing the second and third opening parenthesis.
> but it counted in my second example above
Maybe that's because your second example is wrong? I mean, I don't really know what you are trying to achieve in your second example.
By the way, my solution for abba is
^(?!.*(.)(.)\2\1)
I guess you'll will be able to figure out what this regex means by now :)
Edit:
((.(?!\1))+|\1) is used to conditionally match .+ iff a * has been found.
.(?!\1) Matches any character if it is followed by \1. When * has been found then it matches no character, when * is not found it matches every character.
Edit 2:
Formatting to avoid the *s becoming italics :/
I got all but the "do not match" for "Ternstroemiaceae"
The challenge appeared to be to match words with four instances of the same letter. "Ternstroemiaceae" contains four 'e's, and thus should be in the "match" column, instead of the "don't match" column, no? Did I miss something?
I got a 3 character pattern for the first puzzle and a 4 character pattern for the second puzzle (and reading a comment here I see there is a 2 char pattern for the second puzzle).
Only used 1 special char between them (anchoring something). So the first two sets have repeated 3 character sequences, and in one of them they follow an additional pattern.
For the third sequence, 8 character solution, lots of specials.
I haven't figured out how to solve this with the parts of ERE and PCRE that I know. (I definitely don't know the entirety of PCRE.)
It's straightforward for me to write a substitution using regular expressions to create a pattern-matcher for a given glob (just anchor the ends and replace literal ? with . and literal * with .*) but here we have to do it inside a single regular expression.
I don't think there's a BRE solution if the number of stars is unbounded because I don't think this is a regular language.
If you look at the revisions, you'll see my 1st iteration was mostly identifying patterns, then with more and more cheating (and looking at this thread) to squeeze every point possible.
What are the rules? That is, are these Perl regexes, POSIX regexes, …? (Come to that, what is this site? Going up one level to alf.nu gives me a lot of suggestions for what I can do by modifying the address, but no clue of who's doing it on my behalf.)
challenge: use machine learning to find the best solutions.
They might improve on those intended by exploiting accidental regularity in the corpus - though charmingly, the golf-cost of regex length helps combat this overfitting. They might also find genuinely cleverer solutions.
Refined:
[0369] can appear any number of times
[147] and [258] must appear an equal number of times, or for any remaining:
[147] must appear a multiple of 3 times
[258] must appear a multiple of 3 times
Ternstroemiaceae contains four es; anyone else hit that issue / know what that's not in the valid results for Four? I'm guessing there's a pun in there that I didn't get :/
Learning, kinda. I've often been motivated to learn something after being given a problem that can only be solved (or be solved much more easily) using it. Every other Excel trick I know is a result of that.
I like the concept but the word choice doesn't seem too "regular", it is more about catching all the particulars rather than finding a pattern as far as I can tell.
Wait, so what's with the 'point system' here? Why's shdon answer for backrefs 199 and yours 201? (and while you're at it, could you briefly explain (...).*\1 for the regex newbies?)
(...) # Match exactly 3 (the dots) characters and save them as a group (the parenthesis)
.* # Match any character (the dot) 0 or more times (the asterisk)
\1 # Reuse the first group
It isn't working because (...) is just matching 3 characters. What your regex is saying, is basically:
Match any 3 characters, then any character 0 or more times, then any 3 characters.
You're reusing the regex and not the match. Back referencing essentially means that you'll match whatever is matched the first time, rather than reusing the regex pattern.
I hope that made a little sense. Grouping, capturing, matching and backreferences is part of what makes regex tricky and very powerful. You might want to use some of the online tools, which can help explain the regex visually.
You're pretty much correct. True regular expressions don't have the expressive power to decide whether arbitrary-length strings are or aren't palindromes.
That said, 'regular expressions', as used in most programming languages, have extensions that extend the expressive power. One such extension is the matched group & backreference, used in other commentor's answers.
From a theoretic stance, these aren't really 'regular expressions', but that's what we call them in practice.
what's the pattern on "Abba"? I thought it was just to exclude doubled letters but I have doubles on two words on the left hand side as well (noisefully and effusive, in case the word lists are the same)
Because for example "beam" matches [^g-z]. That is, it has a letter somewhere that is not between g and z (namely e and a). I came up with ^[a-f]+$ but I'm guessing it could be shorter :)
Edit: ah, every word in left column has 4 a-f letters. So [a-f]{4} is a shorter match.
They are not redundant. Any string of one or more exes is going to constrain a substring of exes of length 2^n (consider n=0 for a trivial proof), so you do need those anchors!