Here is another one I just did tonight - I wanted to match IPv4 addresses, but d...

MichaelSalib · on Sept 29, 2013

Maybe something like this:

  def is_ipv4_addr(s):
     try:
        octets = s.split('.')
        assert len(octets) == 4
        for o in octets:
            assert 0 <= int(o.lstrip(0) or '0') < 256
     except:
        return False
     return True

It is longer; on the other hand, it is easier to read and more importantly easier to verify correctness.

ghshephard · on Sept 30, 2013

Would:

  1. 12 .13. 14
  089.23.45.67

Both match that? (Your general point is made though - RegExes look fine to the person that just crafted them, but are opaque to the casual observer)

clarry · on Sept 29, 2013

I think you forgot to verify that an octet doesn't have leading zeros (unless its value actually is zero).

MichaelSalib · on Sept 29, 2013

I didn't forget: the (o.lstrip(0) or '0') expression does that.

Actually, that should be o.lstrip('0')...

clarry · on Sept 30, 2013

Wrong.

  >>> is_ipv4_addr("01.0.0.0")
  True

It should reject that (i.e. return False) because the first octet contains a leading zero. But you're just stripping the zero away, ignoring its existence. For no effect, because converting with int() already ignores them for you.

Your code is also ok with bizarre inputs like "0..." :-)

Regexes really do have their strengths -- they compactly express a state machine, and you can always break the expression into parts which'll show exactly what the state machine will accept. They could also be much more readable if people bothered to break them into parts instead of typing it out all inside a long string that becomes really difficult to parse visually. There are other notations to improve readability, for example rx in emacs: http://www.emacswiki.org/emacs/rx

A seemingly simple regex can be implemented in imperative code and it might look clean and pretty until you get the logic exactly right and amend it to handle all the corner cases that are not obvious at first sight. For comparison I did the exercise in old-fashioned C (and the indentation got messed up along the way, sigh).

https://pastebin.mozilla.org/3171656

A state machine would be more appropriate in my opinion.

MichaelSalib · on Sept 30, 2013

You're right. My mistake.

I like automata and I think regexes are good for some things, but I definitely agree about the crappy syntax. When working in CL, I loved Edi Weitz' CLPPCRE package which allowed you to specify regexes using either the traditional broken string form or an s-exp syntax. Much cleaner.

mjhoy · on Sept 29, 2013

> How else would you do it?

Taking your question generally, I was curious to see what it might look like as a parser, since I find that regex a little hard to read. Here's an implementation with Haskell's parsec:

https://gist.github.com/mjhoy/6751909