I attempted this a few years back, didn't get far before I got hung up on the re...

progre · on April 24, 2021

I also could not get the regex to work, so I ended up writing a custom tokenizer. It ended up being like 30 lines of code, and unlike the regex I could understand how it works

KineticLensman · on April 24, 2021

In C#, I ended up with

   static public List<string> Tokenizer(string source)
   {
      // Initialise the token list.
      List<string> tokens = new List<string>();

      // Define a regex pattern whose groups match the MAL syntax.
      string pattern = @"[\s ,]*(~@|[\[\]{}()'`~@]|""(?:[\\].|[^\\""])*""|;.*|[^\s \[\]{}()'""`~@,;]*)";
      //                 empty  ~@ | specials     |   double quotes      |;  | non-specials

      // Break the input string into its constituent tokens.
      string[] result = Regex.Split(source, pattern);

This took a while to understand and get going but it really improved my understanding of regex.

duncaen · on April 24, 2021

Regex to parse lisp expressions?

the-smug-one · on April 24, 2021

Regexes are at least useful for parsing numbers and symbols.

But yeah, that shouldn't be where you get stuck.

macintux · on April 24, 2021

[\s,](~@|[\[\]{}()'`~^@]|"(?:\\.|[^\\"])"?|;.|[^\s\[\]{}('"`,;)])

Step 0, so I didn't get very far.

https://github.com/kanaka/mal/blob/master/process/guide.md#s...

kanaka · on April 25, 2021

It's a long regex, but it's just whitespace followed by an alternation with 5 different types of data: split-unquote, special characters, strings, comments, symbols. The string tokenizing branch is a bit complicated because it has to allow internal escaping of quotes. Early iterations of the guide didn't explain the regex in detail but the section now describes each of the regex components.

There are online tools to help visualize regex's. Here is a recent tweet including a visualization of mal's tokenizer regex: https://twitter.com/Mehulwastaken/status/1382292764834996230

cellularmitosis · on April 24, 2021

Whoa that regex is a monster. Try starting with simpler pieces and see if you get further this time around. Good luck! https://gist.github.com/cellularmitosis/75dc4aefe88438c14e94...

the-smug-one · on April 25, 2021

Well, you certainly don't need that regex to implement a Lisp.

bgorman · on April 24, 2021

The regex is used as a tokenizer, the outputs of which are then fed into the reader module.

User23 · on April 25, 2021

Yeah little weird since regexes can’t parse context free languages. I suppose most so-called regexes aren’t actually regular expressions, but it still feels like driving screws with a hammer.

kanaka · on April 25, 2021

Mal uses a regex for lexing/tokenizing. I didn't want people to get hung up on the lexing step (my university compilers class spent 1/3rd of the semester just on lexing). It's certainly a worthwhile area to study but not the focus of mal/make-a-lisp.

SanFranManDan · on April 24, 2021

If you are using erlang you should be using pattern matching not regex. Erlang/Elixir are one of the easiest languages to build parsers in with their binary string pattern matching.

macintux · on April 24, 2021

I agree, pattern matching is what I sorely miss every time I use anything other than Erlang. It's just enough of a hurdle I set it aside and didn't return.

It's a big step.

https://github.com/kanaka/mal/blob/master/process/guide.md#s...

agumonkey · on April 24, 2021

Brown University PLT textbook used lisp/scheme and their first paragraph was something like "nobody cares about parsing, let's get down to business in sexps"

I like parsing, I like regexes but I agree it's often a waste of time :)