Parser construction

A parser is nothing more than a class that derives from Parslet::Parser. The simplest parser that one could write would look like this:


  class SimpleParser < Parslet::Parser
    rule(:a_rule) { str('simple_parser') }
    root(:a_rule)
  end

The language recognized by this parser is simply the string “simple_parser”. Parser rules do look a lot like methods and are defined by


  rule(name) { definition_block }

Behind the scenes, this really defines a method that returns whatever you return from it.

Every parser has a root. This designates where parsing should start. It is like an entry point to your parser. With a root defined like this:


  root(:my_root)

you create a #parse method in your parser that will start parsing by calling the #my_root method. You’ll also have a #root (instance) method that is an alias of the root method. The following things are really one and the same:


  SimpleParser.new.parse(string)
  SimpleParser.new.root.parse(string)
  SimpleParser.new.a_rule.parse(string)

Knowing these things gives you a lot of flexibility; I’ll explain why at the end of the chapter. For now, just let me point out that because all of this is Ruby, your favorite editor will syntax highlight parser code just fine.

Atoms: The inside of a parser

Matching strings of characters

A parser is constructed from parser atoms (or parslets, hence the name). The atoms are what appear inside your rules (and maybe elsewhere). We’ve already encountered an atom, the string atom:


  str('simple_parser')

This returns a Parslet::Atoms::Str instance. These parser atoms all derive from Parslet::Atoms::Base and have essentially just one method you can call: #parse. So this works:


  str('foobar').parse('foobar') # => "foobar"@0

The atoms are small parsers that can recognize languages and throw errors, just like real Parslet::Parser subclasses.

Matching character ranges

The second parser atom you will have to know about allows you to match character ranges:


  match('[0-9a-f]')

The above atom would match the numbers zero through nine and the letters ‘a’ to ‘f’ – yeah, you guessed right – hexadecimal numbers for example. The inside of such a match parslet is essentially a regular expression that matches a single character of input. Because we’ll be using ranges so much with #match and because typing (‘[]’) is tiresome, here’s another way to write the above #match atom:


  match['0-9a-f']

Character matches are instances of Parslet::Atoms::Re. Here are some more examples of character ranges:


  match['[:alnum:]']      # letters and numbers
  match['\n']             # newlines
  match('\w')             # word characters
  match('.')              # any character

The wild wild #any

The last example above corresponds to the regular expression /./ that matches any one character. There is a special atom for that:


  any 

Composition of Atoms

These basic atoms can be composed to form complex grammars. The following few sections will tell you about the various ways atoms can be composed.

Simple Sequences

Match ‘foo’ and then ‘bar’:


  str('foo') >> str('bar')    # same as str('foobar')

Sequences correspond to instances of the class Parslet::Atoms::Sequence.

Repetition and its Special Cases

To model atoms that can be repeated, you should use #repeat:


  str('foo').repeat

This will allow foo to repeat any number of times, including zero. If you look at the signature for #repeat in Parslet::Atoms::Base, you’ll see that it has really two arguments: min and max. So the following code all makes sense:


  str('foo').repeat(1)      # match 'foo' at least once
  str('foo').repeat(1,3)    # at least once and at most 3 times
  str('foo').repeat(0, nil) # the default: same as str('foo').repeat

Repetition has a special case that is used frequently: Matching something once or not at all can be achieved by repeat(0,1), but also through the prettier:


  str('foo').maybe          # same as str('foo').repeat(0,1)

These all map to Parslet::Atoms::Repetition. Please note this little twist to #maybe:


  str('foo').maybe.as(:f).parse('')         # => {:f=>nil}
  str('foo').repeat(0,1).as(:f).parse('')   # => {:f=>[]}

The ‘nil’-value of #maybe is nil. This is catering to the intuition that foo.maybe either gives me foo or nothing at all, not an empty array. But have it your way!

Alternation

The most important composition method for grammars is alternation. Without it, your grammars would only vary in the amount of things matched, but not in content. Here’s how this looks:


  str('foo') | str('bar')   # matches 'foo' OR 'bar'

This reads naturally as “‘foo’ or ‘bar’”.

Operator precedence

The operators we have chosen for parslet atom combination have the operator precedence that you would expect. No parenthesis are needed to express alternation of sequences:


  str('s') >> str('equence') | 
    str('se') >> str('quence')

And more

Parslet atoms are not as pretty as Treetop atoms. There you go, we said it. However, there seems to be a different kind of aesthetic about them; they are pure Ruby and integrate well with the rest of your environment. Have a look at this:


  # Also consumes the space after important things like ';' or ':'. Call this
  # giving the character you want to match as argument: 
  #
  #   arg >> (spaced(',') >> arg).repeat
  #
  def spaced(character)
    str(character) >> match['\s']
  end

or even this:


  # Turns any atom into an expression that matches a left parenthesis, the 
  # atom and then a right parenthesis.
  #
  #   bracketed(sum)
  #
  def bracketed(atom)
    spaced('(') >> atom >> spaced(')')
  end

You might say that because parslet is just plain old Ruby objects itself (PORO ™), it allows for very tight code. Module inclusion, class inheritance, … all your tools should work well with parslet.

Tree construction

By default, parslet will just echo back to you the strings you feed into it. Parslet will not generate a parser for you and neither will it generate your abstract syntax tree for you. The method #as(name) allows you to specify exactly how you want your tree to look like:


  str('foo').parse('foo')             # => "foo"@0
  str('foo').as(:bar).parse('foo')    # => {:bar=>"foo"@0}

So you think: #as(name) allows me to create a hash, big deal. That’s not all. You’ll notice that annotating everything that you want to keep in your grammar with #as(name) autocreates a sensible tree composed of hashes and arrays and strings. It’s really somewhat magic: Parslet has a set of clever rules that merge the annotated output from your atoms into a tree. Here are some more examples, with the atom on the left and the resulting tree (assuming a successful parse) on the right:


  # Normal strings just map to strings
  str('a').repeat                         "aaa"@0                                 

  # Arrays capture repetition of non-strings
  str('a').repeat.as(:b)                  {:b=>"aaa"@0}                           
  str('a').as(:b).repeat                  [{:b=>"a"@0}, {:b=>"a"@1}, {:b=>"a"@2}] 

  # Subtrees get merged - unlabeled strings discarded
  str('a').as(:a) >> str('b').as(:b)      {:a=>"a"@0, :b=>"b"@1}                  
  str('a') >> str('b').as(:b) >> str('c') {:b=>"b"@1}                             

  # #maybe will return nil, not the empty array
  str('a').maybe.as(:a)                   {:a=>"a"@0}                             
  str('a').maybe.as(:a)                   {:a=>nil}

Capturing input

Advanced reading material – feel free to skip this.

Sometimes a parser needs to match against something that was already matched against. Think about Ruby heredocs for example:


  str = <<-HERE
    This is part of the heredoc.
  HERE

The key to matching this kind of document is to capture part of the input first and then construct the rest of the parser based on the captured part. This is what it looks like in its simplest form:


  match['ab'].capture(:capt) >>               # create the capture
    dynamic { |s,c| str(c.captures[:capt]) }  # and match using the capture

This parser matches either ‘aa’ or ‘bb’, but not mixed forms ‘ab’ or ‘ba’. The last sample introduced two new concepts for this kind of complex parser: the #capture(name) method and the dynamic { ... } code block.

Appending #capture(name) to any parser will capture that parsers result in the captures hash in the parse context. If and only if the parser match['ab'] succeeds, it stores either ‘a’ or ‘b’ in context.captures[:capt].

The only way to get at that hash during the parse process is in a dynamic { ... } code block. (for reasons that are out of the scope of this document) In such a block, you can:


  dynamic { |source, context|
    # construct parsers by using randomness
    rand < 0.5 ? str('a') : str('b')
    
    # Or by using context information 
    str( context.captures[:capt] )
    
    # Or by .. doing other kind of work (consumes 100 chars and then 'a')
    source.consume(100)
    str('a')
  }

Scopes

What if you want to parse heredocs contained within heredocs? It’s turtles all the way down, after all. To be able to remember what string was used to construct the outer heredoc, you would use the #scope { ... } block that was introduced in parslet 1.5. Like opening a Ruby block, it allows you to capture results (assign values to variables) to the same names you’ve already used in outer scope – without destroying the outer scopes values for these captures!.

Here’s an example for this:


  str('a').capture(:a) >> scope { str('b').capture(:a) } >> 
    dynamic { |s,c| str(c.captures[:a]) }

This parses ‘aba’ – if you understand that, you understand scopes and captures. Congrats.

And more

Now you know exactly how to create parsers using Parslet. Your parsers will output intricate structures made of endless arrays, complex hashes and a few string leftovers. But your programming skills fail you when you try to put all this data to use. Selecting keys upon keys in hash after hash, you feel like a cockroach that has just read Kafka’s works. This is no fun. This is not what you signed up for.

Time to introduce you to Parslet::Transform and its workings.