Parser construction
A parser is nothing more than a class that derives from
Parslet::Parser
. The simplest parser that one could write would
look like this:
class SimpleParser < Parslet::Parser
rule(:a_rule) { str('simple_parser') }
root(:a_rule)
end
The language recognized by this parser is simply the string “simple_parser”. Parser rules do look a lot like methods and are defined by
rule(name) { definition_block }
Behind the scenes, this really defines a method that returns whatever you return from it.
Every parser has a root. This designates where parsing should start. It is like an entry point to your parser. With a root defined like this:
root(:my_root)
you create a #parse
method in your parser that will start parsing
by calling the #my_root
method. You’ll also have a #root
(instance) method that is an alias of the root method. The following things are
really one and the same:
SimpleParser.new.parse(string)
SimpleParser.new.root.parse(string)
SimpleParser.new.a_rule.parse(string)
Knowing these things gives you a lot of flexibility; I’ll explain why at the end of the chapter. For now, just let me point out that because all of this is Ruby, your favorite editor will syntax highlight parser code just fine.
Atoms: The inside of a parser
Matching strings of characters
A parser is constructed from parser atoms (or parslets, hence the name). The atoms are what appear inside your rules (and maybe elsewhere). We’ve already encountered an atom, the string atom:
str('simple_parser')
This returns a Parslet::Atoms::Str
instance. These parser atoms
all derive from Parslet::Atoms::Base
and have essentially just
one method you can call: #parse
. So this works:
str('foobar').parse('foobar') # => "foobar"@0
The atoms are small parsers that can recognize languages and throw errors, just
like real Parslet::Parser
subclasses.
Matching character ranges
The second parser atom you will have to know about allows you to match character ranges:
match('[0-9a-f]')
The above atom would match the numbers zero through nine and the letters ‘a’
to ‘f’ – yeah, you guessed right – hexadecimal numbers for example. The inside
of such a match parslet is essentially a regular expression that matches
a single character of input. Because we’ll be using ranges so much with
#match
and because typing (‘[]’) is tiresome, here’s another way
to write the above #match
atom:
match['0-9a-f']
Character matches are instances of Parslet::Atoms::Re
. Here are
some more examples of character ranges:
match['[:alnum:]'] # letters and numbers
match['\n'] # newlines
match('\w') # word characters
match('.') # any character
The wild wild #any
The last example above corresponds to the regular expression /./
that matches
any one character. There is a special atom for that:
any
Composition of Atoms
These basic atoms can be composed to form complex grammars. The following few sections will tell you about the various ways atoms can be composed.
Simple Sequences
Match ‘foo’ and then ‘bar’:
str('foo') >> str('bar') # same as str('foobar')
Sequences correspond to instances of the class
Parslet::Atoms::Sequence
.
Repetition and its Special Cases
To model atoms that can be repeated, you should use #repeat
:
str('foo').repeat
This will allow foo to repeat any number of times, including zero. If you
look at the signature for #repeat
in Parslet::Atoms::Base
,
you’ll see that it has really two arguments: min and max. So the following
code all makes sense:
str('foo').repeat(1) # match 'foo' at least once
str('foo').repeat(1,3) # at least once and at most 3 times
str('foo').repeat(0, nil) # the default: same as str('foo').repeat
Repetition has a special case that is used frequently: Matching something
once or not at all can be achieved by repeat(0,1)
, but also
through the prettier:
str('foo').maybe # same as str('foo').repeat(0,1)
These all map to Parslet::Atoms::Repetition
. Please note this
little twist to #maybe
:
str('foo').maybe.as(:f).parse('') # => {:f=>nil}
str('foo').repeat(0,1).as(:f).parse('') # => {:f=>[]}
The ‘nil’-value of #maybe
is nil. This is catering to the
intuition that foo.maybe
either gives me foo
or
nothing at all, not an empty array. But have it your way!
Alternation
The most important composition method for grammars is alternation. Without it, your grammars would only vary in the amount of things matched, but not in content. Here’s how this looks:
str('foo') | str('bar') # matches 'foo' OR 'bar'
This reads naturally as “‘foo’ or ‘bar’”.
Operator precedence
The operators we have chosen for parslet atom combination have the operator precedence that you would expect. No parenthesis are needed to express alternation of sequences:
str('s') >> str('equence') |
str('se') >> str('quence')
And more
Parslet atoms are not as pretty as Treetop atoms. There you go, we said it. However, there seems to be a different kind of aesthetic about them; they are pure Ruby and integrate well with the rest of your environment. Have a look at this:
# Also consumes the space after important things like ';' or ':'. Call this
# giving the character you want to match as argument:
#
# arg >> (spaced(',') >> arg).repeat
#
def spaced(character)
str(character) >> match['\s']
end
or even this:
# Turns any atom into an expression that matches a left parenthesis, the
# atom and then a right parenthesis.
#
# bracketed(sum)
#
def bracketed(atom)
spaced('(') >> atom >> spaced(')')
end
You might say that because parslet is just plain old Ruby objects itself (PORO ™), it allows for very tight code. Module inclusion, class inheritance, … all your tools should work well with parslet.
Tree construction
By default, parslet will just echo back to you the strings you feed into it.
Parslet will not generate a parser for you and neither will it generate your
abstract syntax tree for you. The method #as(name)
allows you
to specify exactly how you want your tree to look like:
str('foo').parse('foo') # => "foo"@0
str('foo').as(:bar).parse('foo') # => {:bar=>"foo"@0}
So you think: #as(name)
allows me to create a hash, big deal.
That’s not all. You’ll notice that annotating everything that you want to keep
in your grammar with #as(name)
autocreates a sensible tree
composed of hashes and arrays and strings. It’s really somewhat magic: Parslet
has a set of clever rules that merge the annotated output from your atoms into
a tree. Here are some more examples, with the atom on the left and the resulting
tree (assuming a successful parse) on the right:
# Normal strings just map to strings
str('a').repeat "aaa"@0
# Arrays capture repetition of non-strings
str('a').repeat.as(:b) {:b=>"aaa"@0}
str('a').as(:b).repeat [{:b=>"a"@0}, {:b=>"a"@1}, {:b=>"a"@2}]
# Subtrees get merged - unlabeled strings discarded
str('a').as(:a) >> str('b').as(:b) {:a=>"a"@0, :b=>"b"@1}
str('a') >> str('b').as(:b) >> str('c') {:b=>"b"@1}
# #maybe will return nil, not the empty array
str('a').maybe.as(:a) {:a=>"a"@0}
str('a').maybe.as(:a) {:a=>nil}
Capturing input
Advanced reading material – feel free to skip this.
Sometimes a parser needs to match against something that was already matched against. Think about Ruby heredocs for example:
str = <<-HERE
This is part of the heredoc.
HERE
The key to matching this kind of document is to capture part of the input first and then construct the rest of the parser based on the captured part. This is what it looks like in its simplest form:
match['ab'].capture(:capt) >> # create the capture
dynamic { |s,c| str(c.captures[:capt]) } # and match using the capture
This parser matches either ‘aa’ or ‘bb’, but not mixed forms ‘ab’ or ‘ba’. The
last sample introduced two new concepts for this kind of complex parser: the
#capture(name)
method and the dynamic { ... }
code
block.
Appending #capture(name)
to any parser will capture that parsers
result in the captures hash in the parse context. If and only if the parser
match['ab']
succeeds, it stores either ‘a’ or ‘b’ in
context.captures[:capt]
.
The only way to get at that hash during the parse process is in a
dynamic { ... }
code block. (for reasons that are out of the
scope of this document) In such a block, you can:
dynamic { |source, context|
# construct parsers by using randomness
rand < 0.5 ? str('a') : str('b')
# Or by using context information
str( context.captures[:capt] )
# Or by .. doing other kind of work (consumes 100 chars and then 'a')
source.consume(100)
str('a')
}
Scopes
What if you want to parse heredocs contained within heredocs? It’s turtles all
the way down, after all. To be able to remember what string was used to
construct the outer heredoc, you would use the #scope { ... }
block that was introduced in parslet 1.5. Like opening a Ruby block, it allows
you to capture results (assign values to variables) to the same names you’ve
already used in outer scope – without destroying the outer scopes values for
these captures!.
Here’s an example for this:
str('a').capture(:a) >> scope { str('b').capture(:a) } >>
dynamic { |s,c| str(c.captures[:a]) }
This parses ‘aba’ – if you understand that, you understand scopes and captures. Congrats.
And more
Now you know exactly how to create parsers using Parslet. Your parsers will output intricate structures made of endless arrays, complex hashes and a few string leftovers. But your programming skills fail you when you try to put all this data to use. Selecting keys upon keys in hash after hash, you feel like a cockroach that has just read Kafka’s works. This is no fun. This is not what you signed up for.
Time to introduce you to Parslet::Transform and its workings.