lark(7)

Lark Documentation

Section 7 python3-lark bookworm source

Description

LARK

NAME

lark - Lark Documentation

PHILOSOPHY

Parsers are innately complicated and confusing. They're difficult to understand, difficult to write, and difficult to use. Even experts on the subject can become baffled by the nuances of these complicated state-machines.

Lark's mission is to make the process of writing them as simple and abstract as possible, by following these design principles:

Design Principles

	•		Readability matters
	•		Keep the grammar clean and simple
	•		Don't force the user to decide on things that the parser can figure out on its own
	•		Usability is more important than performance
	•		Performance is still very important
	•		Follow the Zen of Python, whenever possible and applicable

In accordance with these principles, I arrived at the following design choices:

----

Design Choices

1. Separation of code and grammar

Grammars are the de-facto reference for your language, and for the structure of your parse-tree. For any non-trivial language, the conflation of code and grammar always turns out convoluted and difficult to read.

The grammars in Lark are EBNF-inspired, so they are especially easy to read & work with.

2. Always build a parse-tree (unless told not to)

Trees are always simpler to work with than state-machines.

	•		Trees allow you to see the "state-machine" visually
	•		Trees allow your computation to be aware of previous and future states
	•		Trees allow you to process the parse in steps, instead of forcing you to do it all at once.

And anyway, every parse-tree can be replayed as a state-machine, so there is no loss of information.

See this answer in more detail here.

To improve performance, you can skip building the tree for LALR(1), by providing Lark with a transformer (see the JSON example).

3. Earley is the default

The Earley algorithm can accept any context-free grammar you throw at it (i.e. any grammar you can write in EBNF, it can parse). That makes it extremely friendly to beginners, who are not aware of the strange and arbitrary restrictions that LALR(1) places on its grammars.

As the users grow to understand the structure of their grammar, the scope of their target language, and their performance requirements, they may choose to switch over to LALR(1) to gain a huge performance boost, possibly at the cost of some language features.

Both Earley and LALR(1) can use the same grammar, as long as all constraints are satisfied.

In short, "Premature optimization is the root of all evil."

Other design features

	•		Automatically resolve terminal collisions whenever possible
	•		Automatically keep track of line & column numbers

FEATURES

Main Features

•

Earley parser, capable of parsing any context-free grammar

•

			Implements SPPF, for efficient parsing and storing of ambiguous grammars.
	•		LALR(1) parser, limited in power of expression, but very efficient in space and performance (O(n)).

•

			Implements a parse-aware lexer that provides a better power of expression than traditional LALR implementations (such as ply).
	•		EBNF-inspired grammar, with extra features (See: Grammar Reference)
	•		Builds a parse-tree (AST) automagically based on the grammar
	•		Stand-alone parser generator - create a small independent parser to embed in your project. (read more)
	•		Flexible error handling by using an interactive parser interface (LALR only)
	•		Automatic line & column tracking (for both tokens and matched rules)
	•		Automatic terminal collision resolution
	•		Grammar composition - Import terminals and rules from other grammars
	•		Standard library of terminals (strings, numbers, names, etc.)
	•		Unicode fully supported
	•		Extensive test suite
	•		Type annotations (MyPy support)
	•		Pure-Python implementation

Read more about the parsers

Extra features

	•		Import rules and tokens from other Lark grammars, for code reuse and modularity.
	•		Support for external regex module (see here)
	•		Import grammars from Nearley.js (read more)
	•		CYK parser
	•		Visualize your parse trees as dot or png files (see_example)
	•		Automatic reconstruction of input from parse-tree (see examples)
	•		Use Lark grammars in Julia and Javascript.

PARSERS

Lark implements the following parsing algorithms: Earley, LALR(1), and CYK

Earley

An Earley Parser is a chart parser capable of parsing any context-free grammar at O(nˆ3), and O(nˆ2) when the grammar is unambiguous. It can parse most LR grammars at O(n). Most programming languages are LR, and can be parsed at a linear time.

Lark's Earley implementation runs on top of a skipping chart parser, which allows it to use regular expressions, instead of matching characters one-by-one. This is a huge improvement to Earley that is unique to Lark. This feature is used by default, but can also be requested explicitly using lexer='dynamic'.

It's possible to bypass the dynamic lexing, and use the regular Earley parser with a basic lexer, that tokenizes as an independent first step. Doing so will provide a speed benefit, but will tokenize without using Earley's ambiguity-resolution ability. So choose this only if you know why! Activate with lexer='basic'

SPPF & Ambiguity resolution

Lark implements the Shared Packed Parse Forest data-structure for the Earley parser, in order to reduce the space and computation required to handle ambiguous grammars.

You can read more about SPPF here

As a result, Lark can efficiently parse and store every ambiguity in the grammar, when using Earley.

Lark provides the following options to combat ambiguity:

	•		Lark will choose the best derivation for you (default). Users can choose between different disambiguation strategies, and can prioritize (or demote) individual rules over others, using the rule-priority syntax.
	•		Users may choose to receive the set of all possible parse-trees (using ambiguity='explicit'), and choose the best derivation themselves. While simple and flexible, it comes at the cost of space and performance, and so it isn't recommended for highly ambiguous grammars, or very long inputs.
	•		As an advanced feature, users may use specialized visitors to iterate the SPPF themselves.

lexer="dynamic_complete"

Earley's "dynamic" lexer uses regular expressions in order to tokenize the text. It tries every possible combination of terminals, but it matches each terminal exactly once, returning the longest possible match.

That means, for example, that when lexer="dynamic" (which is the default), the terminal /a+/, when given the text "aa", will return one result, aa, even though a would also be correct.

This behavior was chosen because it is much faster, and it is usually what you would expect.

Setting lexer="dynamic_complete" instructs the lexer to consider every possible regexp match. This ensures that the parser will consider and resolve every ambiguity, even inside the terminals themselves. This lexer provides the same capabilities as scannerless Earley, but with different performance tradeoffs.

Warning: This lexer can be much slower, especially for open-ended terminals such as /.*/

LALR(1)

LALR(1) is a very efficient, true-and-tested parsing algorithm. It's incredibly fast and requires very little memory. It can parse most programming languages (For example: Python and Java).

LALR(1) stands for:

	•		Left-to-right parsing order
	•		Rightmost derivation, bottom-up
	•		Lookahead of 1 token

Lark comes with an efficient implementation that outperforms every other parsing library for Python (including PLY)

Lark extends the traditional YACC-based architecture with a contextual lexer, which processes feedback from the parser, making the LALR(1) algorithm stronger than ever.

The contextual lexer communicates with the parser, and uses the parser's lookahead prediction to narrow its choice of terminals. So at each point, the lexer only matches the subgroup of terminals that are legal at that parser state, instead of all of the terminals. Itâs surprisingly effective at resolving common terminal collisions, and allows one to parse languages that LALR(1) was previously incapable of parsing.

(If you're familiar with YACC, you can think of it as automatic lexer-states)

This is an improvement to LALR(1) that is unique to Lark.

Grammar constraints in LALR(1)

Due to having only a lookahead of one token, LALR is limited in its ability to choose between rules, when they both match the input.

Tips for writing a conforming grammar:

	•		Try to avoid writing different rules that can match the same sequence of characters.
	•		For the best performance, prefer left-recursion over right-recursion.
	•		Consider setting terminal priority only as a last resort.

For a better understanding of these constraints, it's recommended to learn how a SLR parser works. SLR is very similar to LALR but much simpler.

CYK Parser

A CYK parser can parse any context-free grammar at O(nˆ3*|G|).

Its too slow to be practical for simple grammars, but it offers good performance for highly ambiguous grammars.

JSON PARSER - TUTORIAL

Lark is a parser - a program that accepts a grammar and text, and produces a structured tree that represents that text. In this tutorial we will write a JSON parser in Lark, and explore Lark's various features in the process.

It has 5 parts.

	•		Writing the grammar
	•		Creating the parser
	•		Shaping the tree
	•		Evaluating the tree
	•		Optimizing

Knowledge assumed:

	•		Using Python
	•		A basic understanding of how to use regular expressions

Part 1 - The Grammar

Lark accepts its grammars in a format called EBNF. It basically looks like this:

rule_name : list of rules and TERMINALS to match
| another possible list of items
| etc.

TERMINAL: "some text to match"

(a terminal is a string or a regular expression)

The parser will try to match each rule (left-part) by matching its items (right-part) sequentially, trying each alternative (In practice, the parser is predictive so we don't have to try every alternative).

How to structure those rules is beyond the scope of this tutorial, but often it's enough to follow one's intuition.

In the case of JSON, the structure is simple: A json document is either a list, or a dictionary, or a string/number/etc.

The dictionaries and lists are recursive, and contain other json documents (or "values").

Let's write this structure in EBNF form:

list : "[" [value ("," value)*] "]"

dict : "{" [pair ("," pair)*] "}"
pair : STRING ":" value

A quick explanation of the syntax:

	•		Parenthesis let us group rules together.
	•		rule* means any amount. That means, zero or more instances of that rule.
	•		[rule] means optional. That means zero or one instance of that rule.

Lark also supports the rule+ operator, meaning one or more instances. It also supports the rule? operator which is another way to say optional.

Of course, we still haven't defined "STRING" and "NUMBER". Luckily, both these literals are already defined in Lark's common library:

%import common.ESCAPED_STRING -> STRING
%import common.SIGNED_NUMBER -> NUMBER

The arrow (->) renames the terminals. But that only adds obscurity in this case, so going forward we'll just use their original names.

We'll also take care of the white-space, which is part of the text, by simply matching and then throwing it away.

%import common.WS
%ignore WS

We tell our parser to ignore whitespace. Otherwise, we'd have to fill our grammar with WS terminals.

By the way, if you're curious what these terminals signify, they are roughly equivalent to this:

NUMBER : /-?\d+(\.\d+)?([eE][+-]?\d+)?/
STRING : /".*?(?<!\\)"/
%ignore /[ \t\n\f\r]+/

Lark will accept this way of writing too, if you really want to complicate your life :)

You can find the original definitions in common.lark. They don't strictly adhere to json.org - but our purpose here is to accept json, not validate it.

Notice that terminals are written in UPPER-CASE, while rules are written in lower-case. I'll touch more on the differences between rules and terminals later.

Part 2 - Creating the Parser

Once we have our grammar, creating the parser is very simple.

We simply instantiate Lark, and tell it to accept a "value":

list : "[" [value ("," value)*] "]"

dict : "{" [pair ("," pair)*] "}"
pair : ESCAPED_STRING ":" value

%import common.ESCAPED_STRING
%import common.SIGNED_NUMBER
%import common.WS
%ignore WS

""", start='value')

It's that simple! Let's test it out:

>>> text = '{"key": ["item0", "item1", 3.14]}'
>>> json_parser.parse(text)
Tree(value, [Tree(dict, [Tree(pair, [Token(STRING, "key"), Tree(value, [Tree(list, [Tree(value, [Token(STRING, "item0")]), Tree(value, [Token(STRING, "item1")]), Tree(value, [Token(NUMBER, 3.14)])])])])])])
>>> print( _.pretty() )
value
dict
pair
"key"
value
list

value

				"item0" value
				"item1" value
				3.14

As promised, Lark automagically creates a tree that represents the parsed text.

But something is suspiciously missing from the tree. Where are the curly braces, the commas and all the other punctuation literals?

Lark automatically filters out literals from the tree, based on the following criteria:

	•		Filter out string literals without a name, or with a name that starts with an underscore.
	•		Keep regexps, even unnamed ones, unless their name starts with an underscore.

Unfortunately, this means that it will also filter out literals like "true" and "false", and we will lose that information. The next section, "Shaping the tree" deals with this issue, and others.

Part 3 - Shaping the Tree

We now have a parser that can create a parse tree (or: AST), but the tree has some issues:

	•		"true", "false" and "null" are filtered out (test it out yourself!)
	•		Is has useless branches, like value, that clutter-up our view.

I'll present the solution, and then explain it:

...

string : ESCAPED_STRING

	•		Those little arrows signify aliases. An alias is a name for a specific part of the rule. In this case, we will name the true/false/null matches, and this way we won't lose the information. We also alias SIGNED_NUMBER to mark it for later processing.
	•		The question-mark prefixing value ("?value") tells the tree-builder to inline this branch if it has only one member. In this case, value will always have only one member, and will always be inlined.
	•		We turned the ESCAPED_STRING terminal into a rule. This way it will appear in the tree as a branch. This is equivalent to aliasing (like we did for the number), but now string can also be used elsewhere in the grammar (namely, in the pair rule).

Here is the new grammar:

list : "[" [value ("," value)*] "]"

dict : "{" [pair ("," pair)*] "}"
pair : string ":" value

string : ESCAPED_STRING

%import common.ESCAPED_STRING
%import common.SIGNED_NUMBER
%import common.WS
%ignore WS

""", start='value')

And let's test it out:

>>> text = '{"key": ["item0", "item1", 3.14, true]}'
>>> print( json_parser.parse(text).pretty() )
dict
pair

string

"key"

list

string

				"item0" string
				"item1" number
				3.14

true

Ah! That is much much nicer.

Part 4 - Evaluating the tree

It's nice to have a tree, but what we really want is a JSON object.

The way to do it is to evaluate the tree, using a Transformer.

A transformer is a class with methods corresponding to branch names. For each branch, the appropriate method will be called with the children of the branch as its argument, and its return value will replace the branch in the tree.

So let's write a partial transformer, that handles lists and dictionaries:

from lark import Transformer

class MyTransformer(Transformer):
def list(self, items):
return list(items)
def pair(self, key_value):
k, v = key_value
return k, v
def dict(self, items):
return dict(items)

And when we run it, we get this:

>>> tree = json_parser.parse(text)
>>> MyTransformer().transform(tree)
{Tree(string, [Token(ANONRE_1, "key")]): [Tree(string, [Token(ANONRE_1, "item0")]), Tree(string, [Token(ANONRE_1, "item1")]), Tree(number, [Token(ANONRE_0, 3.14)]), Tree(true, [])]}

This is pretty close. Let's write a full transformer that can handle the terminals too.

Also, our definitions of list and dict are a bit verbose. We can do better:

from lark import Transformer

class TreeToJson(Transformer):
def string(self, s):
(s,) = s
return s[1:-1]
def number(self, n):
(n,) = n
return float(n)

list = list
pair = tuple
dict = dict

null = lambda self, _: None
true = lambda self, _: True
false = lambda self, _: False

And when we run it:

>>> tree = json_parser.parse(text)
>>> TreeToJson().transform(tree)
{u'key': [u'item0', u'item1', 3.14, True]}

Magic!

Part 5 - Optimizing

Step 1 - Benchmark

By now, we have a fully working JSON parser, that can accept a string of JSON, and return its Pythonic representation.

But how fast is it?

Now, of course there are JSON libraries for Python written in C, and we can never compete with them. But since this is applicable to any parser you would write in Lark, let's see how far we can take this.

The first step for optimizing is to have a benchmark. For this benchmark I'm going to take data from json-generator.com/. I took their default suggestion and changed it to 5000 objects. The result is a 6.6MB sparse JSON file.

Our first program is going to be just a concatenation of everything we've done so far:

import sys
from lark import Lark, Transformer

list : "[" [value ("," value)*] "]"

dict : "{" [pair ("," pair)*] "}"
pair : string ":" value

string : ESCAPED_STRING

%import common.ESCAPED_STRING
%import common.SIGNED_NUMBER
%import common.WS
%ignore WS
"""

class TreeToJson(Transformer):
def string(self, s):
(s,) = s
return s[1:-1]
def number(self, n):
(n,) = n
return float(n)

list = list
pair = tuple
dict = dict

null = lambda self, _: None
true = lambda self, _: True
false = lambda self, _: False

json_parser = Lark(json_grammar, start='value', lexer='basic')

if __name__ == '__main__':
with open(sys.argv[1]) as f:
tree = json_parser.parse(f.read())
print(TreeToJson().transform(tree))

We run it and get this:

$ time python tutorial_json.py json_data > /dev/null

	real	0m36.257s
	user	0m34.735s

sys 0m1.361s

That's unsatisfactory time for a 6MB file. Maybe if we were parsing configuration or a small DSL, but we're trying to handle large amount of data here.

Well, turns out there's quite a bit we can do about it!

Step 2 - LALR(1)

So far we've been using the Earley algorithm, which is the default in Lark. Earley is powerful but slow. But it just so happens that our grammar is LR-compatible, and specifically LALR(1) compatible.

So let's switch to LALR(1) and see what happens:

json_parser = Lark(json_grammar, start='value', parser='lalr')

$ time python tutorial_json.py json_data > /dev/null

real 0m7.554s
user 0m7.352s
sys 0m0.148s

Ah, that's much better. The resulting JSON is of course exactly the same. You can run it for yourself and see.

It's important to note that not all grammars are LR-compatible, and so you can't always switch to LALR(1). But there's no harm in trying! If Lark lets you build the grammar, it means you're good to go.

Step 3 - Tree-less LALR(1)

So far, we've built a full parse tree for our JSON, and then transformed it. It's a convenient method, but it's not the most efficient in terms of speed and memory. Luckily, Lark lets us avoid building the tree when parsing with LALR(1).

Here's the way to do it:

json_parser = Lark(json_grammar, start='value', parser='lalr', transformer=TreeToJson())

if __name__ == '__main__':
with open(sys.argv[1]) as f:
print( json_parser.parse(f.read()) )

We've used the transformer we've already written, but this time we plug it straight into the parser. Now it can avoid building the parse tree, and just send the data straight into our transformer. The parse() method now returns the transformed JSON, instead of a tree.

Let's benchmark it:

	real	0m4.866s
	user	0m4.722s
	sys	0m0.121s

That's a measurable improvement! Also, this way is more memory efficient. Check out the benchmark table at the end to see just how much.

As a general practice, it's recommended to work with parse trees, and only skip the tree-builder when your transformer is already working.

Step 4 - PyPy

PyPy is a JIT engine for running Python, and it's designed to be a drop-in replacement.

Lark is written purely in Python, which makes it very suitable for PyPy.

Let's get some free performance:

$ time pypy tutorial_json.py json_data > /dev/null

	real	0m1.397s
	user	0m1.296s
	sys	0m0.083s

PyPy is awesome!

Conclusion

We've brought the run-time down from 36 seconds to 1.1 seconds, in a series of small and simple steps.

Now let's compare the benchmarks in a nicely organized table.

I measured memory consumption using a little script called memusg

I added a few other parsers for comparison. PyParsing and funcparselib fair pretty well in their memory usage (they don't build a tree), but they can't compete with the run-time speed of LALR(1).

These benchmarks are for Lark's alpha version. I already have several optimizations planned that will significantly improve run-time speed.

Once again, shout-out to PyPy for being so effective.

Afterword

This is the end of the tutorial. I hoped you liked it and learned a little about Lark.

To see what else you can do with Lark, check out the examples.

Read the documentation here: https://lark-parser.readthedocs.io/en/latest/

HOW TO USE LARK - GUIDE

Work process

This is the recommended process for working with Lark:

	•		Collect or create input samples, that demonstrate key features or behaviors in the language you're trying to parse.
	•		Write a grammar. Try to aim for a structure that is intuitive, and in a way that imitates how you would explain your language to a fellow human.
	•		Try your grammar in Lark against each input sample. Make sure the resulting parse-trees make sense.
	•		Use Lark's grammar features to shape the tree: Get rid of superfluous rules by inlining them, and use aliases when specific cases need clarification.
	•		You can perform steps 1-4 repeatedly, gradually growing your grammar to include more sentences.
	•		Create a transformer to evaluate the parse-tree into a structure you'll be comfortable to work with. This may include evaluating literals, merging branches, or even converting the entire tree into your own set of AST classes.

Of course, some specific use-cases may deviate from this process. Feel free to suggest these cases, and I'll add them to this page.

Getting started

Browse the Examples to find a template that suits your purposes.

Read the tutorials to get a better understanding of how everything works. (links in the main page)

Use the Cheatsheet (PDF) for quick reference.

Use the reference pages for more in-depth explanations. (links in the main page)

Debug

Grammars may contain non-obvious bugs, usually caused by rules or terminals interfering with each other in subtle ways.

When trying to debug a misbehaving grammar, the following methodology is recommended:

	•		Create a copy of the grammar, so you can change the parser/grammar without any worries
	•		Find the minimal input that creates the error
	•		Slowly remove rules from the grammar, while making sure the error still occurs.

Usually, by the time you get to a minimal grammar, the problem becomes clear.

But if it doesn't, feel free to ask us on gitter, or even open an issue. Post a reproducing code, with the minimal grammar and input, and we'll do our best to help.

LALR

By default Lark silently resolves Shift/Reduce conflicts as Shift. To enable warnings pass debug=True. To get the messages printed you have to configure the logger beforehand. For example:

import logging
from lark import Lark, logger

logger.setLevel(logging.DEBUG)

collision_grammar = '''
start: as as
as: a*
a: "a"
'''
p = Lark(collision_grammar, parser='lalr', debug=True)

Tools

Stand-alone parser

Lark can generate a stand-alone LALR(1) parser from a grammar.

The resulting module provides the same interface as Lark, but with a fixed grammar, and reduced functionality.

Run using:

python -m lark.tools.standalone

For a play-by-play, read the tutorial

Import Nearley.js grammars

It is possible to import Nearley grammars into Lark. The Javascript code is translated using Js2Py.

See the tools page for more information.

HOW TO DEVELOP LARK - GUIDE

There are many ways you can help the project:

	•		Help solve issues
	•		Improve the documentation
	•		Write new grammars for Lark's library
	•		Write a blog post introducing Lark to your audience
	•		Port Lark to another language
	•		Help with code development

If you're interested in taking one of these on, contact us on Gitter or Github Discussion, and we will provide more details and assist you in the process.

Code Style

Lark does not follow a predefined code style. We accept any code style that makes sense, as long as it's Pythonic and easy to read.

Unit Tests

Lark comes with an extensive set of tests. Many of the tests will run several times, once for each parser configuration.

To run the tests, just go to the lark project root, and run the command:

python -m tests

pypy -m tests

For a list of supported interpreters, you can consult the tox.ini file.

You can also run a single unittest using its class and method name, for example:

## test_package test_class_name.test_function_name
python -m tests TestLalrBasic.test_keep_all_tokens

tox

To run all Unit Tests with tox, install tox and Python 2.7 up to the latest python interpreter supported (consult the file tox.ini). Then, run the command tox on the root of this project (where the main setup.py file is on).

And, for example, if you would like to only run the Unit Tests for Python version 2.7, you can run the command tox -e py27

pytest

You can also run the tests using pytest:

pytest tests

Using setup.py

Another way to run the tests is using setup.py:

python setup.py test

RECIPES

A collection of recipes to use Lark and its various features

Use a transformer to parse integer tokens

Transformers are the common interface for processing matched rules and tokens.

They can be used during parsing for better performance.

from lark import Lark, Transformer

class T(Transformer):
def INT(self, tok):
"Convert the value of `tok` from string to int, while maintaining line number & column."
return tok.update(value=int(tok))

parser = Lark("""
start: INT*
%import common.INT
%ignore " "
""", parser="lalr", transformer=T())

print(parser.parse('3 14 159'))

Prints out:

Tree(start, [Token(INT, 3), Token(INT, 14), Token(INT, 159)])

Collect all comments with lexer_callbacks

lexer_callbacks can be used to interface with the lexer as it generates tokens.

It accepts a dictionary of the form

{TOKEN_TYPE: callback}

Where callback is of type f(Token) -> Token

It only works with the basic and contextual lexers.

This has the same effect of using a transformer, but can also process ignored tokens.

from lark import Lark

comments = []

parser = Lark("""
start: INT*

COMMENT: /#.*/

%import common (INT, WS)
%ignore COMMENT
%ignore WS
""", parser="lalr", lexer_callbacks={'COMMENT': comments.append})

parser.parse("""
1 2 3 # hello
# world
4 5 6
""")

print(comments)

Prints out:

[Token(COMMENT, '# hello'), Token(COMMENT, '# world')]

Note: We don't have to return a token, because comments are ignored

CollapseAmbiguities

Parsing ambiguous texts with earley and ambiguity='explicit' produces a single tree with _ambig nodes to mark where the ambiguity occurred.

However, it's sometimes more convenient instead to work with a list of all possible unambiguous trees.

Lark provides a utility transformer for that purpose:

from lark import Lark, Tree, Transformer
from lark.visitors import CollapseAmbiguities

grammar = """
!start: x y

!x: "a" "b"
| "ab"
| "abc"

!y: "c" "d"
| "cd"
| "d"

"""
parser = Lark(grammar, ambiguity='explicit')

t = parser.parse('abcd')
for x in CollapseAmbiguities().transform(t):
print(x.pretty())

This prints out:

start
x
a
b
y
c
d

start
x ab
y cd

start
x abc
y d

While convenient, this should be used carefully, as highly ambiguous trees will soon create an exponential explosion of such unambiguous derivations.

Keeping track of parents when visiting

The following visitor assigns a parent attribute for every node in the tree.

If your tree nodes aren't unique (if there is a shared Tree instance), the assert will fail.

class Parent(Visitor):
def __default__(self, tree):
for subtree in tree.children:
if isinstance(subtree, Tree):
assert not hasattr(subtree, 'parent')
subtree.parent = proxy(tree)

Unwinding VisitError after a transformer/visitor exception

Errors that happen inside visitors and transformers get wrapped inside a VisitError exception.

This can often be inconvenient, if you wish the actual error to propagate upwards, or if you want to catch it.

But, it's easy to unwrap it at the point of calling the transformer, by catching it and raising the VisitError.orig_exc attribute.

For example:

from lark import Lark, Transformer
from lark.visitors import VisitError

tree = Lark('start: "a"').parse('a')

class T(Transformer):
def start(self, x):
raise KeyError("Original Exception")

t = T()
try:
print( t.transform(tree))
except VisitError as e:
raise e.orig_exc

EXAMPLES FOR LARK

How to run the examples:

After cloning the repo, open the terminal into the root directory of the project, and run the following:

[lark]$ python -m examples.<name_of_example>

For example, the following will parse all the Python files in the standard library of your local installation:

[lark]$ python -m examples.advanced.python_parser

Beginner Examples

Parsing Indentation

A demonstration of parsing indentation (âwhitespace significantâ language) and the usage of the Indenter class.

Since indentation is context-sensitive, a postlex stage is introduced to manufacture INDENT/DEDENT tokens.

It is crucial for the indenter that the NL_type matches the spaces (and tabs) after the newline.

from lark import Lark
from lark.indenter import Indenter

tree_grammar = r"""
?start: _NL* tree

tree: NAME _NL [_INDENT tree+ _DEDENT]

%import common.CNAME -> NAME
%import common.WS_INLINE
%declare _INDENT _DEDENT
%ignore WS_INLINE

_NL: /(\r?\n[\t ]*)+/
"""

class TreeIndenter(Indenter):
NL_type = '_NL'
OPEN_PAREN_types = []
CLOSE_PAREN_types = []
INDENT_type = '_INDENT'
DEDENT_type = '_DEDENT'
tab_len = 8

parser = Lark(tree_grammar, parser='lalr', postlex=TreeIndenter())

test_tree = """
a
b
c
d
e
f
g
"""

def test():
print(parser.parse(test_tree).pretty())

if __name__ == '__main__':
test()

Total running time of the script: ( 0 minutes 0.000 seconds)

Lark Grammar

A reference implementation of the Lark grammar (using LALR(1))

import lark
from pathlib import Path

examples_path = Path(__file__).parent
lark_path = Path(lark.__file__).parent

parser = lark.Lark.open(lark_path / 'grammars/lark.lark', rel_to=__file__, parser="lalr")

grammar_files = [
examples_path / 'advanced/python2.lark',
examples_path / 'relative-imports/multiples.lark',
examples_path / 'relative-imports/multiple2.lark',
examples_path / 'relative-imports/multiple3.lark',
examples_path / 'tests/no_newline_at_end.lark',
examples_path / 'tests/negative_priority.lark',
examples_path / 'standalone/json.lark',
lark_path / 'grammars/common.lark',
lark_path / 'grammars/lark.lark',
lark_path / 'grammars/unicode.lark',
lark_path / 'grammars/python.lark',
]

def test():
for grammar_file in grammar_files:
tree = parser.parse(open(grammar_file).read())
print("All grammars parsed successfully")

if __name__ == '__main__':
test()

Total running time of the script: ( 0 minutes 0.000 seconds)

Handling Ambiguity

A demonstration of ambiguity

This example shows how to use get explicit ambiguity from Lark's Earley parser.

import sys
from lark import Lark, tree

grammar = """
sentence: noun verb noun -> simple
| noun verb "like" noun -> comparative

noun: adj? NOUN
verb: VERB
adj: ADJ

NOUN: "flies" | "bananas" | "fruit"
VERB: "like" | "flies"
ADJ: "fruit"

%import common.WS
%ignore WS
"""

parser = Lark(grammar, start='sentence', ambiguity='explicit')

sentence = 'fruit flies like bananas'

def make_png(filename):
tree.pydot__tree_to_png( parser.parse(sentence), filename)

def make_dot(filename):
tree.pydot__tree_to_dot( parser.parse(sentence), filename)

if __name__ == '__main__':
print(parser.parse(sentence).pretty())
# make_png(sys.argv[1])
# make_dot(sys.argv[1])

# Output:
#
# _ambig
# comparative
# noun fruit
# verb flies
# noun bananas
# simple
# noun
# fruit
# flies
# verb like
# noun bananas
#
# (or view a nicer version at "./fruitflies.png")

Total running time of the script: ( 0 minutes 0.000 seconds)

Basic calculator

A simple example of a REPL calculator

This example shows how to write a basic calculator with variables.

from lark import Lark, Transformer, v_args

try:
input = raw_input # For Python2 compatibility
except NameError:
pass

calc_grammar = """
?start: sum
| NAME "=" sum -> assign_var

?sum: product
| sum "+" product -> add
| sum "-" product -> sub

?product: atom
| product "*" atom -> mul
| product "/" atom -> div

?atom: NUMBER -> number
| "-" atom -> neg
| NAME -> var
| "(" sum ")"

%import common.CNAME -> NAME
%import common.NUMBER
%import common.WS_INLINE

%ignore WS_INLINE
"""

@v_args(inline=True) # Affects the signatures of the methods
class CalculateTree(Transformer):
from operator import add, sub, mul, truediv as div, neg
number = float

def __init__(self):
self.vars = {}

def assign_var(self, name, value):
self.vars[name] = value
return value

def var(self, name):
try:
return self.vars[name]
except KeyError:
raise Exception("Variable not found: %s" % name)

calc_parser = Lark(calc_grammar, parser='lalr', transformer=CalculateTree())
calc = calc_parser.parse

def main():
while True:
try:
s = input('> ')
except EOFError:
break
print(calc(s))

def test():
print(calc("a = 1+2"))
print(calc("1+a*-3"))

if __name__ == '__main__':
# test()
main()

Total running time of the script: ( 0 minutes 0.000 seconds)

Turtle DSL

Implements a LOGO-like toy language for Pythonâs turtle, with interpreter.

try:
input = raw_input # For Python2 compatibility
except NameError:
pass

import turtle

from lark import Lark

turtle_grammar = """
start: instruction+

instruction: MOVEMENT NUMBER -> movement
| "c" COLOR [COLOR] -> change_color
| "fill" code_block -> fill
| "repeat" NUMBER code_block -> repeat

code_block: "{" instruction+ "}"

MOVEMENT: "f"|"b"|"l"|"r"
COLOR: LETTER+

%import common.LETTER
%import common.INT -> NUMBER
%import common.WS
%ignore WS
"""

parser = Lark(turtle_grammar)

def run_instruction(t):
if t.data == 'change_color':
turtle.color(*t.children) # We just pass the color names as-is

elif t.data == 'movement':
name, number = t.children
{ 'f': turtle.fd,
'b': turtle.bk,
'l': turtle.lt,
'r': turtle.rt, }[name](int(number))

elif t.data == 'repeat':
count, block = t.children
for i in range(int(count)):
run_instruction(block)

elif t.data == 'fill':
turtle.begin_fill()
run_instruction(t.children[0])
turtle.end_fill()

elif t.data == 'code_block':
for cmd in t.children:
run_instruction(cmd)
else:
raise SyntaxError('Unknown instruction: %s' % t.data)

def run_turtle(program):
parse_tree = parser.parse(program)
for inst in parse_tree.children:
run_instruction(inst)

def main():
while True:
code = input('> ')
try:
run_turtle(code)
except Exception as e:
print(e)

def test():
text = """
c red yellow
fill { repeat 36 {
f200 l170
}}
"""
run_turtle(text)

if __name__ == '__main__':
# test()
main()

Total running time of the script: ( 0 minutes 0.000 seconds)

Simple JSON Parser

The code is short and clear, and outperforms every other parser (that's written in Python). For an explanation, check out the JSON parser tutorial at /docs/json_tutorial.md

import sys

from lark import Lark, Transformer, v_args

json_grammar = r"""
?start: value

array : "[" [value ("," value)*] "]"
object : "{" [pair ("," pair)*] "}"
pair : string ":" value

string : ESCAPED_STRING

%import common.ESCAPED_STRING
%import common.SIGNED_NUMBER
%import common.WS

%ignore WS
"""

class TreeToJson(Transformer):
@v_args(inline=True)
def string(self, s):
return s[1:-1].replace('\\"', '"')

array = list
pair = tuple
object = dict
number = v_args(inline=True)(float)

null = lambda self, _: None
true = lambda self, _: True
false = lambda self, _: False

### Create the JSON parser with Lark, using the Earley algorithm
# json_parser = Lark(json_grammar, parser='earley', lexer='basic')
# def parse(x):
# return TreeToJson().transform(json_parser.parse(x))

### Create the JSON parser with Lark, using the LALR algorithm
json_parser = Lark(json_grammar, parser='lalr',
# Using the basic lexer isn't required, and isn't usually recommended.
# But, it's good enough for JSON, and it's slightly faster.
lexer='basic',
# Disabling propagate_positions and placeholders slightly improves speed
propagate_positions=False,
maybe_placeholders=False,
# Using an internal transformer is faster and more memory efficient
transformer=TreeToJson())
parse = json_parser.parse

def test():
test_json = '''
{
"empty_object" : {},
"empty_array" : [],
"booleans" : { "YES" : true, "NO" : false },
"numbers" : [ 0, 1, -2, 3.3, 4.4e5, 6.6e-7 ],
"strings" : [ "This", [ "And" , "That", "And a \\"b" ] ],
"nothing" : null
}
'''

j = parse(test_json)
print(j)
import json
assert j == json.loads(test_json)

if __name__ == '__main__':
# test()
with open(sys.argv[1]) as f:
print(parse(f.read()))

Total running time of the script: ( 0 minutes 0.000 seconds)

Advanced Examples

LALRâs contextual lexer

This example demonstrates the power of LALR's contextual lexer, by parsing a toy configuration language.

The terminals NAME and VALUE overlap. They can match the same input. A basic lexer would arbitrarily choose one over the other, based on priority, which would lead to a (confusing) parse error. However, due to the unambiguous structure of the grammar, Lark's LALR(1) algorithm knows which one of them to expect at each point during the parse. The lexer then only matches the tokens that the parser expects. The result is a correct parse, something that is impossible with a regular lexer.

Another approach is to use the Earley algorithm. It will handle more cases than the contextual lexer, but at the cost of performance. See examples/conf_earley.py for an example of that approach.

from lark import Lark

parser = Lark(r"""
start: _NL? section+
section: "[" NAME "]" _NL item+
item: NAME "=" VALUE? _NL

NAME: /\w/+
VALUE: /./+

%import common.NEWLINE -> _NL
%import common.WS_INLINE
%ignore WS_INLINE
""", parser="lalr")

sample_conf = """
[bla]
a=Hello
this="that",4
empty=
"""

print(parser.parse(sample_conf).pretty())

Total running time of the script: ( 0 minutes 0.000 seconds)

Templates

This example shows how to use Lark's templates to achieve cleaner grammars

from lark import Lark

grammar = r"""
start: list | dict

list: "[" _seperated{atom, ","} "]"
dict: "{" _seperated{key_value, ","} "}"
key_value: atom ":" atom

_seperated{x, sep}: x (sep x)* // Define a sequence of 'x sep x sep x ...'

atom: NUMBER | ESCAPED_STRING

%import common (NUMBER, ESCAPED_STRING, WS)
%ignore WS
"""

parser = Lark(grammar)

print(parser.parse('[1, "a", 2]'))
print(parser.parse('{"a": 2, "b": 6}'))

Total running time of the script: ( 0 minutes 0.000 seconds)

Earleyâs dynamic lexer

Demonstrates the power of Earleyâs dynamic lexer on a toy configuration language

Using a lexer for configuration files is tricky, because values don't have to be surrounded by delimiters. Using a basic lexer for this just won't work.

In this example we use a dynamic lexer and let the Earley parser resolve the ambiguity.

Another approach is to use the contextual lexer with LALR. It is less powerful than Earley, but it can handle some ambiguity when lexing and it's much faster. See examples/conf_lalr.py for an example of that approach.

from lark import Lark

parser = Lark(r"""
start: _NL? section+
section: "[" NAME "]" _NL item+
item: NAME "=" VALUE? _NL

NAME: /\w/+
VALUE: /./+

%import common.NEWLINE -> _NL
%import common.WS_INLINE
%ignore WS_INLINE
""", parser="earley")

def test():
sample_conf = """
[bla]

a=Hello
this="that",4
empty=
"""

r = parser.parse(sample_conf)
print (r.pretty())

if __name__ == '__main__':
test()

Total running time of the script: ( 0 minutes 0.000 seconds)

Error handling using an interactive parser

This example demonstrates error handling using an interactive parser in LALR

When the parser encounters an UnexpectedToken exception, it creates a an interactive parser with the current parse-state, and lets you control how to proceed step-by-step. When you've achieved the correct parse-state, you can resume the run by returning True.

from lark import Token

from _json_parser import json_parser

def ignore_errors(e):
if e.token.type == 'COMMA':
# Skip comma
return True
elif e.token.type == 'SIGNED_NUMBER':
# Try to feed a comma and retry the number
e.interactive_parser.feed_token(Token('COMMA', ','))
e.interactive_parser.feed_token(e.token)
return True

# Unhandled error. Will stop parse and raise exception
return False

def main():
s = "[0 1, 2,, 3,,, 4, 5 6 ]"
res = json_parser.parse(s, on_error=ignore_errors)
print(res) # prints [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0]

main()

Total running time of the script: ( 0 minutes 0.000 seconds)

Reconstruct a JSON

Demonstrates the experimental text-reconstruction feature

The Reconstructor takes a parse tree (already filtered from punctuation, of course), and reconstructs it into correct text, that can be parsed correctly. It can be useful for creating "hooks" to alter data before handing it to other parsers. You can also use it to generate samples from scratch.

import json

from lark import Lark
from lark.reconstruct import Reconstructor

from _json_parser import json_grammar

test_json = '''
{
"empty_object" : {},
"empty_array" : [],
"booleans" : { "YES" : true, "NO" : false },
"numbers" : [ 0, 1, -2, 3.3, 4.4e5, 6.6e-7 ],
"strings" : [ "This", [ "And" , "That", "And a \\"b" ] ],
"nothing" : null
}
'''

def test_earley():

json_parser = Lark(json_grammar, maybe_placeholders=False)
tree = json_parser.parse(test_json)

new_json = Reconstructor(json_parser).reconstruct(tree)
print (new_json)
print (json.loads(new_json) == json.loads(test_json))

def test_lalr():

json_parser = Lark(json_grammar, parser='lalr', maybe_placeholders=False)
tree = json_parser.parse(test_json)

new_json = Reconstructor(json_parser).reconstruct(tree)
print (new_json)
print (json.loads(new_json) == json.loads(test_json))

test_earley()
test_lalr()

Total running time of the script: ( 0 minutes 0.000 seconds)

Custom lexer

Demonstrates using a custom lexer to parse a non-textual stream of data

You can use a custom lexer to tokenize text when the lexers offered by Lark are too slow, or not flexible enough.

You can also use it (as shown in this example) to tokenize streams of objects.

from lark import Lark, Transformer, v_args
from lark.lexer import Lexer, Token

class TypeLexer(Lexer):
def __init__(self, lexer_conf):
pass

def lex(self, data):
for obj in data:
if isinstance(obj, int):
yield Token('INT', obj)
elif isinstance(obj, (type(''), type(u''))):
yield Token('STR', obj)
else:
raise TypeError(obj)

parser = Lark("""
start: data_item+
data_item: STR INT*

%declare STR INT
""", parser='lalr', lexer=TypeLexer)

class ParseToDict(Transformer):
@v_args(inline=True)
def data_item(self, name, *numbers):
return name.value, [n.value for n in numbers]

start = dict

def test():
data = ['alice', 1, 27, 3, 'bob', 4, 'carrie', 'dan', 8, 6]

print(data)

tree = parser.parse(data)
res = ParseToDict().transform(tree)

print('-->')
print(res) # prints {'alice': [1, 27, 3], 'bob': [4], 'carrie': [], 'dan': [8, 6]}

if __name__ == '__main__':
test()

Total running time of the script: ( 0 minutes 0.000 seconds)

Transform a Forest

This example demonstrates how to subclass TreeForestTransformer to directly transform a SPPF.

from lark import Lark
from lark.parsers.earley_forest import TreeForestTransformer, handles_ambiguity, Discard

class CustomTransformer(TreeForestTransformer):

@handles_ambiguity
def sentence(self, trees):
return next(tree for tree in trees if tree.data == 'simple')

def simple(self, children):
children.append('.')
return self.tree_class('simple', children)

def adj(self, children):
return Discard

def __default_token__(self, token):
return token.capitalize()

grammar = """
sentence: noun verb noun -> simple
| noun verb "like" noun -> comparative

noun: adj? NOUN
verb: VERB
adj: ADJ

NOUN: "flies" | "bananas" | "fruit"
VERB: "like" | "flies"
ADJ: "fruit"

%import common.WS
%ignore WS
"""

parser = Lark(grammar, start='sentence', ambiguity='forest')
sentence = 'fruit flies like bananas'
forest = parser.parse(sentence)

tree = CustomTransformer(resolve_ambiguity=False).transform(forest)
print(tree.pretty())

# Output:
#
# simple
# noun Flies
# verb Like
# noun Bananas
# .
#

Total running time of the script: ( 0 minutes 0.000 seconds)

Simple JSON Parser

The code is short and clear, and outperforms every other parser (that's written in Python). For an explanation, check out the JSON parser tutorial at /docs/json_tutorial.md

(this is here for use by the other examples)

from lark import Lark, Transformer, v_args

json_grammar = r"""
?start: value

array : "[" [value ("," value)*] "]"
object : "{" [pair ("," pair)*] "}"
pair : string ":" value

string : ESCAPED_STRING

%import common.ESCAPED_STRING
%import common.SIGNED_NUMBER
%import common.WS

%ignore WS
"""

class TreeToJson(Transformer):
@v_args(inline=True)
def string(self, s):
return s[1:-1].replace('\\"', '"')

array = list
pair = tuple
object = dict
number = v_args(inline=True)(float)

null = lambda self, _: None
true = lambda self, _: True
false = lambda self, _: False

Total running time of the script: ( 0 minutes 0.000 seconds)

Custom SPPF Prioritizer

This example demonstrates how to subclass ForestVisitor to make a custom SPPF node prioritizer to be used in conjunction with TreeForestTransformer.

Our prioritizer will count the number of descendants of a node that are tokens. By negating this count, our prioritizer will prefer nodes with fewer token descendants. Thus, we choose the more specific parse.

from lark import Lark
from lark.parsers.earley_forest import ForestVisitor, TreeForestTransformer

class TokenPrioritizer(ForestVisitor):

def visit_symbol_node_in(self, node):
# visit the entire forest by returning node.children
return node.children

def visit_packed_node_in(self, node):
return node.children

def visit_symbol_node_out(self, node):
priority = 0
for child in node.children:
# Tokens do not have a priority attribute
# count them as -1
priority += getattr(child, 'priority', -1)
node.priority = priority

def visit_packed_node_out(self, node):
priority = 0
for child in node.children:
priority += getattr(child, 'priority', -1)
node.priority = priority

def on_cycle(self, node, path):
raise Exception("Oops, we encountered a cycle.")

grammar = """
start: hello " " world | hello_world
hello: "Hello"
world: "World"
hello_world: "Hello World"
"""

parser = Lark(grammar, parser='earley', ambiguity='forest')
forest = parser.parse("Hello World")

print("Default prioritizer:")
tree = TreeForestTransformer(resolve_ambiguity=True).transform(forest)
print(tree.pretty())

forest = parser.parse("Hello World")

print("Custom prioritizer:")
tree = TreeForestTransformer(resolve_ambiguity=True, prioritizer=TokenPrioritizer()).transform(forest)
print(tree.pretty())

# Output:
#
# Default prioritizer:
# start
# hello Hello
#
# world World
#
# Custom prioritizer:
# start
# hello_world Hello World

Total running time of the script: ( 0 minutes 0.000 seconds)

Python 3 to Python 2 converter (tree templates)

This example demonstrates how to translate between two trees using tree templates. It parses Python 3, translates it to a Python 2 AST, and then outputs the result as Python 2 code.

Uses reconstruct_python.py for generating the final Python 2 code.

from lark import Lark
from lark.tree_templates import TemplateConf, TemplateTranslator

from lark.indenter import PythonIndenter
from reconstruct_python import PythonReconstructor

#
# 1. Define a Python parser that also accepts template vars in the code (in the form of $var)
#
TEMPLATED_PYTHON = r"""
%import python (single_input, file_input, eval_input, atom, var, stmt, expr, testlist_star_expr, _NEWLINE, _INDENT, _DEDENT, COMMENT, NAME)

%extend atom: TEMPLATE_NAME -> var

TEMPLATE_NAME: "$" NAME

?template_start: (stmt | testlist_star_expr _NEWLINE)

%ignore /[\t \f]+/ // WS
%ignore /\\[\t \f]*\r?\n/ // LINE_CONT
%ignore COMMENT
"""

parser = Lark(TEMPLATED_PYTHON, parser='lalr', start=['single_input', 'file_input', 'eval_input', 'template_start'], postlex=PythonIndenter(), maybe_placeholders=False)

def parse_template(s):
return parser.parse(s + '\n', start='template_start')

def parse_code(s):
return parser.parse(s + '\n', start='file_input')

#
# 2. Define translations using templates (each template code is parsed to a template tree)
#

pytemplate = TemplateConf(parse=parse_template)

translations_3to2 = {
'yield from $a':
'for _tmp in $a: yield _tmp',

'raise $e from $x':
'raise $e',

'$a / $b':
'float($a) / $b',
}
translations_3to2 = {pytemplate(k): pytemplate(v) for k, v in translations_3to2.items()}

#
# 3. Translate and reconstruct Python 3 code into valid Python 2 code
#

python_reconstruct = PythonReconstructor(parser)

def translate_py3to2(code):
tree = parse_code(code)
tree = TemplateTranslator(translations_3to2).translate(tree)
return python_reconstruct.reconstruct(tree)

#
# Test Code
#

_TEST_CODE = '''
if a / 2 > 1:
yield from [1,2,3]
else:
raise ValueError(a) from e

'''

def test():
print(_TEST_CODE)
print(' -----> ')
print(translate_py3to2(_TEST_CODE))

if __name__ == '__main__':
test()

Total running time of the script: ( 0 minutes 0.000 seconds)

Grammar-complete Python Parser

A fully-working Python 2 & 3 parser (but not production ready yet!)

This example demonstrates usage of the included Python grammars

import sys
import os, os.path
from io import open
import glob, time

from lark import Lark
from lark.indenter import PythonIndenter

kwargs = dict(postlex=PythonIndenter(), start='file_input')

# Official Python grammar by Lark
python_parser3 = Lark.open_from_package('lark', 'python.lark', ['grammars'], parser='lalr', **kwargs)

# Local Python2 grammar
python_parser2 = Lark.open('python2.lark', rel_to=__file__, parser='lalr', **kwargs)
python_parser2_earley = Lark.open('python2.lark', rel_to=__file__, parser='earley', lexer='basic', **kwargs)

try:
xrange
except NameError:
chosen_parser = python_parser3
else:
chosen_parser = python_parser2

def _read(fn, *args):
kwargs = {'encoding': 'iso-8859-1'}
with open(fn, *args, **kwargs) as f:
return f.read()

def _get_lib_path():
if os.name == 'nt':
if 'PyPy' in sys.version:
return os.path.join(sys.base_prefix, 'lib-python', sys.winver)
else:
return os.path.join(sys.base_prefix, 'Lib')
else:
return [x for x in sys.path if x.endswith('%s.%s' % sys.version_info[:2])][0]

def test_python_lib():
path = _get_lib_path()

start = time.time()
files = glob.glob(path+'/*.py')
total_kb = 0
for f in files:
r = _read(os.path.join(path, f))
kb = len(r) / 1024
print( '%s -\t%.1f kb' % (f, kb))
chosen_parser.parse(r + '\n')
total_kb += kb

end = time.time()
print( "test_python_lib (%d files, %.1f kb), time: %.2f secs"%(len(files), total_kb, end-start) )

def test_earley_equals_lalr():
path = _get_lib_path()

files = glob.glob(path+'/*.py')
for f in files:
print( f )
tree1 = python_parser2.parse(_read(os.path.join(path, f)) + '\n')
tree2 = python_parser2_earley.parse(_read(os.path.join(path, f)) + '\n')
assert tree1 == tree2

if __name__ == '__main__':
test_python_lib()
# test_earley_equals_lalr()
# python_parser3.parse(_read(sys.argv[1]) + '\n')

Total running time of the script: ( 0 minutes 0.000 seconds)

Creating an AST from the parse tree

This example demonstrates how to transform a parse-tree into an AST using lark.ast_utils.

create_transformer() collects every subclass of Ast subclass from the module, and creates a Lark transformer that builds the AST with no extra code.

This example only works with Python 3.

import sys
from typing import List
from dataclasses import dataclass

from lark import Lark, ast_utils, Transformer, v_args
from lark.tree import Meta

this_module = sys.modules[__name__]

#
# Define AST
#
class _Ast(ast_utils.Ast):
# This will be skipped by create_transformer(), because it starts with an underscore
pass

class _Statement(_Ast):
# This will be skipped by create_transformer(), because it starts with an underscore
pass

@dataclass
class Value(_Ast, ast_utils.WithMeta):
"Uses WithMeta to include line-number metadata in the meta attribute"
meta: Meta
value: object

@dataclass
class Name(_Ast):
name: str

@dataclass
class CodeBlock(_Ast, ast_utils.AsList):
# Corresponds to code_block in the grammar
statements: List[_Statement]

@dataclass
class If(_Statement):
cond: Value
then: CodeBlock

@dataclass
class SetVar(_Statement):
# Corresponds to set_var in the grammar
name: str
value: Value

@dataclass
class Print(_Statement):
value: Value

class ToAst(Transformer):
# Define extra transformation functions, for rules that don't correspond to an AST class.

def STRING(self, s):
# Remove quotation marks
return s[1:-1]

def DEC_NUMBER(self, n):
return int(n)

@v_args(inline=True)
def start(self, x):
return x

#
# Define Parser
#

parser = Lark("""
start: code_block

code_block: statement+

?statement: if | set_var | print

if: "if" value "{" code_block "}"
set_var: NAME "=" value ";"
print: "print" value ";"

value: name | STRING | DEC_NUMBER
name: NAME

%import python (NAME, STRING, DEC_NUMBER)
%import common.WS
%ignore WS
""",
parser="lalr",
)

transformer = ast_utils.create_transformer(this_module, ToAst())

def parse(text):
tree = parser.parse(text)
return transformer.transform(tree)

#
# Test
#

if __name__ == '__main__':
print(parse("""
a = 1;
if a {
print "a is 1";
a = 2;
}
"""))

Total running time of the script: ( 0 minutes 0.000 seconds)

Example-Driven Error Reporting

A demonstration of example-driven error reporting with the Earley parser (See also: error_reporting_lalr.py)

from lark import Lark, UnexpectedInput

from _json_parser import json_grammar # Using the grammar from the json_parser example

json_parser = Lark(json_grammar)

class JsonSyntaxError(SyntaxError):
def __str__(self):
context, line, column = self.args
return '%s at line %s, column %s.\n\n%s' % (self.label, line, column, context)

class JsonMissingValue(JsonSyntaxError):
label = 'Missing Value'

class JsonMissingOpening(JsonSyntaxError):
label = 'Missing Opening'

class JsonMissingClosing(JsonSyntaxError):
label = 'Missing Closing'

class JsonMissingComma(JsonSyntaxError):
label = 'Missing Comma'

class JsonTrailingComma(JsonSyntaxError):
label = 'Trailing Comma'

def parse(json_text):
try:
j = json_parser.parse(json_text)
except UnexpectedInput as u:
exc_class = u.match_examples(json_parser.parse, {
JsonMissingOpening: ['{"foo": ]}',
'{"foor": }}',
'{"foo": }'],
JsonMissingClosing: ['{"foo": [}',
'{',
'{"a": 1',
'[1'],
JsonMissingComma: ['[1 2]',
'[false 1]',
'["b" 1]',
'{"a":true 1:4}',
'{"a":1 1:4}',
'{"a":"b" 1:4}'],
JsonTrailingComma: ['[,]',
'[1,]',
'[1,2,]',
'{"foo":1,}',
'{"foo":false,"bar":true,}']
}, use_accepts=True)
if not exc_class:
raise
raise exc_class(u.get_context(json_text), u.line, u.column)

def test():
try:
parse('{"example1": "value"')
except JsonMissingClosing as e:
print(e)

try:
parse('{"example2": ] ')
except JsonMissingOpening as e:
print(e)

if __name__ == '__main__':
test()

Total running time of the script: ( 0 minutes 0.000 seconds)

Example-Driven Error Reporting

A demonstration of example-driven error reporting with the LALR parser (See also: error_reporting_earley.py)

from lark import Lark, UnexpectedInput

from _json_parser import json_grammar # Using the grammar from the json_parser example

json_parser = Lark(json_grammar, parser='lalr')