GSoC 2 — The SymPy Boogaloo
In this blog post, I will discuss what my 2nd GSoC project is about.
GSoC with SymPy
In 2023, I decided to apply to just one organization for GSoC: SymPy. I was interested in the parsing, codegen, and printing modules as I had worked with all 3, and solved some interesting bugs in the parsing and printing submodules.
I knew that, in the past, SymPy has had a little trouble with finding mentors, even for projects they wanted, so I decided to send an email to the SymPy mailing list to gauge interest in having a GSoC project in either the parsing or codegen submodules.
I chose to ask about these two submodules specifically, because out of the three submodules in which I have previous experience, the codegen submodule and the parsing submodule have their own dedicated sections on the GSoC Ideas page.
In response to my email, Aaron Meurer replied with a few potential GSoC project ideas. Of those ideas, I ended up choosing to write a proposal about rewriting the LaTeX parser, because it is widely used by the community.
Wait, what is SymPy?
As the About section on SymPy’s GitHub says, SymPy is
A computer algebra system written in pure Python
The way I like to explain it to people is like this: If I were to ask MATLAB what the value of $d$ is, where
\[d = \int\limits_1^2 \! \dfrac{1}{x}\,\mathrm{d}x,\]The answer I would get is d = 0.6931
, by running
On the other hand, if I run the following SymPy code
I get the output d = log(2)
.
In MATLAB’s case, I got the approximate numerical value, whereas in SymPy’s case, I got the symbolic answer.
Note: I am aware that MATLAB has a Symbolic Math Toolbox™ which gives MATLAB symbolic capabilities. On the flip side, running d.evalf()
at the end of the SymPy snippet will print 0.693147180559945
on the terminal. Therefore, the above example is simplistic, but it helps me get the point across.
The Project
My project, as I briefly mentioned, is to rewrite the LaTeX parser using Lark as the parser generator, instead of ANTLR.
My GSoC Proposal
My proposal can be found here. As such, my proposal doesn’t deviate much from the original plan, unlike last time.
What is the LaTeX Parser?
The LaTeX parser is one of the submodules in SymPy which allows you to convert a LaTeX expression into a SymPy expression. For example, if I have some LaTeX code like \int\! x^2 \,dx
, I can convert it into a SymPy expression by running:
Once you have this expression, you can do things like substituting values. For example, running expr.evalf(subs=dict(a=5, b=2))
after running the above snippet, will give us 20.0000000000000
as the output.
Why Does It Need a Rewrite?
To get the full picture, we first need to understand the history of the LaTeX parser in SymPy, and how it came to be.
Some History
Here is a snippet from the documentation for the LaTeX parser:
$\mathrm{\LaTeX}$ parsing was ported from latex2sympy. While functional and its API should remain stable, the parsing behavior or backend may change in future releases.
If we look at the linked GitHub repository, we see that the project was originally started in January 2016 by augustt198.
This repository solved a long-standing feature request that people had asked for, as can be seen in this SymPy mailing list email thread and this old SymPy issue.
Soon after the repository was made, long-term SymPy contributor @moorepants mentioned in Issue #1 that the SymPy community was interested in bring the codebase into SymPy itself.
After that, the latex2sympy
code was merged into SymPy in #13706.
The reason the history is important is to understand why ANTLR was used: Originally, since the LaTeX parser was not a part of SymPy, augustt198 was free to use whatever library he wanted, and he likely felt most comfortable using ANTLR for the task.
However, after the code became a part of SymPy (and even while that process was happening), there were some concerns about having ANTLR as an optional dependency. The priority at the time, however, was to get a LaTeX parser into SymPy as a sort of reference implementation or a baseline, and worry about the rest later.
Issues
There were a few issues with ANTLR, which is the reason other alternatives were being considered:
- The runtime package can be difficult to install. There have been reports of users who found the LaTeX parser’s runtime dependencies difficult to install. There are a couple of packages on conda-forge with closely related names:
antlr4-python3-runtime
andantlr-python-runtime
. Installing the wrong one causes hard-to-debug issues. As one user who ran into this issue pointed out,That’s a few lost hours for each of the two characters
- The above issue is further exacerbated by the fact that the required package’s name is
antlr4-python3-runtime
on PyPI.
In #19528, sylee957 pointed out a few more shortcomings of the existing ANTLR-based parser:
- The ANTLR generated files don’t make the parser truly standalone.
- The ANTLR generated files generate huge diffs when changes are made to the grammar files. (Here is an example.)
- The ANTLR generated files contain version information that cause warnings for the user. The good news is that in spite of the warnings, they appear to run without critical problems. However, this is bad for developers because different versions of ANTLR give differently structured script files, which exacerbates the problem mentioned above, of generating huge diffs.
- The ANTLR generated files contain personal information, which must be filtered out before committing them to version control.
All of the above shortcomings are reasons to move away from ANTLR and towards a pure Python library.
One advantage of ANTLR is its performance, which isn’t that important for this use-case.
Alternatives to ANTLR
There are multiple viable alternatives that were originally considered:
Note that all the libraries in the list are Python libraries.
Of these libraries, Lark was chosen as the library that fits SymPy’s needs best.
Why Lark?
Numerous advantages to using Lark were identified:
- Lark has active, receptive maintainers. For example, when I found something in the documentation that wasn’t being rendered correctly, I opened an issue for it which was then promptly fixed.
- Lark has good documentation. The documentation is detailed and filled with examples, which makes using the library a lot easier.
- Lark has no runtime dependencies beyond Python’s standard library. This is important because, for example, Parsimonious still needs the external regex package.
- Lark shows strong performance.
- Lark handles ambiguities that PEG parsers cannot. By using the Early parser, for example, Lark can return all a tree with all the possibilities if a certain expression is ambiguous.
- Lark uses a dedicated, self-described format,
- that cannot include implementation details (e.g. inline python). This is good because the library itself enforces separation of concerns (i.e. keeping the grammar definition separate from the parser).
- which can be stored inline in a
.py
file, or- stored in one or more
.lark
files.
- stored in one or more
- which has plugins for various editors. For example, there is a VS Code extension and a PyCharm plugin for syntax highlighting Lark files,
- can be used to generate parsers in other languages like Julia and JavaScript (see the last point here.)
- has a “standard library” of useful tokens and expressions which can be imported into a grammar.
- Lark can generate an standalone
.py
file. In this case, this is not a big advantage since one of the reasons for moving away from ANTLR was to remove compiled components.
Prior Work
There was already some prior work done (before this GSoC project) in rewriting the LaTeX parser: In #19825, costrouc started working on removing the ANTLR-based parser and started implementing the Lark-based parser.
That’s all for this blog post! In this blog post, I tried to give the full history behind the LaTeX parser in SymPy, what the motivation for rewriting it in Lark was, and talked about where we stand currently. Stay tuned to this series for more information and for a work update on what I’ve done so far
Leave a comment