Towards a new SymPy: part 1 - Outline

This is the first part of a series of posts on big changes for SymPy with particular focus on speed. What I am going to describe in these posts are:

  • A description of the of the issues around speed with SymPy currently.

  • Work that I (and others) have done to speed up SymPy after the past few years.

  • Work in particular that I have done over the last year to make SymPy faster.

  • What I think are the next steps to make SymPy faster.

The other posts in this series can be found at Towards a new SymPy.

Over the last year in particular I have been working as a part of a CZI funded project that has three strands. One of those three strands is for me to work on speeding up SymPy. Now that we come to the end of that year, I want to describe what has been done and spell out a vision for the future.

I will be writing this in a series of blog posts. This first post will outline the structure of the foundations of a computer algebra system (CAS) like SymPy, describe some problems SymPy currently has and what can be done to address them. Then subsequent posts will focus in more detail on particular components and the work that has been done and what should be done in the future.

I am writing this with the intention that it should be accessible to someone who is not a SymPy developer although other SymPy developers are the intended audience for many of the points that I will make. Many of the things that I will describe here are not well understood even by many SymPy developers though and a major goal of this series of posts is to help to try to change that.

What is a CAS?

There are many different kinds of computer algebra system (CAS) and they have different intended uses that are driven by their different kinds of authors, contributors and users. Some like SageMath are directed more at pure mathematicians, while others like SymPy are directed more at scientists and engineers.

Most scientific computing is not done with computer algebra systems. Instead numerical libraries (e.g. NumPy, SciPy, etc in the context of Python) are the usual backbone of computing in science and engineering. What those libraries do is to provide fast implementations of numerical computing based on machine precision floating point numbers. When I say machine precision floating point I mean something like a 64-bit binary floating point number like the float type in Python:

>>> from math import pi
>>> pi
3.141592653589793
>>> type(pi)
<class 'float'>

Many things can be done with machine precision floating point numbers, but there are situations where they are not good enough. Programmers learn fairly early on that floating point calculations are not exact e.g.:

>>> (1 / 49) * 49
0.9999999999999999

We can see here the limits of the precision of 64-bit floating point because we should have a result of 1 but in fact we have something that is quite close to 1 but still not exactly 1. Since the answer is quite close most of the time this is usually not a huge problem even if it does cause some confusion for beginners.

However, there are many situations where this is not acceptable:

  • Small errors can accumulate and become large errors.

  • To solve the error accumulation problem some calculations should be done with many more digits of precision than machine precision floating point numbers can allow.

  • Some calculations cannot be done reliably at all unless they are done exactly.

  • Sometimes we want to calculate not in terms of explicit numbers but in terms of symbols to obtain e.g. “analytic” rather than “numerical” results.

Traditionally these other situations are handled by scientists and engineers turning to a CAS and so in the context of scientific computing the usual distinction is between working “numerically” with floating point numbers or otherwise using a CAS to work “symbolically” or “analytically”. There is a lot more to a CAS than this but in simple terms this is the role that SymPy plays within the Python scientific computing ecosystem. Of course there are many people also using SymPy for applications that are more like pure mathematics but most SymPy users and developers are from a scientific background and that has a big influence on the way that SymPy is developed.

The “core” of a CAS

Different people will have different ideas about what the “core” of a CAS like SymPy is. For example users might think of functions like solve or integrate as being the core functions but these are really high level functions and not really the foundation of a CAS. Most SymPy developers would probably think of the sympy.core module as being the core of SymPy but this actually misses a major component of SymPy which is the sympy.polys module. In my mind these three components will be the core of any CAS like SymPy:

  • A robust numerical evaluation subsystem (sympy.core.evalf).

  • A symbolic manipulation subsystem (sympy.core.basic).

  • A computational algebra subsystem (sympy.polys).

I will explain each of these in turn because it is easy to miss the importance of each of them and I think that many SymPy developers do not understand the role each of these components has and how they relate to one another. Of course there are many other components of SymPy that are important but these three are the foundation of almost all of the higher-level things that SymPy does. Importantly when we want to think about the speed of SymPy we need to understand that almost everything depends on the speed of these three components. Higher level algorithms are absolutely important but at least in SymPy those are all built on top of these three pieces as the foundation for everything else.

A robust numerical evaluation subsystem

Above I use the word “numerical” to mean calculations that are done with machine precision floating point numbers. This is the usual intended meaning of the word but this use of the word is tricky because in the context of a CAS we can also do numerical calculations with arbitrary precision approximate or exact numbers and those are also “numerical” calculations. I will define several levels of numerical calculation ranging from machine precision floating point to exact and symbolic calculations. As we go down this list the calculations become slower but more accurate:

  • Machine precision floating point: This is the usual 64-bit binary floating point numbers that are used in most numerical libraries. They are fast and have a wide range of support in hardware and software but are limited in both range and precision. Most scientific Python libraries like NumPy, SciPy etc work (almost) exclusively with machine precision floating point numbers.

  • Arbitrary fixed precision floating point: This means using floating point numbers that can have as many digits of precision as desired. These calculations are much slower than machine precision floating point calculations but by increasing the precision we can make the calculations as accurate as we want. The mpmath library that is used by SymPy is an example of this. With mpmath you can choose to use say 100 digits and then all calculations will be done with 100 digit numbers.

  • “Robust” floating point calculations: These are calculations that are done with arbitrary precision floating point numbers but are done in such a way that the result is guaranteed (perhaps formally or perhaps heuristically) to be correct to within a specified tolerance. Making this work requires using adaptive precision so if the end result should be accurate to 100 digits then intermediate calculations might need to use many more than 100 digits to control the growth of errors. SymPy’s evalf is an example of heuristic robust numerics and the Arb library that I will discuss later is an example of a library that does formal robust numerics.

  • Exact numerical calculations: these are numerical calculations that are done using e.g. exact numbers like rational numbers or more complicated types of exact numbers.

  • Symbolic calculations: these are calculations that are done with symbols rather than just numbers. The result of a symbolic calculation is a symbolic expression that can be manipulated further. This is the kind of calculation that probably most people think of when they think of a CAS.

Let us just quickly demonstrate each of these:

  • Machine precision floating point:

    >>> from math import sqrt
    >>> sqrt(2)  # Use machine precision floating point
    1.4142135623730951
    
  • Arbitrary precision floating point:

    >>> from mpmath import mp
    >>> mp.dps = 50 # Use 50 digits for the calculation
    >>> mp.sqrt(2)
    mpf('1.4142135623730950488016887242096980785696718753769')
    
  • Robust floating point calculations:

    >>> from sympy import sqrt
    >>> sqrt(2).evalf(50) # Compute a result that is correct to 50 digits
    1.4142135623730950488016887242096980785696718753769
    
  • Exact numerical calculations:

    >>> from sympy import Rational
    >>> Rational(1, 2) + Rational(1, 3) # Compute exactly
    5/6
    
  • Symbolic calculations:

    >>> from sympy import Symbol, sqrt
    >>> x = Symbol('x')
    >>> sqrt(x)**2   # Compute with symbols
    x
    

The distinction between these different kinds of calculations can be a bit fuzzy but the first point to note is that the vast majority of scientific computing is done with machine precision floating point numbers as mentioned at the top of the list. Everything below that is what a CAS like SymPy is typically used for. It is also possible to do symbolic calculations with e.g. machine precision floating point numbers so perhaps including “symbolic” in this list does not make sense but I think that from a user’s perspective “symbolic” is in some way the next level after exact numerical calculations.

Many SymPy users do not understand the distinctions between these different kinds of numeric calculations but also many SymPy developers do not understand them either. For example I am not sure how many SymPy developers would automatically realise mpmath and evalf are at different levels in this scheme despite the fact that evalf works entirely by using mpmath. As an example to demonstrate the differences we will compute \(\sin(1)^2 + \cos(1)^2 - 1\) using SymPy’s evalf, mpmath and also SymEngine which is a C++ library that recreates the “core” of SymPy in C++:

>>> import sympy as sym
>>> e = sym.cos(1)**2 + sym.sin(1)**2 - 1
>>> e.evalf()
-0.e-124
>>> from mpmath import mp
>>> mp.dps = 100
>>> mp.cos(1)**2 + mp.sin(1)**2 - 1
mpf('0.0')
>>> import symengine as se
>>> (se.cos(1)**2 + se.sin(1)**2 - 1).evalf()
1.11022302462516e-16

The correct answer here is zero because \(\sin^2(x) + \cos^2(x) = 1\) for all \(x\). SymPy’s evalf method returns the strange looking result -0.e-124. This is evalf’s way of saying that the result is smaller than \(10^{-124}\) and is possibly zero but not proven to be zero. The robust numerics in evalf should ensure that we know that the result is very close to zero but can still never prove that it is or is not exactly zero. We do not get an exact result of zero because evalf is careful not to claim that the result is exactly zero if it cannot prove that. Usually evalf will prove that a nonzero expression is nonzero but it can never prove that a zero expression is exactly zero. This is a fundamental limitation of robust (rather than exact) numerics: we can never prove that something is exactly zero. In order to compute this result evalf will have used mpmath but with higher and higher precision until it reached the maximum allowed precision (124 digits) and gave up.

The mpmath result is exactly zero which is exactly correct but it is only exactly correct by “luck”. Here mpmath is using 100 decimal digits but it still does not prove that the result is zero. It does however return something that is indistinguishable from an exact zero unlike evalf which made it clear that it could not prove that the result is exactly zero. This is a fundamental limitation of arbitrary fixed precision numerics as compared to robust numerics: we can use as many digits as we like and the result will likely be accurate but without very careful analysis we generally do not know anything about how accurate the result is. It is always possible that had we used more digits we would have found that the result was not zero after all but mpmath does not indicate any level of uncertainty about the result it returns.

By contrast SymEngine’s evalf has computed the result here using machine precision floating point numbers and so it gives a fast result but the result is not zero and there is no way to know if that is due to a rounding error or not. We can ask SymEngine’s evalf to use more digits but it will still only use arbitrary fixed precision (like mpmath) and not robust numerics (like SymPy’s evalf):

>>> (se.cos(1)**2 + se.sin(1)**2 - 1).evalf(100)
-3.9443045261050590270586428264e-31

There is nothing inherently wrong with SymEngine’s evalf and there are good reasons to use all of the different levels of numeric calculation that I have described above. Many SymPy users would actually be happier if SymPy’s evalf was faster even if less accurate and would prefer the behaviour of SymEngine’s evalf. However these levels of numeric calculation are not interchangeable and SymPy’s other two core systems (the expression system and the computational algebra subsystem) absolutely do depend on SymPy’s evalf giving robust numeric evaluation and not just arbitrary fixed precision numeric evaluation: many things would break if SymPy’s evalf was changed to behave like SymEngine’s evalf.

I will propose later a plan for how to improve SymPy’s numeric evaluation subsystem both in terms of speed and accuracy. For now I just want to note that the original author of both mpmath and SymPy’s evalf is Fredrik Johansson who subsequently went on to create Arb which is a library for doing formal robust numerics. SymPy’s evalf should change to using Arb-like formal robust numerics and should ultimately provide the option to use the Arb library as the basis for this subsystem.

The symbolic expression system

SymPy’s symbolic expression system is the second component of what I call the “core” of SymPy and is located in the sympy.core package. This is what most SymPy users and contributors are used to working with. This system defines expressions in a symbolic tree representation e.g.:

>>> import sympy as sym
>>> x = sym.Symbol("x")
>>> e = x**2 + 2*x + 1
>>> e
x**2 + 2*x + 1
>>> sym.srepr(e)
"Add(Pow(Symbol('x'), Integer(2)), Mul(Integer(2), Symbol('x')), Integer(1))"
>>> sym.print_tree(e, assumptions=False)
Add: x**2 + 2*x + 1
+-One: 1
+-Pow: x**2
| +-Symbol: x
| +-Integer: 2
+-Mul: 2*x
  +-Integer: 2
  +-Symbol: x

Most work that goes on in SymPy is done either on the internals of this expression system or on the other code that operates with these expressions. In many ways this system is nice but there are also many problems with it. Essentially it is designed with an emphasis on what would be nice for end users who are doing simple things and as a result is not well suited as a foundation for building more complex algorithms. What we do not have is any alternative that can be used instead of this system for the internals of SymPy.

Many of the problems with SymPy and in particular its performance stem from overuse of this expression system. Unfortunately the prominent exposure of the symbolic subsystem in the SymPy API makes it very difficult to change and so realistically the best path forward is to reduce the usage of this system at least in the internals of SymPy. That is difficult though because we usually do not have a clear alternative to use instead and most SymPy developers do not understand how to use SymPy apart from by using this system.

When I say that the symbolic subsystem is overused I should be clear about what the alternatives would be:

  • Use a symbolic subsystem that has a very different design.

  • Use the computational algebra subsystem instead.

Much of the work that I have done recently in SymPy has been to try to expand the computational algebra subsystem, to make it faster and to make more use of it for heavier algorithms in the internals of things like solve, integrate etc. Many things in SymPy (e.g. matrices) can almost immediately be made a lot faster simply by having them use the computational algebra subsystem instead of the symbolic subsystem. I will talk more about this later.

Of course being a CAS that is primarly intended for symbolics there are many things in SymPy that do need to use a symbolic subsystem but the current design of core symbolics in SymPy is not suitable for most of those things. The problems manifest both in terms of:

  • speed: many things are much slower than they could be.

  • behaviour: many things being more difficult to do.

  • features: many users want to do things that the design cannot really support.

  • bugs: the system is hard to use robustly and this leads to bugs.

  • maintainability: making changes to any part of the expression system (including just fixing obvious bugs) is difficult because the effects of any change are far reaching and unpredictable.

What about SymEngine?

In terms of speed one approach that has been mooted for making the symbolic subsystem faster is to rewrite the “core” in e.g. C++ and this is essentially what SymEngine is. So one possibility to make SymPy faster would be to use SymEngine instead of SymPy’s symbolic subsystem. However SymEngine is not a drop-in replacement for SymPy’s symbolic subsystem and it could never be made to be so. There are two basic problems with attempting to use SymEngine for the core of SymPy:

  • On the one hand SymEngine is not sufficiently similar to SymPy’s existing symbolic subsystem so simply switching to use it inside SymPy would break all sorts of things.

  • On the other hand SymEngine is too much like SymPy’s symbolic subsystem so using it would not solve many of the problems that SymPy has and in fact would make it much harder to solve those problems.

The part of SymPy that SymEngine aims to emulate is literally made up hundreds of thousands of lines of code and has poorly defined semantics in many cases. It is basically impossible to make any drop-in replacement for this system that would not differ in ways that would break things. At the same time some of the SymEngine interface is deliberately different from SymPy (e.g. evalf above) so that changing SymPy’s behaviour to be like SymEngine would break compatibility for users and downstream projects.

It is also not possible to adapt the behaviour of SymEngine to be more like what SymPy needs because it is not extensible from Python. Previously Julia used both SymPy and SymEngine for symbolics but subsequently they decided to move away from both and create SymbolicUtils.jl and Symbolics.jl to be the foundation for symbolics in Julia. The reasons given at the time that neither SymPy nor SymEngine were suitable were that:

  • SymPy worked for what they needed but was too slow.

  • SymEngine was fast enough but too inflexible and not extensible from Julia.

These same two considerations actually apply to SymPy itself: SymPy’s symbolic subsystem is too slow for SymPy itself and SymEngine is too inflexible for SymPy to use it as a replacement.

There is a third problem which the Julia people seemed to overlook which is that both SymPy’s symbolic subsystem and SymEngine’s reimplementation of it are based on a design that is not suitable for the kinds of things that SymPy needs either for its internals or also for what many users and downstream libraries would like. There are many aspects to this design problem but the most fundamental one is the problem of automatic evaluation. What I mean by automatic evaluation is basically this:

>>> import sympy as sym
>>> sym.cos(sym.pi/4)
sqrt(2)/2

Quite simply you asked to create the expression \(\cos(\pi/4)\) and SymPy instead gave you the expression \(\sqrt{2}/2\). Maybe that is what you wanted but maybe it is not. The problem is that you do not have any real control over this. There is evaluate=False but it does not work in general and it cannot be made to work in general.

In terms of speed the problem with automatic evaluation is that it makes it impossible to control the performance of higher-level algorithms because every time any expression is created a huge amount of computational work is done in order to try to “evaluate” the expression. Every now and again someone will ask a question like “what is the computational complexity of <some SymPy function>” but if this function uses the symbolic subsystem then it is entirely impossible to answer this question: unbounded computation can occur just during the creation of an expression.

We can try to reduce the cost of the computational work during automatic evaluation but actually we really need to be able to reduce it to zero which is what it would be if there were no automatic evaluation. Creating an expression needs to be so cheap that we think of it as free compared to doing any actual computation (many SymPy contributors do think of it as free which leads them to write very inefficient code).

Many other problems with the design of SymPy’s symbolic subsystem could in principle be fixed without most users really noticing the internal changes. Almost all code that uses the symbolic subsystem relies on automatic evaluation though and so simply making a change to disable evaluation would break almost everything.

The reason this makes it difficult for SymPy to use SymEngine is that SymEngine is also based on automatic evaluation and in such a way that is impossible to control from the outside. In the case of SymEngine evaluate=False is not an option and I believe there are even plans for SymEngine’s internal data structures to change so that it would be impossible to implement something like evaluate=False. SymEngine broadly does “less” automatic evaluation than SymPy which is one reason why it is faster but there is still some. Even just swapping the order of the terms here is unacceptable if there is no way to disable it:

>>> import symengine as se
>>> x = se.symbols('x')
>>> se.exp(x) + x
x + exp(x)

A new symbolic subsystem

What SymPy needs is a symbolic subsystem that is not based on automatic evaluation. Of course many users would still want automatic evaluation and there should be a way to provide that. It is easy though to build a system that has automatic evaluation on top of a system that does not whereas it is impossible to build a system that does not have automatic evaluation on top of a system that does. The current design of Basic and Expr in SymPy and all of their hundreds of subclasses builds automatic evaluation into the core data structures of the symbolic subsystem. This is a fundamental design flaw that causes all kinds of problems for speed, behaviour and extensibility. Using SymEngine (in its current form) can make this subsystem faster but it would then make all of the other problems unsolvable.

The solution to this problem is to build a new symbolic subsystem but it really needs to be built from the ground up: there is no viable way to get there through incremental changes to the existing system. This can be done for the internals of SymPy to get a lot of the benefit in terms of speed and behaviour without breaking compatibility for users and downstream projects (which could still use the existing system).

At some point though it would be better to switch the user facing symbolic system to an implementation that would be based internally on that new subsystem. We can try to minimise the noticeable impact of this change but it would definitely be a breaking change. There has been some discussion of a SymPy 2.0 that would make some backwards incompatible changes and this is the number one thing that should be done. In my opinion we are not ready for SymPy 2.0 until we are ready to switch out the internals of the symbolic subsystem. I don’t see any other potential change that is important enough to warrant any major compatibility break.

Of course if SymPy were redesigned to have a hypothetical new symbolic subsystem with a different design then either SymEngine or something similar (let’s call it SymEngineX) could be designed to provide a faster implementation of that subsystem. The new system though should be designed so that it is possible for a faster implementation to be created without all of the problems that would apply to SymPy using SymEngine as it is now. In particular:

  • The parts that would be provided by SymEngineX must be a small part of the code that makes up the whole system (rather than a rewrite of everything).

  • Those parts would need to have well defined semantics such that it is actually possible to make multiple compatible implementations (at least the SymEngineX implementation would need to be compatible with the pure Python SymPy implementation).

  • The behaviour of the overall system still needs to be controllable from Python so none of the evaluation etc rules of SymEngineX could be hard-coded (in the way that SymPy and SymEngine both hard-code everything right now).

If a new symbolic subsystem was designed in this way then it would become much easier to make an alternate core that would play the role that SymEngine was intended to play in relation to SymPy. It could be SymEngine itself that still provides that core but the parts that SymPy would need from it would look very different to what SymEngine provides right now.

I want to be clear here that I do not want to criticise the excellent work done by many contributors to both SymPy and SymEngine working on these symbolic systems. There is a lot of great code in both projects that make up these systems but there are also problems that are just not fixable without a complete rebuild of the internals starting from the foundations. The existing code and contributions would not be wasted but much of that code would need to be adapted over time to a new framework.

I will talk about my proposals for the symbolic subsystem more concretely in a separate blog post.

The computational algebra subsystem

SymPy’s computational algebra subsystem is the third component of what I call the “core” of SymPy and is located in the sympy.polys package. This subsystem is underused and underappreciated but is absolutely critical to much SymPy’s ability to do anything nontrivial. In a CAS that was aimed more at pure mathematics this subsystem would be the most prominent feature but in SymPy it is mostly hidden away and not well understood. Few SymPy users or developers know how to use the lower levels of this system and few are aware of the features that it provides or why you might want to use them instead of the “higher level” symbolic subsystem.

To give a clear example of how this subsystem is used I will show how to convert an expression from the symbolic subsystem into the computational algebra subsystem using the high-level Poly representation:

>>> import sympy as sym
>>> x = sym.Symbol("x")
>>> e = (x + 1)**2
>>> p = sym.Poly(e, x)
>>> p
Poly(x**2 + 2*x + 1, x, domain='ZZ')
>>> p.as_expr()
x**2 + 2*x + 1

The Poly representation is the highest level representation in the computational algebra subsystem but its internals are still very different from the symbolic subsystem. We can see this by looking at the internal representation:

>>> p.rep.rep
[1, 2, 1]

This is a representation of the polynomial as a list of coefficients. Within the computational algebra subsystem this is referred to as the dense univariate polynomial (DUP) representation. In this representation we cannot represent the unexpanded power (x + 1)**2 but instead we have to expand it out to x**2 + 2*x + 1 because we can only store the coefficients of the expanded form. This is a very simple example but this computational algebra subsystem can represent much more complex expressions. I have described this subsystem in more detail in this page of the SymPy documentation:

https://docs.sympy.org/latest/modules/polys/domainsintro.html

Many important SymPy functions like factor, cancel etc work by converting from the symbolic subsystem to the computational algebra subsystem, doing some computation and then converting back to the symbolic subsystem. These conversions back and forth between the two systems are quite expensive and often just converting back to the symbolic subsystem can be more expensive than the actual calculation that is done in the computational algebra system. This is because just creating expressions is slow in the symbolic system (mainly because of automatic evaluation). It is much more efficient to perform an entire calculation in the computational algebra subsystem without converting back and forth but any code that does this needs to be written by someone who has some understanding of how to use the computational algebra subsystem. At least with the conversions someone can write code that seems like it only uses the symbolic subsystem without needing to know anything about the computational algebra subsystem. The price for that convenience is that it makes various things in SymPy slower than they need to be.

So many things could be made faster if SymPy just used the computational algebra subsystem more. The problem is that a certain mathematical background is needed just to understand how to use it and most of the people who use and contribute to SymPy do not have that background. This is also not helped by the fact that the system is not well documented: there is a lot of documentation but mostly it has not been written so that someone who does not have a background in computational algebra could understand even what the basic functions do (I wrote the doc page linked above to try to address this partially).

Speaking of documentation I think it is quite telling about both SymPy users and developers that while here I am describing the sympy.polys module as one of the three pillars in the foundation of all of SymPy the docs do not even bother to mention it at top level (as of SymPy 1.12):

https://docs.sympy.org/latest/reference/index.html#reference

The main front-page of the API docs consider that “polynomials” should just be listed under “topics” unlike ntheory (number theory) which is a whole section to itself. The tutorial does not mention the polys module directly, there is nothing in the “guides” section about it etc. To give some context here is a very rough count of how many lines of code are in each of SymPy’s top-level modules (this includes tests, comments, docstrings etc):

polys            100651
physics           80805
core              58583
printing          48638
solvers           43966
functions         42605
matrices          32942
utilities         32248
combinatorics     26144
integrals         25588
parsing           25213
stats             22356
tensor            19663
simplify          18296
geometry          15534
assumptions       11524
ntheory           11464
series            11258
plotting          11034
sets              10400
concrete           7093
logic              6929
vector             6772
codegen            6489
categories         4730
testing            4610
holonomic          4299
crypto             3958
calculus           3665
diffgeom           3046
liealgebras        2189
external           2182
algebras           2086
discrete           1813
interactive        1416
strategies         1406
multipledispatch   1239

The other things listed as “topics” alongside polys are geometry, holonomic, liealgebras, categories, crypto, diffgeom, plotting, and stats. Some of these are barely used by anyone while others like plotting are widely used and should also not be listed here. The “polynomials” module is the largest module in SymPy making up over 10% of the all of the code and is used by everything else but is sorely undervalued by the SymPy community.

The basic problem here is that SymPy is a project that is mostly used and developed by people who are not pure mathematicians and the polys module does not look like something that would be relevant to them. We need to understand though that a CAS needs to have a computational algebra subsystem in order to do the things that scientists and engineers want it to do as well. Higher level functions that like solve, integrate etc that are used by scientists and engineers need to be built on top of a computational algebra subsystem in order to be fast and reliable.

Maintenance of the computational algebra subsystem

When I started working on SymPy I did not have a background in computational algebra and I also did not understand how to use this subsystem (the polys module). Over time I have learned more about it and I have also learned how to use it and I have made various improvements to it. To reach this point I had to read a lot of textbooks and papers and look at what other (more algebra-oriented) CASs do.

To begin with what I found was that there was this large almost unmaintained subsystem that was critical to SymPy but yet did not have any active maintainer working on it. Kalevi Suominen was the only SymPy developer who seemed to really understand this part of the codebase and was critical to ensuring that at least pull requests could be reviewed but was also not actively working on major improvements or any redesign. Kalevi’s guidance made it possible for me to make improvements to this subsystem while in the process of learning about it.

The code in the polys module is well designed. Of course there can always be improvements but the fundamentals of the design are good. Looking back to before my own involvement in SymPy it looks like it was originally developed by Mateusz Paprocki and others. It is clear from the code that it was at some point well maintained and loved by people who understood it well and cared about it and had some vision for how it should work and what the plan was for the future. Then it looks though as if those people at some point just moved on leaving the code with various things in a sort of work-in-progress state or in the middle of a redesign that was never finished. This is probably the hardest part about trying to work on it because there are lots of pieces there but no one remembers what the plan was supposed to be for the future.

More recently though there has been renewed interest in this subsystem and there are several maintainers who can understand the code and make improvements. The most significant improvements recently are

  • The addition of the DomainMatrix class which has been mostly work done by myself (I will talk about this more later).

  • The significant redesign and expansion of the sympy.polys.numberfields module by Steve Keiffer who has also added some significant algorithms and improved others in parts of the code that had been untouched for years.

There is another contribution that I want to draw attention to that is very relevant which is this PR by another maintainer Sangyub Lee:

The significance of this PR is that it transfers the calculation of a particular operation (here the Matrix.eigenvects method) from the symbolic subsystem to the computational algebra subsystem. This is a very good example of exactly what needs to be done more in SymPy and how things that many SymPy users could relate to (eigenvectors in this example) can be computed much faster and better simply by using the computational algebra subsystem more.

The main point here though is just that the polys module is not (almost-)unmaintained any more. There are multiple maintainers who can understand it and make improvements to it and we are now in a better position to improve its capabilities and expand its use within SymPy.

Improvements to the computational algebra subsystem

Above I casually suggested that the numeric subsystem should be completely rewritten to use Arb-like semantics. I then went on to say that the symbolic subsystem should be rewritten (which is a major undertaking for SymPy!). I would not say the same thing about the computational algebra subsystem:

  • The fundamentals of the design are good.

  • Much of the code is written exactly how it should be written and is well micro-optimized.

  • The more widely used parts are mature and well tested.

Of course there can be improvements like adding more features, improving algorithms and so on, but this system can be improved incrementally. There is however a major limitation here. While I say that much of the code is written exactly how it should be written, it is all Python code and the lower level parts of this subsystem are precisely the sort of thing that should be written in a lower level language like C. It is clear from the code that this was always part of the development plan but then it just didn’t happen. Happily we do not need to rewrite everything in C ourselves because there is now already a C library that does exactly what we need: Flint. I have been working on making it possible for SymPy to leverage Flint which is a library for doing fast computations with polynomials, matrices and other things in the same way that SymPy’s computational algebra subsystem does.

As for the other topics I mentioned above, I will talk about this more in a subsequent post.

Summary

Above I have described what I consider to be the subsystems that make up the core of SymPy and what the current state of each of them is, as well as what I think needs to be done to improve them. In subsequent posts I will talk about each of these in more detail to explain what has been done and what should be done going forward. Briefly though:

  • The numeric subsystem should be rewritten to use Arb-like semantics and should be able to use Arb directly at least as an optional backend.

  • The symbolic subsystem should be rebuilt from the ground up based on a non-evaluating representation of expressions.

  • The computational algebra subsystem should be made faster both by improving its algorithms and also by leveraging Flint.

  • The computational algebra subsystem should be used more by the rest of SymPy.