Towards a new SymPy: part 2 - Polynomials
=========================================

This is the second part of a series of posts on big changes for SymPy with
particular focus on speed. The other posts in this series can be found at
:ref:`new sympy`. The first post outlined what I consider to be the three
core subsystems of SymPy:

1. Robust numerical evaluation (``sympy.core.evalf``)
2. The symbolic subsystem (``sympy.core.basic``)
3. The computational algebra subsystem (``sympy.polys``)

In relation to the computational algebra subsystem, there are three ways in
which SymPy can be made faster:

1. Improve/extend some of the algorithms and features.
2. Make more use of the computational algebra subsystem in the rest of SymPy.
3. Make use of FLINT which is a fast C library for computational algebra.

This post will describe SymPy's computational algebra system for polynomials
and how each of these steps could be applied to speed up SymPy. I will talk a
bit about FLINT and python-flint but I will also write a separate post about
those because I know that some people will be more interested in using
python-flint than SymPy itself and I hope to encourage them to contribute to
python-flint.

As before I am writing this with the intention that it should be to some extent
understandable to non SymPy developers. The primary intended audience though is
other SymPy developers because I want them to understand the significance of
the work done so far and the changes that I think are needed for the future.

TLDR
----

The short summary of this post is that:

- SymPy's polynomial code is good and is well structured although a minor
  redesign is needed to make use of sparse polynomials.
- There are some algorithms that could be improved like polynomial gcd but most
  are pretty good although limited in speed by being pure Python.
- It is now possible to use python-flint to bring state of the art speeds to
  SymPy's polynomial operations.
- Some work is needed to enable big speedups from python-flint but from here it
  is not difficult to do that.

The rest of this post is quite long mainly just because there are a lot of
details about the computational algebra subsystem that need to be understood in
order to follow the detail of what the problems are and what can or should be
done to improve things.

Integers and rationals
----------------------

First, I will give a brief overview of the computational algebra subsystem. In
SymPy the computational algebra subsystem is located in the ``sympy.polys``
module. The lowest level of this system are the "domains" which are objects
that represent rings and fields. Briefly a ring is a set of "elements" that can
be added and multiplied and a field is a ring in which it is also possible to
divide any two elements (except 0). The most well known examples of each are
the ring of integers :math:`\mathbf{Z}` and the field of rational numbers
:math:`\mathbf{Q}`.

Engineers and scientists (rather than pure mathematicians) might wonder why we
need to distinguish carefully between e.g. the "ring" of integers or the
"field" of rationals and worry about whether or not division is "possible". In
some sense we can always divide two integers (except 0) and just get a rational
number. Certainly in SymPy's *symbolic* subsystem there is no difficulty in
dividing two integers although we do need to be careful to divide SymPy's
``Integer`` objects and not Python's ``int`` objects (whose division operator
gives ``float``)::

    >>> from sympy import Integer
    >>> Integer(1)/Integer(2)  # dividing Integer gives Rational
    1/2
    >>> 1/2  # dividing ints gives floats
    0.5

The basic reasons that we need to distinguish carefully between rings and
fields and between :math:`\mathbf{Z}` and :math:`\mathbf{Q}` specifically are
not particularly deep or mathematical but rather purely *computational*.
Efficient computing means having well defined types, representations and
operations. On a very basic level the representation of an integer is a
sequence of bits and the representation of a rational number is a pair of
integers. Arithmetic operations manipulate these representations and so an
efficient arithmetic heavy algorithm needs to use a well defined representation
for its elementary types. If we choose our representation to be
:math:`\mathbf{Z}` then we cannot represent :math:`1/2` without converting to a
different representation and so (inexact) division is not allowed. Of course we
can always convert any element of :math:`\mathbf{Z}` to an element of
:math:`\mathbf{Q}` but this conversion is not a free operation and the
representation as an element of :math:`\mathbf{Q}` would be less efficient if
actually all of our elements are integers and if we do not need to divide them.

The other reason that we need to distinguish between e.g. :math:`\mathbf{Z}`
and :math:`\mathbf{Q}` in particular or more generally between different
"domains" is that we usually want to use different algorithms for different
domains. For example, the most efficient algorithm for inverting a matrix or
solving a linear system of equations over :math:`\mathbf{Z}` is different from
the most efficient algorithm for doing the same over :math:`\mathbf{Q}`. Often
the algorithms will convert to different domains for example to invert a matrix
of integers we might convert to :math:`\mathbf{Q}` and use a division-based
algorithm. Alternatively to invert a matrix of rational numbers we might factor
out denominators and convert to a matrix over :math:`\mathbf{Z}` before using a
fraction-free algorithm for inverting a matrix over :math:`\mathbf{Z}`.

In SymPy's domain system, there are domain objects like ``ZZ`` and ``QQ`` that
represent :math:`\mathbf{Z}` and :math:`\mathbf{Q}` respectively. These domain
objects are used to construct elements of the domain and to convert between
domains e.g.::

    >>> from sympy import ZZ, QQ
    >>> QQ(2,3)  # construct an element of QQ
    MPQ(2,3)
    >>> q = QQ(2)  # construct an integer-valued element of QQ
    >>> q
    MPQ(2,1)
    >>> z = ZZ.convert_from(q, QQ)  # convert from QQ to ZZ
    >>> z
    2
    >>> type(z)
    <class 'int'>

In this example we construct the rational number :math:`2/3`. Its
representation is the ``MPQ`` type which is a pure Python implementation of
rational numbers based on Python's ``int`` type (analogous to the Python
standard library ``fractions.Fraction`` type). The representation for ``ZZ``
uses Python's ``int`` type.

GMP and gmpy2
-------------

I mentioned the two domains ``ZZ`` and ``QQ`` above and I will describe the
other domains below. Since these two domains are absolutely foundational it is
worth discussing their implementation a bit first because it is important to
understand the role that gmpy2 and the underlying GMP C library play in this.
If gmpy2 is installed then SymPy will use its ``mpz`` and ``mpq`` types for
``ZZ`` and ``QQ`` respectively. These are based on the underlying GMP C
library::

    >>> from sympy import ZZ, QQ
    >>> ZZ(2)  # pip install gmpy2 to see this
    mpz(2)
    >>> QQ(2,3)
    mpq(2,3)

Here are plots that can compare timings for multiplication of integers with and
without gmpy2 for different bit sizes:

.. image:: scripts/time_ZZ.svg

For small integers ``mpz`` is about the same speed as Python's ``int`` type.
Potentially ``mpz`` is actually a little slower for very small integers just
because CPython's ``int`` is heavily micro-optimised for small integers. In
Python all integers use a single arbitrary precision type ``int`` which is
unusual among programming languages. Since most integers are small, CPython
tries to optimise for this case. Possibly also CPython can do some
optimisations with ``int`` that it cannot do with "third party types" like
gmpy2's ``mpz``.

For larger integers (bigger than machine precision) both ``mpz`` and ``int``
will represent an integer using multiple "digits" although the digits are
referred to as "limbs" in GMP. CPython uses 30-bit limbs because its
implementation of integer arithmetic is designed to be portable "generic C"
that can be compiled by any compiler. By contrast GMP's implementation is
designed to be as fast as possible at all costs and so it uses handcrafted
assembly code for many different CPU architectures. In CPU specific assembly it
is possible to access instructions like ``mulx`` or to read the CPU's carry
flag after an addition etc. These things are essential for being able to use
64-bit limbs but are not available in generic C. GMP's ``mpz`` type therefore
uses 64-bit limbs (as do all widely used fast big integer implementations).

As the bit size increases say to 1000 bits then gmpy2's ``mpz`` becomes a lot
faster than CPython's ``int``. At this bit size the difference in speed to
multiply two integers is something like 10x and is primarily due to the fact
that ``mpz`` uses 64-bit limbs whereas ``int`` uses 30-bit limbs. This smaller
limb-size essentially means a 4x increase in the number of CPU-level operations
needed for a big integer multiplication.

For very large integers gmpy2's ``mpz`` is *much* faster than ``int``. This is
because GMP has a whole hierarchy of algorithms for different bit sizes.
CPython's ``int`` type tops out at the Karatsuba algorithm which is also used
by GMP for intermediate bit-sizes but GMP has more complex algorithms that are
used for larger bit sizes. At these large bit sizes the difference in speed
between ``mpz`` and ``int`` can be enormous. Here we focus on multiplication
which is probably more favourable to ``int`` than some other operations would
be. In fact the slowness of some algorithms used by CPython's ``int`` type was
even recently considered to be a security vulnerability to the extent that
certain operations were disabled:

https://discuss.python.org/t/int-str-conversions-broken-in-latest-python-bugfix-releases/18889

That did lead to some work on improving CPython's ``int`` algorithms but the
disabled operations remain disabled even though the algorithms have been
improved to some extent::

    >>> import math
    >>> x = math.factorial(1559)
    >>> print(x)
    ...
    ValueError: Exceeds the limit (4300 digits) for integer string conversion;
    use sys.set_int_max_str_digits() to increase the limit

This plot shows timings for multiplication of rational numbers with and without
gmpy2 for different bit sizes:

.. image:: scripts/time_QQ.svg

For small rational numbers gmpy2's ``mpq`` type is about 10x faster than
SymPy's pure Python ``MPQ`` type when both are used from Python code for
multiplication as shown here. The difference can be up to about 20x for general
arithmetic-heavy algorithms. This difference is primarily due to gmpy2
implementing this in C and SymPy's ``PythonMPQ`` being implemented in pure
Python. For intermediate bit sizes gmpy2's ``mpq`` is only about 2-3x faster
because the bulk of the time is spent in ``gcd`` and CPython's ``math.gcd``
function is implemented in C (the Python overhead is less significant at larger
bit sizes). The difference in timings here most likely reflects GMP being more
microoptimised with CPU-specific assembly, 64-bit limbs etc. At larger bit
sizes gmpy2's ``mpq`` becomes asymptotically faster and this is again due to
asymptotically faster integer multiplication: every integer operation including
``gcd`` will ultimately reduce to multiplication at large sizes.

What this all means is that for some operations SymPy is a lot faster when
gmpy2 is installed. Some people reading this might think that it seems absurd
to worry about the performance of megabyte sized integers but many symbolic
algorithms will generate much larger integers than might be expected. Also when
gmpy2 is installed it will be used by mpmath and so it speeds up SymPy's
numeric subsystem as well as the computational algebra subsystem.

The symbolic subsystem is the only one of the three core subsystems of SymPy
that does not use gmpy2. I did recently look at changing SymPy to have the
symbolic ``Integer`` and ``Rational`` types use ``mpz`` and ``mpq`` when gmpy2
is installed but hit a stumbling block that the symbolic subsystem allows
"unevaluated rationals" whose gcd is not cancelled::

    >>> from sympy import Rational
    >>> Rational(3, 6, 1)
    3/6

That is apparently documented behaviour. This is not really a useful feature
though because the symbolic subsystem could also represent the same thing using
``Mul`` and ``Pow``. Having the symbolic ``Rational`` type use gmpy2's ``mpq``
for its internal representation would break these "unevaluated rationals" (that
should be done anyway though).

One of the nice things about the computational algebra subsystem is that it is
possible to swap out the implementation of the domain objects like this. So on
the one hand SymPy and its only hard dependency mpmath can be used as entirely
pure Python code without gmpy2. This is useful for many people who do not need
the performance of gmpy2 and do not want to install it or cannot install it for
example if they use a different Python implementation like PyPy. On the other
hand if gmpy2 is installed then SymPy will use it and will be a *lot* faster
for some operations but with *no other observable change in behaviour*.

(Contrast this last point with my previous comments about the difficulty of
SymPy using SymEngine to speed up the symbolic subsystem in :ref:`symengine`)

The domain system
-----------------

I have talked a lot about the two domains ``ZZ`` and ``QQ`` but there are many
more domains in SymPy. You can read more about them here:

https://docs.sympy.org/latest/modules/polys/domainsintro.html

People who are familiar with computational algebra will recognise these
domains:

- ``GF(n)``: integers mod ``n`` (the name ``GF`` is misleading)
- ``ZZ``: the integers
- ``QQ``: the rational numbers
- ``ZZ_I``: Gaussian integers
- ``QQ_I``: Gaussian rationals
- ``QQ(a)``: algebraic number field generated by ``a``
- ``RR``: the real numbers (floats with fixed precision provided by mpmath)
- ``CC``: the complex numbers (complex floats with fixed precision)
- ``K[x]``: polynomials in e.g. ``x`` with coefficients in another domain ``K``
- ``K[x,y]``: multivariate polynomials in ``x`` and ``y``.
- ``K(x,y)``: rational functions (ratios of polynomials) in ``x`` and ``y``.
- ``EX``: The expression domain (basically the symbolic subsystem)
- ``EXRAW``: The raw expression domain.

There are more domains but these are the most important ones. Perhaps the
easiest way to see what the domains are for is by using the
``construct_domain`` function. This function is used internally by SymPy to
choose a domain that could represent some expression from the symbolic
subsystem in the domain system::

    >>> from sympy import *
    >>> x, y = symbols('x, y')
    >>> construct_domain([1, 2])
    (ZZ, [mpz(1), mpz(2)])

Here we asked for a domain that could represent both ``1`` and ``2``. What was
returned was the domain ``ZZ`` (meaning the integers) and a list of two
elements representing ``1`` and ``2`` in that domain (as gmpy2 ``mpz`` objects
in this case). We can try more examples::

    >>> construct_domain([x**2, 1])
    (ZZ[x], [x**2, 1])
    >>> construct_domain([x**2, y])
    (ZZ[x,y], [x**2, y])
    >>> construct_domain([x**2, y/x])
    (ZZ(x,y), [x**2, y/x])
    >>> construct_domain([sin(x), y])
    (ZZ[y,sin(x)], [(sin(x)), y])
    >>> construct_domain([x, 2.0])
    (RR[x], [x, 2.0])
    >>> construct_domain([x/2, 1])
    (QQ[x], [1/2*x, 1])

In each case ``construct_domain`` tries to find the simplest domain that can
represent all the expressions.

Sparse and dense polynomials
----------------------------

Importantly the polynomial domains are always implemented as "sparse"
polynomials. This means that only nonzero terms are stored. This example
contrasts the sparse and dense representations of polynomials::

    >>> from sympy import QQ, symbols
    >>> x, y = symbols('x, y')
    >>> e = x**10 + y
    >>> p_sparse = QQ[x,y].convert(e)
    >>> p_dense = e.as_poly()

This is how the sparse ``PolyElement`` and the dense ``Poly`` usually look::

    >>> p_sparse
    x**10 + y
    >>> p_dense
    Poly(x**10 + y, x, y, domain='ZZ')

This is what their internal representations look like::

    >>> dict(p_sparse) # internal sparse representation
    {(0, 1): mpq(1,1), (10, 0): mpq(1,1)}
    >>> e.as_poly().rep.rep  # internal dense representation
    [[mpz(1)], [], [], [], [], [], [], [], [], [], [mpz(1), mpz(0)]]

Here the sparse representation is a dictionary mapping exponent tuples to
coefficients. The dense representation is a list of lists of coefficients. The
empty lists in the dense representation represent zero terms. The dense
representation is described as "dense" because it needs to store zero terms
explicitly.

Integers mod ``n``
------------------

Some domains that people might expect to be find in the domain system are
missing like finite fields of non-prime order e.g. ``GF(2**3)``. Essentially
that is because most SymPy users are not interested in such things and the rest
of the codebase does not need them. Some people have expressed interest in
these and contributions are certainly welcome but I guess it has not happened
because it is not considered high priority and most users who are interested in
such things are more likely to use something like Sage rather than SymPy.

Probably most SymPy users are not interested in ``GF(n)`` (integers mod ``n``)
either but that is there because it is needed for the algorithms in the other
domains. Let me give an example to show how all of this is used::

    >>> from sympy import symbols, factor
    >>> x, y = symbols('x, y')
    >>> e = x**4 - y**4/16
    >>> e
    x**4 - y**4/16
    >>> factor(e)
    (2*x - y)*(2*x + y)*(4*x**2 + y**2)/16

So how does this work? First ``factor`` converts the expression to a polynomial
with coefficients in some domain::

    >>> p = e.as_poly()
    >>> p
    Poly(x**4 - 1/16*y**4, x, y, domain='QQ')
    >>> p.domain
    QQ

Here the ``Poly`` identifies that we have two variables ``x`` and ``y`` and
that the coefficients are in the domain ``QQ`` (the rational numbers). We are
now ready to call the factorisation algorithm. The factorisation algorithm for
polynomials with coefficients in ``QQ`` will first factor out the denominator
``16`` giving a polynomial ``16*x**4 - y**4`` with coefficients in ``ZZ``. We
now want to factorise this but then the algorithm for factorising polynomials
over ``ZZ`` will convert the problem to factorising polynomials over ``GF(p)``
for some prime ``p``. Then we compute the factorisation over ``GF(p)`` and
convert the result back to a factorisation over ``ZZ`` and so on. So the steps
in the computation are (with ``EX`` representing the ordinary symbolic
expressions)::

    EX -> QQ[x,y] -> ZZ[x,y] -> GF(p)[x,y] -> factored -> ... -> EX

The ``...`` here obscures a bunch of complexity that I don't want to get into.
For those familiar with these things the ``Zassenhaus`` algorithm is used by
default but the main weakness is that LLL-based techniques are not implemented
(so worst case is not polynomial time). For everyone else the algorithms used
here are usually good but for certain inputs ``factor`` can be very slow when a
different algorithm would be a lot faster.

The main point of this factorisation example is just to show the significance
of all the different domains including ``GF(p)``. Most SymPy users do not care
about ``GF(p)`` but it is crucial for things that they do care about because it
is used by e.g. ``factor`` which is in turn used by ``solve`` and ``simplify``
and so on.

Algebraic number fields
-----------------------

In this example we get ``EX`` which is what ``construct_domain`` returns when
it gives up::

    >>> construct_domain([sqrt(2), 1])
    (EX, [EX(sqrt(2)), EX(1)])

There is a domain for this but it will not be used by default (we have to pass
``extension=True``)::

    >>> construct_domain([sqrt(2), 1], extension=True)
    (QQ<sqrt(2)>, [ANP([mpq(1,1), mpq(0,1)], [mpq(1,1), mpq(0,1), mpq(-2,1)], QQ), ANP([mpq(1,1)], [mpq(1,1), mpq(0,1), mpq(-2,1)], QQ)])

It might not look nice but that is the domain for the algebraic number field
:math:`\mathbb{Q}(\sqrt{2})`. The ``ANP`` stands for "algebraic number
polynomial". The representation of algebraic number fields always uses a
*primitive element*. This representation is efficient for arithmetic but
computing the primitive element can be expensive which means that it can be
slow to construct the domain. Timings are (on a slow computer)::

    In [10]: %time ok = construct_domain([sqrt(2)], extension=True)
    CPU times: user 26 ms, sys: 0 ns, total: 26 ms
    Wall time: 25.2 ms

    In [11]: %time ok = construct_domain([sqrt(2), sqrt(3)], extension=True)
    CPU times: user 47.4 ms, sys: 0 ns, total: 47.4 ms
    Wall time: 46.3 ms

    In [12]: %time ok = construct_domain([sqrt(2), sqrt(3), sqrt(5)], extension=True)
    CPU times: user 55.2 ms, sys: 0 ns, total: 55.2 ms
    Wall time: 53.9 ms

    In [13]: %time ok = construct_domain([sqrt(2), sqrt(3), sqrt(5), sqrt(7)], extension=True)
    CPU times: user 120 ms, sys: 0 ns, total: 120 ms
    Wall time: 118 ms

    In [14]: %time ok = construct_domain([sqrt(2), sqrt(3), sqrt(5), sqrt(7), sqrt(11)], extension=True)
    CPU times: user 688 ms, sys: 0 ns, total: 688 ms
    Wall time: 686 ms

    In [15]: %time ok = construct_domain([sqrt(2), sqrt(3), sqrt(5), sqrt(7), sqrt(11), sqrt(13)], extens
        ...: ion=True)
    ^C^C
    KeyboardInterrupt

I don't know how long that last command would take but I interrupted it after
about 5 minutes. I have not investigated why it is so slow but I expect that it
can be made faster. I think what it really shows though is that it is a bad
idea to even try to compute the primitive element and that it is better to
represent algebraic number fields differently in the case of having many
algebraic generators.

EX and EXRAW domains
--------------------

There are more domains than listed above but those are the ones that would
usually be created automatically within SymPy when the computational algebra
subsystem is used implicitly. There will always be some situations where the
symbolic subsystem has some expressions that the computational algebra
subsystem cannot represent using a standard ring/field from the list above. In
those situations it will use the ``EX`` or ``EXRAW`` domains. In these domains
the elements are actually just symbolic expressions from the symbolic
subsystem. This provides an escape hatch that allows code that expects to work
with the domains to fall back on using the symbolic subsystem when a more
structured domain cannot be found.

The difference between the ``EX`` and ``EXRAW`` domains is that the elements of
the ``EX`` domain are always simplified using the high-level ``cancel``
function so in this domain ``c = b + a`` is equivalent to writing ``c =
cancel(b + a)`` with ordinary SymPy expressions from the symbolic subsystem.
The effect of ``cancel`` on a symbolic expression is that always rearranges an
expression into a ratio of expanded polynomials and then cancels the polynomial
gcd of the numerator and denominator::

    >>> from sympy import symbols, cancel
    >>> x, y = symbols('x, y')
    >>> e = 1/x + x
    >>> e
    x + 1/x
    >>> cancel(e)
    (x**2 + 1)/x

Calling ``cancel`` on symbolic expressions like this is slow because every call
to to ``cancel`` has to go through the whole process of identifying a
polynomial representation, choosing a domain, converting the expressions into
the domain and then after actually computing the cancelled fraction the result
needs to be convert back to the symbolic subsystem. If this sort of
simplification is wanted then it is always better to use any of the more
structured domains above than to use ``EX`` because it avoids all the cost of
these conversions.

For some algorithms the automatic expansion and cancellation used in ``EX`` is
exactly what is needed as a method of intermediate simplification to speed up a
large calculation and return a result in a mostly canonical form. In some
situations though it is preferrable not to have this cancellation (which in
itself can be slow) and for this the ``EXRAW`` domain is provided. Operations
with the ``EXRAW`` domain are precisely equivalent to operations in the
symbolic subsystem (without calling ``cancel``). All the reasons that it is
difficult to build heavy algorithms over the symbolic subsystem apply to the
``EXRAW`` domain as well. The ``EXRAW`` domain is only really useful for
preserving existing behaviour in a situation where we want to change code that
currently uses the symbolic system to use the computational algebra subsystem
instead. It would almost always be better to use something other than ``EXRAW``
(even if just ``EX``) but if we want to be conservative when making changes
then ``EXRAW`` provides a possible compatibility mechanism.

Using the right domains
-----------------------

Having talked a lot about the domain system above I can now explain how that
relates to things sometimes being slow in SymPy and what can be done to improve
that.

Firstly, when implementing any arithmetic heavy algorithm like solving a
system of linear equations all of the domains described above apart from ``EX``
or ``EXRAW`` are almost always faster than any algorithm that could be
implemented directly with symbolic expressions. The number one reason for
slowness in things like computing the inverse of a matrix is just the fact that
many such algorithms do not use the domain system at all and instead use the
symbolic subsystem.

Secondly, in many cases the ``EX`` domain is used when it would not be
difficult to choose a better domain instead. This is because the mechanism for
constructing domains is quite conservative about what it will accept. An
example would be::

    >>> from sympy import *
    >>> x, y = symbols('x, y')
    >>> construct_domain([x + y])
    (ZZ[x,y], [x + y])
    >>> t = symbols('t')
    >>> x = Function('x')
    >>> y = Function('y')
    >>> construct_domain([x(t) + y(t)])
    (EX, [EX(x(t) + y(t))])

Here the functions ``x(t)`` and ``y(t)`` should be treated the same as the
symbols ``x`` and ``y`` were. A suitable domain can easily be created
explicitly::

    >>> domain = ZZ[x(t),y(t)]
    >>> domain.from_sympy(x(t) + y(t))
    (x(t)) + (y(t))

The problem here is just that the code inside ``construct_domain`` rejects this
domain because it does not want to create a polynomial ring where the
generators have free symbols in common (the ``t`` in this case). The reason for
rejecting this is to try to avoid something like this::

    >>> ZZ[sin(t),cos(t)]
    ZZ[sin(t),cos(t)]

This ``ZZ[sin(t),cos(t)]`` domain is invalid for many situations. the problem
with it is that it is possible to create an expression that should really be
zero but appears not to be zero::

    >>> R = ZZ[sin(t),cos(t)]
    >>> s = R.from_sympy(sin(t))
    >>> c = R.from_sympy(cos(t))
    >>> e = s**2 + c**2 - 1
    >>> e
    (sin(t))**2 + (cos(t))**2 - 1
    >>> R.is_zero(e)
    False
    >>> R.to_sympy(e).trigsimp()
    0

One of the reasons that arithmetic heavy algorithms with domains are so much
faster than with symbolic expressions is because in the domain system any
expression that is equal to zero should be simplified automatically to zero.
Many algorithms need to know whether expressions are zero or not so this is an
extremely useful property. Just treating ``sin(t)`` and ``cos(t)`` as
independent variables in a polynomial ring violates this property. Sometimes
that would be fine but in other situations it could lead to bugs. Therefore
``construct_domain`` refuses to create the ring ``ZZ[sin(t),cos(t)]`` to avoid
bugs. This refusal leads to the ``EX`` domain being used which is much slower
and also potentially subject to precisely the same bugs. The advantage of using
the ``EX`` domain here is mainly that other code can at least be aware that the
domain is not well defined.

It is perfectly possible to implement a domain that can represent a ring
involving both ``sin(t)`` and ``cos(t)``. There are already some kinds of
domains that can do this although they are not used by default and also are not
quite right for what is needed. What we really want is to be able to make a 
more complicated ring like this::

    QQ[sqrt(2),m1,m2,k1,k2,sin(theta),cos(theta),sin(phi),cos(phi)]

In science and engineering the need to work with ``sin`` and ``cos`` is very
common so specialised domains are needed that can handle this for many
different variables and can recognise trig identities etc. SymPy does not yet
have this but adding it would mostly complete the domain system in terms of
being able to represent the sorts of expressions that users typically want to
work with. This would be particularly beneficial for example in the case of
symbolic calculations in mechanics (as in the ``sympy.physics.mechanics``
module) because those calculations involve many systems of linear equations
with ``sin`` and ``cos`` in the coefficients. I have an implementation of a
domain that could represent the ring above using sparse polynomials and
Groebner bases but it is still incomplete.

There are then two common ways that things might become slower than they should
be when using the domain system:

- Sometimes the ``EX`` domain is used conservatively when suitable
  alternative domains are already there and could easily be used.
- Sometimes a suitable domain is not yet implemented (e.g. ``sin/cos``).

In either case the result is that a calculation ends up using the ``EX`` domain
which is a lot slower than any of the other domains. The fixes for thes
problems are simple:

- Improve the logic for deciding which domains are used by default.
- Add new domains that can represent things like ``sin`` and ``cos`` for
  example.

The other common problem is that constructing algebraic fields can be extremely
slow (as shown above). This is unfortunate because after construction algebraic
fields are much faster than alternate representations for the same expressions.
The fix here is to use a different representation for algebraic fields that is
not based on primitive elements but this is not a trivial change.

Most of these changes are not particularly difficult to make but in each case
the impact of making such a change can be far reaching and hard to predict in
full. Each time some calculation is moved from either the symbolic subsystem or
the ``EX`` domain to a more structured domain the most noticeable effect is to
make things (much) faster, and a secondary effect is that it potentially
reduces bugs. The third effect is that it leads to the output of the
calculation being in a "more canonical" form which is a good thing but it is a
change in output in some sense and it is the impact of this change that is hard
to predict. Once a calculation is moved to use a more structured domain though
it then becomes much easier to optimise things in future because the form of
the output is much more predictable and can be preserved exactly under any
change of algorithm.

The Poly system
---------------

The domain system is from a user-interface perspective the lowest level of the
computational algebra subsystem. The next level up are the low-level poly
functions for the ``dup`` (dense univariate polynomial) and ``dmp`` (dense
multivariate polynomial) representations. In the ``dup`` representation a
polynomial is represented as a list of coefficients. There are many functions
with the ``dup_`` prefix for operating on this representation e.g. this is how
you would multiply two polynomial using the symbolic system::

    >>> from sympy import symbols, expand
    >>> x = symbols('x')
    >>> p1 = 3*x + 4
    >>> p2 = 5*x + 6
    >>> p3 = expand(p1*p2)
    >>> p3
    15*x**2 + 38*x + 24

This is how you would do the same with ``dup_mul``::

    >>> from sympy import ZZ
    >>> from sympy.polys.densearith import dup_mul
    >>> p1 = [ZZ(3), ZZ(4)]
    >>> p2 = [ZZ(5), ZZ(6)]
    >>> p3 = dup_mul(p1, p2, ZZ)
    >>> p3
    [mpz(15), mpz(38), mpz(24)]

A multivariate polynomial involving e.g. two symbols ``x`` and ``y`` can be
thought of as a polynomial whose coefficients are polynomials e.g.::

    >>> x, y = symbols('x, y')
    >>> p = x**2*y + x**2 + x*y**2 - x - y
    >>> p.collect(x)
    x**2*(y + 1) + x*(y**2 - 1) - y

Here we could think of ``p`` as a polynomial in ``x`` whose coefficients are
polynomials in ``y``. If a univariate polynomial is represented as a list of
coefficients then a multivariate polynomial can be represented as a list of
lists of coefficients. This is the ``dmp`` representation::

    >>> p.as_poly(x, y).rep.rep
    [[mpz(1), mpz(1)], [mpz(1), mpz(0), mpz(-1)], [mpz(-1), mpz(0)]]

There are many functions like ``dmp_mul`` etc for operating on this
representation. In a simple example like this it looks okay but actually it is 
a very bad idea to use this list of lists (of lists of ...) representation when
we have many more than two variables::

    >>> x0, x1, x2, x3, x4, x5, x6, x7, x8, x9 = syms = symbols('x:10')
    >>> p = x0*x9 - x5*x3 + x6
    >>> p.as_poly(syms).rep.rep
    [[[[[[[[[[mpz(1), mpz(0)]]]]]]]]], [[[[[[[[[mpz(-1)]]]], [[[[]]]]]], [[[[[[mpz(1)]]], [[[]]]]]]]]]]

These deeply nested recursive lists are very inefficient to work with compared
to the flatter sparse representation which in this case looks like::

    >>> dict(ZZ[syms].from_sympy(p))
    {(0, 0, 0, 0, 0, 0, 1, 0, 0, 0): mpz(1), (1, 0, 0, 0, 0, 0, 0, 0, 0, 1): mpz(1), (0, 0, 0, 1, 0, 1, 0, 0, 0, 0): mpz(-1)}

Altogether there are about 500 ``dup_*`` and ``dmp_*`` functions. These
collectively implement just about all of the operations and algorithms that you
might want for doing anything with polynomials and there are in many cases
multiple algorithms for the same operation. It is clear from the design of
these that the intention was at some point to rewrite all of this code in C to
speed things up.

The next level up are the polynomial classes like the ``DMP`` class. The
``DMP`` class wraps up a list of lists in the representation shown above along
with a reference to the associated domain. The ``DMP`` class then provides
around 150 methods that are implemented by calling down to the lower-level
routines.

The next level above that is ``Poly`` which is the main user-facing class for
polynomials. The ``Poly`` class is a wrapper around the ``DMP`` class that
provides a lot of convenience methods for doing things like converting to and
from the symbolic subsystem, and other representations.

These levels ``Poly``, ``DMP`` and the ``dup`` representation can be seen by
creating a ``Poly`` and then looking at its ``rep`` attribute::

    >>> p = Poly(x**2 + 1)
    >>> p          # Poly
    Poly(x**2 + 1, x, domain='ZZ')
    >>> p.rep      # DMP
    DMP([1, 0, 1], ZZ, None)
    >>> p.rep.rep  # dup
    [1, 0, 1]

The final level are the "polys" functions like ``factor``. These are the
highest level functions that are used by most SymPy users and can operate on
expressions from the symbolic subsystem::

    >>> from sympy import factor
    >>> factor(x**2 - 1)
    (x - 1)*(x + 1)

The levels altogether then are:

1. Domains (e.g. ``ZZ``, ``QQ``, ``ZZ[x]``, ``QQ[x]`` etc)
2. Low-level poly functions (e.g. ``dup_mul``, ``dmp_mul`` etc)
3. Polynomial classes (e.g. ``DMP``)
4. High-level ``Poly`` class.
5. Polynomial functions for symbolic expressions (e.g. ``factor``)

The intention was always that some of these levels would be replaced by more
efficient implementations in C but it is not obvious at which level that was
intended to happen. The ``Poly`` class is too high-level and interfaces a lot
with the symbolic subsystem so it is not easy or desirable to rewrite that in
C. The ``dup_`` and ``dmp_`` functions are closely tied to their own
(suboptimal) data representations so without rewriting every single one of them
it would not be possible to swap out this whole layer.

There are two levels that are good candidates for swapping in alternate
implementations wholesale then:

- The domain level
- The ``DMP`` class

At the domain level SymPy already swaps out the implementations of ``ZZ`` and
``QQ`` when gmpy2 is installed but the same could be done for all of the
commonly used domains like ``ZZ[x]``, ``QQ[x]`` etc if the implementations of
these from e.g. python-flint were used. The ``DMP`` class is also a good
candidate for swapping out just because it is nicely self-contained. A wrapper
class could be made like ``FlintDMP`` that wraps a polynomial type from
python-flint and provides the same interface as ``DMP`` currently does. That
would make it easy to swap in a faster implementation of the ``DMP`` class
when python-flint is installed.

Besides swapping out a whole layer the easiest way to make some things faster
would just be to swap out a few key functions that involve expensive algorithms
e.g. like ``dup_zz_factor`` or ``dmp_zz_factor`` which are the base functions
for factorising univariate ot multivariate polynomials in ``ZZ[x]`` or
``ZZ[x,y,...]`` respectively. Swapping out a single function like this is
suboptimal because it implies potentially costly conversions from a more
efficient C representation to e.g. the inefficient ``DMP`` representation but
it would be a very simple change to make and could immediately bring big speed
ups for important operations.

Sparse polynomials
------------------

The system of levels in the polys module that I described above is all nicely
designed and mostly well implemented. There are some algorithms that would be
nice to have like asymptotically fast multiplication and division of
polynomials but mostly it has the pieces and algorithms that are needed. The
big glaring problem with it though is that it is all based on dense polynomials
and the ``DMP`` representation. This is very inefficient for polynomials with
many variables. The system works nicely for univariate polynomials but many of
the algorithms necessarily become inefficient for multivariate polynomials.
Most SymPy users are not really interested in using only univariate polynomials
(e.g. solving equations with only one symbol or something) so this is a
significant limitation.

At some point the mistake of basing everything on dense polynomials was
realised and a new class ``PolyElement`` implementing sparse polynomials was
added. The implementation of ``PolyElement`` is mostly complete and it is used
in many places but not always. For example the domain system always uses
(sparse) ``PolyElement`` but the ``Poly`` class always uses (dense) ``DMP``.

It would be possible to add an equivalent of the ``DMP`` class based on sparse
polynomials that ``Poly`` could use internally. It is not clear if it was
intended that ``PolyElement`` would be that class or if another level was
expected. Looking at the code right now the easiest thing would be to add a new
class in between ``PolyElement`` and ``Poly`` that implements the same
interface as ``DMP`` but using sparse polynomials internally, let's call it
``SMP``. Then ``Poly`` could use ``DMP`` or ``SMP`` internally but ``DMP``
would only ever be used for univariate polynomials. Then the structure would
look like:

1. Domains (``ZZ`` ...)
2. ``dup_*`` (dense) or ``PolyElement`` (sparse)
3. ``DMP`` (dense) or ``SMP`` (sparse)
4. ``Poly`` (dense or sparse internally)
5. ``factor`` etc

It is not completely clear if that is the right thing to do though because
actually many ``Poly`` methods are designed tightly around what makes sense in
the dense representation or around things that only make sense for univariate
polynomials. Probably the interface of ``Poly`` and the signatures and
behaviour of the methods and functions that operate on it would have been
designed differently if sparse and multivariate polynomials had been considered
from the start.

In general it is not really clear to me if it even makes sense to have a
generic interface that covers both univariate and multivariate polynomials
because often the operations that make sense for one case do not make sense for
the other. The fact that all of the low-level code needs to have completely
separate functions for univariate (``dup_*``) and multivariate (``dmp_*``)
polynomials shows that it is not really possible to write generic code that
ignores this distinction.

An alternate scheme for sparse multivariate polynomials could look like:

1. Domains
2. ``PolyElement``
3. ``SPoly`` (sparse version of ``Poly``)

Then depending on context you would choose whether you wanted to use ``Poly``
for univariate polynomials or ``SPoly`` for multivariate polynomials (perhaps
``SPoly`` should be called ``MPoly`` for "multivariate"). This approach would
require more careful design because it would impact more on code that end users
would potentially write rather than just being a rearrangement of internal
code. We would need to decide what e.g. ``expr.as_poly()`` should return and
changing that to ``SPoly/MPoly`` in some cases might not be backwards
compatible so perhaps we need new methods like ``expr.as_mpoly()``.

Polynomial GCD
--------------

Apart from actually making use of sparse polynomials in ``Poly`` (or ``SPoly``)
the implementation of sparse polynomials is mostly complete but the biggest
weakness is the use of poor algorithms for polynomial gcd. The current
implementation of polynomial gcd for ``PolyElement`` uses "heugcd" for ``ZZ``
or ``QQ`` or switches to the ``DMP`` representation and uses subresultant PRS
for all other domains. This is bad for several reasons:

- The "heugcd" algorithm is often okay but can be much worse than other
  algorithms in some cases.
- Sometimes "heugcd" is "unlucky" and fails (the current code does not fall
  back to anything else in this case and just blows up instead).
- Switching from the sparse to the dense representation can be very inefficient
  for all the reasons that it is usually better to use a sparse representation
  in the first place (for multivariate polynomials).
- The subresultant PRS algorithm is not generally the best algorithm in many
  cases either.

Ideally there would be a range of different algorithms for sparse polynomial
gcd and the best one would be chosen based on sparsity, degree etc. It is not
too hard to port the dense PRS algorithm to the sparse implementation but
ultimately other algorithms should often be used. There is a whole module
``sympy.polys.modulargcd`` which has about 3000 lines of code implementing
modular algorithms for gcd over many different domains including algebraic
number fields. It looks like good code but is not used anywhere in SymPy and
its status is unclear.

Mostly I think that the main route to speeding up polynomial operations in
SymPy is just by making use of python-flint but the pure Python implementation
should at least use reasonable algorithms. The poor polynomial gcd algorithms
are a major weakness that directly impacts high-level operations like
``solve``, ``integrate`` etc. I will give two examples of how polynomial gcd
can slow down other things. The first is computing the inverse of the matrix
described in this issue:

https://github.com/sympy/sympy/issues/25403

The matrix includes ``I`` (:math:`\sqrt{-1}`) and so the domain of the elements
of the matrix is something like ``ZZ_I[a,b,c,...,g]`` where ``ZZ_I`` means the
Gaussian integers. The newly added fraction-free RREF algorithm for computing
the inverse can compute the inverse of each component of the matrix in the form
of a numerator matrix and scalar denominator relatively quickly e.g.::

    In [2]: %time Mnum, den = M2parts[0].to_DM().inv_den()
    CPU times: user 1.2 s, sys: 10.8 ms, total: 1.21 s
    Wall time: 1.21 s

That takes 1 second which I think is still too slow but most users will
tolerate waiting that long for something like this. We want to combine the
numerator matrix and the denominator though which means dividing them and
ideally the gcd should be cancelled to bring the inverse matrix into its
simplest form::

    In [3]: %time Minv = Mnum.to_field() / den
    ^C^C
    KeyboardInterrupt

I interupted that after about 5 minutes so I'm not sure how long it actually
would have taken. The slow part here is 25 polynomial gcd calculations. The
polynomial expressions are moderately large but better algorithms could do this
a lot faster. In this case because the domain of the polynomials is ``ZZ_I``
the dense PRS algorithm is used rather than ``heugcd`` and that is just very
slow in this case. We can avoid the gcd calculations here but in other cases it
is definitely better to cancel the gcd and it is awkward to have to work around
some extremely slow cases by returning less simplified results in most other
cases. This sort of consideration applies not just to matrix inverse but e.g.
solving a system of linear equations so ``solve``, ``linsolve`` etc and many
other algorithms that use these operations internally (e.g. integration
algorithms).

Another example of slowness caused by polynomial gcd is in the "heurisch"
integration algorithm. This algorithm has essentially three computational steps
that take place in a loop:

- Differentiate some expression.
- Cancel a gcd between two polynomials.
- Solve a sparse system of linear equations.

The system of linear equations is solved using the computational algebra
subsystem (``DomainMatrix``). Theoretically the linear algebra should be the
slow part but in SymPy it is not because the other operations are computed
using the symbolic subsystem and end up being much slower. It should be
possible to stay entirely in the computational algebra subsystem but the reason
for using the symbolic subsystem is that the gcd calculations would be too slow
in the computational algebra subsystem. Polynomial gcd is precisely the sort of
thing that *should* be more efficient in the computational algebra subsystem
but it is not because the polynomial rings used here have huge numbers of
symbols like ``QQ_I[x1,x2,...,x1000]``. The dense gcd algorithms are extremely
slow in this case because each step needs to recurse through 1000 levels of
nested lists even if the polynomials involved only have a couple of terms.

The ``heurisch`` algorithm works around slow gcd by converting to a symbolic
expression and then calling ``cancel`` (which internally converts to the
computational algebra subsystem and back) and then converting back again to the
original polynomial ring. The slow part ends up being often just the
conversions back and forth between different representations but those are only
happening to avoid the slow gcd algorithms. The end result of this is that
"heurisch" which is the main workhorse for ``integrate`` in SymPy is often very
slow as an *indirect* result of sparse polynomial gcd being slow.

These things would all be made a lot faster by using python-flint but I still
think that it is worth improving the pure Python algorithms here as well
because it is not really that hard to make big improvements.

Python-flint
------------

The python-flint library is a Python wrapper around the C library FLINT. FLINT
is a C library for fast arithmetic and other operations with polynomials,
integers, matrices etc and is used by many other computer algebra systems such
as SageMath, Mathematica, Maple, OSCAR etc. FLINT itself is built on top of GMP
and MPFR but then provides hugely expanded functionality over the top. The
FLINT library provides fast implementations for all of the domains that I have
mentioned above (``ZZ``, ``QQ``, ``GF(p)``, ``ZZ[x]``, ``QQ[x]`` etc) and much
more:

https://flintlib.org/

The FLINT library is now maintained by Fredrik Johansson who is also the
original author of SymPy's ``evalf`` and the ``mpmath`` library that is SymPy's
only hard dependency. Fredrik went on to create ``Arb``, ``Calcium`` and the
``generic-rings`` project. Fredrik recently took over maintainership of FLINT
and merged all of these projects together into FLINT 3.0 (AKA MegaFLINT).
Altogether FLINT is around 1 million lines of C code with many contributors and
provides state of the art implementations for the sorts of things that I have
been discussing above and much more. The FLINT codebase has a lot of widely
used mature code but is also still under active development.

The python-flint library is a wrapper around FLINT that provides a Python
interface to a small subset of the functionality of FLINT (and Arb):

https://fredrikj.net/python-flint/

https://github.com/flintlib/python-flint

Fredrik originally created python-flint but it had not been actively maintained
having only a few releases in 2017 and 2018. Over time I have been working on
making python-flint more usable and this has now culminated in the releases of
python-flint 0.4.0-0.4.2 during the last month. The major changes in these
releases are that there are now CI-built wheels for Linux, macOS and Windows
that can be installed with ``pip install python-flint`` as well as conda
packages ``conda install -c conda-forge python-flint`` (thanks to Isuru
Fernando). There was a major Windows bug that basically made python-flint
unusable on Windows and that is now fixed. Lastly I made a bunch of smaller
changes to make python-flint more usable and better tested and to fix a few
minor bugs.

The end result of that work is that now python-flint is perfectly usable and
well tested although still very incomplete in wrapping all of FLINT's
functionality. My own interest in working on this was to make it possible to
use python-flint as a drop-in replacement for the pure Python implementations
of parts of SymPy's computational algebra subsystem. Now that python-flint is
usable though others have become interested and there are two new active
contributors working on expanding its functionality to add finite fields
``GF(q)`` and (critical for SymPy) multivariate sparse polynomials.

To give a quick example of what it would mean in terms of speed to make use of
python-flint let's compare the time taken to factorise a polynomial between
SymPy's existing implementation and using python-flint::

    In [16]: nums = [1, 2, 3, 4, 5] * 50

    In [17]: nums2 = [1, 2, 3, 4, 5] * 50

    In [18]: p1 = Poly(nums, x)

    In [19]: p2 = Poly(nums2, x)

    In [20]: p3 = p1*p2

    In [21]: %time result = p3.factor_list()
    CPU times: user 7.57 s, sys: 10.1 ms, total: 7.58 s
    Wall time: 7.58 s

    In [22]: import flint

    In [23]: p1 = flint.fmpz_poly(nums[::-1]) # reverse coefficient order

    In [24]: p2 = flint.fmpz_poly(nums2[::-1])

    In [25]: p3 = p1*p2

    In [26]: p3.degree()
    Out[26]: 498

    In [27]: %time result = p3.factor()
    CPU times: user 83.1 ms, sys: 0 ns, total: 83.1 ms
    Wall time: 82.3 ms

The polynomial here is of degree 500 and the factorisation takes 8 seconds with
SymPy's existing implementation and 80 milliseconds with python-flint which is
a speedup of 100x. This ratio is approximately what can be expected when
comparing an optimised C implementation with a pure Python implementation of
the same low-level algorithm: SymPy's existing algorithm is okay in this case
but is limited by being implemented in Python. FLINT is not just highly
optimised C code though, as it also uses state of the art algorithms and so in
the cases discussed above where SymPy's algorithms are not so good (e.g.
polynomial gcd) the speedup can be *much* larger than 100x.

Here is a comparison of factorisation time vs degree:

.. image:: scripts/time_factor.svg

For the timings in this plot I generate two polynomials of degree :math:`d`
with random integer coefficients in the range from 1 to 10, multiply them
together and then time how long it takes to factorise the result. The plot
compares the timings against :math:`3d^2` for FLINT and :math:`200d^2` for
SymPy. Both curves show an approximately quadratic scaling but with SymPy being
about 60x slower than FLINT up to around degree 100. Then the SymPy curve looks
like it starts to bend upwards and gets a lot slower above degree 100. The 60x
speed difference is the difference between a pure Python implementation of the
low-level operations vs FLINT's optimised C implementation. The upwards bend in
the SymPy curve is most likely the result of some algorithmic difference and I
think it means that for larger polynomials SymPy will become extremely slow.

While both curves in the plot show approximately quadratic scaling it is worth
noting that the worst case for *both* implementations is worse than
:math:`O(d^2)`. For SymPy's algorithms the worst case is in fact exponential
like :math:`O(2^d)` and I am not sure exactly what it would be for FLINT but
I think maybe the best bounds are something like :math:`O(d^6)`. For randomly
generated polynomials like shown here we will tend not to see these worst cases
but that does not necessarily make them unlikely in practice. For example this
issue shows a case where SymPy's ``minpoly`` is extremely slow because
``factor`` is slow for a polynomial of degree 162:

https://github.com/sympy/sympy/issues/22400

Using FLINT (see the change to ``dup_zz_factor`` below) SymPy can compute the
operation from that issue in 300 milliseconds::

    In [1]: %time ok = minpoly(root(2,3)+root(3,3)+(-1+I*sqrt(3))/2*root(5,3))
    CPU times: user 345 ms, sys: 3.67 ms, total: 349 ms
    Wall time: 347 ms

The 60x speed difference above would suggest that SymPy's existing algorithm
would take around 20 seconds to compute the same result but in fact it takes
much longer than that. I'm not sure how long but it is more than 5 minutes so
at least 1000x slower. In the issue I suggested a change that would bring this
down to about 40 seconds so more like 100x slower which is worth doing. Also
SymPy now has the LLL algorithm which was discussed there and so that could be
used to speed this up.

We could spend a lot of time trying to optimise SymPy's existing algorithms and
in some cases that is worthwhile but ultimately it will always be better to use
python-flint when speed is needed. For many of the lower-level algorithms a
pure Python implementation will never be able to come close to the speed of
FLINT and so if we want SymPy users to be able to use SymPy and have state of
the art speed then we just need to use something like python-flint. Of course
if anyone would like to improve SymPy's factorisation algorithm then that is
fine and it is impressive what previous SymPy contributors have achieved while
working only in Python. For working on making SymPy faster for end users though
the immediate priority is to make more use of python-flint.

I think that the way to think about what FLINT is is that it is like a BLAS
library for computer algebra systems. Most scientific computing is done with
machine precision types like 64-bit floats etc and in that context when you
want to compute say the inverse of a matrix then you would absolutely use a
BLAS library to do that. You might be working in Python, R or Julia or
something but the BLAS library is written in C or Fortran and is just much
faster than anything you could write directly. Even if you could write a fast
implementation of BLAS in say Python there would be no benefit in doing so
because it is better to be able to share things like that e.g. the same BLAS
library can be used from Python, R, Julia, Matlab etc.

If you were working in C you would still not write your own code to do what
BLAS does because it is not just faster but also more reliable, more accurate,
less buggy etc than you would likely achieve without a lot of work. In fact
even an optimised C implementation would likely be slower than BLAS because
your BLAS library would have not just better algorithms but also things like
hand-crafted assembly (much like GMP does), SIMD etc, with the aim of being
*faster* than C.

Anyone who knows about BLAS libraries would say that it would be absurd to try
to reimplement BLAS in Python when you can just use an off the shelf BLAS
library (``pip install numpy`` already does this for you). This is essentially
the same situation for computer algebra systems and libraries like FLINT. It is
nice that SymPy has pure Python implementations of many algorithms and there
are situations where those are useful in practice but if you want anything like
top-end speed for CPU-heavy operations then you need to use something like
FLINT. Then if you do use FLINT there is no actual reason why working in Python
should be slower than anything else because if the bulk of the time is spent in
FLINT then it will be just as fast whether you call it from Python, C or
anything else.

I will write more specifically about what is currently implemented in
python-flint and what work is needed in a separate post. I will also write more
about matrices in a separate post but for now I will just say that FLINT
provides both polynomials and matrices over all of the domains mentioned above
and SymPy could use those to achieve state of the art speeds for many
operations. In the same way that SymPy currently uses gmpy2 to speed up ``ZZ``
and ``QQ`` it would be possible to speed up every part of the computational
algebra subsystem by using python-flint. There are also many other features
like ``Arb`` and ``Calcium`` that SymPy would benefit from but that is not the
topic of this post.

Using python-flint
------------------

Since python-flint is now easily installable and usable I have added support
for using it in SymPy. In SymPy 1.12 and earlier the ``SYMPY_GROUND_TYPES``
environment variable can be used to specify whether or not to use ``gmpy2`` for
``ZZ`` and ``QQ``. If the environment variable is not set to anything then the
default is to use ``gmpy2`` if it is installed and otherwise to use CPython's
``int`` and SymPy's ``PythonMPQ`` types as described above. With the latest
changes on the SymPy master branch the ``SYMPY_GROUND_TYPES`` environment
variable can also be set to ``flint`` to use python-flint instead of ``gmpy2``
(if it is installed). The PR implementing this change is:

https://github.com/sympy/sympy/pull/25474

Using this looks like::

    $ SYMPY_GROUND_TYPES=flint isympy
    ...
    In [1]: type(ZZ(1))
    Out[1]: flint._flint.fmpz

The default is currently to use ``gmpy2`` if the environment variable is not
set explicitly but I am considering whether the default should be changed to
``flint`` in SymPy 1.13. So far SymPy does not use enough of python-flint's
features to derive major benefits from it and so being conservative it is only
used if the user opts in explicitly but that is definitely not what we would
want longer term. (Changing the default would only potentially affect anything
for those who already have python-flint installed which I suspect is not many
people.)

With this change SymPy can now use python-flint for ``ZZ`` and ``QQ``. Since
FLINT depends on GMP and uses GMP for its integer and rational number types
using python-flint for ``ZZ`` and ``QQ`` is not really much different from
using ``gmpy2``. The main difference is that python-flint provides many more
things and using python-flint's elementary types for e.g. ``ZZ`` makes it
easier to later add support for polynomials in e.g. ``ZZ[x]``.

Besides ``ZZ`` and ``QQ`` the only other things that SymPy currently (on
master) uses python-flint for are the internal dense implementation of
``DomainMatrix`` over ``ZZ`` and ``QQ`` and a few number theory functions that
would otherwise be provided by ``gmpy2``:

https://github.com/sympy/sympy/pull/25495

https://github.com/sympy/sympy/pull/25577

I will talk more about the ``DomainMatrix`` changes in a separate post but for
now the point is that SymPy can now use python-flint for a few things and it is
now very easy to use it for more things. So far SymPy does not get a major
benefit from this but the hard work is now done so that we can just flip a few
switches and make some important things *much* faster.

A simple demonstration of making something faster is that this change would
make factorisation of polynomials in ``ZZ[x]`` faster when python-flint is
installed (and ``SYMPY_GROUND_TYPES=flint`` is set):

.. code:: diff

    diff --git a/sympy/polys/factortools.py b/sympy/polys/factortools.py
    index de1821a89f..d9a833e6c1 100644
    --- a/sympy/polys/factortools.py
    +++ b/sympy/polys/factortools.py
    @@ -1,5 +1,7 @@
     """Polynomial factorization routines in characteristic zero. """
     
    +from sympy.external.gmpy import GROUND_TYPES
    +
     from sympy.core.random import _randint
     
     from sympy.polys.galoistools import (
    @@ -76,6 +78,12 @@
     from math import ceil as _ceil, log as _log
     
     
    +if GROUND_TYPES == 'flint':
    +    from flint import fmpz_poly
    +else:
    +    fmpz_poly = None
    +
    +
     def dup_trial_division(f, factors, K):
         """
         Determine multiplicities of factors for a univariate polynomial
    @@ -662,6 +670,12 @@ def dup_zz_factor(f, K):
         .. [1] [Gathen99]_
     
         """
    +    if GROUND_TYPES == 'flint':
    +        f_flint = fmpz_poly(f[::-1])
    +        cont, factors = f_flint.factor()
    +        factors = [(fac.coeffs()[::-1], exp) for fac, exp in factors]
    +        return cont, factors
    +
         cont, g = dup_primitive(f, K)
     
         n = dup_degree(g)
    oscar@nuc:~/current/activ

This simple change does not cause any outwardly observable change in behaviour
and does not result in any test failures. It just makes one operation a lot
faster e.g. we can now factorise a 2000 degree polynomial in 2 seconds::

    In [11]: p1 = random_poly(x, 1000, 1, 5)

    In [12]: p2 = random_poly(x, 1000, 1, 5)

    In [13]: p3 = p1.as_poly() * p2.as_poly()

    In [14]: p3.degree()
    Out[14]: 2000

    In [15]: %time ok = p3.factor_list()
    CPU times: user 2.09 s, sys: 2.6 ms, total: 2.09 s
    Wall time: 2.09 s

I have not waited long enough to see how long this operation would take with
SymPy's existing implementation but it is definitely a lot longer than 2
seconds.

It would be easy to speed up specific functions like ``dup_zz_factor``
piecemeal like this and it is worth doing that to some extent. All polynomial
operations would be a lot faster though if python-flint was used and converting
to and from ``fmpz_poly`` is relatively slow compared to many ``fmpz_poly``
operations. The best approach then would be to swap out a layer somewhere in
the poly system. In this case I think what makes the most sense is to make a
wrapper class that holds an ``fmpz_poly`` internally but provides the same
interface as ``DMP``. Then ``Poly`` could use this class internally and the
result would be that all polynomial operations would be *much* faster including
just basic arithmetic like addition and multiplication. The structure would
then be like:

1. Domains (``ZZ`` ...)
2. ``dup_*`` (SymPy) or ``fmpz_poly`` (python-flint)
3. ``DMP`` (SymPy) or ``FlintDMP`` (python-flint)
4. ``Poly`` (holds ``DMP`` or ``FlintDMP`` internally)
5. ``factor`` etc

If SymPy did not already have existing implementations for many of the things
that python-flint provides then using python-flint would bring a massive
expansion in SymPy's capabilities. There are other things that
FLINT/python-flint have that SymPy does not currently have but those are not
the topic here. For the sorts of things that I discussed above SymPy already
has existing implementations of the algorithms but they are just a lot slower.
SymPy will need to keep those pure Python implementations (so that it can still
be used without python-flint) so making use of python-flint's capabilities for
polynomials is really just about speed. In that sense it is not really
necessary to try to use all of FLINT's features in SymPy because the speed
benefit mostly comes just from using it for the lowest levels: there is no big
penalty in writing higher-level algorithms in Python rather than C.

The biggest bang-for-buck that SymPy can get from FLINT would be from using
FLINT's sparse multivariate polynomials. Multivariate polynomials are crucial
for many of the things that SymPy users want to do and SymPy's existing
implementations are slow because of being pure Python or dense or because of
have algorithms that are far from optimal in some cases like polynomial gcd.
The FLINT library has state of the art algorithms for sparse polynomial
arithmetic, gcd, factorisation etc which are (or should be) the bottlenecks for
many things in SymPy like solving systems of equations, matrix operations,
integration, simplification etc. Unfortunately python-flint does not yet expose
FLINT's sparse polynomials but that is being worked on and I hope that the next
release of python-flint will provide this. This one feature would bring
enormous speedups to SymPy for many things that users want to do.

Probably if SymPy was using python-flint's existing features and also its
sparse polynomials then the potential for simple game-changing speedups of
relevance to SymPy users from using FLINT's other algebra features would be a
lot smaller. It would definitely be worth adding those other features but after
the sparse polynomials the next biggest benefits for SymPy users would come
from using other features like ``Arb`` rather than more of FLINT's algebra
features.

What needs to be done
---------------------

Some significant work has already been done to make it possible to use
python-flint to make SymPy faster but there is still a lot of work to do. Also
there are things that should be improved in SymPy's existing algorithms or code
structure to make it faster without python-flint and/or to make it easier to
leverage python-flint for improving SymPy's polynomial capabilities.

This is what can be done now without python-flint or with python-flint as it
stands today:

1. Some existing operations and algorithms should be improved like polynomial
   gcd (port PRS to sparse polynomials, investigate modulargcd module etc).
2. The sparse polynomial representation should be used more widely in SymPy
   whether that means adding ``SMP`` at the level below ``Poly`` or introducing
   a new ``SPoly`` or ``MPoly`` class at the ``Poly`` level instead.
3. A wrapper class like ``FlintDMP`` or something should be added that can wrap
   python-flint's dense univariate polynomial types for use by ``Poly`` so that
   it can use python-flint internally for arithmetic, ``factor`` etc over
   ``ZZ`` and ``QQ``.
4. The ``ANP`` class used for algebraic number fields should be able to wrap a
   FLINT ``fmpq_poly`` internally because that is a situation where dense
   univariate polynomials are very important in the domain system (perhaps it
   should wrap ``DMP`` or ``FlintDMP``).
5. Algebraic number fields should be reimplemented based on sparse polynomials
   rather than primitive elements though (this is a much bigger task than
   anything else mentioned here).
6. Domains like ``ZZ_I[x, y]`` should be reimplemented over the top of sparse
   polynomials like ``ZZ[x, y]`` rather than directly over ``ZZ_I`` - this
   would already speed them up but would also prepare them to get faster with
   python-flint.

There is an important design decision to be made about whether to use the
``SMP`` approach or the ``SPoly`` approach described above. Everything else
listed above is unambiguous and is just work that should be done.

Another area where python-flint can immediately bring big improvements is
``GF(p)`` since python-flint provides an implementation of ``GF(p)`` as well as
matrices and polynomials over ``GF(p)``. I have not listed this above just
because for what most users want to do the need for ``GF(p)`` is mainly in the
internals of things like factorisation and if SymPy gets those algorithms from
FLINT then it depends much less on its own ``GF(p)`` domain. (If anyone else is
interested in this though it would not be hard to make use of python-flint
here.)

These changes require the sparse polynomials to be added in python-flint and
are the absolute top priority items that would make the biggest difference to
the things that users want to do:

7. FLINT's sparse polynomials should be exposed in python-flint.
8. SymPy should use python-flint's sparse polynomials internally for domains
   like ``ZZ[x, y]``.
9. SymPy should use python-flint's sparse polynomials for expensive operations
   involving multivariate polynomials like ``factor``, ``gcd`` etc.
10. SymPy should use python-flint's sparse polynomials internally for ``Poly``.
    This would be easy if SymPy already used its own sparse representation
    (i.e. ``SMP`` or ``SPoly``).

All sorts of things would be made much, much faster by making these changes. In
many cases the speed differences would be beyond the sort of thing that it
makes sense to express as e.g. 10x faster because really a calculation that is
currently so slow as to be effectively impossible would become achievable in a
reasonable time.

There are reasons beyond just speed for most of these changes as well. It is
also worth noting that any work on python-flint benefits the wider Python
ecosystem as well and that leveraging FLINT means sharing efforts across
language ecosystems like Python and Julia as well as other computer algebra
systems.

Some of the changes I have described above are not really that difficult to do
and could be done quite soon. Others are bigger more long-term projects. For me
the top ticket is adding FLINT's sparse polynomials to python-flint and then
having SymPy use them for the domains which I don't think would be particularly
difficult.

.. raw:: html

    <script src="https://utteranc.es/client.js"
            repo="oscarbenjamin/blog"
            issue-term="pathname"
            theme="github-light"
            crossorigin="anonymous"
            async>
    </script>