Thanks for the pointers - on latest release with Python 3.8 and roughly 2~5 of t...

Arathorn · on Feb 27, 2021

So you’ll want to try dialling up the overall cache factor a bit.

Redis is only useful if you split the server into multiple worker processes, which you shouldn’t need to at that size (and even then, doesn’t provide shared caching yet, although there’s a PR in flight for it - we currently just use redis as a pubsub mechanism between the workers).

Highly recommend hooking up prometheus and grafana if you haven’t already, as it will likely show a smoking gun of whatever the failure mode is.

Are the logs stacking up with slow state-res warnings? Stuff like:

    2021-02-25 23:15:26,408 - synapse.state.metrics - 705 - DEBUG - None - 1 biggest rooms for state-res by CPU time: ['!YynUnYHpqlHuoTAjsp:matrix.org (34.6265s)']
    2021-02-25 23:15:26,411 - synapse.state.metrics - 705 - DEBUG - None - 1 biggest rooms for state-res by DB time: ['!YynUnYHpqlHuoTAjsp:matrix.org (148.6s)']