So you’ll want to try dialling up the overall cache factor a bit.
Redis is only useful if you split the server into multiple worker processes, which you shouldn’t need to at that size (and even then, doesn’t provide shared caching yet, although there’s a PR in flight for it - we currently just use redis as a pubsub mechanism between the workers).
Highly recommend hooking up prometheus and grafana if you haven’t already, as it will likely show a smoking gun of whatever the failure mode is.
Are the logs stacking up with slow state-res warnings? Stuff like:
2021-02-25 23:15:26,408 - synapse.state.metrics - 705 - DEBUG - None - 1 biggest rooms for state-res by CPU time: ['!YynUnYHpqlHuoTAjsp:matrix.org (34.6265s)']
2021-02-25 23:15:26,411 - synapse.state.metrics - 705 - DEBUG - None - 1 biggest rooms for state-res by DB time: ['!YynUnYHpqlHuoTAjsp:matrix.org (148.6s)']
Sounds like I should tune some caches then - I have memory to spare if it turns out to make a difference.
BTW, I just noticed there is an option to add a Redis - would that be a significant improvement compared to just using the process caching?