The goal was to extract as much parallelism from a single squaring operation as possible using the available hardware resources. We used a Xilinx FPGA. There will be a paper detailing the exact algorithm in the coming weeks. In the interim, if you are interested in digging into code, you can find some of the primitives for the design here: https://github.com/supranational/primitives