How? NPUs are going to be included in every PC in 2025. The only differentiators will be how much SRAM and memory bandwidth you have or whether you use processing in memory or not. AMD is already shipping APUs with 16 TOPS or 4 TFLOPS (bfloat16) and that is more than enough for inference considering the limited memory bandwidth. Strix Halo will have around 12 TFLOPS (bfloat16) and four memory channels.
llama.cpp already supports 4 bit quantization. They unpack the quantization back to bfloat16 at runtime for better accuracy. The best use case for an FPGA I have seen so far was to pair it with SK Hynix's AI GDDR and even that could be replaced by an even cheaper inference chip specializing in multi board communication and as many memory channels as possible.
llama.cpp already supports 4 bit quantization. They unpack the quantization back to bfloat16 at runtime for better accuracy. The best use case for an FPGA I have seen so far was to pair it with SK Hynix's AI GDDR and even that could be replaced by an even cheaper inference chip specializing in multi board communication and as many memory channels as possible.