6.7b is pretty small, no? Do you even need offloading for that on a 3090? I'd be curious to see what's needed to run opt-30b or opt-66b with reasonable performance. The README suggests that even opt-175b should be doable with okay performance on a single NVIDIA T4 if you have enough RAM.
It is entirely possible to run 6.7B parameter models on a 3090, although I believe you need 16 bit weights. I think you can squeeze a 20b parameter model onto the 3090 if you go all the way down to 8.