Sure, you can do that kind of thing in principle. I think the vocal cord model needs to be constrained which will not be very fair, but I was reading papers over a decade ago about simulating speech with physical models of the voice. I assume that's how they guess what dinosaurs might have sounded like.
Of course, with such a complicated and underconstrained system you might need to basically tell it what a human vocal system roughly looks like and let it calculate parameters based on the model. Maybe not though, neural networks are surprising sometimes.