For a similar project, I worked with GPT to create an extensive dataset of translations from a historical language. I could then use this both to evaluate base capacity of other models in the language, i.e. giving the model the task of translating the various passages and evaluating the results with GPT, as well as for fine-tuning.