Hacker News new | past | comments | ask | show | jobs | submit login

Huggingface have good guides on tokenization, and tokenizer training. BPE (e.g. used by gpt) and wordpiece (e.g. used by bert) are two commonly used methods https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt



Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: