- if a model has been trained on shorter paragraphs, it will likely do better on those than on longer ones, and vice versa
- each model has some maximum input length (e.g. 512 tokens, or about 350 words), and might silently discard words when it's given a longer chunk
I don't know whether or not processing multiple lengths is worthwhile, but you probably want to have some overlap when you turn your docs into chunks.
Maybe take a look at Langchain or LlamaGPT: someone has probably come up with sensible defaults for overlap and chunk size.
If you want to do embeddings locally, check out sentence-transformers/all-MiniLM-L6-v2
- if a model has been trained on shorter paragraphs, it will likely do better on those than on longer ones, and vice versa
- each model has some maximum input length (e.g. 512 tokens, or about 350 words), and might silently discard words when it's given a longer chunk
I don't know whether or not processing multiple lengths is worthwhile, but you probably want to have some overlap when you turn your docs into chunks.
Maybe take a look at Langchain or LlamaGPT: someone has probably come up with sensible defaults for overlap and chunk size.
If you want to do embeddings locally, check out sentence-transformers/all-MiniLM-L6-v2