The reason is that you do not need to finely understand the structure of individual sentences to group documents by similar topics. Word order does not matter much for this task. Hence the success of methods that use Bag of Words (eg TFIDF) as their input representation.
It might be that the corpus I was trying to cluster needs better preprocessing, or perhaps better n-grams. Using Bigrams only I saw a lot of common words that were meaningless, but adding them as stop words made the results worse. Hence my wondering if some other vectorization would produce better results.
On a related note, as a newcomer just trying to get things done (i.e. applied NLP) I find the whole ecosystem great but frustrating, so many frameworks and libraries but not clear ways to compose them together. Any resources out there that help make a sense of things?
It’s not meaningless words - it’s common English words that are overloaded and I think considering their position in sentences instead would give better results.
I haven’t yet tried TFIDF though so I’ll see what that will do.