The reason is that you do not need to finely understand the structure of individ...

orestis · on Jan 13, 2020

It might be that the corpus I was trying to cluster needs better preprocessing, or perhaps better n-grams. Using Bigrams only I saw a lot of common words that were meaningless, but adding them as stop words made the results worse. Hence my wondering if some other vectorization would produce better results.

On a related note, as a newcomer just trying to get things done (i.e. applied NLP) I find the whole ecosystem great but frustrating, so many frameworks and libraries but not clear ways to compose them together. Any resources out there that help make a sense of things?

nestorD · on Jan 13, 2020

If I understand you problem clearly, you can use TFIDF to reduce the weight of meaningless words.

orestis · on Jan 14, 2020

It’s not meaningless words - it’s common English words that are overloaded and I think considering their position in sentences instead would give better results.

I haven’t yet tried TFIDF though so I’ll see what that will do.