A First Excercise in Natural Language Processing with Python: Counting Hapaxes

dec0dedab0de · on Sept 8, 2017

I get that the point is to be an introduction to the libraries and whatnot, but was I the only one who immediately thought of just using Counter?

    from collections import Counter
    import re

    [word for word, count in Counter(re.findall('\w*', text.lower())).items() if count == 1]

kleiba · on Sept 9, 2017

As the article states in the introductory paragraph, this problem encompasses more than just counting strings. It also involves "some fundamental tasks of natural language processing (NLP): tokenization (dividing a text into words), stemming, and part-of-speech tagging for lemmatization", so a little more work is required here.

cristoperb · on Sept 9, 2017

Yep, if you don't need any of the fancy NLP features of a library then something like this is the most straight-forward. (In the article I did give a plain Python solution using split() to tokenize and then using Counter to get the hapaxes in the function called "word_form_hapaxes".)

newman8r · on Sept 9, 2017

for anyone interested in more good beginner resources, I really enjoyed this youtube playlist on python NLTK https://www.youtube.com/watch?v=OGxgnH8y2NM&list=PLQVvvaa0Qu...

edit* I accidentally linked to another good playlist, but here's the first vid of the NLTK list from the same user https://www.youtube.com/watch?v=FLZvOKSCkxY

hbex5 · on Sept 8, 2017

Hapax Legomenon is such a satisfying phrase to say. Even the opportunity to look at it makes my eyes happy.

visarga · on Sept 9, 2017

I counted word n-grams up to length 6 in a corpus of 6 billion words with Madoka, a Count-Min sketch algorithm.

https://pypi.python.org/pypi/madoka

cristoperb · on Sept 9, 2017

Author here. The misspelling in the title is embarrassing, but luckily not very noticeable (I've fixed it on the site).

bluzeee · on Sept 9, 2017

It's so beautifully put together and made so easy to understand, I thank you very much, it helped greately in my learning.