Hacker News new | past | comments | ask | show | jobs | submit login

Why didn't they just download the dumps via https://dumps.wikimedia.org/enwiktionary/ (as explained in https://en.wiktionary.org/wiki/Help:FAQ#Downloading_Wiktiona...)

Scraping, even via an api, is way less efficient imho.




They’re in wikitext, which looks to be considerably less semantic than the crawled data. I’m not sure that’s the reason, but it could be a reason.


I'd say not the reason, since the wiki text is pretty semantic. the wiki source of https://en.wiktionary.org/wiki/subbureau#English is:

  ==English==

  ===Etymology===
  {{prefix|en|sub|bureau}}

  ===Noun===
  {{en-noun|s|subbureaux}}

  # A [[district]]-level public security bureau in [[China]].
so as long as one can parse wikitext, it's split pretty well up!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: