K-Nearest Neighbors from Scratch

Radim · on June 15, 2016

While the topic may seem trivial and the execution/implementation a bit too naive, it should be noted the author is a graphic designer! Check out David's homepage [0] -- original, if not very practical :-)

I personally think it's great to see such cross-pollination between various fields. It helps other newcomers to read about this stuff "in their own language", using a perspective of someone from their own field (as opposed to the usual academic / old-hand angle).

[0] http://www.lettier.com/

matt4077 · on June 15, 2016

It's excellent – even though I somehow get the impression you think all designers are idiots :). I've met quite a few with more talent for abstract thought than some math professors.

I can only hope that we get more and more people from other disciplines involved in data science. The field is wide open and full of opportunities. Not just the stereotypical 'creative type' creating art with ANNs, but also, for example, in applying these methods to problems in the humanities.

machinelearning · on June 15, 2016

"I've met quite a few with more talent for abstract thought than some math professors."

Such a statement can be made about almost any two fields.

I suspect "quite a few" is either hyperbole to strengthen your point or you suffer from the dunning-kruger effect.

All the parent comment says is that its good that two fields that are conventionally thought to be orthogonal have cross-pollination of knowledge. Nothing about "abstract thought" or intelligence is mentioned and is your own insecurity coming through.

jokoon · on June 15, 2016

I came here to say that its distance array was going to be large, but yes, pretty nice for someone who wasn't exposed to algorithm or the big o notation.

hugh4 · on June 15, 2016

Maybe, but if the new cross-pollination just confuses the issue rather than clarifying it, what's the point?

"Suppose you have a bunch of flowers in a field, and they're planted so that the east-west direction corresponds to the diameter of the flower and the north-west direction corresponds to the length of the leaf."

"Huh?"

"Suppose you have a bunch of data points in an n-dimensional space"

"Oh, gotcha"

mcbits · on June 15, 2016

"Suppose you have a flower data set, and each item has two parts: one value for flower diameter, and one for leaf length."

"Oh, that makes sense."

"Makes sense to me, too."

"And me."

"I wish my professor had just said that. Thanks!"

There is a lot of middle ground between an artsy, overly wordy scenario and a perfectly abstract scenario using jargon that's defined in terms of other jargon, etc.

The cross-disciplinary sweet spot is where you can define a problem from one side in terms compatible with an existing solution from the other side. You end up with language that's somewhere in the middle and easier to understand than it was at either extreme.

matt4077 · on June 15, 2016

His version actually contains more information.

spdustin · on June 15, 2016

What an exclusionary thing to say. Was that your intent?

hugh4 · on June 15, 2016

That's not even a word.

spdustin · on June 15, 2016

http://www.dictionary.com/browse/exclusionary

Of course, I did hear that gullible was removed from the dictionary during an annual meeting of linguists in Leeds last year.

henryw · on June 15, 2016

Pretty neat for implementation used as a tutorial. A more industry standard implementation would probably use a tree of some type to reduce the space and search time. For example https://en.wikipedia.org/wiki/K-d_tree in http://scikit-learn.org/stable/modules/generated/sklearn.nei...

bra-ket · on June 15, 2016

very nice write-up. if you're interested in this, the next logical step is to look at 'locality-sensitive hashing' and some simple state-of-the-art methods: https://github.com/spotify/annoy

leecarraher · on June 15, 2016

this description of linear scan kNN may be a bit tedious for the hacker news crowd. here's the gist (in python):

  def knn(y,X,k):
     d = len(y)
     least = [(float('inf'),[float('inf')]*d) for i in xrange(k)]
     for x in X:
        d = 0.0 
        for i in xrange(len(x)):
           d+=abs(x[i]-y[i])
        if d < least[k-1][0]:
           least[k-1] = (d,x)
           least.sort()
     return least

Sorting even for linear isn't the best idea but keeps the algorithm terse. A data structure like a priority queue should probably be used instead of an array.

and one should use a kd tree or ball tree for exact, and LSH-knn for inexact in practice.

thomasahle · on June 15, 2016

You can use `heapq` from the standard library to at least go from `n k log k` to `n log k`.

theoracle101 · on June 15, 2016

Why not just use a kd tree or voroni?

Scea91 · on June 15, 2016

One reason is that the implementations are much more difficult. In k-d trees you have to deal with the distance from hyperrectangles. It is not something you would write in 30 minutes.

Also k-d trees don't scale much in higher dimensions (https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/...) so in large dimensional problems (the article says 20 is already large) the naive linear search tends to be actually faster. Voronoi is even worse in dimensions larger than 2.

jeena · on June 15, 2016

How would you learn how to implement that algorythm if you'd just use an existing implementation?

AReallyGoodName · on June 15, 2016

Read the wiki page on it, run through the psuedocode on pen and paper a few times and then implement it.

Source: I did this when I wrote a library for reverse geocoding which uses a kd tree.

https://github.com/AReallyGoodName/OfflineReverseGeocode

UK-AL · on June 15, 2016

Implementing a spatial tree is not the same as implementing Nearest neighbours. It's just a data structure to allow efficient culling.