Hacker News new | past | comments | ask | show | jobs | submit login
Enough Machine Learning to Make Hacker News Readable Again [video] (pyvideo.org)
169 points by Wingman4l7 on May 7, 2014 | hide | past | favorite | 85 comments



This guy right here made watching the video worth it: http://i.imgur.com/twr2j8Y.png


Why is it that in genetic algorithms never seem to be mentioned any more. Are they sub-standard, or just a "higher-level" than typically talked about i.e. you must implement them yourself.


Genetic algorithms are not really an off-the-shelf black box that you can just plug your data into and get results. They take a domain expert to use efficiently, and even then they aren't guaranteed to perform that well. The area that I've encountered where they are most effective is in approximation heuristics for NP-hard problems where you slowly assemble a solution from smaller pieces.


+1 I'd also add that genetic algorithms are for optimization, and can't really be compared with most of the algorithms in that chart. It'd be a sub-level where different optimization techniques for finding model weights, for each type of approach (classification, clustering etc)are compared.


Most (all?) of the algorithms on the chart iteratively optimize an objective. However, most of the objectives are convex or otherwise admit an optimization strategy that performs better than a genetic algorithm.


I believe you are repeating what I said (?). All of the algorithms have different methods of arriving to an objective function and leveraging it's results. Yet, most share the same problem in terms of optimizing it, and yes, most choose other routes.


Could you give a concrete example where a genetic algorithm performs well? I have never been able to find any such example.


I'm not sure this in an example of a GA performing well, but it is an interesting write up https://web.archive.org/web/20130526010327/http://www.aelag....


[deleted]


> They are very prone to getting stuck in local minima.

That's quite a generalization. A GA's tendency to get stuck in local minima can be mitigated by adjusting population size, selection method/size and rate of mutation -- i.e. increase the randomness of the search.


This is not a good generalization. I've usually only seen this issue with optimization problems when:

1) You haven't played with parameters 2) Implementation is not correct (usually the case with genetic algos, since it requires a reasonable amount of domain expertise vs say GD)


Copped out when called out. Deleted comment said GAs were bad search algos and tend to get stuck at local minima.


What evidence is informing your opinion that genetic algorithms are a bad search algorithm? What makes you say that they are very prone to getting stuck in local minima? Do you think they suffer from local minima more than, say, gradient descent?



I find it odd that Ordinary Least Squares is missing from the map, even though it's probably more popular than all the other methods in that entire map combined.

However, it is mentioned at the top here: http://scikit-learn.org/stable/supervised_learning.html.


OLS is a special case of ElasticNet, Lasso, and ridge regression with the regularization parameters set to zero. (The latter two are also special cases of ElasticNet with one of the two regularization parameters set to zero.) In the presence of many predictors or multicollinearity among the predictors, OLS tends to overfit the data and regularized models usually provide better predictions, although OLS still has its place in exploratory data analysis.


To add to simonster's comment [1]: confusingly, OLS is also morally equivalent to what the map calls "SGD regressor" with a squared loss function[2]. It is also nearly equivalent, with lots of caveats and many details aside, to SVR with a linear kernel and practically no regularization.

So yeah, it is confusing. There is a lot of overlap between several disciplines and it's still an emerging field.

[1] https://news.ycombinator.com/item?id=7713940

[2] http://scikit-learn.org/dev/modules/sgd.html#regression


It's also odd there's no mention of Logistic Regression.


Yeah the nomenclature is not very rigorous and there is some overlap depending on how you look at it but, roughly and without being pedantic, the closest in that map would be SGD with a logistic loss function[1].

[1] http://scikit-learn.org/dev/modules/sgd.html#classification


It would be nice if scikit-learn included an "autolearn" function based on this flow-chart.


Cool project!

But HN to me is a way to keep current on what people in tech are talking about. I don't want to live in a bubble. I want to discover new things that smart people think are cool.


Something that keeps bringing me back to HN specifically (over the likes of Reddit, Twitter, etc.) is the sheer intelligence of conversations.

More often than not I'll find myself skimming through the discussion here before exploring the linked material. The reasoning behind why people feel something deserves to be "front paged" and the insights that domain experts offer in the discussion is what (I feel) makes HN valuable. Taking away the brainy aspect behind how the community works would be an interesting experiment, alas one I wouldn't want to see _replace_ what we have today.

"Things smart people think are cool" is nearly an understatement.


This is exactly what people on reddit used to say about reddit six or seven years ago. I have to believe the decline in general quality, and the quality of the front page most of all, in reddit could happen in some form to HN, without vigilance on the part of users and mods.


Ironically, dang himself just got accused of posting something "not worthy" of HN: https://news.ycombinator.com/item?id=7712692

There seems to be a remarkable difference between what people feel HN should be, and what it actually is. "Quality" seems to be more of a sliding scale which correlates to personal bias, and perhaps, a feeling of alienation brought about by a diverse community, than anything objective.


The community will drive to a common form which is shaped by the community managers (mods and power users). I believe @pg gets that and the people taking over also get. Additionally some users may simple outgrow the community even if the community stays exactly the same.

I think HN gets a lot of things right to foster what I enjoy, which is intelligent conversations about a wide variety of forward looking topics. I think the biggest risk at the moment is spamming the site with what are essentially targeted ads/click bait because of the user base is, but for the most part the front page is decently curated.


Interestingly, as long as the discussions continue to contain valuable — and often educated — insights, I personally think there's still enough value to keep coming back to HN there alone. The subjective quality of links themselves may shift as the community ages, but it's the well-thought-out discussions and insights within them which are arguably often more worthwhile than the linked content itself.

It's worth mentioning that comments which don't seem to add much to the conversation are often down voted lately, which is promising.


Agreed. This used to be the case with Slashdot, but that became a shithole with the new management. Sorry for the language, but as a long-time reader and commenter on Slashdot, I was very bitter when stuff started going downhill.

I've found HN to be a very good alternative. The comments/discussions were a bit iffy to get into at first with some trigger-happy downmods, but overall I think the conversation is very constructive. And there is a dearth of information and helpful things being said.


Where's the dearth?


[..]being said/posted on HN. Sorry, should have elaborated that sentence.


Did you mean wealth? Otherwise I'm confused by your statement.


Wow, dearth does NOT mean what I though it does. Anyways, I meant to say "lots", which is the complete opposite of dearth.

Thanks for being confused :)


Just a subtle prod in the right direction :)

I actually went and looked up 'dearth' just to be sure it meant 'lack of'.


You are still living in a bubble, it's just the valley-tinged HN bubble and you have encoded this as "smart people".


HN is a bubble. I came here to get to know what is being discussed on the SV startup scene.

The European enterprise consulting world is another galaxy.


I would be quite interested in similar (or not so similar) discussion forums for the European Enterprise galaxy. Would you be able to either submit a list of what you find notable, or post some starting points?


Not sure. I get most of my information via Heise, Skillsmatter webcasts and occasional meetings at local JUG and .NET user groups.


Any idea where we can keep a pulse/view on that other galaxy?


As answered in another thread,

https://news.ycombinator.com/item?id=7715048


Maybe some version of this tool should be used as a to to filter past discussions instead of search? For example, for creation of a portal shout easy to use development tools including discussions?


Here's the result of the presentation: http://hn.njl.us/

The classifier rejects this very submission... not sure what to think of it.

Maybe the "[video]" label killed it? The fact that it references HN?


More likely the retrieved text fron the page itself contains very little of technical interest. If there were a transcript I expect it would fare better.

Great reminder - material in a video is undiscoverable.


Just my 2c; I can often get data faster by reading than waiting for someone to explain it verbally, so I usually prefer not to watch videos.

If someone could automatically transcribe videos with key panes from the video... (google??) ... then that would be cool.


^^^agree with this


Probably just doesn't want to hear himself talk ;)


His algorithm rates this article as bad.

That is very funny.


Idea: browser extension that notices two things: when you follow links from the HN front page, and whether you upvote that story. If you read an article and don't upvote it, it labels that article "dreck". If you do upvote it, it labels that article non-"dreck". Maybe it has some subtle reminder that you should remember to upvote the article once you're done.

People who use this extension make HN better for themselves (because they're classifying articles according to their tastes as they go along) and they're also making HN better for others (by incentivizing people to upvote good material when they may otherwise have not upvoted).

If you have enough HN karma to downvote, maybe only downvotes count as dreck. Then you're still improving both your own and others' experience.

Yo dawg, I heard you like HN, so I proposed a browser extension that lets you improve HN while you improve HN.


Two general problems with this, and they're common to many content-recommendation / filtering systems.

• Explicit rating actions are only a small part of interactions with a site. Other implicit actions are often far richer in quantity and quality -- time spent on an article, interactions and discussion, the quality of that discussion (see pg's hierarchy of disagreement, for example), and other measures. As Robert Pirsig noted, defining quality is hard.

• Whose ratings you consider matters. The problem of topic and quality drift happens as general interests tend to subvert the initial focus of a site or venue. Those which can retain their initial focus will preserve their nature for a longer period of time, but even that is difficult. Increasingly, my sense is that you want to be somewhat judicious in who you provide an effective moderating voice to, but those who get that voice should be encouraged to use it copiously. Policing the moderators (to avoid collusion and other abuse) becomes a growing concern (see reddit and its recent downvote brigades against /r/technology and /r/worldnews).


regarding the first part, granted.

regarding the second part, the proposed scheme uses hn's built in control of making users earn a bunch of karma before letting them downvote. I agree that topic drift happens, witness all of the bitcoin related discussion over the past year or so.


So, there are two basic approaches you can make to this:

1. Delegate moderation powers only to a select group of individuals who know and will uphold the site's standards. Effectively: and editorial board.

2. Allow all users to moderate. But score each on how well the result of their moderation adheres to a specified goal -- that is, for a given submission, was it 1) appropriate to the site and 2) did it create high-level engagement? Users might correlate positively or negatively, strongly or weakly. That is: some people will vote up content that's not desirable, and downvote content that is. Others simply can't tell ass from teakettle. In the first case, you simply reverse the sign, in the second, you set a low correlation value. And of course, those who are good and accurate predictors get a higher correlation value.

With the 2nd approach, everyone's "vote" counts, though it may not be as they expected. You've also got to re-normalize moderation against a target goal.

It's more computationally intensive, but I think it might actually be a better way of doing things.


  > ... hn's built in control of making users
  > earn a bunch of karma before letting them
  > downvote.
Since all of this is talking about the classification of submissions, this is irrelevant, because you can't downvote submissions, only comments.

At least, I don't yet have enough karma to downvote submissions.


Submissions can be flagged.

I'd argue they should be downvotable as well, though you're right, they're not.

Incidentally, comments can also be flagged (on the comment link view only, not in the forum view).


And the most beautiful thing here is that his classification algorithm marks this entry as "Probably I shouldn't read this"

I love it: http://hn.njl.us/


Are people really serious when they talk about the fabled Hacker News of old? What could it possibly have been like? I'm imagining something like zombo.com, only with lower contrast text.


The way I remember it, it was like this site but more frequently updated and focused mostly on Haskell:

http://lambda-the-ultimate.org/

If you read a few rows down on that page, you'll see this:

"For the debate about MS being evil, you can head directly to HN where you'll also find an explanation of what bootstrapping a compiler means."

And that about sums it up. For a while I didn't even create an account because I didn't think I could add anything without sounding stupid compared to everyone else. Now I try to refrain from commenting for...different reasons.


> And that about sums it up. For a while I didn't even create an account because I didn't think I could add anything without sounding stupid compared to everyone else. Now I try to refrain from commenting for...different reasons.

Same here. Though I refrain less.

At some point, I'd like to go and find my first comment on here just to see what got me to make an account.



That is the most disappointing thing I have seen all day. Oh well.


People keep saying how much better HN was, but I'm just not seeing it: https://web.archive.org/web/20071115044647/http://news.ycomb...


Good catch, articles are saved as well.

I don't notice a shift in tenor from the crowd of old and the one we have now.


See for yourself.

    https://news.ycombinator.com/classic
Often the content on the main page is quite different.


That's biased by the fact that many people who were upvoting back then have since left the site. If you want to see for yourself, the Wayback Machine is a better sample: https://web.archive.org/web/20071115044647/http://news.ycomb...


Is there an explanation of what's going on for that page somewhere?


Articles voted up by old timers.


is there a definition of "old timers" ? N years old account? signed up before year X?


To my recollection, it was pretty similar, but with a lot fewer general news stories and a lot more stories about specific YC companies.


Slashdot was fun back in the day (Id 64578), so most sites do that progression. I think HN has changed in that general social stories show up more. I do think the weekends are a bit weirder now.


I actually did something like this at some point. I took all the high ranking items, tokenized them to extract features, and ran them through a bayesian classifier to do some filtering. I was just using whatever information was available on the front page and did not do any further analysis with the actual content.

The results were ok. Maybe with a bit more power it could be more useful but the results were still hit and miss and I didn't have a long term strategy for not filtering myself into a bubble other than continuously re-training the model.


As an econometrician I cannot believe how many times he said 'magic'. There is something very wrong when you put things in your model 'because, who knows, it might be helpful' (like he did with host names). Variable selection is a very hard problem and using 'magic' is asking for problems. It is so disappointing to see machine learning, statistics, econometrics deal with similar problems and fail to learn from one another.


It's a completely harmless toy project, so who cares if he chose his variables non-scientifically? He's not creating a cancer diagnosis tool here.


I understand this is a toy project, but he is put in a position in which he educates people how to use these methods and gives the wrong impression. The next guy might use this flawed logic while creating a tool for disease prevalence prediction.


To be fair he did explain that he thought that the host name might be indicative of whether or not it would be druck. If he knew exactly how that was the case (and if it was already known that it did have effect), why bother with machine learning? Just write an explicit scoring mechanism.


It did seem to me that this was an interpretation that he came up with after he tired many different pipelines and "flipped all the switches". There are many sources of randomness that warrant using statistical methods. But it feels strange to me to see people use these tools without giving much thought to parameter stability, parameter significance, causality, model selection in general.


That's exactly the recent criticism of 'big data': engineers and others getting correlations they don't understand from all the data they can collect, and attempting to use them for who knows what.


I do agree.

The presentation did a wonderful job of providing a high level introduction into the idea of machine learning but anyone that's strongly interested in ML should pick up some of the books he mentioned.


Interesting enough... the greens are all ones I clicked on earlier today and read. A few false negatives but not bad!


Just because I think it's worth mentioning, I find it ironic that the link to this video got marked as bad using his algorithm =)

Saw this at the link he provided.


Excellent first rate presentation. Those giving technical talks might want to take note.


Very block-boxy ("and it does a whole bunch of math and voila!")


Yes. So are compilers. And web frameworks. And editors. And memory-management tools. Progress is made by no longer re-implementing and re-inventing the things that many, many people have invented and implemented in the past, and building on their work.

This doesn't mean that there is no value in learning about these things for yourself, but the packaging of knowledge in reusable tools is the only way programming progresses.


Nicely encapsulated doesn't have to imply a black-box implementation, though. I for one would like it if compilers were less black-boxy; ideally, I want to find out why my compiler does a particular thing by investigating its output, querying the API, going through the compilation steps, etc., rather than having to google some StackOverflow answer.


Isn't that the point of tools like scikit-learn? You don't need to know how to code, optimize, etc. all the algorithms, just understand how to use them.


Perhaps, but I feel like if you are trying to use a statistical tool, it would be best to know how it works. Think about if every scientist claimed a discovery when they found a result with a 90% confidence interval. Machine learning (at least in this application) is different because often the consequences are testable, verifiable, but I still think that it's better to know how it works than treat it like a black box.


There is probably a large group that lacks advanced linear algebra and statistics for learning the theory but would still be able to build useful applications using a ML library. I think the video is mainly directed at that group.


What makes you assert he is treating it like a black box? There isn't time in the presentation to go into detail, but actually linear models are inspectable, namely, you can obtain a list of features and how they are weighted. Also, as he said, the scikit-learn documentation is of high quality and explains how the models work. BTW you give an example of scientists, but like he stressed, machine learning as he applied it is a form of engineering.


black box, not block box.


I didn't even see that at first, although now that I look at it... I kind of like the term "block box". It takes a black box, and defines it in terms of how it's used, not what it does. It is a block that can be implemented in a certain way. How does it work? Doesn't matter at this level. It's a building block for a differently-focused project. A block box.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: