Did you look at posting times at all to exclude people who never post at certain times? (I should add that that's not my idea but something someone posted on the original thread).
I used to do cold call sales for a living and from the "pick up" you can tell what kind of person is answering and modify your voice accordingly. You basically just follow their lead.
E.G:
* Quiet people don't want shouty salesmen so go quiet, reserved and professional.
* Some guy off a council estate/trailer park is typically not going to want to talk to a suit so in that case you'll adopt a very casual and matey tone as if you've bumped into him in a pub/bar.
* If I EVER met someone who "knew the score" (e.g. someone smart or knows how telesales work) then I'd just drop all pretences and give it to them "straight".
Those are the "obvious" kinds of NLP, there are subtle bits and pieces including copying their tone or turns of phrase.
It does increase your conversion rate.
Obviously (almost) everyone HATES cold calls so one of my goals was to see how long I could keep "angrys" on the phone. NLP definitely helps here (although empathy, humility and diplomacy are probably more useful) I even managed to sell to one once! :D
For the challenge, mostly. Also because it can result in a good call back as well (apparently after I left my call-backs had good conversion rates).
As long as you combine the right amount of humility, amiability and possibly humour you can turn almost anyone around. For example if someone is getting A LOT of sales calls and yours is the straw that broke the camel's back then you can still help them. Provide information, give em the telephone preference service number (to remove them from sales calls databases), tell them how automated diallers work or how the regulations work in regards to cold calls. Be honest.
There is no reason why any call should have to end in a negative on either side. :)
I asked PG if swombat and onetimetoken had set off the HN sock puppet detector and he said that onetimetoken had used an IP address never before seen by HN.
So, swombat, time to prove that you are onetimetoken.
If I was onetimetoken, and I went to all this hassle to create a properly anonymous account and even laid down a challenge to you to find me out... why would I help you to find me out by giving you proof?
This is certainly an interesting problem,but without having some access to the data I don't think I could really approach it.
Perhaps I am alone in saying this, but I think data mining is interesting while web crawling is boring. Could somebody make the data available so that we don't have to write a crawler? Or is this part of the challenge?
I think this is a classic example of unsupervised learning, for which I would generally use a system like Fuzzy ART. I think that might perform better than a naive Basyesian text classifier though I can't be sure until I try it out.
If anyone wants to use 80legs for this challenge, just drop us a line at http://www.80legs.com/contact.html. We might be able to set up some custom free plans.
Isn't the Naive Bayes classifier biased to users with a large volume of text? Ie if there are two users and one writes 99% of the content it's very likely that that user will be picked as the author for almost anything? At the same time, this may be desired since someone who does contribute a lot on HN may have also desired to have some fun.
The key calculation is given a word w what's the frequency with which this user uses word w. So that's number of times w occurs / numbers of words that user has used. So it doesn't matter as long as a user has 'enough' text so that they've covered a good portion of the overall dictionary of words in use.
The prior probability is based on the number of comments a user makes. In this case that prior is insignificant because the sample text is large.
Very cool, and appropriate that you're basically using PG's spam filtering to identify users on his site :)
I think the next step is to write a more complex filter that does not assume word probabilities are independent of each other, i.e. take unusual phrases like "entirely dissimilar" into account.
I'd have thought that capitalisation and punctuation were key elements in any textual analysis. In the subject text there is a very unusual hyphenation "pure-ad" for example.
In authorship identification punctuation can be very important, as can non-grammatical features like the distribution of the number of syllables per word.
That's a nice approach, but a naive bayes classifier doesn't seem like it would be the best method for this particular problem.
You probably want to do an N-gram analysis, like that performed by libtextcat http://software.wise-guys.nl/libtextcat/. This will perform a comparison based on common combinations of letters used (like "wo", "or", "rd"). Seems like it would be more accurate with such a relatively small sample of comments. If you had a list of 10-20 possible candidates, you could narrow it down to just a few.
As I was in the list on there, I just want to confirm it wasn't me but when I read the original comments left by the anonymous commenter, I saw a lot of my own syntax mannerisms - at least the algorithm isn't too bad, eh? ;-)
So basically everybody struck out. Most likely due to sample size.
There's an interesting lesson here that probably says something like the coolness of the tool used has no direct relation to the usefulness of the conclusions provided.
I think it would be more interesting if the "guesses" would actually take into consideration how successful or unsuccessful the method is with the data available. For example, how likely are each of the names he mentioned and how likely is it that it's any one of them?
Edit: If someone here has a background in intelligence I would love to here their take on the challange.
Great to see an analytical approach to the challenge :D Although reviewing your first list most of them actually seem unlikely (for a variety of reasons).
I'm dying to know if this turns out to be right, or not. I actually have a totally different list which is based on a different handling of punctuation. It suggests that the most likely person who also commented on that thread is zaveri.
This challenge is a bit flawed in that we have no way of knowing if the anonymous poster is willing to ever confirm that he/she made the post, isn't it? Not to state the obvious but it just seems that even if a great amount of technical work is put into this, you'll never know the answer unless the person in question agrees to participate.
You can test your techniques on comments (not used in training) for which you know the author. If you can achieve a high accuracy there, you can be fairly sure to be correct in the challenge, too.
Im still convinced the lower case use of google, facebook etc. (which occured more than once in the comment) is important - especially as there is one Google at the start of a sentence - indicating it is intentional/common.
That's why I personally discounted many from your first list (plus the fact I know a few are native English speakers)
I am aghast that people would think I'm not a native English speaker. I hope I'm numbered in the "few." Disclaimer: sometimes Windows handwriting recognition makes a real hash of my post without my noticing.
Run it through a Markovian classifier like CRM114 (or, if that's too expensive, just do it for the likely candidates identified by your naive Bayesian classifier).