Obvious point to raise: the reason people regularly delete their browser history is because they watch porn without turning on private browsing. How do you propose to deal with this?
You'd need to provide at least the ability to selectively delete portions of the history. But you can selectively delete portions of your browser history too, and people don't - because it would be too easy to miss something. Instead, they just nuke the whole thing. How is your tool different?
I take advantage of my browsers history with porn. When I was in college the first bash script I wrote was to open a movie in my porn collection that I hadn't watched in the longest time. This was great. But now with streaming porn sites I don't have a huge collection and I often watch scenes that I can't seem to find later. There is a lot of porn out there.
Sure people clear their browser history because their embarrassed by their porn obsession, but I think this tool could be very useful for pornaholics too.
Vinny Glennon, One of the founders here. Thanks very much for the up votes. The Chrome extension does not work in private browsing. I have a set of porn sites(1.7 million stored in redis) that I check if incoming links are a member of. You can selectively block sites ( https://www.seenbefore.com/blacklist_items).
Do you store the entire set of 1.7 million entries in redis? Or is redis an index to data stored elsewhere, in a relational DB perhaps?
I was under the impression that redis wouldn't be all that useful to store a lot of data. Would be great if something as quick as redis could work with large data sets.
Storing a list of 1.7 million strings for us takes 70mb stored in memory. Testing for membership is an O(1) op. Very happy with it. We use mongo as a dumb data store as well as a bunch of other infrastructure tools, like http://circleci.com we could have only dreamt of years ago.
Again just curious, but I'd still like to know how. Someone there asks the mod how it could be O(1), the mod replies it's a "hash table lookup". But Wikipedia at http://en.wikipedia.org/wiki/Big_O_notation suggests that such lookup is no faster than O(log log n). I think the redis info is incorrect.
O(1) implies that the location of the member in the list is already known, with no search required. I don't see how that could be the case when it's a key lookup. The key could be anywhere in the list, even if the list is sorted. They key would have to be searched for, it seems.
O(1) means constant time.
Redis sets are interesting, because there are a few possible implementations under the hood, but the typical case winds up implementing it as a hash table. Hash tables have constant time lookup.
Think of it this way (this isn't literally what happens, but it's close).
1) Take the url you're looking for. Run it through a hash function. This takes an (amortized) constant amount of time.
2) Now you have an index to check. (the return value from the hash function). So index into your actual table, and check to see what's stored there. If there's a value stored there, then the url is a member of the set. This also takes a constant amount of time.
> So index into your actual table, and check to see what's stored there
But that check isn't a constant time lookup. The lookup time can vary. (Analogously, a lookup in a phone book can vary in time; we can't necessarily go to the exact spot the first time.) So the total time for both steps must vary as well. I think.
I think where you're getting confused is that you're conceptualizing this like a search problem, where you compare values and inspect each member to see if it matches the target.
That's not what's going on here. Instead, you use the value as in input into a function that tells you where to look for it, then you look, to see if it's there.
If it's not there, it won't be anywhere else, so you don't have to keep looking. Things get interesting with collisions but that's a subject for another time.
The main benefit of Bloom filters is that they can be made small. Given that his database takes only 70MB or so and he's not trying to ship this to devices that might have much in terms of space limitations, there would appear to be little point.
Wouldn't one simply use one specific browser, say either safari or firefox or chrome, and only that browser for their... unsavory activities? I think that is a great way to keep accounts separate and keep "bad" sites from knowing about "good" sites and vice versa. Just saying. Not that I partake in any such unsavory activities.
For testing purposes I use Chrome's "Users" feature to keep an extra profile with no extensions installed handy.
The same could be done for a "Porn" profile too I guess, sand-boxing any history, extensions and bookmarks to that profile. You could even associate tie it to a Google account for portability.
This problem is nullified by private browsing. I think the idea is BRILLIANT, as Google's already tracking all my 'legitimate' searches, and I find that most of what I Google are things I've looked at on other machines, or seen already.
The noise introduced by phrasing my query differently is a real problem in search that Google hasn't fixed yet.
Beat me to it! This was something I had been planning to build on my own for a while, but didn't get around to . Congrats!
Whenever I have tech discussions with friends I would recall something mentioned in a article I read via HN. But it would take me a whole lot of effort to get that link. Oftentimes I simply couldn't get hold of the link even after an hour of searching.
Please do get the Firefox extension out. Would love to use it. Also, please do make sure the extensions/addons are stable. Have been facing problems with Annotary's extensions [1], for instance.
By the way, do you have a crawler fetch the link content or do you send it from the user's browser?
cofounder here. Our first version spidered out for the content but a far more efficient way was to upload compressed version of the data from the user as we can then do hash checks for reference counting. Chrome extension has been used in the wild for last 3 months on 6 continents. Firefox extension too unstable at the moment(also Mozilla ten day review process), but hope to get it out with 1-2 weeks. Would love any feedback, good or bad!
Was mainly concerned about the scalability. For a large number of users, your server would have to handle a large number of concurrent connections while they uploaded their data. If you used spiders, you could push the URLs to a queue and process them at your convenience.
How do you deal with 2 users looking at the same URL but seeing different things? example.com/me would be different for user1 and user2.
Some pages would be very dynamic, eg. Facebook. And not everyone browses facebook/twitter behind https (which you do not index). Do you not index social networks?
I like the fact that the extension requires no user input and works silently in the background. Has some trade-offs, but worth it. Cannot comment on the search quality yet because Chrome is only my secondary browser; not enough history to search for anything meaningful.
Few annoyances I noted in the FAQ "What Google search sites does it support?" section. google.co.in is by default in English, you would have to explicitly set it to another language [1]. "Indian" is not a language (Hindi, Malayalam, Bengali etc. is [2]). Farsi is not spelt Farsai [3]
Fixed. Switched example to Turkish and Iranian, as that is were most of your traffic came from this post. Just read an 800 page book on India, can't believe I made that mistake.
You need to work on stemming and clustering terms, I think... I've visited a number of Postgres related pages, and some of them contains only "Postgres" while others contains only "PostgreSQL", and searching for Postgres will only give me the former pages. It confused me for a little bit.
Main issue with Evernoting and Bookmarking is that it requires an effort to say that today, this page is useful and I want to store it. Most pages I want to find are very things I did not think was useful at the time.
Each unique page(unique as per the content) is stored per user.
Our goal is to build the tools needed to find the information quickly, similiar to what hipmunk.com did for airline search. We have the added dimension of time to use.
The main thing I use bookmarks for is categorisation. If you add the ability to tag and/or add notes that becomes part of the search terms, that'd be the killer feature for me - I could throw out my 3500 bookmarks and remove Xmarks (at least if we could get a way of automatically getting our existing bookmarks installed).
I'm a paying Xmarks user, but if you were to add a way of tagging sites or adding a note, I'd happily pay for this instead. Just a freeform text field that I could add some keywords into that gets treated as part of the search would actually be sufficient for me.
One feature that could help this: verifying that the account holder is the one using the computer, before showing results.
Without this, assuming this plugin is always-on on all the computers one uses, breaking user's privacy just becomes too easy.
And there's a lot of data one might want to leave private except porn(and usually don't post them in facebook): medical issues, sexual issues, marriage and some other relationship issues, drugs issues and probably others.
But Seen Before requires less effort on my part as a user -> I am more likely to use it. I just continue to google as per normal and now I have an extra option on the right to filter results.
Definitely something we are looking into. Major barrier is the cost for someone keeping a server running 24*7 in cloud(Micro instance on AWS is 175 dollars a year).
Specifically I'm not comfortable for big web company to keep the history of my web activity. So I make it work completely locally. My project did not get much uptake, probably my lackluster marketing and other assorted issues are to blame. So good luck on this one!
Co-founder here. This took us by surprise, we were planning to have Firefox and Safari support done by launch. At this stage, it is priceless to know if we are solving a real problem people have. Also, is this something people would pay for (loops back to if this is enough of a pain point). From the moment we start charging, is the moment we start learning.
See comment elsewhere: With tagging or (simple plain text) notes attached, absolutely. Even moreso with a simple API and/or support to push the cached content to my own server. If it could be selectively enabled for private content too, then even better (e.g. there's several extensive private Wiki's I use regularly that are not sensitive enough that I'd worry about getting them indexed, and I'd love to be able to tell you to index them but perhaps disable the caching).
I don't spend much online, especially when it comes to recurring fees. However, I use Pinboard enough that it's going to be hard for me to resist not paying the $25 fee they charge for archiving bookmarked pages for the second time.
Yeah I could code/hack together something myself and have been thinking of doing it [for fun], but ya know :p
If I remember correctly, this was the result of some research done by a startup. Or was it Google? I couldn't tell you because this service didn't exist back when I read that article
I get that deployment is easier when it is vendor hosted, but this really should be a local app using local storage, withe maybe transient server-side storage for syncing between machiens.
Information Re-Retrieval: Repeat Queries in Yahoo’s Logs
Abstract: "This paper explores repeat search behavior through the analysis of a one-year Web query log of 114 anonymous users and a separate controlled survey of an additional 119 volunteers. Our study demonstrates that as many as 40% of all queries are re-finding queries. Re-finding appears to be an important behavior for search engines to explicitly support, and we explore how this can be done."
I am making some assumptions here absolutely, but because 40% is a large effect you don't need as many samples to be confident.
The other way of looking at it is that maybe it's actually 35% or 45% but either way, that's still interesting, even with a rougher approximation of the actual "answer". If, for some reason, you needed to know if it was 40% or 40.01% because that mattered to you then you would absolutely be annoyed at the small sample size.
If the finding was 2% then we would care about the uncertainty of +/- 5% since the finding is dwarfed by the error rate. That's a smaller effect size so you would need more samples to separate reality from the noise.
I am, by the way, pulling all of these numbers out my ass. Your stats 101 class will teach you the formulas to calculate the actual error bars at work here as well as the assumptions you need to make about the distribution of the data to use those formulas.
I think this is a great idea. I've been using Opera, which has a full-text search capability for history, but it's limited to the machine you're using it on.
I often find interesting articles on Hacker News while I'm at home that I want to find again when I'm at work. Being able to search by browser history across machines is fantastic for me.
I use a system adapted from http://www.gwern.net/Archiving%20URLs to archive every page I've bookmarked (using FF) in the previous month. Then I just query with local tools.
Not ideal, several flaws, but works well enough for me so far.
I'll take it for a spin... this is something I've wanted for a long time.
I was going to hack it by making chrome bookmark every site I visit with a tag:history then when I wanted to search for a site that I've already visited I was going to just search with that tag.
Similar to my project Peerbelt.com. A notable difference is Peerbelt runs entirely on the client to void privacy concerns. Vinny, let's chat and see if we can collaborate. Cheers, -Krassimir the Peerbelt founder
There's also weekly reports that tell you what sites you've been visiting the most, what time of day/what days you visit sites most, and how many pages SeenBefore added to your file.
I'm going to give it a spin and let know what I think (it'll take a few weeks of usage), but I can tell you right now that it's definitely solving a real problem I have.
I am going to try this out because it seems like what I spend a large portion of my time doing. The security and privacy of this scares me a lot though.
But if a small piece of software was installed on your machine it wouldn't be "in the cloud". We know that makes everything better. Ok well not application performance ... or cost ... or usability ... but still, "the cloud".
You'd need to provide at least the ability to selectively delete portions of the history. But you can selectively delete portions of your browser history too, and people don't - because it would be too easy to miss something. Instead, they just nuke the whole thing. How is your tool different?