Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If you took into account user, location, etc. 15% seems too low. I almost never search for the exact same thing twice in the same location.

15% of the queries themselves are unique. https://blog.google/products/search/our-latest-quality-impro...

https://www.google.com/search/howsearchworks/responses/

I work for Google (and used to work on Search).



I'd be interested in seeing how polluted that 15% of new queries is with people blasting malformed URLs or FQDNs into the omnibox of Chrome.


What's so unbelievable about 15%? I personally think it is way lower than I expected. We're clearly not googling in the same way.


I agree with you. Also in my experience less tech-savvy people tend to overcomplicate their queries instead of just entering the relevant keywords which I'm sure accounts for many uniques.


The point is not that. It’s that when you search for “cute animals”, Google shouldn’t be storing that you searched for that, or even care. Your location is arguably potentially relevant but it could be coarse enough except when searching for directions to allow at least some caching.


Hey Igor! Hate to be a bore, but I wanted to provide feedback that your comment may unintentionally come across as aggressive. OP has pretty relevant work experience that I know I’d love to hear more about, but there’s not really any room for them to respond.

I know many folks IRL who work at big tech who have no interest in posting here because the community comes off as very unwelcoming. That’s a shame, because they have insight that would be great to hear. Regardless of anyone’s opinion of their employer.

Apologies in advance if your intent was purely about the topic. I just thought I read something in your tone that might hinder discourse rather than encourage it. I wanted to point it out, in case it was unintentional.


Agreed about the tone. The comment could have been less argumentative — instead of "that's not the point," they could have said "that's not the only reason."

On the other hand, if I'm not responding, it's not because I find HN too abrasive — it's because I am afraid of leaking non-public information. That's why whenever I talk about Google, I try to cite a Google blog post or other authoritative source, or talk about my own personal experience; hence, "I rarely search for the same query twice."


I’m gonna have to disagree with the negative comments above concerning Igors tone. He made his point with clear respectful language that I would be happy to entertain at work, at the bar, at worship or while on a (previous to covid) group run or golf outing. so, to me, it looks like instead of an ‘agree to disagree’ while respecting each other, you disrespect igor by dismissing his arguments due to his tone, which handily allows you to ignore his content, such as it is. Therefore, in my judgement, you guys are being unfair to Igor while also being disingenuous about your reason for policing his tone.


> disrespect igor by dismissing his arguments due to his tone, which handily allows you to ignore his content, such as it is

I didn't dismiss his argument; I said that he was correct right after he posted: https://news.ycombinator.com/item?id=26073488

"That's not the point" can be interpreted as respectful, but it also can be interpreted as argumentative. I chose to assume good intentions, but I offered a different phrasing that would have a higher chance of not being misinterpreted: i.e. using "yes and" instead of "no but": https://www.theheretic.org/2017/yes-and-vs-no-but/


I apologize for the tone. It the start of my comment was clumsily wired and it wasn’t my intention to have it come off as argumentative. The way I read the GP comment to mine was talking about how Google’s tracking of its users’ telemetry was what was contributing to the uniqueness of requests. Your comment to me boiled down to the fact that of course most requests are unique because of tracking location data and the user account. There seemed to be a disconnect because your comment took for granted that user location and account were a part of the search query while the person you were replying to specifically challenged that notion (again in my reading of both). I tried to post a concise bridge between the two concepts, and of course we all see how well I did with that :)

Having said that, I do think this is clearly a sensitive issues, not a purely technical one. I can appreciate the nuance of working for Google and doing excellent work while seeing the company criticized left and right for its business model. I think given the community, while there is opposition to how Google may at certain points conduct itself as a corporation, there is no lack of respect for any individual working there. I certainly view my comment and the discussion of privacy as having 50% to do with Google’s strategy and 50% to do with the technical aspects of whether you can build a search engine that holds user privacy as a core priority rather than trying to launch an ad hominem on you or anyone. And I saw your other comment that agreed with me and the GP comment so I think my first sentence aside, we are on the same page :)


Thanks for responding constructively, Igor.


To me Igors comment is also displaced. He injects activism into a technical discussion (sadly happens very often here on HN). We all know by now that the bigcorps are to a large degree based on data collection. We do not need to be reminded about it each and every day. We are adults, if we don't like it we use alternatives.


Yeah, this is a fair point. My larger point was mostly that HN misses out on some valuable comments by insiders because those people are disincentives by some of the rhetoric and tone when an article on big tech is popular. I didn’t think the comment I replied to was particularly aggressive - it was just something that came to mind when I read it. OP was actually very kind and constructive in their response - a good ending and constructive discussion for us all!


This is right on the money — getting search results for queries that are too personalized to e.g. location means that you can't cache those search results (or if you did cache them, their entries would be useless).


Right, you can cache that query. That doesn't mean that you can cache "two bunnies playing in the snow r/aww reddit".


I think you mean "15% seems too high". Any easy way to think about this is the following: even if search the entire internet you will almost never see the same sentence twice, assuming it's has a certain number of words. There is a combinatorial explosion in possible sentences to write. Search queries are essentially just sentences without stopwords.


Removing stop words is what old school users of IT systems do, because that's what we learned worked best at the time.

Internet users who came online later, from GenZ to many boomers, will often just write conversational sentences and questions.


I don't understand how you compute that estimate.

I doubt you store the history of all searches ever? People don't need a google account to query the engine, others disable history, etc.

Are you saying you still have all searches ever made ever? Because you would need this to say a query hasn't been made before wouldn't you?


Why would you not store every search ever? It's only a few petabytes, and you can find out all sorts of useful info from it.


I don't know how they did it but I suspect that it wouldn't be very hard to model the distribution by sampling a few million queries and extrapolate from that.


You'd only need to store the list of unique searches, but even if that's true and the 15% number is true, that must be a huge amount of data.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: