More

bqe · on Sept 27, 2022

I remember being asked this during my interview at Google. It was the first time I heard it and I gave an answer that iterated over the list twice. The interviewer said that it wasn't good enough and I am only allowed to iterate over it once. He didn't let me write my O(2n) solution down so he returned a strong no as feedback.

heleninboodler · on Sept 28, 2022

This type of interviewing style is bullshit. It means the interviewer knows a better solution that is "clever" and expects you to either have the same cleverness epiphany on the spot or to have studied this question. Neither is actually very useful as a hiring criterion.

Guy Steele gave an incredibly interesting guest talk[1] at google about four different ways to solve this exact problem, and the fact that it's an interesting enough topic for an hour long google talk should probably be a clue that you shouldn't be expected to invent the best solution on the whiteboard in 40 minutes.

[1] https://www.youtube.com/watch?v=ftcIcn8AmSY

jodrellblank · on Sept 28, 2022

Guy Steele's talk came up here: https://news.ycombinator.com/item?id=30462835 with an APL solution of:

     +/((⌽⌈\⌽a)⌊⌈\a)-a

with comment that it could "be fully parallelizable and run fast on the GPU, as it's based on a couple scan (generalized prefix sum) operations". Some explanation in my reply there.

jll29 · on Sept 28, 2022

I have interviewed hundreds of scientists-engineers-developers, and I never use esoteric quiz questions or logic riddles.

I want to screen people for the skills that they will really need in their daily lives, so that's where I draw my pool of questions from. For example, processing data in the most simple ways (sorting, searching, extracting) quickly reveals if the candidates have actually _done_ what they claim they did. You'd be surprised how many Oxford PhDs struggle to write done a pipeline for extracting a simple word frequency list.

Now you might say "Maybe they're Windows people, or prefer Python." and my response is "I don't mind what tools you use - but you need to demonstrate that you can solve easy/common tasks on the spot, without wasting have a calendar day to re-invent the wheel."

black_puppydog · on Sept 28, 2022

Thanks for the link, that made for entertaining watching!

Curious that he gave that talk about parallelism in the end of 2015, and talked about how we'll engineer more systems to enable parallelism.

First question at the end was actually whether we can get this into existing languages because new languages are "notoriously hard to get accepted."

I'm currently learning Rust and now I'm wondering how the iterator map() and other "accumulation style" functions are implemented and whether there's a way to make these parallel, since the map() call treats things independently and a sum() could be done in the proposed tree style way.

Guess I have a piece of code to look up in the standard library :)

ehsanu1 · on Sept 28, 2022

Not sure if there's anything in the standard library, but I recall this as the definitive library for data parallelism in Rust: https://github.com/rayon-rs/rayon

google234123 · on Sept 28, 2022

this could be a good question. There’s nothing wrong with a candidate giving some less effluent answer and then progressing up to a better solution with some small hints or discussions

heleninboodler · on Oct 3, 2022

If I'm giving this sort of question, I expect you to solve it in an obvious way, write the code for it, then talk about potential improvements. Writing it twice while having an epiphany between the first and second is simply not reasonable. And selecting for candidates who have studied these problems well enough to already know the optimal solutions is not going to get you good developers, it's going to get you leetcode champs. If you're building a competitive leetcode team, then great. If not, you're just focusing on the wrong hiring bar.

weard_beard · on Sept 28, 2022

Effluent is poop

vertis · on Sept 28, 2022

It should be a "more effluent" answer and then working up to less effluent :)

weard_beard · on Sept 28, 2022

I think they were mathematically correct.

Affluent + Eloquent = Bullshit

lupire · on Sept 27, 2022

He probably got chastised for poor interviewing in Feedback on Feedback.

google234123 · on Sept 28, 2022

That almost never happens.

spennant · on Sept 28, 2022

It happens… I’ve chastised an interviewer before. Their job was to identify: 1) Can the candidate do the job? 2) Will the candidate do the job? 3) Will the candidate work well the team?

Their job was not to emulate FAANG, intimidate the candidate, or see if the candidate studied Leetcode.

Nuzzerino · on Sept 28, 2022

They had me do the same problem at Amazon too.

bqe · on Feb 23, 2022

Why this is interesting: a major defense against mass account takeovers (ATOs) at large scale companies has been fingerprinting browsers. You as a normal user see this most when you use something like reCaptcha, but it's actually happening on nearly every login flow for major websites. By blocking automation like evilginx, you stop a lot of phishing and credential stuffing attacks against your users.

Using VNC here is super clever. This means that the "automation" part of the phishing attack is actually a browser just like the user is using, so you can't fingerprint it. In fact, the victim is really typing in their password into a real Google login page, but the attacker is logging everything through VNC. It's going to be very hard for Google (or anyone else) to detect this.

The solution to this (like all phishing attacks), is still WebAuthn. However, many of us in security were hoping we could get by with bandaids like fingerprinting until WebAuthn was more widespread.

Spivak · on Feb 24, 2022

I really don't get the hype about WebAuthn. It's only real protection against phishing is that credentials are associated with a particular domain which has been a feature of every password-manager, including the OS/browser built-in ones since forever. The thing requesting the password -- (i.e. the browser) is still the ultimately the source of trust. The treat model these things protect against is so narrow, and now narrower since phones have built-in secure storage, that it can't be worth the effort compared to a marketing push for people to use Bitwarden, Lastpass, 1Password, KeypassX, Browsers, or iCloud password saving. And if you really care about accidental logging of plaintext passwords PAKE already has your back.

If we have the political capital to somehow get everyone on-board with changing their flow I really don't see why it should be webauthn. It's ultimately just a key stored somewhere controlled by the client presenting it, but with more red tape, pseudo-drm, and ewaste.

^ If you're in a high-security setting then go for it, but for the masses nah.

bqe · on Jan 6, 2022

I haven't tried it yet, but if it does what you say it does, I honestly think you underpriced this.

loh · on Jan 6, 2022

You could be right, but at the same time, many people seem to expect this for free. Perhaps they underestimate the amount of work required to implement quality core functionality, or maybe they feel that all core software should be free and open source.

I recently lowered the price in hopes that more people would be willing to follow through on buying. There have been many submissions but no one seems willing to pay. I may increase the prices again later on, but it really depends on how things go.

Ultimately, I want this to be accessible to the average developer so more people have the opportunity to quickly build awesome things and bring their ideas to life. It may prove difficult to find the optimal price point.

There's still a lot of work to done, especially on the marketing/convincing side of things. Maybe I'll look for funding, but I would prefer to bootstrap it. I'll probably need to find contract work soon if no buyers follow through, however. Ideally, clients would submit via assemble.molecule.dev and we move forward from there.

ux-app · on Jan 6, 2022

>There have been many submissions but no one seems willing to pay.

Tailwind is a masterclass with regard to how to market this kind of thing.

1. Release the core of your product for free (basicially give 95% away for nothing)

2. Sell premium features/components (the last 5% of your product for serious people with a budget)

Tailwind is selling their CSS framework UI component kit for $400!!!

It's a freaking simple little design system which is nothing in comparison to what you've built here and they are *rolling* in money.

Good luck.

xweb · on Jan 6, 2022

Second this. This product sounds amazing. But hard to believe it without seeing it. What can you charge for that enterprises/funded entities would pay for, but side projects wouldn’t need until they start making money. Like: Electron, Oracle/SQL Server support, Exchange integration, etc. Good luck! As someone who has been “working” on a SaaS project for a year, would love to try this. But can’t shell out $400 out of pocket for something I can’t touch.

loh · on Jan 6, 2022

I will probably do something like this. I really appreciate your suggestion(s) and feedback.

gremlinsinc · on Jan 6, 2022

Better business model might be make some api, or layer that you control in all this...basically it becomes like a headless cms...could be as simple as centralized auth/data-hosting, then the other version for $400+ could be on-prem/self-hosted...

Then people might get 1 month trial, and then $19/month/app.

This way you start building up a lot of MRR, and can maybe add price points for # of average monthly users or something like Auth0 has.

loh · on Jan 6, 2022

This is a really solid idea. I'll have to think about this. It could pair well with ux-app's (sibling's) suggestion.

gverrilla · on Jan 7, 2022

A suggestion: maybe you should do a document/video explaining what your app does for people like me, because while I was reading I got curious, but I don't really know what it does. After spending 18 months developing my SaaS science algo in Python (self-taught) I was like: ok, now it shouldn't be too hard to make this available in a website. I only knew very basic css,js(jq) and html, so I started to learn VueJS: it was so freaking hard I realized I wouldn't be able to do it in the timeframe I had so we have decided to launch our product with an animated wireframe instead. At that point I also knew there would be challenges in 2 other areas in which I wouldn't have the necessary time/expertise: cloud/server and security.

windowshopping · on Jan 6, 2022

Oh man, I hate to hear that you are struggling to find traction since I think you're building the future here. I really hope you find a way to make this work so that you can be a trailblazer on this front. The sooner these types of things become popular, the sooner there will be competition and more contributions to this space, and it will become more accessible.

bqe · on Sept 14, 2021

Awhile ago I wrote a Python library called LiveStats[1] that computed any percentile for any amount of data using a fixed amount of memory per percentile. It uses an algorithm I found in an old paper[2] called P^2. It uses a polynomial to find good approximations.

The reason I made this was an old Amazon interview question. The question was basically, "Find the median of a huge data set without sorting it," and the "correct" answer was to have a fixed size sorted buffer and randomly evict items from it and then use the median of the buffer. However, a candidate I was interviewing had a really brilliant insight: if we estimate the median and move it a small amount for each new data point, it would be pretty close. I ended up doing some research on this and found P^2, which is a more sophisticated version of that insight.

[1]: https://github.com/cxxr/LiveStats

[2]: https://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf

cyral · on Sept 14, 2021

There are some newer data structures that take this to the next level such as T-Digest[1], which remains extremely accurate even when determining percentiles at the very tail end (like 99.999%)

[1]: https://arxiv.org/pdf/1902.04023.pdf / https://github.com/tdunning/t-digest

djk447 · on Sept 14, 2021

NB: Post author here.

Yeah, that was one of the reasons we chose it as one of the ones to implement, seemed like that was a really interesting tradeoff, we also used uddsketch[1] which provides relative error guarantees, which is pretty nifty. We thought they provided different enough tradeoffs that we wanted to implement both.

[1]: https://arxiv.org/abs/1908.10693

vvern · on Sept 14, 2021

Is it using https://github.com/tvondra/tdigest under the hood, or a separate implementation?

mfreed · on Sept 14, 2021

If folks are interested:

https://github.com/timescale/timescaledb-toolkit/blob/main/e...

(The TimescaleDB Toolkit is also implemented in Rust)

rogers18445 · on Sept 15, 2021

Facebook seems to have an even better performance implementation using sqrt. Might make sense to port that over to Rust. https://github.com/facebook/folly/blob/master/folly/stats/TD...

Lockerman · on Sept 14, 2021

a separate implementation

_0ffh · on Sept 15, 2021

Hi, in an unrelated nitpick: The relative error should be calculated by dividing the error by the true value, not by it's approximation. Still, very nice writeup!

djk447 · on Sept 15, 2021

Thanks! messed up the formula but had it right in the text :( Fixed now.

jeremysalwen · on Sept 15, 2021

Not an expert on this topic but I noticed that the KLL algorithm (published in 2016) was not mentioned in this thread, which provides theoretically optimal performance for streaming quantiles with guaranteed worst case performance: http://courses.csail.mit.edu/6.854/20/sample-projects/B/stre... (And is pretty fast in practice).

djk447 · on Sept 15, 2021

NB: Post author here.

Interesting will have to take a look! Thanks for sharing!

riskneutral · on Sept 14, 2021

Also relevant: Single-Pass Online Statistics Algorithms

[1] http://www.numericalexpert.com/articles/single_pass_stat/

breuleux · on Sept 14, 2021

That's pretty neat! Can these be used to efficiently compute rolling percentiles (over windows of the data), or just incremental?

WireBaron · on Sept 14, 2021

The UDDSketch (default) implementation will allow rolling percentiles, though we still need a bit of work on our end to support it. There isn't a way to do this with TDigest however.

jeffbee · on Sept 14, 2021

Sure there is. You simply maintain N phases of digests, and every T time you evict a phase and recompute the summary (because T-digests are easily merged).

djk447 · on Sept 14, 2021

I think this would be a tumbling window rather than a true "rolling" tdigest. I suppose you could decrement the buckets, but it gets a little weird as splits can't really be unsplit. The tumbling window one would probably work, though Tdigest is a little weird on merge etc as it's not completely deterministic with respect to ordering and merging (Uddsketch is) so it's likely you get something that is more than good enough, but wouldn't be the same as if you just calculated it directly so it gets a little confusing and difficult.

(NB: Post author here).

cyral · on Sept 14, 2021

This is what I do, it's not a true rolling digest but it works well enough for my purposes.

fotta · on Sept 14, 2021

yep I had to implement t-digest in a monitoring library. another alternative (although older) that the prometheus libraries use is CKMS quantiles [0].

[0] http://dimacs.rutgers.edu/~graham/pubs/papers/bquant-icde.pd...

convolvatron · on Sept 14, 2021

i think the new ones started wtih Greenwald-Khanna. but i definately agree - p^2 can be a little silly and misleading. in particular it is really poor at finding those little modes on the tail that correspond to interesting system behaviours.

cyral · on Sept 14, 2021

That sounds familiar, I remember reading about Greenwald-Khanna before I found T-Digest after I ran into the "how to find a percentile of a massive data set" problem myself.

tantalor · on Sept 14, 2021

> Find the median ... randomly evict items

So, not find, but approximate. That's a different thing.

rendaw · on Sept 15, 2021

> without sorting it... have a fixed size sorted buffer

(that you sort yourself)

raxxorrax · on Sept 15, 2021

That doesn't really make sense to me at all. Don't sort it, just have it?

Is the storage restriction the point?

panda88888 · on Sept 16, 2021

Yes. The goal is to find (or approximate) the median without storing all the elements. Instead, it approximates the median by finding the median of randomly selected samples from the elements.

djk447 · on Sept 14, 2021

NB: Post author here.

Thanks for sharing! Hadn't heard of that algorithm, have seen a number of other ones out there, we chose a couple that we knew about / were requested by users. (And we are open to more user requests if folks want to use other ones! https://github.com/timescale/timescaledb-toolkit and open an issue!)

Cd00d · on Sept 15, 2021

Did the candidate get an offer? Genuinely curious.

I had a basic screening call fail once because the expected answer was (in my perspective) more naive than my answer. I'd love it if generating curiosity were an interview +1.

williamdclt · on Sept 15, 2021

Whether they got it or not probably isn't useful information. Having a good/brilliant answer probably isn't the only point of the question, this probably wasn't the only question of the interview, and this probably wasn't the only interview

bo1024 · on Sept 15, 2021

> if we estimate the median and move it a small amount for each new data point, it would be pretty close.

Yeah, this is gradient descent on the absolute loss.

eru · on Sept 15, 2021

> The question was basically, "Find the median of a huge data set without sorting it," [...]

You can use eg QuickSelect (https://en.wikipedia.org/wiki/Quickselect) or Median of Medians.

They don't sort the data, but they do need linear amount of storage.

Izikiel43 · on Sept 14, 2021

> The question was basically, "Find the median of a huge data set without sorting it,"

Isn't this done using a min heap and a max heap in conjuction?

ilikebits · on Sept 14, 2021

The real constraint here is probably "find the median of a huge data set without holding the entire data set in memory".

robotresearcher · on Sept 14, 2021

'Estimate the median of an arbitrary sized data set using a constant amount of memory'.

tantalor · on Sept 14, 2021

No, for 2 reasons,

1. huge data set: heaps requires storing the whole set, but "huge" means "more than you can store"

2. without sorting it: heaps are basically semi-sorted, so you are breaking the rules

xyzzyz · on Sept 14, 2021

You can actually construct the heap from unsorted data in O(n) time, so constructing the heap is definitely not sorting. However, yeah, to actually use the heap to find median in O(n) time, you need to do something similar to magic-five (median of medians) algorithm.

eru · on Sept 15, 2021

Something like QuickSelect is probably better in practice than median-of-medians.

thaumasiotes · on Sept 14, 2021

> but "huge" means "more than you can store"

Really? Where's it coming from?

namrog84 · on Sept 14, 2021

I think they mean having more than you can store simultaneously on a single device.

With a few exceptions this is pretty common scenario.

Dylan16807 · on Sept 14, 2021

More than you can store.

And possibly it's live data.

hnfong · on Sept 15, 2021

The two heap method helps you maintain the current median given an incoming stream of “online” data, and technically is not “sorting”.

The original problem is probably more accurately described as “what if you have too much data to sort” though.

5faulker · on Sept 14, 2021

It's funny that this is often left out from a data & algorithm class.

thaumasiotes · on Sept 14, 2021

Is the minheap-maxheap approach faster than sorting the data? The obvious approach (process each element, one by one, into the appropriate heap, and rebalance the heaps so they are of equal size) takes n log n time and linear space. You can use the same resources to just produce a sorted copy of the input, which is a much better thing to have than two heaps that center on the median.

Izikiel43 · on Sept 15, 2021

The minheap-maxheap approach is better for streaming data, to get the median as data comes in.

I agree that if you have the whole thing, doing heapsort and pulling a[N/2] or a[1 + N/2] is simpler.

thaumasiotes · on Sept 15, 2021

> The minheap-maxheap approach is better for streaming data, to get the median as data comes in.

I see that it's better if you need to know "what is the median of the amount of the list that I've consumed so far?"

But if what you want is the median of the whole list, which might be in a random order, the medians of random prefixes of the list don't seem especially relevant. And if you do have an indefinite amount of data coming in, so that you need a "well, this is what we've seen so far" data point, the minheap-maxheap approach doesn't seem very well suited since it requires you to remember the entirety of the data stream so far.

My first instinct is to divide the possible data values into buckets, and just count the number of datapoints that fall into each bucket. This gives you a histogram with arbitrary resolution. You won't know the median value, but you will know which bucket contains the median value, and your storage requirements depend only on the number of buckets you want to use.

lanstin · on Sept 14, 2021

Because unlike many dynamic programming algorithms, it is something anyone running a large system will need.

ZephyrBlu · on Sept 15, 2021

How does this get you the median?

bqe · on July 14, 2021

I think the simplest solution to this would be to simply hide comment/retweet/like counts. It will be possible to sort of figure this out from the engagement, but it won't be easy to figure out if a tweet is popular or wildly popular.

bqe · on July 9, 2021

Spiritless Kentucky 74: https://spiritless.com/products/kentucky-74

It's pretty good!

JohnWhigham · on July 9, 2021

Didn't know that was a thing. I'm guessing they just take whiskey, heat it up enough to boil off the alcohol but not enough to boil off the water, and then they're done?

Matticus_Rex · on July 9, 2021

I'd guess that they burn it off. It's fairly easy, though you can't get 100% of the alcohol out this way at home. You can make a vinegar out of liquor this way by burning the alcohol off of a handle of liquor, pouring a fifth back in, and adding a vinegar mother with some live vinegar to kick it off before waiting 4-6 weeks.

bqe · on Feb 23, 2021

That website is not a quality source of information. Media Bias/Fact Check rates it as questionable, low quality, and far to the right: https://mediabiasfactcheck.com/here-is-the-evidence/

opwieurposiu · on Feb 23, 2021

Mediabiasfactcheck.com Claims that hereistheevidence.com is "low quality", because it links to "low quality" sources. But Mediabiasfactcheck has links to those same sources, so by it's own argument Mediabiasfactcheck is "low quality".

Please don't take the above argument too seriously, my point is that so many of the fact check orgs are riddled with logical fallacies. In this case we have guilt by association, ad hominem, argument from authority, and appeal to motive.

I think a fact check system that did not rely on logical fallacy would be quite useful, however I have yet to find one.

bqe · on Nov 18, 2020

> They pay for expensive ($$$$$$) cloud anti-bot/anti-scraping, captcha you after a few requests if you run adblock, they pay for extensive browser/device (attempting to reidentify a user across multiple devices/multiple browsers) fingerprinting services and Fastly.

This sounds like a great anti-account takeover program. I imagine with all of the various compliance programs they have to deal with and the legal risk, these are prudent measures.

ev1 · on Nov 18, 2020

There is no registration or login.

bqe · on July 22, 2020

The parent commenter was discussing revenue and not profit. If they were aggressively expanding, I would expect that profits to remain small or negative, but I'd expect revenue to grow as a result.

H8crilA · on July 22, 2020

Just to make it more clear what bqe is saying: they are selling the ~same amount of cars as 2 years ago. Both in terms of $ and in terms of #. Growth is completely flat.

Brakenshire · on July 22, 2020

They’ve lost a $7500 subsidy in the US during that time period, and just recently had Coronavirus. So it’s not that surprising.

guerby · on July 22, 2020

Q2 2018: 53,339 vehicules produced Q2 2020: 82,272 vehicules produced

2 years ago flat?

Source: https://en.wikipedia.org/wiki/Tesla,_Inc.

Tesla production will be around 100k+ per quarter until they open a new factory (Berlin, july 2021) or expand current ones.

If they're still production limited the only growth in production numbers for the next 12 monthes will be in their China factory and may be a bit in Fremont (p7 of PDF).

Tesla announced they hope to be close to 500k produced vehicules in 2020 (p10 of PDF) so that makes 157k/quarter for the next two quarters. I don't think they'll reach 500k in 2020.

But of course the thing you have to look at is results from other automakers (hint: ugly).

H8crilA · on July 22, 2020

Not exactly 2 years: look up Q3 2018 onwards.

samfisher83 · on July 22, 2020

Most companies have shown big declines due to covid. Just being flat is good.

H8crilA · on July 22, 2020

It was flat pre-covid too. It has been flat for 2 years.

labcomputer · on July 23, 2020

Referencing your other comment, this claim really starts to fall apart.

* You're cherry-picking 2018Q3. So it's not exactly 2 years, it's actually 1.75 years (2018Q3 - 2020Q2).

* But now you're saying pre-covid too, so now it's actually 1.25 years (2018Q3 - 2019Q4).

That period in question is the time after Tesla finished ramping Model 3 production (using a tent!) at Fremont in 2018Q3, and before they finished building the factory in Shanghai in 2020Q1.

So... doesn't it seem reasonable that production gains would be a bit "lumpy"? They go up every time a new factory is finished, and they stay flat until the next one.

H8crilA · on July 23, 2020

I hope it works out for them! Good luck.

sbelskie · on July 22, 2020

Yea, if they are adding product lines without an increase in revenue that seems (possibly) concerning, though obviously the current economic situation makes it hard to know how to value year over year comparisons.

bqe · on April 10, 2020

It's important to know the limits of your knowledge and reasoning capabilities and defer to relevant experts, which is known as epistemic learned helplessness[1]. This is also much faster, as a lot of bullshit sounds like truth if you're unfamiliar with the field.

[1]: https://web.archive.org/web/20180416171148/http://squid314.l...

roenxi · on April 10, 2020

Better to just account for expert opinions rather than defer to them; experts typically don't have a magical ability to see through uncertainty. And it is possible to find experts who support a very wide range of opinions. This can be compounded - eg, government officials have fair incentive not to report bad news unless they are completely certain that things have gone wrong. So it would be a mistake to do your recession planning while deferring to the experts at the US Fed.

Another topical example is the WHO in this COVID-19 crisis. The WHO was pretty consistently reporting on what had certainly gone wrong rather than what had likely gone wrong so it was preempted by a bunch of countries closing their borders against the WHO's advice.