I remember being asked this during my interview at Google. It was the first time I heard it and I gave an answer that iterated over the list twice. The interviewer said that it wasn't good enough and I am only allowed to iterate over it once. He didn't let me write my O(2n) solution down so he returned a strong no as feedback.
This type of interviewing style is bullshit. It means the interviewer knows a better solution that is "clever" and expects you to either have the same cleverness epiphany on the spot or to have studied this question. Neither is actually very useful as a hiring criterion.
Guy Steele gave an incredibly interesting guest talk[1] at google about four different ways to solve this exact problem, and the fact that it's an interesting enough topic for an hour long google talk should probably be a clue that you shouldn't be expected to invent the best solution on the whiteboard in 40 minutes.
with comment that it could "be fully parallelizable and run fast on the GPU, as it's based on a couple scan (generalized prefix sum) operations". Some explanation in my reply there.
I have interviewed hundreds of scientists-engineers-developers, and I never use esoteric quiz questions or logic riddles.
I want to screen people for the skills that they will really need in their daily lives, so that's where I draw my pool of questions from. For example, processing data in the most simple ways (sorting, searching, extracting) quickly reveals if the candidates have actually _done_ what they claim they did. You'd be surprised how many Oxford PhDs struggle to write done a pipeline for extracting a simple word frequency list.
Now you might say "Maybe they're Windows people, or prefer Python." and my response is "I don't mind what tools you use - but you need to demonstrate that you can solve easy/common tasks on the spot, without wasting have a calendar day to re-invent the wheel."
Thanks for the link, that made for entertaining watching!
Curious that he gave that talk about parallelism in the end of 2015, and talked about how we'll engineer more systems to enable parallelism.
First question at the end was actually whether we can get this into existing languages because new languages are "notoriously hard to get accepted."
I'm currently learning Rust and now I'm wondering how the iterator map() and other "accumulation style" functions are implemented and whether there's a way to make these parallel, since the map() call treats things independently and a sum() could be done in the proposed tree style way.
Guess I have a piece of code to look up in the standard library :)
Not sure if there's anything in the standard library, but I recall this as the definitive library for data parallelism in Rust: https://github.com/rayon-rs/rayon
this could be a good question. There’s nothing wrong with a candidate giving some less effluent answer and then progressing up to a better solution with some small hints or discussions
If I'm giving this sort of question, I expect you to solve it in an obvious way, write the code for it, then talk about potential improvements. Writing it twice while having an epiphany between the first and second is simply not reasonable. And selecting for candidates who have studied these problems well enough to already know the optimal solutions is not going to get you good developers, it's going to get you leetcode champs. If you're building a competitive leetcode team, then great. If not, you're just focusing on the wrong hiring bar.
It happens… I’ve chastised an interviewer before. Their job was to identify:
1) Can the candidate do the job?
2) Will the candidate do the job?
3) Will the candidate work well the team?
Their job was not to emulate FAANG, intimidate the candidate, or see if the candidate studied Leetcode.
Why this is interesting: a major defense against mass account takeovers (ATOs) at large scale companies has been fingerprinting browsers. You as a normal user see this most when you use something like reCaptcha, but it's actually happening on nearly every login flow for major websites. By blocking automation like evilginx, you stop a lot of phishing and credential stuffing attacks against your users.
Using VNC here is super clever. This means that the "automation" part of the phishing attack is actually a browser just like the user is using, so you can't fingerprint it. In fact, the victim is really typing in their password into a real Google login page, but the attacker is logging everything through VNC. It's going to be very hard for Google (or anyone else) to detect this.
The solution to this (like all phishing attacks), is still WebAuthn. However, many of us in security were hoping we could get by with bandaids like fingerprinting until WebAuthn was more widespread.
I really don't get the hype about WebAuthn. It's only real protection against phishing is that credentials are associated with a particular domain which has been a feature of every password-manager, including the OS/browser built-in ones since forever. The thing requesting the password -- (i.e. the browser) is still the ultimately the source of trust. The treat model these things protect against is so narrow, and now narrower since phones have built-in secure storage, that it can't be worth the effort compared to a marketing push for people to use Bitwarden, Lastpass, 1Password, KeypassX, Browsers, or iCloud password saving. And if you really care about accidental logging of plaintext passwords PAKE already has your back.
If we have the political capital to somehow get everyone on-board with changing their flow I really don't see why it should be webauthn. It's ultimately just a key stored somewhere controlled by the client presenting it, but with more red tape, pseudo-drm, and ewaste.
^ If you're in a high-security setting then go for it, but for the masses nah.
You could be right, but at the same time, many people seem to expect this for free. Perhaps they underestimate the amount of work required to implement quality core functionality, or maybe they feel that all core software should be free and open source.
I recently lowered the price in hopes that more people would be willing to follow through on buying. There have been many submissions but no one seems willing to pay. I may increase the prices again later on, but it really depends on how things go.
Ultimately, I want this to be accessible to the average developer so more people have the opportunity to quickly build awesome things and bring their ideas to life. It may prove difficult to find the optimal price point.
There's still a lot of work to done, especially on the marketing/convincing side of things. Maybe I'll look for funding, but I would prefer to bootstrap it. I'll probably need to find contract work soon if no buyers follow through, however. Ideally, clients would submit via assemble.molecule.dev and we move forward from there.
Second this. This product sounds amazing. But hard to believe it without seeing it. What can you charge for that enterprises/funded entities would pay for, but side projects wouldn’t need until they start making money. Like: Electron, Oracle/SQL Server support, Exchange integration, etc. Good luck! As someone who has been “working” on a SaaS project for a year, would love to try this. But can’t shell out $400 out of pocket for something I can’t touch.
Better business model might be make some api, or layer that you control in all this...basically it becomes like a headless cms...could be as simple as centralized auth/data-hosting, then the other version for $400+ could be on-prem/self-hosted...
Then people might get 1 month trial, and then $19/month/app.
This way you start building up a lot of MRR, and can maybe add price points for # of average monthly users or something like Auth0 has.
A suggestion: maybe you should do a document/video explaining what your app does for people like me, because while I was reading I got curious, but I don't really know what it does. After spending 18 months developing my SaaS science algo in Python (self-taught) I was like: ok, now it shouldn't be too hard to make this available in a website. I only knew very basic css,js(jq) and html, so I started to learn VueJS: it was so freaking hard I realized I wouldn't be able to do it in the timeframe I had so we have decided to launch our product with an animated wireframe instead. At that point I also knew there would be challenges in 2 other areas in which I wouldn't have the necessary time/expertise: cloud/server and security.
Oh man, I hate to hear that you are struggling to find traction since I think you're building the future here. I really hope you find a way to make this work so that you can be a trailblazer on this front. The sooner these types of things become popular, the sooner there will be competition and more contributions to this space, and it will become more accessible.
Awhile ago I wrote a Python library called LiveStats[1] that computed any percentile for any amount of data using a fixed amount of memory per percentile. It uses an algorithm I found in an old paper[2] called P^2. It uses a polynomial to find good approximations.
The reason I made this was an old Amazon interview question. The question was basically, "Find the median of a huge data set without sorting it," and the "correct" answer was to have a fixed size sorted buffer and randomly evict items from it and then use the median of the buffer. However, a candidate I was interviewing had a really brilliant insight: if we estimate the median and move it a small amount for each new data point, it would be pretty close. I ended up doing some research on this and found P^2, which is a more sophisticated version of that insight.
There are some newer data structures that take this to the next level such as T-Digest[1], which remains extremely accurate even when determining percentiles at the very tail end (like 99.999%)
Yeah, that was one of the reasons we chose it as one of the ones to implement, seemed like that was a really interesting tradeoff, we also used uddsketch[1] which provides relative error guarantees, which is pretty nifty. We thought they provided different enough tradeoffs that we wanted to implement both.
Hi, in an unrelated nitpick: The relative error should be calculated by dividing the error by the true value, not by it's approximation. Still, very nice writeup!
Not an expert on this topic but I noticed that the KLL algorithm (published in 2016) was not mentioned in this thread, which provides theoretically optimal performance for streaming quantiles with guaranteed worst case performance: http://courses.csail.mit.edu/6.854/20/sample-projects/B/stre... (And is pretty fast in practice).
The UDDSketch (default) implementation will allow rolling percentiles, though we still need a bit of work on our end to support it. There isn't a way to do this with TDigest however.
Sure there is. You simply maintain N phases of digests, and every T time you evict a phase and recompute the summary (because T-digests are easily merged).
I think this would be a tumbling window rather than a true "rolling" tdigest. I suppose you could decrement the buckets, but it gets a little weird as splits can't really be unsplit. The tumbling window one would probably work, though Tdigest is a little weird on merge etc as it's not completely deterministic with respect to ordering and merging (Uddsketch is) so it's likely you get something that is more than good enough, but wouldn't be the same as if you just calculated it directly so it gets a little confusing and difficult.
i think the new ones started wtih Greenwald-Khanna. but i definately agree - p^2 can be a little silly and misleading. in particular it is really poor at finding those little modes on the tail that correspond to interesting system behaviours.
That sounds familiar, I remember reading about Greenwald-Khanna before I found T-Digest after I ran into the "how to find a percentile of a massive data set" problem myself.
Yes. The goal is to find (or approximate) the median without storing all the elements. Instead, it approximates the median by finding the median of randomly selected samples from the elements.
Thanks for sharing! Hadn't heard of that algorithm, have seen a number of other ones out there, we chose a couple that we knew about / were requested by users. (And we are open to more user requests if folks want to use other ones! https://github.com/timescale/timescaledb-toolkit and open an issue!)
Did the candidate get an offer? Genuinely curious.
I had a basic screening call fail once because the expected answer was (in my perspective) more naive than my answer. I'd love it if generating curiosity were an interview +1.
Whether they got it or not probably isn't useful information. Having a good/brilliant answer probably isn't the only point of the question, this probably wasn't the only question of the interview, and this probably wasn't the only interview
You can actually construct the heap from unsorted data in O(n) time, so constructing the heap is definitely not sorting. However, yeah, to actually use the heap to find median in O(n) time, you need to do something similar to magic-five (median of medians) algorithm.
Is the minheap-maxheap approach faster than sorting the data? The obvious approach (process each element, one by one, into the appropriate heap, and rebalance the heaps so they are of equal size) takes n log n time and linear space. You can use the same resources to just produce a sorted copy of the input, which is a much better thing to have than two heaps that center on the median.
> The minheap-maxheap approach is better for streaming data, to get the median as data comes in.
I see that it's better if you need to know "what is the median of the amount of the list that I've consumed so far?"
But if what you want is the median of the whole list, which might be in a random order, the medians of random prefixes of the list don't seem especially relevant. And if you do have an indefinite amount of data coming in, so that you need a "well, this is what we've seen so far" data point, the minheap-maxheap approach doesn't seem very well suited since it requires you to remember the entirety of the data stream so far.
My first instinct is to divide the possible data values into buckets, and just count the number of datapoints that fall into each bucket. This gives you a histogram with arbitrary resolution. You won't know the median value, but you will know which bucket contains the median value, and your storage requirements depend only on the number of buckets you want to use.
I think the simplest solution to this would be to simply hide comment/retweet/like counts. It will be possible to sort of figure this out from the engagement, but it won't be easy to figure out if a tweet is popular or wildly popular.
Didn't know that was a thing. I'm guessing they just take whiskey, heat it up enough to boil off the alcohol but not enough to boil off the water, and then they're done?
I'd guess that they burn it off. It's fairly easy, though you can't get 100% of the alcohol out this way at home. You can make a vinegar out of liquor this way by burning the alcohol off of a handle of liquor, pouring a fifth back in, and adding a vinegar mother with some live vinegar to kick it off before waiting 4-6 weeks.
Mediabiasfactcheck.com Claims that hereistheevidence.com is "low quality", because it links to "low quality" sources. But Mediabiasfactcheck has links to those same sources, so by it's own argument Mediabiasfactcheck is "low quality".
Please don't take the above argument too seriously, my point is that so many of the fact check orgs are riddled with logical fallacies. In this case we have guilt by association, ad hominem, argument from authority, and appeal to motive.
I think a fact check system that did not rely on logical fallacy would be quite useful, however I have yet to find one.
> They pay for expensive ($$$$$$) cloud anti-bot/anti-scraping, captcha you after a few requests if you run adblock, they pay for extensive browser/device (attempting to reidentify a user across multiple devices/multiple browsers) fingerprinting services and Fastly.
This sounds like a great anti-account takeover program. I imagine with all of the various compliance programs they have to deal with and the legal risk, these are prudent measures.
The parent commenter was discussing revenue and not profit. If they were aggressively expanding, I would expect that profits to remain small or negative, but I'd expect revenue to grow as a result.
Just to make it more clear what bqe is saying: they are selling the ~same amount of cars as 2 years ago. Both in terms of $ and in terms of #. Growth is completely flat.
Tesla production will be around 100k+ per quarter until they open a new factory (Berlin, july 2021) or expand current ones.
If they're still production limited the only growth in production numbers for the next 12 monthes will be in their China factory and may be a bit in Fremont (p7 of PDF).
Tesla announced they hope to be close to 500k produced vehicules in 2020 (p10 of PDF) so that makes 157k/quarter for the next two quarters. I don't think they'll reach 500k in 2020.
But of course the thing you have to look at is results from other automakers (hint: ugly).
Referencing your other comment, this claim really starts to fall apart.
* You're cherry-picking 2018Q3. So it's not exactly 2 years, it's actually 1.75 years (2018Q3 - 2020Q2).
* But now you're saying pre-covid too, so now it's actually 1.25 years (2018Q3 - 2019Q4).
That period in question is the time after Tesla finished ramping Model 3 production (using a tent!) at Fremont in 2018Q3, and before they finished building the factory in Shanghai in 2020Q1.
So... doesn't it seem reasonable that production gains would be a bit "lumpy"? They go up every time a new factory is finished, and they stay flat until the next one.
Yea, if they are adding product lines without an increase in revenue that seems (possibly) concerning, though obviously the current economic situation makes it hard to know how to value year over year comparisons.
It's important to know the limits of your knowledge and reasoning capabilities and defer to relevant experts, which is known as epistemic learned helplessness[1]. This is also much faster, as a lot of bullshit sounds like truth if you're unfamiliar with the field.
Better to just account for expert opinions rather than defer to them; experts typically don't have a magical ability to see through uncertainty. And it is possible to find experts who support a very wide range of opinions. This can be compounded - eg, government officials have fair incentive not to report bad news unless they are completely certain that things have gone wrong. So it would be a mistake to do your recession planning while deferring to the experts at the US Fed.
Another topical example is the WHO in this COVID-19 crisis. The WHO was pretty consistently reporting on what had certainly gone wrong rather than what had likely gone wrong so it was preempted by a bunch of countries closing their borders against the WHO's advice.