Asking candidates to come up with this kind of solution in an interview setting where they are under all kinds of pressure is honestly dehumanizing.
There's a lot of good insight in the article about the correct way to approach the problem, but asking anyone to come up with it on the spot is unrealistic. You have the benefit of having seen the problem before with time on your side to reflect on it. They haven't.
When I do interviews like this, I prefer to talk them through the problem together, like we were actual teammates working on a problem together. That more closely relates to life on the job, which to me is the point of interviewing someone.
Many people (especially from big tech backgrounds), treat interviews as "the time for the candidate to prove that they are good enough to work at my company". I, like you, prefer to use the time for collaborative problem solving to try and get as much signal as possible about whether it would be fruitful for us to work together, while also trying to figure out if we would want to work together.
The "is this person good enough for me" interview allows geniuses who are assholes through. I prefer to filter for good teammates.
If this is considered dehumanizing, I'm concerned for the future of our industry. Anyone with even a first undergraduate algorithms course should immediately spit out "sort by customer and page, then stream the output".
I wouldn't expect everyone to immediately see the optimizations -- you can sort the days individually, you can drop all but a small number of pages per customer -- immediately, but I would be disappointed if someone couldn't be hinted there by the end of a 30 minute interview.
Failing to even get the O(n log n) solution tells me that someone should never have graduated.
> If this is considered dehumanizing, I'm concerned for the future of our industry. Anyone with even a first undergraduate algorithms course should immediately spit out "sort by customer and page, then stream the output".
Well yes, but that's not what this is about. It's about social pressure, not ability.
I could program this for you right now if I wanted to; most "optimisations" seemed the "obvious ones" to me, so I guess I would have passed. I could also program it in a hurry at ludicrous speed under pressure because all of production is down or whatever.
But ... I wouldn't be able to program this in an interview setting.
Last time I was asked a bunch of even easier things (in a timed interview, no less), where I literally have some exact code on my GitHub. I just couldn't finish it. Not because I can't "get" it, but because programming while two strangers are judging your every keystroke and commenting on what you're doing is just nerve-wrecking.
> Failing to even get the O(n log n) solution tells me that someone should never have graduated.
This is something I wonder about in my comment sidethread - to me, the natural solution is the O(n) one in O(n) space.
I see that the O(n log n) solution requires O(1) space to run. But it requires O(n) space to return a result! Is that actually better than O(n) time and O(n) space?
With the sorting solution, you can stream the result directly to a file _without_ holding the entire file in memory. So it can be better if you are memory constrained (e.g. if the files are bigger than what you can hold in memory), or if you want to hold the working set inside the CPU L3 cache as 'cperciva is alluding to.
Either of those solutions is ok from a junior developer.
From a senior developer, I would expect a discussion of in-core vs. out-of-core operations, and the fact that hash tables (even if they fit into RAM) aren't really O(1) time per operation since random access to large address spaces runs into e.g. TLB misses.
Given that the depth of the page table tree is fixed at 3/4/5 depending on the architecture and kernel, I think most people would be confused if asked whether the average complexity of a hash table lookup is always really O(1).
Even senior developers that understand the TLB would likely think of it as "O(1), but with a big constant factor".
On a side note, one thing that the article doesn't mention is the possibility of using a Bloom filter. It trades an extra scan of a file for smaller hash tables/smaller set to sort with external sort.
Yup, was surprised bloom filters weren't even mentioned in passing. I suck but still it's crazy to see some people genuinely would have a problem getting the O(n) solution when they've been in the industry for more than 5-10 years
Whether you get a cache miss or hit is part of 'c', or whether those bytes are fetched over WAN. Big oh of one just means lookup time is a constant and is independent of data size. The big oh doesn't change if move from a SSD to a tape drive
Thus, those operations are really O(1)
Big oh is the infimum function such that for an arbitrarily large N, the runtime of the algorithm is then strictly larger than C times the big oh function. (Paraphrasing there a bit, and on a phone, apologies for not pasting the exact definition)
In a formal sense, anything you can compute on a computer is O(1) since there's only a finite number of states. But we tend to hand-wave that away.
Hash table lookup is absolutely bounded by a constant, but going from L1 to L2 to L3 to RAM with anywhere between 0 and 5 TLB misses is far from constant.
> In a formal sense, anything you can compute on a computer is O(1)
I don't think the definition of Big-Oh supports that conclusion.
"Big O notation is a mathematical notation that describes the limiting behavior of a function when the argument tends towards a particular value or infinity." [1]
"For example, the time (or the number of steps) it takes to complete a problem of size n might be found to be T(n) = 4n^2 − 2n + 2.... As n grows large, the n^2 term will come to dominate,... say that the algorithm has order of n^2 time complexity" [1]
> Hash table lookup is absolutely bounded by a constant, but going from L1 to L2 to L3 to RAM with anywhere between 0 and 5 TLB misses is far from constant.
Indeed, but I can choose a value "C" that is large enough to negate any of the differences between cache miss or hit.
Let's say "N" is the number of lines in one of our text files. Once we have put that data into a hashtable, the claim is then that lookup time is O(1). You counter, and state, no, the lookup time is potentially 30ms, or even 100ms (making up the numbers). For my claim that the lookup time of O(1) to be true, I can pick any value of 'C'. So I choose 1 million. I'm basically saying, lookup times will always be under 1 hour. Given our precondition that we have built this hashtable, no matter how we get the data, yes - it will be under 1 hour. If we used carrier pigeon with USB stick, we could just choose a larger value for C and still be correct.
The idea is basically, if you have something like N^2 + N^1, eventually when N is large enough, n^2 will become so dominant that you could multiple by it by a fixed value to wash away all the other terms in the expression.
Thus, while the actual lookup time can and will vary, it is still bounded. Keep in mind as well there are different functions for best case and average case performance. These notations do take into account that the per-instance lookup time can vary.
> Hash table lookup is absolutely bounded by a constant, but going from L1 to L2 to L3 to RAM with anywhere between 0 and 5 TLB misses is far from constant.
Wait, don't the inner nodes of the page table contain the _physical_ addresses of its child nodes? Wouldn't that make it a most one TLB miss per hash table lookup (assuming key and value are adjacent in the hash table bucket)?
Bounded by a constant does not mean constant lookup time.
Let's say the runtime of some function is (x1n + x2n + x3n), where x1, x2, and x3 represent the L1,L2 l3 cache lookup times.
A function F is called big-oh of function G if there exists a value C such that CF is strictly greater for all sufficiently large values of N
For the above example, let's say I choose the value one billion for C. Is it true that 1 billion times n will be greater than the sum of (x1+x2+x3)n. The answer is yes, thus, f(n) = n is big-oh
(x1+x2+x3)n. In other words, at a certain size of n, we can pick a constant multiple that dominates the other terms. This is in part why asymptotic functions are independent of hardware. Even if L3 cache were actually virtual and fetched over a network, I could still just choose a larger constant. If you can't choose such a constant, then chances are you need a faster growing function like n-squared
It depends how you define "TLB miss". Read it as "paging table reads" if that's clearer.
If you have a 64-entry TLB and 4kB pages, then you can do random memory accesses (aka hash table accesses) up to 256 kB without any paging table reads; up to ~16 MB with one paging table read per operation; and up to ~1GB with two paging table reads per operation. (This, of course, is why large databases use superpages when possible; but while that pushes the limits back it doesn't eliminate them.)
In virtual machines there's further overhead since you have "virtual physical" addresses...
I'd expect talking about TLB misses as a junior engineer thing. That's stuff you learn in school.
Id expect senior engineers to talk about how they will monitor this system, and to bring up things like customers we know we don't trust.
A senior engineer might tell you that running into TLB issues makes a suggestion that your approach is wrong and it's about time to move the solution to your data warehouse
I have a non traditional background (non comp sci) and while I understand complexity I wouldn’t have immediately arrived at the solution like this.
I think that rather, I would have done well because I know after doing hundreds of these both as the interviewer and the interviewee that using some sort of Map along with some vague rumblings of “space/time complexity tradeoff” is very often the solution for these types of questions. So I would have immediately gone there first by instinct.
The question asked doesn’t involved any computer science algorithm knowledge at all, nowhere near leet code complexity.
Only the basics that are close to everyday programming work: write a for-loop, know what a Map/dict is, and Google for “how to read a file”. If a candidate can’t do that, they can’t really program. ChatGPT can probably code this answer.
Agree. Our interview technique is a trimmed down but somewhat realistic coding exercise. We do it as a pair programming exercise and we tell them up-front that this should be an interactive and positive experience, we're not trying to trick them or put them under pressure. It'll test the candidates ability to actually do the job they'll be doing if we hire them, which is to write code in an IDE. We do it with them, so they can ask questions on libraries they're not familiar with and so on. Obviously we're not going to solve the problems in exercise for them, but if they do get stuck, we'll help move them forward.
The coding exercise will involve writing tests, deisgn of classes, working with floating point arithmetic, working across thread barriers, some fairly standard maths, and so on.
I actually can't say we're hired a bad candidate yet and they all report that the interview is a positive experience.
In my fairly long career I've been through about 10-15 interview processes. A few of them were quite toxic and those were the "just cram leetcode for months" type. I never want to inflict that on someone else.
While I like your approach better than his by a considerable margin... I would say his approach has merits since he's a principle engineer at google(maybe hiring there) and was an engineer hiring at amazon. the problems they face would definitely have this applicability of a large volume of data needing to be worked on efficiently.
I actually think the question from the blog is a good one and I've used similar at previous roles. It may consist of one part of the interview process. I think in hindsight I was more responding to the overall ethos were some interviews are conducted in an unduly abrasive manner.
> I prefer to talk them through the problem together, like we were actual teammates working on a problem together.
If a colleague came to me to discuss a question this basic, I'd want them PIPed out. This is miles from the complexity of problems that require collaboration. It's a shell pipeline that I'd write with sort/uniq in less than five minutes. I just did it to make sure my estimate wasn't wrong:
Maintainable? No. Done iteratively in less than five minutes? Yes.
If your perspective is that it's unrealistic to ask anyone to come up with an answer to this question on the spot, you should take a long, hard look at yourself and the quality of your colleagues.
As mentioned over and over. This isn't about the ability to do it. This is about making it less of a hazing ritual that generally only seeks to make the interviewer have a sense of importance.
Sadly I know plenty of Seniors with 15 or so years of experience. That would write brute force and it still wouldn't work right.
Senior in a perfect world sure but the reality is that for decades ppl have been desperate for talent and so you don't have to do much to get by.
I knew a popular principle engineer at a fortune 50 company who was loved by execs. He would check in 5 or 10 of the same file. Opposite of the dry principle.
I don’t about that, I mean - if you’re going to be in a position where this level of craftsmanship is necessary, then why shouldn’t the interviewer require that you demonstrate your ability to do so? It’s not a trick question, it isn’t unrealistic or impractical, it just spotlights a particular programming focus, performance.
> After a candidate puts forth the O(n²), I smile politely and I wait. I am really hoping the next words that come out of their mouth are “…but the complexity of this is O(n²) so can I do better?”
> Occasionally, a candidate will think they’re done at this point. 90% of the times that a candidate was done without questioning the quadratic nature of that solution, the final outcome of the loop was No Hire. So that’s another signal for me.
I would have been one of such candidates. The author said they didn't like tricky questions and wanted to get a signal on how the candidate may approach real world problems. Well this is indeed tricky -- unless you drop a bunch of constraints in the beginning, for a real world project, I would just use all the resources I can access to finish it. I am not going to go the extra miles to optimize it in all possible ways. Premature optimization can be evil. I provided the solution, it works and meets all your requirements, then I am done.
Want me to make it fast/memory efficient? You have to say it. Forgot to mention it in the first iteration? No problem, cut me a ticket and I will see if I can sneak it into my next sprint.
Describing 'throwing a hashmap' at the problem as 'premature optimisation' is a bit reductionist.
It wouldn't even occur to me to go with the naive O(n^2) solution because it has such obvious drawbacks with large inputs.
And it's an interview question... yes, you're getting a No-Hire if you just leave it there. Although I personally would prompt you if you're not interviewing for a senior position.
Yep. I'm surprised anyone hears a problem like this without immediately thinking of maps and sets. I was expecting the naive/"poor candidate" solution to just be memory-inefficient uses of those data structures. The O(n^2) solution is something that you might do when you're just starting to learn programming, but anyone ready to apply for jobs should see that it's not only extremely inefficient, but IMO requires more cognitive effort than just throwing the data into maps and letting the standard library do the heavy lifting (even if that is still not optimal). This is basically an Algo 101 "what not to do" introductory example
> Yep. I'm surprised anyone hears a problem like this without immediately thinking of maps and sets.
If you want to import a day's worth of access logs into a hash map, you need to take into account not only the time but also the space complexity of this sort of solution.
Depending on how much traffic a service handles and also on the misguided arbitrary beliefs of a reviewer, importing a day's worth of accesses into a hash map can be a gross mistake and deemed a no-hire with prejudice.
This exercise is even dumber if we take into account that something like importing access logs into something like sqlite, querying data, and discarding the database is something that might be as efficient (if not more) and can take a hand-full of lines of code to implement.
Also, personally I'd consider a sign of incompetence complaining about using a proper database to ingest and query data efficiently but also believes that the time complexity of a task which is expected to run once per day is a hire/no-hire deal-maker. Computational complexity matters for code running in a hot path. Beyond that, either you're playing trivial pursuit or failing to understand the basics of what you're doing.
> If you want to import a day's worth of access logs into a hash map, you need to take into account not only the time but also the space complexity of this sort of solution.
Good thing space complexity is routinely discussed in these interviews.
And the idea that computation complexity only matters for 'the hot path' is way too dismissive.
For instance; a quadratic loop over a 10 million line log file is just a catastrophe (100 trillion iterations), even for something run once a day. Such a log file would happily fit into memory (it's mere gigabytes) and a hashmap would be perfectly adequete.
> Depending on how much traffic a service handles
Yes, it depends. The interviewee is expected to ask, or at least state what assumptions they're making.
I'm sorry, I thought of sorting as my first approach, and didn't think of maps at all. But what do I know? I only taught Algorithms and Data Structures (CS3) 3 times a year for 10 years.
Exactly. At least you have to show that you know what you're doing and it's deliberate. Depending on the seniority, I expect some kind of justification in the comments, like "it's O(nˆ2), but since the input will be very small, it's ok".
In real life people do a lot of O(nˆ2) code without realizing, and usually it's just some unnecessary loop inside another loop. I want to know that you care about some things.
I have grinded leetcode and worked at FAANG, so I understand the common signals interviewers look for. But that's only because I spent a lot of time on FAANG style interview prep. I don't know why the interviewer had to give that awkward silence -- The candidate, who is likely under abnormal pressure, may start thinking about a bunch of other things "is there a bug in my code? did I miss some edge case?" I am one of those people who become very nervous in a live coding interview and usually can't perform as good as I do in my daily work. There was one interview when I embarrassingly forgot some basic syntax of a language that I used every day for years.
I don't understand why this has to be a red flag. What's wrong with just saying "it works, but can you make it faster"? As an interviewer, I say this a lot and I never judge candidates on this at all.
Not even large inputs. More like medium or even the larger end of small. A quadratic-time solution can often start to be noticably slow with just a few thousand items.
A realistic dataset for the problem they descibed could easily be tens of millions of records even if you're not a Google-sized site.
With an attitude like that, I think the author would be absolutely justified in not wanting to hire or work with you. This is the sort of attitude you’d expect from a disaffected minimum wage teenager, not from anyone who took pride in their work or cared about the quality of what they delivered. There are certain things that, even if they aren’t explicitly stated as requirements from the outset, should be reasonably assumed. Avoiding quadratic algorithms is one of those. That’s not what “premature optimization” means; optimization is about tweaking something after the fact but choosing an efficient algorithm is a decision you make at the very beginning.
> Want me to make it fast/memory efficient? You have to say it. Forgot to mention it in the first iteration? No problem, cut me a ticket and I will see if I can sneak it into my next sprint.
It’s probably a lot easier to just not hire people who deliver poor quality work in the first place. Then you don’t have to worry about whether or not they can go back and fix it later.
absolutely not, in the real world you work with constraints and you can make simplifying assumptions.
"How large can the file potentially be" would be my first question. Depending on the size I might even throw sqlite, an RDBMS, or even a full text search engine at the problem. Or I might not, it depends on the actual scenario.
But if your description is to keep it simple I'm going to do that in good faith.
If it's any consolation, the running water is more valuable than the soap. Sadly, time under the running water is a major factor in how much stuff you can dislodge from your hands, so two seconds isn't going to accomplish a lot.
The soap is independently helpful, but it does occur to me to wonder how much of the additional value it brings is due to the automatic requirement to spend more time with your hands in the water, lathering up and rinsing off.
Interesting thought, that the process of rinsing off soap requires more time spent flushing with water, however, due to the nature of solubility, while plain water removes a lot, a https://en.wikipedia.org/wiki/Surfactant will help remove the rest. Not very much is needed. Your laundry will get just as clean if you do not fill the provided measuring cup with detergent.
You could wash your hands with a dry powder or a solvent better at dissolving dirt than soapy water to boot. Glycerol/ethanol comes to mind. (Surgical prep hand sinitizer heh.)
I've heard O(n²) described as the most dangerous asymptotic complexity, because it seems fine for small testing inputs but falls down when you throw real-world -sized data at it.
I generally agree with this, as an interviewer. You need to be explicit about what you're screening for, unless you're trying to hire a mind reader.
E.g. "this is a one-off report" and "ok, how would you do it differently if we wanted to turn this into a feature at scale?"
It's great if an engineer gets there on their own, but there are so many tangents one could go on that I won't ding someone for picking the wrong one, or asking for the most relevant one.
I would expect seniors not to need that level of guidance on the job, with the caveat that expectations should be reasonably set in design docs, tickets, various bits of context, etc.
> No great engineer should ever settle for an O(n²) algorithm, unless bound by memory or some other unmovable constraint.
What if this is a one-off to produce a business report? Would it make sense to use programmer time to create an O(n) structure in memory, or just loop through the files line by line and let the CPU take a minute or five, or thirty? What is the programming language - something that has a library for this or something very low level where we’d read the file byte by byte?
If we’re dealing with the latter, a small amount of data, and a one off report, I don’t care at all in my work whether an engineer I’m managing somehow writes it in O(n^3).
It’s interesting how quick to judge the author is - ask this question for points, don’t even think about that, don’t mention arrays because they’re fixed size (despite implementations for dynamically allocated arrays totally existing and the candidate might be coming from that), and so on. Some humility would be nice.
Although I think what they wrote is very valuable, as this is how many interviews go. And I have to at least appreciate the author’s approach for trying to start a conversation, even if he still takes a rather reductive approach to evaluating candidates.
There's places where writing bad shitty code doesn't matter but frankly I'd rather be at places and rather have colleagues and an environment where we don't write bad shitty code.
The attempts to circumstantially excuse away shitty answers goes against the desire to just not be shit.
The author here seemed able and willing to talk through constraints & issues. Their first paragraphs practically begged for it, as a sign of maturity. Rather than just excuse away shitty solutions, my hope is, even if you are not a super competent can-do coder, you at least can talk through and walk through problems with people that are trustable, to arrive at reasonably competent capable answers.
I would argue that in most cases where performance isn’t a constraint, the first algorithm that comes to mind is probably the most optimal choice. He even says:
> About 80% of the candidates go for the naive solution first. It’s easiest and most natural.
The “naive solution” will be easier to understand and maintain. Why make it harder if it doesn’t add value?
The nearly optimal solution is effectively just as easy to understand and maintain and frankly should come even more naturally to an engineer than the O(n^2) version.
But even more importantly, with a slightly better solution I don’t get woken up in the night once a week because some buffoon left behind a hundred of these little performance landmines that worked great when the table had ten rows on their dev box but causes an outage in prod the second we hit some critical threshold of data.
This takes all sorts of forms from people using quadratic time or quadratic memory to my personal favorite: pulling entire database tables across the network to do basic analytics.
The authors always have the same excuse, that “it worked good enough for what was needed at the time”, ignoring the simple fact that the version which wouldn’t have caused an outage would have been just as easy from day one.
Of course, using the correct date structure is least an engineer can do. But the point is that it doesn’t matter quite often. You’re describing a situation where it matters. Which it doesn’t always.
The problem is even if this is "known" upfront things always end up changing.
Unless the code is truly throw away code and not checked into source control.
I'm maybe guilty of being too harsh with this sometimes but the amount of times half-assed code has woken me up has trained me to be very critical of assumptions like this.
It doesn't always, but I'll be blunt: if you find yourself frequently writing O(n^2) or O(m * n) algorithms where n, m aren't fundamentally small values, you are doing a massive disservice to your coworkers. There is very rarely any meaningful cost to using something better, and that better thing should come more or less instinctively for the overwhelming majority of cases by the time you have even a little experience as a software engineer. These types of algorithms should immediately jump out at you as a least an orange flag if not a red one.
If you aren't doing better because you don't care, you're inflicting the costs your apathy on everyone else around you. If you aren't doing better because nested for loops is the most ergonomic solution for virtually every problem in your language of choice, you need to reevaluate your choices.
Everything doesn't have to be overengineered to death, but that's a far cry from being okay with regularly and intentionally choosing the actual worst solution that is all but guaranteed to blow up under any kind of growth.
I said “regularly or intentionally”, because I wanted to differentiate that that is what I have a particular problem with.
And yes, a really one-off business report is fine. As long as it runs and produces an anccurate answer, that is fine. Which is why I distinguished that if you find yourself regularly writing these kinds of algorithms and not having alarm bells going off in your head, that is almost certainly a problem.
I have never in 25 years encountered someone who spent noticeably too long optimizing running time for software that wasn’t performance critical. I have had to respond to at least two to four cases per year where somebody assumed a small case would remain small forever (or just didn’t think about it at all) and caused an outage, a severely bloated AWS bill (unbounded autoscaling + quadratic), or some other completely avoidable hell.
The thing is, compilers are pretty amazing beasts. CPUs are pretty amazing beasts. "10,000" is a very small value of "N", given those two factors.
I've worked with a lot of engineers that considered anything O(n^2) to be a red flag, and half the time the actual performance profiling favored the naive method simply because the compiler optimized it better.
That means that if you actually care about performance, you've got to spend 30 minutes profiling for most real world scenarios. Yeah, O(n^2) is obviously a crazy bad idea if you ever expect to scale to ten million records, but the vast majority of software being written is handing tiny 10K files, and a very large chunk of it doesn't care at all about performance because network latency eclipses any possible performance gain.
Unless you are in a very hot path and know with absolute certainty that n will remain very low, I'd say you are doing clear premature optimization by comparing and choosing the O(n^2).
I say very small because to me, n=10_000 sounds like a number that could easily and quickly grow higher since yoy are past a basic enumeration of a few choices.
And this is how we end up with situations like GTA5 taking several minutes to parse a JSON in production, because nobody did actually test it with real-world values of n (or at least didn't expect it to increase over the product lifetime)
The right lesson to learn here is "manage and maintain your software as it's used in the real world, as that will tell you where you need to invest time" not "prematurely optimize every algorithm just in case it becomes a problem".
No, the right lesson here is that quadratic algorithms are a particularly nasty middle ground of seeming to work for the small n of development and initial rollout but falling over for the medium to large n of living in production.
Luckily they’re typically easy to notice, and better approaches are often as easy as sorting your input first or caching items in a hash table.
An engineer should therefore catch these things when writing them or in code review and make sure a slightly better option is used in any situation where the size of input has a reasonable chance of growing.
The cost is nearly nothing and the benefit is not shipping code that will virtually guarantee slow response times and bloated scaling costs at best and humans being woken up at night for outages at worst.
I used a brute force-y approach that meets the requirements, saves millions in operational costs (vs hiring engineers to build and maintain the complex non-brute-force solution).
Unfortunately people don’t think about actual engineering cost of “optimal” solutions. Engineering costs are part of operational costs and need to be juxtaposed against compute.
You can get a lot more mileage out of running the largest EC2 instance for a year vs hiring a junior engineer.
Why would only the non-bruteforce solution require hiring an engineer to maintain it? Does the brute force solution spontaneously manifest and maintains itself?
It's not just hiring one engineer, it's hiring a whole engineering department because that solution involves symbolic execution plus a few other things that I can not speak about without compromising my employment. The two solutions are in entirely different leagues in terms of engineering complexity.
The bruteforce solution is simple enough that can be maintained by junior engineers.
Add a few mid and senior engineers, and it can become significantly more efficient, without requiring the resources of the optimal solution while still being classified "brute-force".
It is yet another way the bitter lesson manifests [1].
The highest core count machine you can get on EC2 is like 45k per annum, which is peanuts compared to the cost of the team required to build the perfect solution.
The issue here is not brute-force vs optimized. It is over engineering. Why spend 10 minutes optimizing your program when you can spend two weeks over engineering it.
Had I been in your shoes, I'd accept that there things I am not privy to, and therefore I can't have the full picture and thus I can't exercise judgement. Instead I'd try to understand what the other person is telling me - especially on HN - instead of jumping to conclusions.
A complete and efficient solution would take 2+ years plus a very senior, very specialized engineering team chasing a moonshot.
A brute-force approach that can leverage compute is far more efficient in engineering time, operational costs (eng time vs compute), and opportunity cost.
This isn't something that can be optimized in 10 minutes, and insinuating that while digging further suggests that haven't dealt with truly complex engineering problems.
Even you are doing some one off analysis, having to wait several minutes can be annoying. And the non-quadratic solutions are not harder than the quadratic one.
> In theory you could write a pretty simple SQL query, and sure, Big Tech companies have giant data warehouses where you can easily do this sort of thing. But for the scope of a coding interview, you wouldn’t want to. Since this is not a distributed systems problem and the data fits in memory, why introduce the additional complexity and dependencies of a database for something that you can solve with 20 lines of simple code?
My first thought on an actual implementation was, if this is a one-off request, to import it into sqlite. No need to set up a big system, and I think it would be easier/faster than writing those 20 lines of code. Also a hell of a lot easier to iterate on minor spec tweaks like the unique pages overall vs per day clarification. And probably less likely to have off-by-one type of bugs, since the simple logic is handled by the database itself. Bonus, it does handle the case where the dataset is larger than memory!
Yep. I estimate if the files are over 10 million lines, I'd rather use SQLite. They'd probably still fit into memory, but that's where I'd put up with the up-front hassle of SQLite so that I don't have to reparse CSV in Python / Lua / whatever I'm writing this script in
But a point lost because you may need to go out of your way to avoid hitting row limits. “I’m using a proprietary tool that has serious issues with more than 2^20 rows” is not so awesome.
> I’ve actually expressed the problem in an ambiguous way.
So when other people do it, hard and tricky questions are bad, but when you deliberately set your candidate up for failure by withholding concrete information, that's clever and insightful. Got it.
Or more productively put: The author obviously enjoys tearing down simple questions with complex implications (one often does in the sw field) and reflects over their candidates, but seemingly lacks the self-reflection to understand what makes questions hard or tricky and why interviewers like to pick them.
Viewpoint as someone whose primary job right now is recruiting: my observation on what differentiates Fine software engineers from Great ones is how they understand the problem, engage with it, and go get what they need to solve it. Fine software engineers need to have the problem pre-digested; Great ones can take ambiguous problems, figure out what's missing, and hunt it down.
I don't see the author's withholding of all problem parameters as tricky at all, but an attempt to accurately mimic the ambiguities of the real world to see what the candidate does with it.
I do interviews for my team, and I'm very aware that framing questions the wrong way quickly leads people into the trappings of their own obsessions. Be that a need for optimising the wrong things (e.g. optimising minute but easy to optimise logic next to large, network-heavy calls), or assuming the wrong perspective on a task (e.g. getting tasked with an accidentally math-heavy problem, and assuming it's about solving the math instead of their overall approach to new requirements) or simply lacking the confidence to question obvious problems in the task they were given. Interviewing is a stressful to downright dehumanising process that is hard to get right, even with years of experience.
I don't see it. Usually the bad tricks are CS puzzle gotchas, not a designed-in, well-spottable, management underspecification, which is actually testing for an everyday-used skill in the actual job.
Except it's not a "actual job" setting. It's an interview.
Especially for more junior people it's selecting on confidence, because especially in a hiring setting they don't want to "seem stupid" by asking questions. I guess this is also a problem for more senior people. It's just a different setting than "actual job".
If you want to ask if anything could be clarified then ask.
> So when other people do it, hard and tricky questions are bad, but when you deliberately set your candidate up for failure by withholding concrete information, that's clever and insightful. Got it.
To me the worst part is the nonsense rationale to argue away using a database to store and query this data. Taken from the article:
> In theory you could write a pretty simple SQL query, and sure, Big Tech companies have giant data warehouses where you can easily do this sort of thing. But for the scope of a coding interview, you wouldn’t want to. Since this is not a distributed systems problem and the data fits in memory, why introduce the additional complexity and dependencies of a database for something that you can solve with 20 lines of simple code?
The blogger's problem statement explicitly mentions the system needs to track business metrics. Business metrics include things like clickstream metrics and the task of analyzing business metrics is exploratory in nature and requires dashboarding. Only an utter moron would look at those requirements and say "hey, this is a task for logs dumped onto a text file". There is no way in hell that this task does not involve in the very least a database.
This is yet another example of a technical recruiter evaluating candidates, and rejecting them, because they did not presented the wrong solution to the wrong problem that the recruiter somehow convinced himself it was the right one for some reason.
"Recruiter" usually refers to a person whose role is sourcing talent. Their responsibilities are more like identifying candidates, salesmanship toward the candidate, and minor filtering that doesn't require too much subject matter expertise (resume review, canned questions with correct/incorrect answers)
"Interviewer", by contrast, is much more often an actual practitioner whose main role is something else (it rarely makes sense to keep a stable of people around who are solid enough engineers to give interviews and not have them do actual engineering).
Ok. If that is what you meant. I thought you meant a recruiter as in a tech recruiter that finds candidates from different sources and then sets up the actual interview loop after a brief initial call that might involve a simple technical equestrian.
Here we see the most classic interviewing error: not understanding that there's a difference between a test and what's being tested.
Will the data fit in memory? Well, of course it will, it's an interview... oh you expected me to ask you anyway?
I should obviously load both files into the hashmap, that way it works for an arbitrary amount of files instead of just two... oh, you expected me to write a solution for literally the exact problem you stated without considering its practicality? Even though before you wanted the opposite, when you asked about the algorithmic complexity?
An interview could reasonably ask for an out-of-core solution.
But you are right that optimizing specifically for two files seems wrong, especially as the data contains a timestamp, so you could simplify the problem and ignore the number of files completely.
> Now, given two N log files we want to generate a list of ‘loyal customers’ that meet the criteria of: (a) they came on ALL days, and (b) they visited at least L unique pages.
while keeping linear time complexity and
O(num records in first file * L)
memory complexity. It is not too different from the solution given in the article (just use 3 maps instead of two).
That means that multiple files doesn't need out-of-core if the maps for one file at a time fit memory.
The three solutions (n^2 time 1 space, n lg n time k space and n time n space) are basically the three strategies to perform a join in a relational database: a full table scan, a merge-join and a hash-join respectively.
"explain select" is a cool source of interview questions :)
You know since this assignment is basically a boss asking for a single report once about "loyal customers" complexity doesn't matter. let's say worst case scenario it takes 30min to run. who cares about complexity the business value is in the answers from the data.
If you're consistently getting reliable answers and finally decide to build a system for these types of reports, clearly this guy's real world experience at Amazon's Clickstream product is going to be far more valuable than what ever anyone who is brand new to the problem can come up with, even if they choose the "correct" algorithm from the start.
Because I bet you that for most real world products that create more than a single fixed format report you actually want your data setup in a completely different way than what you initially thought. You'll probably learn for example that you want to aggregrate data per week instead of per day. or perhaps you want to link this data to an internal users database, or perhaps your boss wants a notification when new data is added. Or perhaps you'll learn that loading it into a single 1GB SQLite DB solves your problem without even needing to think about any algos.
My team builds out a lot of one-off projects for customers, and a lot of them don't need to be especially performant. That has always been the case for reporting/analytical projects.
You know what does make a difference though? Amount of development effort spent on the solution. The most performant solution can very easily cost the company much more in terms of development time spent. So in that sense, it's sub-optimal
And my favorite answer to questions like this is, "Can I just use grep (or shell commands in general)?"
Grep, uniq, wc, and a few others can be treated as pipeline data transformers to answer questions like this interview question. As long as you make some smart decisions about the order of operations, you can usually get performance on par with what you might write custom code for.
And one of the appropriate questions would be, “Is this a one-off or rare request, or does this need to be productionized? And how sensitive is the response time? And if it must be fast and frequent, then why are we not using some form of indexed db”
Don't get me wrong - it's a totally fair question, frankly one I would have been happy to receive when I was interviewing for those roles.
I'm also a fan of whiteboarding coding interviews in general as a way of evaluating talent so no objections there.
There were just something about this specific question that just struck me as boring, souless, like who cares? I think my objection might be that it too closely resembles a menial task I might actually be given - something that I hope to God the upcoming LLM advances automates away.
yeah, unfortunate "hard studying student who eats algorithms for breakfast" compared to "boring reality". I'm sure there's a fancy data structure for this. In reality, I'd make three buckets (one for each day, and a "isLoyal" byte buffer), and update them as I scan along. O(N) time, O(N) space.
"they don’t know the size of the data upfront", okay. I spend a scan finding the highest customer number and probably make some 10MB index-assossiated buffer. If I'm fancy I find the range and use offset indices to reduce the overall size. You already said it fits in memory and I'm not a distsrubuted programmer. Space is cheap in boring reality
I guess it's one of those cool brain teasers that gets you excited to use your skills from college. Not many get to in reality. Or they prefer other domain-specific skills.
OP's solution seems a little clunky to me. He does this thing where he loads the first day into a hashmap, and then he queries it as he loops over the second day. But why? That just overcomplicates your algorithm, and the memory used by O(2) days is roughly the same as O(1) days. It's also overly specialized to solving the very precise problem that's posed in the interview; what if tomorrow Big Boss wants you to aggregate over 3 days?
A cleaner solution would be to load both days 1 and 2 into the same hashmap. Then you can iterate the map and count whatever condition you want.
I agree. I hate how these interviews always focus so much on algorithmic time at the expense of flexibility of the code. I agree with the first part about avoiding the n^2 algorithm by using hash maps. However, making your algorithm use half the memory but hardcoding the requirement that a user visited on two days is bad design, especially when it's only halving the memory used. Also not only does it hardcode the requirements, it also makes it much more complex logic wise as you need that "if pages from day 1 >= 2 or first page from day 1 != page from day 2".
My design was to create two hash maps, one for customer to a list of days and one for customer to list of pages, though after reading the article I realized my lists should really be sets. Then you can easily account for any change to the definition of a loyal customer, as all you need to do is use two O(1) lookups and then check the size of the lists. Easy, flexible, and little room for error.
I've become partial to interview questions where the interviewee just has to build something like a rock paper scissors game or command line to do list app. Simple prompt, easily extendable as well. It's jarring how many people with 5+ years of experience completely fail on this kind of interview.
Any experienced engineer should have no trouble with it. There's no hiding here. - candidate just needs to deliver something working, something relatively clean, and be reasonably pleasant to pair with. No leetcode grinding necessary, though I have found that those who did well on this problem also generally got high scores from my colleagues who do ask LC questions.
I somehow ended up responsible for the coding interview part at a small startup, and I'm pleased to say that it doesn't involve any niche algorithm memorization or specific language knowledge.
It's really just a scenario with some mockup third-party API docs, where the applicant needs to write some paeudocode that checks different conditions, arranges the data, and ties together the different calls.
It might not be testing every possible skill the applicant has, but at least it's in-line with one of the tasks we actually expect them to perform regularly.
Ah yes where have I heard that "oh my question is so non tricky that you just have to think rationally and I am only looking for your willingness and ability to converse and you will do fine. But most people fail it because they can't have normal conversations".
"Let us hire this person because they dint solve the problem but were a great conversationalist problem solver" - said nobody at a standardized hiring committee!
I worked with someone who was hired because they were likeable, even though they didn’t meet the technical bar. It was a really good decision. They improved the overall atmosphere of the office and made it a nicer place to work.
I think likeability is way too penalized these days (bias). Likeability is actually a useful trait in a team for cohesion and empathy as long as you have processes and culture to identify/manage coasting, sociopathy and toxic behavior.
It wouldn't surprise me if the O(n log n) sorting solution is faster than the O(n) hashing solution, because of better memory locality.
The first answer that popped into my head was a shell pipeline, "cat file1 file2 | sort -k [pattern for customer ID] | awk -f ..." where the awk part just scans the sort output and checks for both dates and two pages within each customer-ID cluster. So maybe 10 lines of awk. It didn't occur to me to use hash tables. Overall it seems like a lame problem, given how big today's machines are: 10,000 page views per second for 2 days, crunching each record into 64 bits, means you can sort everything in memory. If 10 million views per second then maybe we can talk about a hadoop cluster. But 10k a second is an awfully busy site.
I actually had a real-life problem sort of like this a while back, with around 500 million records, and it took a few hours using the Unix command line sort utility on a single machine. That approach generally beat databases solidly.
It uses whatever amount of RAM you tell it to. I think the default is 1MB, which is way too small. It uses external sorting which means it uses O(1) RAM and O(N) temporary disk space. Oversimplified: it reads fixed sized chunks from the input, sorts each chunk in RAM and writes each sorted chunk to its own temp disk file, then merges the sorted disk files. If there are a huge number of temp files, it can merge them recursively, converting groups of shorter files into single longer ones, then merging the longer ones. I'd set the chunk size to a few GB depending on the amount of ram available.
That is basically how everything worked back when 1MB was a lot of memory. The temp files were even on magtape rather than disk. Old movie clips of computer rooms full of magtape drives jumping around, were probably running a sorting procedure of some type. E.g. if you had a telephone in the 1960s, they ran something like that once a month to generate your phone bill with itemized calls. A lot of Knuth volume 3 is still about how to do that.
These days you'd do very large sorting operations (say for a web search engine indexing 1000's of TB of data) with Hadoop or MapReduce or the like. Basically you split the data across 1000s of computers, let each computer do its own sorting operation so you can use all the CPU's and RAM at the same time, and then do the final merge stage between the computers over fast local networks.
I've used the Unix sort program on inputs as large as 500GB and it works fine with a few GB of memory. It does take a while, but so what.
For fun, I fed this interview question to GPT-4 with aider. See the chat transcript linked below.
The data structures look sensible and it did most of what the interviewer wanted on the first try.
It did make the wrong initial assumption that we wanted 2 unique pages per day. When prompted with a clarification, it made a sensible fix.
When asked to optimize, it went for big hammers like parallel processing and caching. As opposed to saving memory by only storing one file in the data structure as the author discussed.
What is very is funny is that, in all big and small companies I have worked in or with, no one ever use or discussed BigO. It only shows up in interviews technical tests.
Lots of good software engineers will have the instinctive knowledge of good or costly solutions without mapping it to BigO.
Also, what is funny is that, in my opinion, BigO is used and required by people that want to look smart but are not necessarily so.
Because what we do with bigO is really limited. Almost nowhere the discussion will go further than O(1), O(logn), O(n), O(n2). Because after it becomes hard to understand maths.
But in my opinion, algorithm complexity goes way beyond that when you use it in real life.
I kind of agree, the track record of people obsessed with BigO is not good in my book. The last person that valued it would still get the smallest element in a collection by... sorting it and then selecting the first element.
Sounds like you don't understand it and are doing the usual mental gymnastics to justify your ignorance. People always say this about math. If you understand it you'll use it all the time because you'll recognize situations where you can leverage it. If you don't understand it you'll blissfully make do, unaware that you could have done things a better way. And then you'll proudly proclaim "math is pointless, I've never needed to use math after I graduated!".
Fact is you did need it, you need it all the time. You just don't realize because you can't use it.
>Poor candidates load the contents of both files into memory.
I suppose this is the step where I become a "poor candidate". I think it's important to acknowledge changing client requirements at this point. Sure, loading both files in to memory is less memory efficient, but it's much easier to tweak this algorithm later if you do it. You can change to to count over 3 different days, or 2 days in a 5 day time period, or any number of other things. You can save some memory if you don't, but you'll arrive at a solution that is much less flexible.
I mean the real solution is to load all the data into a database of course, but even given the constraints of the problem I'd still argue for loading each entire file in to memory as the more general and flexible solutions, when our pretend clients inevitably change their pretend minds.
>you don’t need to actually keep every single page from Day 1 in the Map, just two, since the problem is “at least two pages” so a Set of size 2 or even an array of size 2 will use less memory than an unbounded Set.
And with this I think we've crossed over from the practical to leetcode. At this point you're asking the candidate to add a bunch of new code paths (each one should be tested) and make their solution a lot less general. Going from a pretty general algorithm that can be tweaked pretty easily to something super specific with a bunch of new sources of bugs.
>Or, if you’ve already determined that a customer is loyal, you don’t need to waste CPU cycles going thru the logic again next time you encounter that customer in Day 2.
No, please load it all in to your data structures properly, even if you "waste" a bit of time. All these weird little conditionals sprinkled throughout your code when you're ingesting the data are going to be sources of problems later. They might save a bit of memory, a few cycles, but they significantly increase the complexity of the code, and make a refactor of tweaks much much harder.
If this developer started doing stuff like that in an interview with me, well it would raise some red flags.
>If you want to make the problem even more interesting, you can add a third file. I will leave that as an exercise for the reader as well!
See, our imaginary customers imaginary minds did end up changing. Bet you wish you had loaded both files into memory now.
It's a good question, and the author explains it and the logic really well.
As someone going through this style of interview at the moment (but not having interviewed at Google, Microsoft or Amazon), two things jump out at me:
- If you're going to ask this question and get it done in 1 hour, does the code really matter? I'd argue that if you can get to a good or optimal solution, 99 times out of 100 you can write the code. If I got this question and didn't know better, I'd be stressing about writing the code within an hour. Knowing that we wanted to spend most of the time discussing the algos and data structures would be really useful to me. Maybe Google/Amazon/Microsoft interviews really stress this in their preamble, I don't know.
- The big "issue" I see with this question is that it relies on the interviewer knowing exactly how to steer the conversation. I think I could get to this solution with the hints, and the author seems to imply that it's ok to need a few hints. But an interviewer that doesn't know the right hints to give (or phrases them poorly) is going to turn this question into a train-wreck. This isn't an issue for the author, they clearly know this questions backwards and forwards. But giving this question as a 'standard' question that others will deliver? I think it could easily end up being too conservative and cutting out a lot of otherwise smart developers.
In general, that's my criticism of this style of question: they all claim that they're about 'seeing how you think'. But I think expecting interviewers to be able to elicit a conversation that really shows 'how a candidate thinks' is much more on the interviewer rather than the interviewee. You're expecting people whose primary job is writing software to be really good at delivering interviews.
Instead, you're going to have candidates who most of the time will do well if they can pattern-matching against problems they've seen in the past, and poorly otherwise. I can see how questions like this seem good on paper, and I'm glad this question works for the author. But it's the combination of interviewer and question that makes it effective, not just the question alone. A better title for this post might be 'My favourite way of interviewing candidates', because this post is mostly to do with the author's mental model of how to run an interview with this question.
Well, you're holding it in a log file. Either the data is important enough we'd like to keep it, so we should put it in a database or elasticsearch, or the data isn't important enough, and I'd like to get further clarity on what you're trying to achieve.
Yeah, I'm disagreeing with some approaches. For example, in real life the engineer usually knows the constraints of the problem, and can add them himself. More, asking clarifying questions regarding memory versus speed will quite often draw blanks from less-technical consumer. Some aspects of the problem seem artificial - do we need exactly two days visits to be loyal? Exactly two unique pages? What one's going to do tomorrow, when these requirements change? And that would affect the design chosen.
I feel that author filtered out lots of great candidates over this problem, which might be something to pause about. On the other hand, interviewing to get a good signal is indeed a tricky business, so I can sympathize.
You can do a little better than that. Each item in your map has exactly 3 states:
- We’ve seen this customer visit one unique page with (xx) url on the first day
- We’ve seen this customer visit two unique pages - but only on the first day.
- We’ve seen the customer visit one unique page (xx) and they’ve visited on both days.
In the second state you don’t actually care what the URLs are. And I think the logic for the 3rd state is identical to the logic for the 1st state - since you only add them as a “loyal customer” by visiting them again on the second day. So I think you can get away with using an Option<Url> to store the state instead of a Set or a list. (Though I’d probably use a custom parametric enum for clarity).
It’s a great problem to model using a state machine.
IMO, this interview question is going to get you amazing developer who fail to build anything of value.
I don't like the trick of failing candidates if they don't ask a question. 90% of this style of interview want candidates to rifle through solutions. If you want to talk about requirements, be explicit about it.
I'm really amazed that this "best interview" question really just boils down to leetcode for a _Senior Staff_ level interview. I don't know about y'all, but the _Senior Staff_ and _Principal_ developers I've worked with aren't wasting their time of shit like this. They're ironing out requirements. They're working with stakeholder. They're architecting systems. They're figuring out how to deliver the value the customer wants - and they're ensuring that it's actually the customer wants.
-----
There's a place for performance, but the fast running turd is still a turd.
> I'm really amazed that this "best interview" question really just boils down to leetcode for a _Senior Staff_ level interview.
A good interview should involve more than just a coding problem. But it should absolutely require at least one coding problem. It’s mind boggling the number of “senior” people with good resumes I’ve screened out in interviews over the years because, simple as a problem like this is, they really had no idea how to even start solving it.
I don’t know about the poster, but when I’ve done interviews - especially for senior people - there are a lot of different types of assessment I’d want to do before hiring them. I’d also want to assess their social skills somehow (eg get them to present to the team about something interesting). And ask some high level systems architecture questions, talk about their background, and more.
My username at a large org is my first name. In any situation where someone wants to link to me or mention me, typing my username brings up a list of every person at the org with my name, alphabetically. My last name inevitably sorts me down toward the bottom.
You would think that an exact literal username match would have priority, but no. Typing any prefix of my name similarly sorts everyone else before me too.
Amazon’s VP of Search used to be Udi Manber, who literally wrote the book on Search. The crappiness of their search is deliberate, it is there to serve Amazon’s business objectives, not your needs.
> Same with Microsoft's search in the start menu. Doesn't find Excel if I type "exc".
What do you get? I tried just now, and `E` gives me Excel. In fact, I typed "Esc" first on accident, and I got something different for "Es" but "Esc" gave me Excel too.
Ah, maybe the difference is that I disabled web search completely so it's limited to locally-installed stuff. If it's suggesting web results, I wonder if that's a deliberately-bad result for advertising reasons
What if a candidate notes that web hits can come from all parts of the planet and every timezone and therefore "same day" for a particular end user does NOT overlap with the "same day" of the log files and thus immediately throws into question the meaning of '(a) they came on both days'. Many users could visit the site multiple times in one of their days, but recorded as two separate days in two separate log files. This certainly adds some complexity to the question and might not even be something the interviewer considered in the first place.
Exactly. It's a trick interview question since the minute you deliver this report on 'Loyal Customers' someone will want to increase the complexity of the 'unique' page visitation constraint to include information that will never be available in the log file.
How isn't remotely relevant to being able to accomplish the task.
I only know how certain SQL engines implement queries because I was tasked at that point to increase the speed of the queries, and went deep into the debugging level detail to understand the exact cost of each action.
But I haven't done that in 7 years, and couldn't tell you much more than the tools used to figure it out.
I cannot remotely understand the requirements people make up for our jobs that have nothing to do with doing our jobs.
I agree with most points in the article except for the database part.
> Since this is not a distributed systems problem and the data fits in memory, why introduce the additional complexity and dependencies of a database for something that you can solve with 20 lines of simple code?
Because this is a question about getting the right data, and SQL Databases are...extremely good for filtering, sorting and grouping... data. Besides, every page visit from every client is a unique observation, and the principle of...tidy data suggests that every observation use a database row.
Why solve this with 20 lines of code, when you can solve it in a 4 line SQL query?
The venn diagram of people who blog about their coding interview questions and the people I want to interview with does not overlap. Nothing personal, but they always come across as "here's how I make monkeys dance for my amusement" or "here's a clever question and the more people ask me about it in the interview (despite it being perfectly clear to begin with) the more it strokes my ego and the more I am likely to interview them"
This is a great way to hire people who all meet a homogenous standard of behavior. It’s absolutely efficient from an engineering standpoint to eliminate weak links who can’t solve basic engineering problems.
It creates a team of people who are good at tests, and good in testing environments.
However, there are so many things that make an engineer great that have nothing to do with how they solve problems but who they are, how dedicated to improvement they are, etc.
But they may not be someone who:
-thinks quickly on their feet
- finds this type of situation tolerable
- may have disabilities one can’t see that would make this kind of interview difficult for them
- have personality challenges or anxiety in social situations that make interviews like this impossibly difficult
The list of reasons that “whiteboard testing interviews” don’t work well is long.
I don’t think there’s anything wrong with this approach if that’s the kind of organization you want to build. But it does tend to create homogeny and act as a gatekeeper for “those who do not fit in.”
Some of the very best engineers I have ever hired would never make it though this interview. But they were amazing engineers who did world class work.
This article mentions "good" and "great" candidates many times.
How is the author determining which candidates are great? Do "great" candidates answer the questions the best, or is the interviewer following up 1-2 years after hire and examining the impact of the person?
Great candidates aren't those who can answer DS & algorithms questions the best, but it seems that the author thinks that way.
> For example, you don’t need to actually keep every single page from Day 1 in the Map, just two, since the problem is “at least two pages” so a Set of size 2 or even an array of size 2 will use less memory than an unbounded Set.
That seems overcomplicated. For each customer on day 1, you either have multiple pages or you have a single page. If you see them on day 2 and they had multiple pages on day 1, then they are loyal. Or if they had a different page on day 1 than day 2, they’re loyal. (Or two different pages on day 2, but this comes along for free.)
So the data structure can be:
map<customerid, (MultiplePages | pageid)>
Where MultiplePages is choice in a sum type that doesn’t store any associated data. Or you can do:
map<customerid, optional<pageid>>
Where the none state of the optional means there are multiple pages, but this is a bit of an odd use of an optional.
You could get to something working and relatively bug free in 15 minutes with a shell script consisting of little more than cut, head, sort -u, and grep.
Sorting, merging, then just a clever unique was basically the only thing I'd consider.
If I used this question I'd add to it.
"Pretend you have to do the work on an Arduino uno, which has very little resources. Your uno can request input and produce output from the disk where these are stored at whatever offset you wish. The log files are 100GB each and sit on a desktop computer with a modern Linux on it. Each log line is 512B. You can create files if you need to through some unspecified protocol with the desktop computer. But the desktop computer must be dumb. It will only write to and read from disk. You can send it any disk system call you wish. Step 2; Now do it without sorting or something absurdly slow"
The point is to ask for actual creative solutions instead of the pattern recognition problems that most of these problem formats are.
You want something that a weekend of drilling won't change the result of
1. Find names appearing in both log files. From this create a Set of customers that appear in both. In python it's just creating a set from the first file, creating another set from the second file and unionizing them. O(N)
2. concatenate both files to treat as one. create a Map with key: Customer and value: Set[Page]. This is basically iterating through the log, when you see a customer append the customer_id to the map and add the page to the set if it already exists. O(N)
3. Filter the map for all customers with length(Set[Page]) > 1. To get the Set of all customers that visited more than one page. O(N)
4. Combine the sets of customers who visited multiple pages with customers that appeared in both log files into a new set. O(N). You can do this with, again the union operator in python.
The irony is I'm more able to do this in python then I am in SQL. For SQL stuff at work I just look it up. I never remember the exact syntax.
Total runtime O(N)
total memory O(N)
This is just your basic hashmap stuff for fast look up and aggregation. Usually fang questions are harder than this.
If the job is log file processing, it's an entirely reasonable question.
But most jobs are not log file processing.
It's ridiculous to generalise this sort of thing:
"We are cooking soup here at Amazon Soup Kitchen. My favorite interview question is to ask candidates to bake a cake, that's the real test of any cook."
The question is not about log processing; that is just a framing device. The interviewer does not care whether you can process log files, specifically; the interviewer cares whether you understand basic data structures and know how to make reasonable tradeoffs between time and space complexity, generally.
The people who criticize these questions do so out of insecurity - they know they aren't at the level required to solve it, which is embarrassing because it's pretty simple stuff, so they try to poke holes in it and justify their ineptitude with comments like the one you're responding to.
First they say they don't like tricky questions, and then goes on to admit the "spec" is ambiguous. True, candidates are permitted to ask questions, but perhaps they are trusting of the interviewer and expect the question to be actionable as is? Or just the same, the candidates don't trust the interviewer and don't ask questions because they fear that'll result in a penalty? If you come from an environment where questions aren't rewarded - and there are plenty of those - then silence is likely.
Finally, it's worth mentioning, while the question + answer might correlate well with the hiring decision there's no mention to how well it predicts future performance. That said, there's a survivor bias at play so using it against performance might be iffy.
Frankly, I don't want to hire people who are too timid to ask a simple clarifying question in an interview.
A big part of this job is dealing with ambiguity and communication. If you feel the requirements of your task aren't clear, then go to the person who made the task and clarify them.
What's the alternative, exactly? Staying silent and waiting? Wasting time implementing the wrong solution?
Then make it clear that it's "real world" and that they can and should ask questions about aspects of it that they need more information on, rather than just hoping they do so.
Usually all interviewers will say “please take some time to consider the problem and ask any clarifying questions if you have any”. This is standard for all Amazon interviews.
That feels very formulaic, which is I guess par for the course for behemoth corporations.
If it were me, I would try and frame it like "this is an attempt to be a bit more 'real-world' where we've received some initial direction on what the results are supposed to look like from management, but don't consider any of it set in stone and I'm happy to talk you through it as if we were working together...". IDK the exact verbiage, but something that gets people in that mindset without straight up telling them "this is meant to be ambiguous, wink wink".
You're assuming the nervous stressed candidate can read your mind? That's not going to yield the best candidates. It's going to yield the candidates who are best "tuned" for the process.
I've already explained simple and obvious scenarios where the context can impact the candidates in sub-optimal ways.
If someone says, "Solve this" and the canidate attempts to do so, but that's not what was really expected? That's trickery (and foolish). On the other hand, if the interviewer wants, "Here's a problem, let's discuss..." then *that* is what the interviewer should lead with.
So context doesn't matter? And they're to blindly trust you? Because you said so [1]? Because they've never been screwed by an interviewer before? My sense is, you don't get out enough (i.e., be the candidate).
I hope the next time you do an interview the candidate asks, "Why should I trust you?" or "When was the last time you were the interviewee?"
[1] That's not how trust works. It is, by definition, earned.
It just sounds like you're projecting your poor interview experiences onto me.
If you can't take a statement as unambiguous as "feel free to ask questions at any time" at face value, then why even bother interviewing?
And for the record, I've been the candidate in plenty of interviews. I've done the 5-hour FAANG interviews and know how shitty they are. It's why I specifically designed the coding interview process at my company to be much more candidate friendly.
I think what the author meant is that they dislike questions that require too specialized knowledge, so once you know the trick the question is easy to answer, but whether the candidate knows that particular trick is not otherwise correlated with candidate skill.
I have practically never seen a prd that did not have some ambiguity. I realised spotting ambiguity and asking questions is an essential and invaluable skill that needs to be selected for.
One of my teams was very surprised that I rejected a prd for being too vague and put my foot down that it will not be picked up until specific questions are answered. They were, I would not say meek, but resigned to the inevitability of product managers pushing poorly thought PRDs and not having any say in the matter.
I took my time training them to say no and spot ambiguities, and I like to think they have all become better developers and product managers for it. I always pick questions that have more than one obviously correct interpretation to see if the candidate notices it.
The idea that you trust your interviewer to provide a directly actionable question is strange. I would expect that in campus hirings, and for entry level fresh grads, but not to anyone with even a year of experience. At senior levels, it becomes more and more important to spot ambiguities and clarify them before they result in misunderstandings, wasted efforts, and worse. I trust a good interviewer to have a question that can provide them useful data points on candidates experience, skill, thought process and attitude.
Whether spotting ambiguities in the question has any correlation with future performance is harder to answer, but methodical people with attention to detail are preferrable to the alternative.
If the candidate comes from an environment where questions are penalised, they would be a bad fit for a team that values and expects questioning. It is somewhat unfair, but either way, the interviewers are selecting for their preferred qualities.
I agree with this... you're asked to do something and the interviewer is purposefully holding back information wanting you to come out and ask about it. That feels a bit tricky to me.
I do agree, the real job would habe to deal with ambiguity. But this is the interview. The interviewee has a completely different mindset going in. "Playing games" to see how they respond...that's for the likes of the NSA, CIA, etc.
"We can't find good candidates"? Nah. Your hiring process sucks. Get a mirror.
The skill under test is for that part is “can solve ambiguous problems”, and what you want to see is that the candidate is able to recognize that a problem is ambiguous.
I think the hard part is recognizing that a problem is ambiguous. Telling someone that a problem is ambiguous kind of defeats the point. IMO, it’s not about reading someone’s mind, but recognizing that there are multiple interpretations to what somebody has SAID. That seems less like a mind-reading technique and more like, you know, a communication skill.
I have gotten lots of ambiguous problems during my career, it seems only fair to have them appear during an interview.
The interviewer is expecting the interviewer to violate standard interview protocol? While the interviewee is perhaps desperate for a job? And the interviewer is - in his/her mind - thumping their chest about what a great and experienced interviewer they are?
I can't see how that's a success-minded plan for anyone involved.
I am baffled that several contributors to this thread seem to find this question difficult, some even calling it "dehumanizing". This is a very easy and basic question and I wouldn't want to work with someone who couldn't solve it efficiently in a few minutes.
I agree that "throw everything in a hashmap" should be straightforward and is a good interview test. However, his further steps to "optimize" it by saying "Poor candidates load the contents of both files into memory." are terrible. Yes, that might optimize resources, but first it hardcodes the requirement that there are exactly two days breaking the solution if the requirement changes, and second it adds a bunch of finicky fragile code about "if there are two pages or more from day one or if the first page from day one is different from the page from day two".
Great candidates treat software like a business with changing requirements and code that is read by multiple people, poor candidates treat software like a math challenge where the only goal is to use as few resources as possible.
The problem is that once you start asking algorithm questions for top tech companies people will optimize for knowing them deeply instead of exploring other fields of CS.
This is what happens with Codedorces/ACM-ICPC. Suddenly everyone is hyper driven to crank them out day after day under pressure and much more interesting fields databases or turning a business idea into a usable app get neglected.
Lets not fool ourselves no one is going to solve hard medium LC under time pressure in 30 minutes unless they’ve seen a similar problem before which leads to hiring worse engineers who pass interviews.
Once a metric becomes the goal it ceases to be a good metric.
i have a ~~similar task in my python course.. given 2 text files with similar format, one holding persons and cds they have each, another holding songs and which cds they are on, to answer which songs particular person has, and which persons has particular song. Yeah, a join.
Nothing about O(blah) though, these are way-too specialized / optimizing lands.
that said, coding != thinking..
Some people cannot think of a solution at all, 50% go for 3 loops one-in-another.. copy-pasted 2x2 times, few do the ~~simplest 2 maps with 2-3 funcs.. and once a "master" invented a class with quite a few methods contaning copy-paste inside :/ Probably one can do some analysis over who goes which way and why but i never bothered.
It is also a good ground to show the different ways to add key-value to a map if it's not there - and their pros/cons (one can hook some O(blah) stuff here. i don't. There's just simpler vs faster, with variants).
And yes, having an (subtly) unclear requirement (that should be identified and asked about), is important part of the learning - that's 90% of cases in life.
Data structure questions heavily favors previous use. For example in the past I've used a multi-dictionary for certain problems, so it would have been an easy reach for me; but maybe not for someone from a different coding history.
Personally I prefer something like fizzbuzz which is a pure code question, it applies to candidates of all levels and tells you if they can reason through problems.
The question looks more or less OK, except the ambiguity part. Hower, perhaps after "25 years in Big Tech" one shouldn't invent a process that discriminates against "loyal" customers living in timezones unaligned with the logging server. As usual, users come last.
I think unaligned customers are discriminated in favor of--more likely to be browsing around the server's midnight and get marked as visiting two days.
I don't think you can decide if it's a net positive/negative discrimination with any degree of certainty without knowing the date-stamping machine's location and the demographics of the userbase. However, there most certainly will be a non-zero amount of users discriminated against (browsing during a span around their local midnight that falls onto a single date on the servers), and that's what matters most in my opinion.
At the end of the article he elaborates the problem in a way he seems to feel is too complex to mention in the interview itself: what if there's a third day?
If there are three server days of logs and a loyal customer continues to be defined as one who visits on two different days, the timezone problem essentially goes away. More fully:
- if a loyal customer is one who visits regularly;
- and we sample several days of visits;
- then loyal customers will be detected regardless of their timezone.
If you want the concept of a "loyal customer" to match the detection threshold, the weirdness related to timezones will still exist, but if you think of the detection threshold as a tool that's good enough to detect loyal customers, then it won't.
You do not need to sort for the preprocessing based solution, you only need a K-way partition, which if I'm not mistaken, it can be done in two linear passes. I don't remember if it can be done in place easily though.
Love how he goes, he does not like tricky questions and first thing he describes is a trick hidden in the question designed specifically to fool candidates.
Yet he still thinks it is not a tricky question.
But great article and I learned something from it.
but this "trick" tests for something very real: many times i've seen people around me mindlessly jump to coding a PM-written ticket without clarifying important details or even doing a basic sanity check on the feature. The end result is often a disaster for both parties, and should be avoided with a bit of thought upfront. It's not like anyone is lying or misleading here, such details get omitted all the time in real life but you need to collect them for a proper implementation.
I don't think solution for that is dropping perfectly capable people on the interview.
There should be process for clarifying tickets with the team because you don't just drop tickets on developers "out of nowhere" and expect them to ask questions especially if company culture is "drop the ticket on dev let him figure this out".
Someone has to either make ticket written in detail so you can drop it on dev without dev needing to ask questions OR make refinement where you get team to ask questions to have input from multiple people because single person is not able to understand all the details of the system.
Hot take: The optimal solution is the most obvious one for people who are more used to UNIX commands than programming languages, because the primitives in UNIX are more powerful (they give you a beautiful solution if you can make your problem fit in them; if not, the shell becomes atrocious).
Here's my reasoning for the problem: we have 2 files, one for each day, and we want to see who came both days on different pages. That's a job for join, with some sort|uniq in the middle.
- for each page, we want unique CustomerId -> PageId "mappings", but in UNIX land that's just 2 columns on the same row:
cat dayX | awk '{print $3 $2}' | sort | uniq
- now I have two lists, I join them
join day1_uniq day2_uniq
this gives me, for each customer, its id, then all its pages, on the same line. Customers who came only on one day are not in the output.
- now I want to see if there are at least 2 pages and those 2 pages are different. There's no easy UNIX way to do this because it's all on a single line, so we'll use awk to build a hashmap. We don't need to build a map of all pages, we only need to see if there are at least 2 different pages
cat both_days | awk '{for (i = 1; i < NF; i++) {pages[$i] = 1; if (length(pages) == 2) {print; next}} }'
(Note: length() is not posix but gawk)
Result: a list of all customers having visited at least 2 different pages on both days. Everything is dominated by the initial sort. I haven't ran this, it's a rough draft.
The best answer to the interview question is not mentioned in the article. It’s something like “I’m not interested in working here because I don’t want to use my art to spy on people.”
that looks like I am a good candidate for Amazon, lol.
but on a serious note, the good solution is sort of obvious
and in general you encounter much more interesting problems
on a daily basis, but than again I am working in Clojure where
working with data structures is much easier and straightforward
than in Java.
I am sorry if that sounds condescending, in reality I question my engineering ability almost every day. There is a lot of things that I know horribly less than
any engineer worth their salt.
> I’ve actually expressed the problem in an ambiguous way.
> Did I mean 2 unique pages per day or overall?
The author needs a course in logic. "They visited at least two unique pages." is not ambiguous. Visiting page A on day 1 and visiting page B on day 2 makes the sentence true.
Most people typically do not form sentences based on pure logic, and many things could reasonable be misinterpreted even if purely logically it's unambiguous. Language is not math.
Assuming the quoted wording is the actual question he gives, it's not ambiguous. With the given wording, a person hearing it and understanding that it's stated precisely without ambiguity shouldn't be dinged for not questioning the interviewer's ability to recognize the requirements are unambiguous. Because it's an _interview_ by a technical professional who thought about the problem beforehand and not someone who you should be worried about not understanding what they are asking. In fact, I can imagine another interviewer that would ask you the exact same thing and ding you for asking a stupid question.
Interviews are just not the same thing as real life requirements gathering, so people's thought processes will not be the same. Even if you try to roleplay as someone that doesn't understand how to state requirements precisely, all of the normal procedures and thought processes for sussing that out are compromised due to the inrerview setting. And your ability to assess those processes are compromised due to how familiar you are with the question and how much you've deconstructed it.
It's not the time to be tricky (though somehow simultaneously believing it is and is not a trick question). There's so many more interesting things that could be gleaned from a person's interview performance that "is this interviewer playing fuck-fuck games with me" doesn't rate whatsoever.
Does visiting pages A and B on day 1 and visiting page A on day 2 also make the sentence true? I think that's the source of ambiguity (or maybe it's ambiguous to me only because English is not my native language).
The user has visited A and B on day 1, and A on day 2.
So the total page hits is (A, B, A). Remove duplicates and you have (A, B) which makes the sentence true.
The noun is not the issue but rather the scope of uniqueness:
>Now, given two log files (log file from day 1 and log file from day 2) we want to generate a list of ‘loyal customers’ that meet the criteria of: (a) they came on both days, and (b) they visited at least two unique pages.
It appears to me that the requirement could be interpreted as either:
"(visit on day 1) AND (visit on day 2) AND (total unique pages count > 2)"
a clearer way to put it would be "visited at least two unique pages in total"
or
"(visit at least two unique pages on day 1) AND (visit at least two unique pages on day 2)"
a clearer way to put it would be "visited at least two unique pages on each day"
> I don’t mind getting the naive solution first, but I really want to see my candidate having that aha moment that O(n²) is probably never good in any problem. And I want that aha moment to come pretty quickly and without hints. No great engineer should ever settle for an O(n²) algorithm, unless bound by memory or some other unmovable constraint.
But that is purely a cultural convention. O(n²) is great for many important problems. For example, parsing a sentence in natural language is Ω(n³); getting it in O(n²) would be evidence that the candidate was a deity.
Why select for familiarity with the details and conventions of the interviewing process? How is that supposed to be helpful?
> Candidates then switch it around to have CustomerId as the Key, and PageId as the Value of the Map. But that’s not particularly great either because it overlooks the fact that you can have many pages per customer, not just one. Some candidates have the intuition that they need a Collection of pages as the Value of the Map
But this is wrong. You're completely fine having a map from CustomerId to a single PageId, because the problem statement specifies that you're collecting customers who have visited more than one page. If you process a record that says CustomerId = X, PageId = Y, and you look up CustomerId in your map, these are the possibilities:
1. map[CustomerId] has no entry. You write Y as the new entry for that CustomerId.
2. map[CustomerId] is already Y. You do nothing.
3. map[CustomerId] has an entry that is not Y. You now know that CustomerId represents one of the customers you're trying to find.
At no point did you need the map values to represent more than one page.
> The condition “customers that visited at least 2 unique pages” tends to be a little harder for candidates to get right, so if they’re stuck I throw a little hint: you have a Set of pages from Day1, and a single page from Day2… how can you determine that this is at least two unique pages?
> Poor candidates will loop through the elements in the Set to check if the page from Day2 is in there. This turns your O(n) algorithm into O(n²) again. The number of candidates who have done this is surprising.
> Better candidates will do a .contains() on the Set which is an O(1) operation on a hash set. But there is a catch with the logic.
> The intuition to get this right is this: If you are inside that If loop and the customer visited at least two pages in Day1, and they visited any page in Day2, they’re loyal, regardless of which page they visit in Day2. Otherwise, they only visited only one page in Day1, so the question is: is this a different page? If so they’re loyal, else it’s a duplicate so you don’t know and should keep going. So your If statement has an Or:
> [code sample involving the interviewer's terrible solution]
> There’s a need for attention to detail, like using “>” instead of “>=” or missing the “!” in the second statement. I saw these fairly often. I didn’t worry. Great candidates spotted them quickly as they double-checked the algorithm when they were done. Good candidates spotted them after a little bit of hinting. That gave me a good signal on debugging skills.
Why in the world is this presented as a desirable solution? You have a Set of visited pages from day 1 and a single visited page from day 2. You want to know whether the total number of visited pages is more than 1.
Add the page from day 2 to the Set [O(1)], and then count the Set [also O(1)].
> What if you pre-processed the files and sorted them by CustomerId, then by PageId?
> If the files are sorted, then the problem is easier and it’s just a two-pointer algorithm that you can execute in O(n) with O(1) of memory.
> Since the second sort key is by PageId, you follow another two-pointer algorithm to determine that there are at least two unique pages. So it’s a 2-pointer algorithm within a 2-pointer algorithm. It’s kind of a fun problem! I’ll leave the actual implementation as an exercise for the viewer.
> If you want to make the problem even more interesting, you can add a third file. I will leave that as an exercise for the reader as well!
If you're preprocessing the files, why not concatenate them before (or while) sorting them? The asymptotic resource requirements are the same and you end up with one file that can be processed in O(n) time and O(1) space. (Though the result of the algorithm necessarily takes up O(n) space, so I'm not sure how much this should count as an improvement in terms of space requirements...)
This additional preprocessing step makes the generalization to three files trivial. The algorithm is identical: concatenate the files, sort the monofile, and then walk through the sorted entries.
Awful solutions TBH. They are quite hard to implement (requirements are coupled and could not be composed, logic heavy details) very specific and non-extendable for requirements changes in any way. You don't want this kind of code in a real life.
There's a lot of good insight in the article about the correct way to approach the problem, but asking anyone to come up with it on the spot is unrealistic. You have the benefit of having seen the problem before with time on your side to reflect on it. They haven't.
When I do interviews like this, I prefer to talk them through the problem together, like we were actual teammates working on a problem together. That more closely relates to life on the job, which to me is the point of interviewing someone.