I'm glad to see I'm not alone here! Was really excited about it, and tried pretty hard to make it through the first season but couldn't make it.
It just had too much of that early 2000's cable TV style drama. Which I understand is required since it was on network tv. I honestly think if it was made again today as a netflix/prime series it would be a lot better.
It would be nice to see an actual picture of the physical business card here. Also do you handle sending the design to a manufacturer, or do I need to download and send myself?
This just seems like massive user error. The same thing could have happened in a low tech environment. And the notetaker just made it more obvious.
Ex: Hop on a conference call with a group of people, Person A "leaves early" but doesn't hang up the phone, then the remaining group talks about sensitive info they didn't want Person A to hear.
> Person A "leaves early" but doesn't hang up the phone, then the remaining group talks about sensitive info they didn't want Person A to hear.
I'm sorry but any conference software will make it extremely clear who is still on the call. Again I do put a lot of this scenario down to the User-fault. But the fact that this software is "always on" instead of "activated/deactivated" feels like incomplete software suite to me personally.
On internet / app based systems yes ... but on legacy telephone systems you have to remember all 16 of the '<Person> is joining the call' and mentally check them off when you get the '<Person> is leaving the call' on the way out. Of course you have no idea who joined the meeting before you arrived.
You didnt even have to make the mistake once to know not to keep talking on the call anyone can dial into after you think everyone left.
This depends on whether you mean LLMs in the sense of single shot, or LLMs + software built around it. I think a lot of people conflate the two.
In our application e use a multi-step check_knowledge_base workflow before and after each LLM request. Pretty much, make a separate LLM request to check the query against the existing context to see if more info is needed, and a second check after generation to see if output text exceeded it's knowledge base.
And the results are really good. Now coding agents in your example are definitely stepwise more complex, but the same guardrails can apply.
> Pretty much, make a separate LLM request to check the query against the existing context to see if more info is needed, and a second check after generation to see if output text exceeded it's knowledge base.
They are unreliable at that. They can't reliably judge LLM outputs without access to the environment where those actions are executed and sufficient time to actually get to the outcomes that provide feedback signal.
For example I was working on evaluation for an AI agent. The agent was about 80% correct, and the LLM judge about 80% accurate in assessing the agent. How can we have self correcting AI when it can't reliably self correct? Hence my idea - only the environment outcomes over a sufficient time span can validate work. But that is also expensive and risky.
are the different LLMs correlated in what they get wrong? I suspect they are, given how much incest there's been in their training, but if they each have some edge in one particular area, you could use a committee. would cost that much more tokens, obviously.
For example, the article above was insightful. But the authors pointing to 1,000s of disparate workflows that could be solved with the right context, without actually providing 1 concrete example of how he accomplishes this makes the post weaker.
Sure, concrete example. We do conversational AI for banks, and spend a lot of time on the compliance side. Biggest thing is we don't want the LLM to ever give back an answer that could violate something like ECOA.
So every message that gets generated by the first LLM is then passed to a second series of LLM requests + a distilled version of the legislation. ex: "Does this message imply likelihood of credit approval (True/False)". Then we can score the original LLM response based on that rubric.
All of the compliance checks are very standardized, and have very little reasoning requirements, since they can mostly be distilled into a series of ~20 booleans.
If a hard drive sometimes fails, why would a raid with multiple hard drives be any more reliable?
"Do task x" and "Is this answer to task x correct?" are two very different prompts and aren't guaranteed to have the same failure modes. They might, but they might not.
RAID only works when failures are independent. E. g. if you bought two drivers from the same faulty batch which die after 1000 power-on hours RAID would not help. With LLM it’s not obvious that errors are not correlated.
> If a hard drive sometimes fails, why would a raid with multiple hard drives be any more reliable?
This is not quite the same situation. It's also the core conceit of self-healing file systems like ZFS. In the case of ZFS it not only stores redundant data but redundant error correction. It allows failures to not only be detected but corrected based on the ground truth (the original data).
In the case of an LLM backstopping an LLM, they both have similar probabilities for errors and no inherent ground truth. They don't necessarily memorize facts in their training data. Even with a RAG the embeddings still aren't memorized.
It gives you a constant probability for uncorrectable bullshit. One of the biggest problems with LLMs is the opportunity for subtle bullshit. People can also introduce subtle errors recalling things but they can be held accountable when that happens. An LLM might be correct nine out of ten times with the same context or only incorrect given a particular context. Even two releases of the same model might not introduce the error the same way. People can even prompt a model to error in a particular way.
Hey we've done a lot of research on this side [1] (OCR vs direct image + general LLM benchmarking).
The biggest problem with direct image extraction is multipage documents. We found that single page extraction (OCR=>LLM vs Image=LLM) slightly favored the direct image extraction. But anything beyond 5 images had a sharp fall off in accuracy compared to OCR first.
Which makes sense, long context recall over text is already a hard problem, but that's what LLMs are optimized for. Long context recall over images is still pretty bad.
That's an interesting point. We've found that for most use cases, over 5 pages of context is overkill. Having a small LLM conversion layer on top of images also ends up working pretty well (i.e. instead of direct OCR, passing batches of 5 images - if you really need that many - to smaller vision models and having them extract the most important points from the document).
We're currently researching surgery on the cache or attention maps for LLMs to have larger batches of images work better. Seems like Sliding window or Infinite Retrieval might be promising directions to go into.
Also - and this is speculation - I think that the jump in multimodal capabilities that we're seeing from models is only going to increase, meaning long-context for images is probably not going to be a huge blocker as models improve.
This just depends a lot on how well you can parse down the context prior to passing to an LLM.
Ex: Reading contracts or legal documents. Usually a 50 page document that you can't very effectively cherry pick from. Since different clauses or sections will be referenced multiple times across the full document.
In these scenarios, it's almost always better to pass the full document into the LLM rather than running RAG. And if you're passing the full document it's better as text rather than images.
One big barrier I haven't seen mentioned is all the OEM competition they are going to face.
Caterpillar, John Deer, etc. already have remote operation vehicles. And a lot of provisions on what types of kits can be retrofitted onto their equipment without violating their terms/warranties.
I'm sure this is already something they've taken into consideration, but it seems like this will be more focused on partnerships with existing OEMs rather than selling add on kits to current fleets.
It’s only a pro if Bedrock has some sort of advantage that the existing companies don’t and can’t easily get. Without some sort of innovator’s dilemma-type situation, they’re likely to be crushed (into gravel).
Meh. Being acqui-hired by Caterpillar or John Deere can’t really be a dream of theirs. Plus the financial upside would be limited as these giants would tie it to tough long term milestones. Does not sound like a great deal.
> Caterpillar, John Deer, etc. already have remote operation vehicles. And a lot of provisions on what types of kits can be retrofitted onto their equipment without violating their terms/warranties.
Sounds ripe for disruption, then.
If a startup demonstrates promise, VC money will flood in. Then it's just a balancing of economics. Is the new VC-backed method cheaper? If so, the incumbents will lose market share relative to the value prop.
CAT, Deere are both doing very interesting things with older autonomy techniques. Deere has acquired several companies, and partnered with others to bring in talent from outside. CAT has worked with outside companies (notably Trimble, Topcon) for key technologies when it makes a big difference. Both are awesome companies, but not AI/ML companies at the core and it'll take a lot of work for them to get there. I think this is very much like the self driving world 10 years ago where OEMs tried very hard to become software companies, but ultimately Cruise and Waymo were the ones that executed.
Neither Cruise nor Waymo seems to be profitable yet, and the jury is still out on whether they will win the market. They may be the MySpaces (or the Fiskers) of autonomous driving.
To the parent posters point though, those manufacturers are holding outsized control over what can be retrofit to their machines, so to disrupt them, you have to make your own machines. Working on and owning heavy equipment myself, I of course have looked at it and thought there's a lot to improve, but at the the same time, I don't really see where the big brain Silicon Valley + venture bucks ethos can be applied to the space, it would be a long and slow grind of doing mostly straightforward mechanical engineering and supply chain/vendor agreements to build something like a bulldozer, just to enter a near impenetrable market due to many existing sunk costs and long relationships between buyers and the existing manufacturers.
my understanding is that the barrier to entry in this space isnt manufacturing the equipment, but rather having a large dealer network for people to use for service and repairs. my impression is that people largely buy whatever has a nearby dealer for this reason. and these dealer connections are more and more important as they make it more and more impossible to work and maintain the equipment as an individual
The manufactures are aware of monopoly laws and will give you the 'key' to put your own thing on and even sell it - for a 'reasonable fee' which may be six figures and proff you will care about safety. Universities have got the key for student projects (under nda)
disclosure: I work for jonh deere but am not speaking for the company. The above is all I feel I can say on the subject
Venture hasn’t managed to make a dent in Nvidia despite massive investments.
Maybe they aren’t as powerful as you think outside the comparatively trivial “build
some software” markets. Hell even in networking, compute and storage there are only three or four real success stories in the last two and a half _decades_.
> One big barrier I haven't seen mentioned is all the OEM competition they are going to face.
Not sure on this one. The company likely has it's own vision but I've thought for a while that a swarm of small electric rubber tracked earth moving vehicles (small enough to fit one or two in a tradies van?) could work longer hours due to being much quieter. For larger jobs you put a single person in a small tower on overwatch and run it 24 hours a day.
This'd give you a somewhat scalable approach from small residential jobs to somewhat larger jobs while not competing against the incumbents directly and allowing you to work out the kinks. Then if it makes sense later, you build bigger machines with hopefully better battery technology.
Ultimately though, for proper big jobs, you need proper big tools. Maybe a partnership or "exit strategy" works.
Though maybe I've played too many RTS games like Supreme Commander...
If the missing ingredient is not some secret technology that only few of these old players have, they are probably too busy with their existing business.
Management may invest many years developing some new key technology on the side but when it comes to actually taking the market, it's hard to focus on two areas at the same time.
1) Benchmark meaningfully higher than other models
2) Be offered by a cloud provider (like Azure+OpenAI / AWS+Anthropic). Otherwise you have very little track record in model/api stability. Especially looking at the last week.
> Now think about a bad software product that you might encounter briefly or you are forced to use: a poorly designed electronic kiosk with 1000ms lag on every interaction, or a hospital electronic system. I think there's a high chance that the people building them rarely use them, or not at all.
To be fair, it would be hard for me to build hospital EHR software if I were also checking myself into the hospital every day.
At my former company we built software for enrolling seniors into Medicare. It was as polished as we could possibly make it, but none of the engineers were 65+ and so pretty hard to dogfood.
I'm one of those people who take the bright, shiny trinket that engineers love to show off and, after a few moments, make it start oozing a brown, smelly fluid as I find the flaws.
Another area where people don't dog food anywhere near enough is handicapped accessibility. It's a catch-22 situation where people like me can't write code because their hands or eyes don't work correctly, and those who have the physical ability to write code don't use accessibility tools.
Like, they've been slashed and outsourced and devalued to death over the past several years, but QA is a vital part of the lifecycle of professional software.
And it's not something you can just toss at a bunch of unpaid interns and expect them to do a good job. Being able to properly test software is a valuable skill—and it's one I respect all the more because I don't have it.
It just had too much of that early 2000's cable TV style drama. Which I understand is required since it was on network tv. I honestly think if it was made again today as a netflix/prime series it would be a lot better.
reply