Hacker Newsnew | past | comments | ask | show | jobs | submit | themanmaran's commentslogin

I'm glad to see I'm not alone here! Was really excited about it, and tried pretty hard to make it through the first season but couldn't make it.

It just had too much of that early 2000's cable TV style drama. Which I understand is required since it was on network tv. I honestly think if it was made again today as a netflix/prime series it would be a lot better.


It would be nice to see an actual picture of the physical business card here. Also do you handle sending the design to a manufacturer, or do I need to download and send myself?

The FAQ page is linked to from the home page

https://bizcardz.ai/faq

On the FAQ page there are links to images of the end result / physical


It would still be a lot nicer to see a sample in the repo you linked.

Instead of: Gihub Link => bizcardz => FAQ => "Show me the end result"


This just seems like massive user error. The same thing could have happened in a low tech environment. And the notetaker just made it more obvious.

Ex: Hop on a conference call with a group of people, Person A "leaves early" but doesn't hang up the phone, then the remaining group talks about sensitive info they didn't want Person A to hear.


> Person A "leaves early" but doesn't hang up the phone, then the remaining group talks about sensitive info they didn't want Person A to hear.

I'm sorry but any conference software will make it extremely clear who is still on the call. Again I do put a lot of this scenario down to the User-fault. But the fact that this software is "always on" instead of "activated/deactivated" feels like incomplete software suite to me personally.


> who is still on the call

On internet / app based systems yes ... but on legacy telephone systems you have to remember all 16 of the '<Person> is joining the call' and mentally check them off when you get the '<Person> is leaving the call' on the way out. Of course you have no idea who joined the meeting before you arrived.

You didnt even have to make the mistake once to know not to keep talking on the call anyone can dial into after you think everyone left.


This depends on whether you mean LLMs in the sense of single shot, or LLMs + software built around it. I think a lot of people conflate the two.

In our application e use a multi-step check_knowledge_base workflow before and after each LLM request. Pretty much, make a separate LLM request to check the query against the existing context to see if more info is needed, and a second check after generation to see if output text exceeded it's knowledge base.

And the results are really good. Now coding agents in your example are definitely stepwise more complex, but the same guardrails can apply.


> Pretty much, make a separate LLM request to check the query against the existing context to see if more info is needed, and a second check after generation to see if output text exceeded it's knowledge base.

They are unreliable at that. They can't reliably judge LLM outputs without access to the environment where those actions are executed and sufficient time to actually get to the outcomes that provide feedback signal.

For example I was working on evaluation for an AI agent. The agent was about 80% correct, and the LLM judge about 80% accurate in assessing the agent. How can we have self correcting AI when it can't reliably self correct? Hence my idea - only the environment outcomes over a sufficient time span can validate work. But that is also expensive and risky.


are the different LLMs correlated in what they get wrong? I suspect they are, given how much incest there's been in their training, but if they each have some edge in one particular area, you could use a committee. would cost that much more tokens, obviously.

Do you have a concrete example of what you mean?

For example, the article above was insightful. But the authors pointing to 1,000s of disparate workflows that could be solved with the right context, without actually providing 1 concrete example of how he accomplishes this makes the post weaker.


Sure, concrete example. We do conversational AI for banks, and spend a lot of time on the compliance side. Biggest thing is we don't want the LLM to ever give back an answer that could violate something like ECOA.

So every message that gets generated by the first LLM is then passed to a second series of LLM requests + a distilled version of the legislation. ex: "Does this message imply likelihood of credit approval (True/False)". Then we can score the original LLM response based on that rubric.

All of the compliance checks are very standardized, and have very little reasoning requirements, since they can mostly be distilled into a series of ~20 booleans.


Thank you! Great example!

if an llm is unreliable, then why would another just-as-unreliable llm make it any better?

If a hard drive sometimes fails, why would a raid with multiple hard drives be any more reliable?

"Do task x" and "Is this answer to task x correct?" are two very different prompts and aren't guaranteed to have the same failure modes. They might, but they might not.


RAID only works when failures are independent. E. g. if you bought two drivers from the same faulty batch which die after 1000 power-on hours RAID would not help. With LLM it’s not obvious that errors are not correlated.

> If a hard drive sometimes fails, why would a raid with multiple hard drives be any more reliable?

This is not quite the same situation. It's also the core conceit of self-healing file systems like ZFS. In the case of ZFS it not only stores redundant data but redundant error correction. It allows failures to not only be detected but corrected based on the ground truth (the original data).

In the case of an LLM backstopping an LLM, they both have similar probabilities for errors and no inherent ground truth. They don't necessarily memorize facts in their training data. Even with a RAG the embeddings still aren't memorized.

It gives you a constant probability for uncorrectable bullshit. One of the biggest problems with LLMs is the opportunity for subtle bullshit. People can also introduce subtle errors recalling things but they can be held accountable when that happens. An LLM might be correct nine out of ten times with the same context or only incorrect given a particular context. Even two releases of the same model might not introduce the error the same way. People can even prompt a model to error in a particular way.


If one person is unreliable, why would a group of people make it any better.

Yeah 15 random guys ought to do surgery just as well as one surgeon right?

custodes[.]ai would be a great startup name

Actually, Custodes would have nothing to do with abominable intelligence </warhammer 40k>

...What if we called it a "Machine Spirit"?

Hey we've done a lot of research on this side [1] (OCR vs direct image + general LLM benchmarking).

The biggest problem with direct image extraction is multipage documents. We found that single page extraction (OCR=>LLM vs Image=LLM) slightly favored the direct image extraction. But anything beyond 5 images had a sharp fall off in accuracy compared to OCR first.

Which makes sense, long context recall over text is already a hard problem, but that's what LLMs are optimized for. Long context recall over images is still pretty bad.

[1] https://getomni.ai/blog/ocr-benchmark


That's an interesting point. We've found that for most use cases, over 5 pages of context is overkill. Having a small LLM conversion layer on top of images also ends up working pretty well (i.e. instead of direct OCR, passing batches of 5 images - if you really need that many - to smaller vision models and having them extract the most important points from the document).

We're currently researching surgery on the cache or attention maps for LLMs to have larger batches of images work better. Seems like Sliding window or Infinite Retrieval might be promising directions to go into.

Also - and this is speculation - I think that the jump in multimodal capabilities that we're seeing from models is only going to increase, meaning long-context for images is probably not going to be a huge blocker as models improve.


This just depends a lot on how well you can parse down the context prior to passing to an LLM.

Ex: Reading contracts or legal documents. Usually a 50 page document that you can't very effectively cherry pick from. Since different clauses or sections will be referenced multiple times across the full document.

In these scenarios, it's almost always better to pass the full document into the LLM rather than running RAG. And if you're passing the full document it's better as text rather than images.


One big barrier I haven't seen mentioned is all the OEM competition they are going to face.

Caterpillar, John Deer, etc. already have remote operation vehicles. And a lot of provisions on what types of kits can be retrofitted onto their equipment without violating their terms/warranties.

I'm sure this is already something they've taken into consideration, but it seems like this will be more focused on partnerships with existing OEMs rather than selling add on kits to current fleets.


>One big barrier I haven't seen mentioned is all the OEM competition they are going to face.

Seems like that is a pro not a con. An exit scenario


It’s only a pro if Bedrock has some sort of advantage that the existing companies don’t and can’t easily get. Without some sort of innovator’s dilemma-type situation, they’re likely to be crushed (into gravel).


Aqui-hire. The modern-day labor union (get more for your skilled labor than you would otherwise)


Meh. Being acqui-hired by Caterpillar or John Deere can’t really be a dream of theirs. Plus the financial upside would be limited as these giants would tie it to tough long term milestones. Does not sound like a great deal.


Equipment operators are lead by their largest clients, mining companies such as Rio Tinto for example.

24/7/365 large fleet operators that move a billion tonne of ore per annum and alter the spin balance of the planet by a detectable amount.

Pages such as https://www.riotinto.com/en/mn/about/innovation/automation are out of date and don't do justice to the extant of and demand for grand scale semi autonomous mining and construction equipment.

BBC coverage of one site and mining automation: https://www.bbc.com/news/articles/cgej7gzg8l0o

There's a large yet to be built copper project in the US that has autonomus mining plans in the economic technical report.

https://resolutioncopper.com/mining-method/

https://resolutioncopper.com/wp-content/uploads/2021/11/RTRC...

https://en.wikipedia.org/wiki/Resolution_Copper#Reactions


> Caterpillar, John Deer, etc. already have remote operation vehicles. And a lot of provisions on what types of kits can be retrofitted onto their equipment without violating their terms/warranties.

Sounds ripe for disruption, then.

If a startup demonstrates promise, VC money will flood in. Then it's just a balancing of economics. Is the new VC-backed method cheaper? If so, the incumbents will lose market share relative to the value prop.


CAT, Deere are both doing very interesting things with older autonomy techniques. Deere has acquired several companies, and partnered with others to bring in talent from outside. CAT has worked with outside companies (notably Trimble, Topcon) for key technologies when it makes a big difference. Both are awesome companies, but not AI/ML companies at the core and it'll take a lot of work for them to get there. I think this is very much like the self driving world 10 years ago where OEMs tried very hard to become software companies, but ultimately Cruise and Waymo were the ones that executed.


Neither Cruise nor Waymo seems to be profitable yet, and the jury is still out on whether they will win the market. They may be the MySpaces (or the Fiskers) of autonomous driving.


To the parent posters point though, those manufacturers are holding outsized control over what can be retrofit to their machines, so to disrupt them, you have to make your own machines. Working on and owning heavy equipment myself, I of course have looked at it and thought there's a lot to improve, but at the the same time, I don't really see where the big brain Silicon Valley + venture bucks ethos can be applied to the space, it would be a long and slow grind of doing mostly straightforward mechanical engineering and supply chain/vendor agreements to build something like a bulldozer, just to enter a near impenetrable market due to many existing sunk costs and long relationships between buyers and the existing manufacturers.


my understanding is that the barrier to entry in this space isnt manufacturing the equipment, but rather having a large dealer network for people to use for service and repairs. my impression is that people largely buy whatever has a nearby dealer for this reason. and these dealer connections are more and more important as they make it more and more impossible to work and maintain the equipment as an individual


The manufactures are aware of monopoly laws and will give you the 'key' to put your own thing on and even sell it - for a 'reasonable fee' which may be six figures and proff you will care about safety. Universities have got the key for student projects (under nda)

disclosure: I work for jonh deere but am not speaking for the company. The above is all I feel I can say on the subject


Orrrrrr...venture capital money comes in and they sell for a loss until they achieve monopoly status and jack up the rates!

Might be less successful now that money isn't free.


Venture hasn’t managed to make a dent in Nvidia despite massive investments.

Maybe they aren’t as powerful as you think outside the comparatively trivial “build some software” markets. Hell even in networking, compute and storage there are only three or four real success stories in the last two and a half _decades_.


The money raised is $80m rather than $800m which likely reflects all the challenges faced.

It's the kinda startup that may be able to pivot easier than others.


> One big barrier I haven't seen mentioned is all the OEM competition they are going to face.

Not sure on this one. The company likely has it's own vision but I've thought for a while that a swarm of small electric rubber tracked earth moving vehicles (small enough to fit one or two in a tradies van?) could work longer hours due to being much quieter. For larger jobs you put a single person in a small tower on overwatch and run it 24 hours a day.

This'd give you a somewhat scalable approach from small residential jobs to somewhat larger jobs while not competing against the incumbents directly and allowing you to work out the kinks. Then if it makes sense later, you build bigger machines with hopefully better battery technology.

Ultimately though, for proper big jobs, you need proper big tools. Maybe a partnership or "exit strategy" works.

Though maybe I've played too many RTS games like Supreme Commander...


If the missing ingredient is not some secret technology that only few of these old players have, they are probably too busy with their existing business.

Management may invest many years developing some new key technology on the side but when it comes to actually taking the market, it's hard to focus on two areas at the same time.


Do they have a large patent portfolio that might get into the way?


Honestly I think it would have to:

1) Benchmark meaningfully higher than other models

2) Be offered by a cloud provider (like Azure+OpenAI / AWS+Anthropic). Otherwise you have very little track record in model/api stability. Especially looking at the last week.


It looks like they did the first one. And are already on the platforms. What’s stopping you now?

For us, we’ll probably try it for workflows that don’t currently work with 4.1 or 4 sonnet


Grok 3 is on Azure.


> Now think about a bad software product that you might encounter briefly or you are forced to use: a poorly designed electronic kiosk with 1000ms lag on every interaction, or a hospital electronic system. I think there's a high chance that the people building them rarely use them, or not at all.

To be fair, it would be hard for me to build hospital EHR software if I were also checking myself into the hospital every day.

At my former company we built software for enrolling seniors into Medicare. It was as polished as we could possibly make it, but none of the engineers were 65+ and so pretty hard to dogfood.


I'm one of those people who take the bright, shiny trinket that engineers love to show off and, after a few moments, make it start oozing a brown, smelly fluid as I find the flaws.

Another area where people don't dog food anywhere near enough is handicapped accessibility. It's a catch-22 situation where people like me can't write code because their hands or eyes don't work correctly, and those who have the physical ability to write code don't use accessibility tools.


...This is what a QA department is for.

Like, they've been slashed and outsourced and devalued to death over the past several years, but QA is a vital part of the lifecycle of professional software.

And it's not something you can just toss at a bunch of unpaid interns and expect them to do a good job. Being able to properly test software is a valuable skill—and it's one I respect all the more because I don't have it.


This is something we've been doing as well, and it's pretty magical when the user has a fully customized experience.

That said, it required the user to sign in with their real work email or the results are way off.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: