Hacker Newsnew | past | comments | ask | show | jobs | submit | mhowland's commentslogin

"They're really good at some things, terrible at others, and prone to doing something totally wrong some fraction of the time."

I agree 100% with this sentiment, but, it also is a decent description of individual humans.

This is what processes and control systems/controls are for. These are evolving at a slower pace than the LLMs themselves at the moment so we're looking to the LLM to be its own control. I don't think it will be any better than the average human is at being their own control, but by no means does that mean it's not a solvable problem.


> I agree 100% with this sentiment, but, it also is a decent description of individual humans.

But you can understand individual humans and learn which are trustworthy for what. If I want a specific piece of information, I have people in my life that I know I can consult to get an answer that will most likely be correct and that person will be able to give me an accurate assessment of their certainty and they know how to accurately confirm their knowledge and they’ll let me know later if it turns out they were wrong or the information changed and

None of that is true with LLMs. I never know if I can trust the output, unless I’m already an expert on the subject. Which kind of defeats the purpose. Which isn’t to say they’re never helpful, but in my experience they waste my time more often than they save it, and at an environmental/energy cost I don’t personally find acceptable.


It defeats the purpose of LLM as personal expert on arbitrary topics. But the ability to do even a mediocre job with easy unstructured-data tasks at scale is incredibly valuable. Businesses like my employer pay hundreds of professionals to run business process outsourcing sites where thousands of contractors repeatedly answer questions like "does this support contact contain a complaint about X issue?" And there are months-long lead teams to develop training about new types of questions, or to hire and allocate headcount for new workloads. We frequently conclude it's not worth it.


Actually humans are much worse in this regard. The top performer on my team had a divorce and his productivity dropped by like a factor of 3 and quality fell of a cliff.

Another example from just yesterday is I needed to solve a complex recurrence relation. A friend of mine who is good at math (math PhD) helped me for about 30 minutes still without a solution and a couple of false starts. Then he said try ChatGPT and we got the answer in 30s and we spent about 2 minutes verifying it.


I call absolute bullshit on that last one. There's no way ChatGPT solves a maths problem that a maths PhD cannot solve, unless the solution is also googleable in 30s.


> unless the solution is also googleable in 30s.

Is anything googleable in 30s? It feels like finding the right combination of keywords that bypasses the personalization and poor quality content takes more than one attempt these days.


Right, AI is really just what I use to replace google searches I would have used to find highly relevant examples 10 years back. We are coming out of a 5 year search winter.


Duck-duck-goable then :)


>Actually humans are much worse in this regard. The top performer on my team had a divorce and his productivity dropped by like a factor of 3 and quality fell of a cliff.

Wow. Nice of you to see a coworker go through a traumatic life event, and the best you can drudge up is to bitch about lost productivity and decrease in selfless output of quality to someone else's benefit when they are at the time trying to stitch their life back together.

SMH. Goddamn.

Hope your recurrence relation was low bloody stakes. If you spent only two minutes verifying something coming out of a bullshit machine, I'd hazard you didn't do much in the way of boundary condition verification.


> I agree 100% with this sentiment, but, it also is a decent description of individual humans.

But humans can be held accountable, LLMs cannot.

If I pay a human expert to compile a report on something and they decide to randomly make up facts, that's malpractice and there could be serious consequences for them.

If I pay OpenAI to do the same thing and the model hallucinates nonsense, OpenAI can just shrug it and say "oh that's just a limitation of current LLMs".


>also is a decent description of individual humans

A friend of mine was moving from software development into managing devs. He told me: "They often don't do things the way or to the quality I'd like, but 10 of them just get so much more done than I could on my own." This was him coming to terms with letting go of some control, and switching to "guiding the results" rather than direct control.

The LLMs are a lot like this.


Your friend got lucky, I've seen (and worked with) people with negative productivity - they make the effort and sometimes they commit code, but it inevitably ends up being broken, and I realize that it would take less of my time for me to write the code myself, rather than spend all the time explaining and then fixing bugs.

The LLMS are a lot like this.


>> I agree 100% with this sentiment, but, it also is a decent description of individual humans.

Why would that be a good thing? The big thing with computers is that they are reliable in ways that humans simply can't ever be. Why is it suddenly a success to make them just as unreliable as humans?


I thought the big thing with computers is that they are much cheaper than humans.

If we are evaluating LLM suitability for tasks typically performed by humans, we should judge them by the same standards we judge humans. That means it's OK to make mistakes sometimes.


You missed quoting the next sentence about providing confidence metric.

Humans may be wrong a lot but at least the vast majority will have the decency to say “I don’t know”, “I’m not sure”, “give me some time to think”, “my best guess is”. In contrast to most LLMs today that in full confidence just spews out more hallucinations.


Not really a thing in CA, largely unenforceable.


Thanks for pointing that out, I didn't know it. What about NDAs?


Dvele | Home Automation Engineer (Python) and Full Stack Engineers (Python/React) | San Diego, CA | Full-Time | Onsite | 100-160k + Equity | www.dvele.com

Dvele is building better homes, period. Over a century of construction experience coupled with several decades of Silicon Valley grit is tackling the single family homes space. Our goal is to help people live better lives, by constructing better, more modern, healthy homes for them.

If you’re interested, especially if you have Home Automation experience (HomeAssistant would be rad) apply at https://www.dvele.com/career/ or shoot me an email at matt@dvele.com. Help with relocation to sunny, beautiful La Jolla is available for the right candidate.


Don't listen to the Don't(s). Only you know your circumstances, if you're in a position to follow your inspiration (no kids, mortgage, etc), awesome do it. These circumstances (typically) evaporate with age, take advantage of them.

Worst case, you learn a ton (probably more about yourself than anything else) and you still have a trade (software dev) that will land you you in the top 1% worldwide in terms of standard of living to fallback on. Best case you build a rad business, plenty of room in between.

That said, become enamored with the problem not the tools (django/python, JS etc). Try to get "ramen" profitable or feedback with the least possible effort. The hard part of sideprojects is rarely the dev, it's most always the marketing.

On the marketing side, read, read, read...then experiment, experiment, experiment. Given the lack of capital you're gonna need to be creative, it can be fun, it can be frustrating...this will be the hardest part of your journey.

Good luck, have fun, learn a ton.


Thank you for your comment!

I like your mindset and this is the one I try to have. The tools I mentioned are the ones I use best. Like you said, the choice of tools does not concern me as much as validating my idea and getting to revenue. I'll keep in mind your advice about marketing.

I hope everything is well for you at Scalus!


Most of the Don'ts are going to come from folks who have been there and done that. Listen to them.

By all means, you'll learn from your mistakes. That doesn't mean they aren't mistakes. You can learn even more by building something that's viable and not being flat broke when your first iteration fails to gain traction.


No need to do in transit. I mean iMessage could simply proxy all http/https requests post decryption in iMessage, pre-request.

At the end of the day this privacy trade off (apple gets your browsing info) is probably more secure than an embedded webview that could potentially be exploited and is auto-loaded. Similar to how Chrome alerts of malicious sites...I see this as a long term larger attack vector than privacy leakage.


The URL being disclosed to Apple was what I was getting at, which would happen with any approach that involves Apple performing the request on behalf of the user. I don't think the trade-off you're describing is necessary given that the sender could prepare the preview.


It's been spoofable (and jamable) for some time...but serious PIA to do and a huge FCC no-no as it's tough to focus.

http://news.utexas.edu/2013/07/29/ut-austin-researchers-succ...


Drone defense is very interesting....but I find the concept of jamming/electronic spoofing/RF commandeering somewhat comical in this arena. (what ApolloShield seems to be doing)

This is a physical threat (in the terrorist scenario) that needs to be dealt with kinetically.

RF spoofing doesn't work for planed flights, GPS spoofing can work but reply attacks are wonky at best and most just will trigger a "go home" which is easily overridden to be the target.

Skywall (http://openworksengineering.com/skywall) type stuff is interesting, but very hard to get right...know a few folks in the field working on some cool stuff.

The passive whoops I'm a terrible drone pilot and flew my drone over moffet is fairly easy to deal with (betting we'll see some sort of drone tagging (RF)/mgmt system tied to the in-place registration very soon)...but not nearly as lucrative as true airspace security

/2cents from someone not in the field but interested in it


> This is a physical threat (in the terrorist scenario) that needs to be dealt with kinetically.

While I completely agree with you I'm curious if the pricing has come down on powerful lasers as I could see it being more economical to simply burn the rogue drones out of the sky. It would make re-targeting much faster and less of a need to keep kinetic...material on hand.


So then the terrorist coats their drone in foil, and instead you are blinding anybody who looks at the drone when you fire.

No, kinetic or nets is much better.


> So then the terrorist coats their drone in foil, and instead you are blinding anybody who looks at the drone when you fire.

Many of the laser systems the military have been working with are not in the visual spectrum and foil would not simply reflect it. Otherwise the majority of our anti missile systems that use lasers would be thwarted by a simple foil coating.

> No, kinetic or nets is much better.

All depends on the goal. A quick burnout with a laser which has no reload time and can hit targets in a near instant is highly valuable in anti missile systems; I don't see why this couldn't be applied in an anti drone fashion as well. Kinetics will serve to obliterate which is better as far as falling debris goes; might be useful to have a combination of both in case you have to fend off a larger than expected amount of drones.


3 Months.

I made https://www.myothernumber.com (online temporary sms/mms/voice numbers) about a 1.5 years ago, was making ramen money within 3 months.

Now tracking in the 5-figure/year range.

It's a very, very crowded market so primary cost is user acquisition, but then again that's exactly why I built it, to better learn (consumer facing UA) and have a platform to experiment with.

About to officially launch http://artistic.af (neural artistic transfer meets instagram + canvas printing). I expect this one to take a bit longer to scale, but it was just an excuse to learn DNNs.


Just wanted to let you know that there is a typo on the first page of artistic.af: "...a lot of historic genuius..."

Site looks great!


Awesome, much appreciated!!


And another one: ...and the inspriation of you...


For these ideas, did you do any customer interviews or did you just decide to launch and iterate?


associate.io | mobility for the hourly employee | Oakland

We're still in "stealth", funded, beta clients but early stage. Tackling the mobility problem for the non-professional employee. Founding team has "done it before" and has successful exits under their belts.

We're looking for:

iOS engineers

Python (full stack) - experiance with django (we only use it for presentation) and twisted would be awesome

But first and foremost we're looking for smart people, who play to win.

We've also got a very cool digital nomad work policy for folks that want to travel the world after they've been with us for a few months.

Shoot us an email to learn more: helloworld@associate.io


Cliff notes: google's configuration service broke itself then fixed itself today. Engineers were alerted. Skynet is self-aware.


yeah I'm very curious as to what sort of bug deploys a bad configuration and then magically deploys the fix 30 minutes later...


I don't know the specifics of what happened here, but in my experience with automatic configuration generation one must have a way to validate the config, but that validator can have bugs (as any other software).

Then either the software loading the configuration detects the problem or the monitoring system detects something's not right, and automatically the last working configuration is applied and the non working one is discarded.

By the looks of it I would say their monitoring detected the problem but the reliability team needed some minutes to realise it was a configuration problem. A classic problem is a network appliance that is misbehaving (eg. firewall, switch, etc), but nobody knows it is because of the configuration and it is replaced by a fallback appliance that... oh, has the same problem (configuration).

All together 25 minutes seems a lot, but when you're troubleshooting and you know an important part of your infrastructure is down, time fly!


Probably a race condition that happened in the first deployment and the system ran just fine the second time.


the bug was probably neatly tucked inside a conditional that was time and/or state related

if state=="x": break_google else: fix_google


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: