Hacker Newsnew | past | comments | ask | show | jobs | submit | soumendrak's commentslogin

No projects. Using GitHub sponsor for individual OSS contributors.


Are you blindly running expensive LLM evaluations on EVERY response your AI generates?

This widespread practice is costing companies thousands while delivering questionable value.

Here's why your LLM evaluation strategy might be broken:

1. Generic evals are practically USELESS • Hallucination and toxicity scores mean nothing without context • Your use case is unique - generic metrics rarely capture what matters

2. More evaluation ≠ better results • Evaluating entire conversations drastically reduces judge accuracy • Specific, targeted inputs yield more reliable scores

3. Your judges need guidance too • Binary outputs with justification > arbitrary 1-5 scales • Few-shot examples from YOUR domain are critical

4. The reliability problem is real • Position bias: favors responses based on presentation order • Verbosity bias: longer responses get better scores regardless of quality • Self-enhancement bias: models favor their own outputs

Smart evaluation strategies that won't break the bank:

• Sample strategically instead of evaluating everything • Combine automated evals with periodic human validation • Provide context-specific examples to your judge • Always request justification, not just scores

Remember: The best benchmark isn't some generic leaderboard - it's how well the model performs in YOUR specific application.


This describes many wrongs in the timelines of the Indian section.


This is good to teach programming. I have used IBM Cognos before and may be biased, for complex logic text programming will be better than visual programming.


How is Kagi compared to You.com?


Asking the right question here.


Thanks a lot.


Cloudflare and Akamai: competition


But, do we know the speed of the natural thing?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: