While impressive, the deepseek models aren't really "on par" with either oAI or ...

espadrine · 2024-12-31T10:48:42 1735642122

The blind test at lmarena.ai does give it a higher Elo than GPT-4o (API), Claude, and Gemini 1.5 Pro. It seems that people do enter real-life scenarios in the arena.

orbital-decay · 2024-12-31T15:25:39 1735658739

DeepSeek v3 feels very much like Sonnet 3.5 (v1) in particular, minus the character. Performs more or less similarly, "feels" overfitted just about the same, and repeats itself in multiturn chats even worse. I hope they address it in v3.5, v4, or whatever comes next.

rahimnathwani · 2024-12-31T11:16:18 1735643778

  They are very "stubborn" models

Have you found this to be the case even when using the recommended temperature settings (ranging from 0 for math, to 1.5 for creative tasks)?

NitpickLawyer · 2024-12-31T11:39:47 1735645187

I use 0.05 for math, just did a 5k problem set, trying to fine-tune a smaller model with the outputs. It has some very interesting training, borrowed from r1 per the tech report, where it does the o1/qwq "thinking steps", but a bit shorter. It solves ~80% of the problems in 4k context, while qwq would go on for 8k-16k. It's very good at what it does.

But as soon as I need it to do something other than solve a problem - say rewrite the problem in simpler terms, or given a problem + solution provide hints, or rewrite the solution with these <tags>, etc. it kinda stops working. Often times it still goes ahead and solves the problem. That's why I'm saying it's stubborn. If a task looks like a task that it can handle very well, it's really hard to make it perform that other, similar but not quite the same task.

In a similar vein - https://github.com/cpldcpu/MisguidedAttention/tree/main/eval...

victorbjorklund · 2024-12-31T11:00:50 1735642850

I found deepseek very useful at coding with Aider. On par with claude.