Hacker News
new
|
past
|
comments
|
ask
|
show
|
jobs
|
submit
login
ryeguy
20 days ago
|
parent
|
context
|
favorite
| on:
Claude Fable 5
Did you read the blog post? They compare to deepswe and call it out as the worst one for false positives (failed, but the benchmark assessed it as correct). It also has less language variance.
CSMastermind
20 days ago
[–]
I mean yes that is what you'd say if you were writing a blog post about your new benchmark.
ryeguy
19 days ago
|
parent
[–]
Sure, but they at least quantified it with data. It's not like they just dropped a sentence saying the above, they showed numbers.
Guidelines
|
FAQ
|
Lists
|
API
|
Security
|
Legal
|
Apply to YC
|
Contact
Search: