Thanks for the details. What's a second generation agent?
You mentioned the workflow is heavy on specs and tests. The smaller models seem to be really good at following instructions now. (Well, some of them!)
So that's probably part of why you're seeing good results. It has a very clear target.
Whereas with more open ended instructions they seem to struggle more. I think common sense is the main thing you get with model size.
When I'm working with the big models I feel like I don't have to spell things out so much. The gap is closing, but I'm assuming there is some fundamental limit there based on the size.
Of course the ideal would be Mythos, running for free, in my house, at 1,000 tok/s ;) Someday...
i meant that i initially developed an agent harness as a set of skills integrated with opencode and now i am in the process of using that to write a new agent from scratch to replace opencode.
> probably part of why you're seeing good results
yes. i think tests and setting up feedback loops for diagnosing errors (logs, debugging, etc) are the most important things. in my experience deepseek-v4-flash tends to ignore instructions to use these tools and default to churning through code and guessing the cause of errors, which is often wrong, so it requires occasionally stepping in when it has been grinding fruitlessly for a while and reminding it, probably due to context length and sparse attention forgetting instructions that are put in context at the beginning of a session.
You mentioned the workflow is heavy on specs and tests. The smaller models seem to be really good at following instructions now. (Well, some of them!)
So that's probably part of why you're seeing good results. It has a very clear target.
Whereas with more open ended instructions they seem to struggle more. I think common sense is the main thing you get with model size.
When I'm working with the big models I feel like I don't have to spell things out so much. The gap is closing, but I'm assuming there is some fundamental limit there based on the size.
Of course the ideal would be Mythos, running for free, in my house, at 1,000 tok/s ;) Someday...