Second this notion. After picking up an OEM Spark and running qwen36moe/dense, I was thoroughly impressed with what such small models can do and the (reasonable) speeds you can get. I'm back to using open weight models via an API (wanted more capability for the time being), but will be getting more hardware soon (re: ds4-flash and the fable shot heard round the world)
There is a lower bar (that gets lower over time), but ime, the config you are describing is too low still.
qwen/gemma in the 27/35B range @fp8 are better than gemini-2.5, but less than gemini-3.1, you can run DS4-flash @fp8 on two DGX spark, and things keep becoming better. DiffusionGemma came out recently with 4x token gen speeds.
tl;dr - the models you appear to be trying with are too small or too quant'd
One company, multiple models, Fireworks is the fasts at making the models available (had GLM-5.2 before the other three we are evaluating)
reply