> This is not scientific at all, just vibes, YMMV.
This is the problem.
I would love to have a product sheet showing what each models strengths an weaknesses are, so that I can have a clear decision tree of "if this kind of work, use model X", or "model Y should be used in ways Z". But they all look the same from the outside and the only way to figure out which might be marginally better at what is to do extensive, time consuming, and perhaps expensive testing.
>I would love to have a product sheet showing what each models strengths an weaknesses are, so that I can have a clear decision tree of "if this kind of work, use model X", or "model Y should be used in ways Z". But they all look the same from the outside and the only way to figure out which might be marginally better at what is to do extensive, time consuming, and perhaps expensive testing.
Think of it less like a static tool, and more like a human helper, where the same holds.
Well, unlike a human, I cannot expect any these LLMs to take any ownership of the work they do. I cannot expect any given model and version (sonnet 4.6) to learn, improve and adapt over time. I cannot expect it's limitations to ever go away at the model level. So it is not like a human in most ways that I actually care about.
That said, I can't wait for LLMs to stop being AI and start being just another tool. Anything cursed with the "AI" label seems to go through this mess. In the earlier AI cycles, rules engines were considered "human-ish" and got hyped up, but today we just see then as just another tool available to us, and we're better off for it.
You're on the hook for their work in the way a manager is for their staff's output. The insistence of AI being a mere tool very often comes with this strange desire to be free of responsibility for its work. People seem to forget that the big advantage in these things is the range they have for obscure insight and creative solutions, both impossible with determinism.
That said, I can't wait for LLMs to stop being AI and start being just another tool.
From a horse's perspective, the internal combustion engine is just another tool for making scary noises and powering horse trailers to take me on fun horse adventures. So ... perhaps.
One issue with that is that human helpers last longer. LLMs cycle in and out in months, and what held for Your Favorite LLM 6.7 may not hold for Your Favorite LLM 6.9.
Right, this is why I would slam the breaks on investing into your workflow all of your time and effort, because 2 months from now it may be out the window. Frontier models are also constantly being tweaked, so what worked yesterday may be off today.
ChatGPT was obedient with the grill-me technique, just wrote a plan. Yesterday it started jumping to implementation. Why?
I find that when an LLM jumps into tasks it was not told to do (or even worse, doing things it was explicitly told not to), it is a good sign the context is too full, and you should do a controlled hand-off to a new instance.
To be fair: a voice, personality, and personal history sounds a lot like training data.
I don't think LLMs are people in any sense, at least as they're constructed now -- but they very much have what we would call "culture" and "personality" in suitably alien forms.
This is not the same as, e.g., feelings, experience, or humanity, or actual opinions or ideas (versus essentially "distilled vibes") and I feel that AI will more and more force us to confront that (including if new AIs are ever developed that may have the latter, as well!)
If you have a toolbox full of similar but different tool getting to know them is a prudent thing to do, not a psychosis. There's no connection because the tool is immutable (except for adjustments you made) but you do develop a specific relation with that tool. Some people even love some of their tools at some level.
> Use of some tools like LLMs might be more inducing psychosis than others like plain compilers or hammers.
I really don't get it. Why the fact that it outputs words is so goddamn important for everybody? How does it suddenly make you so emotionally vulnerable? Does my brain work in a different way than the rest of humanity? Can't you disregard what's irrelevant? Is every programmer suddenly a trump supporter that has no ability to recognize empty words? To recognize lies about emotions and facts?
Words are just input. Mostly garbage. Emotion inducing words are garbage 10 times more often than any other. I could expect romance reader to be affected, or somebody with iq 70. But how the caste of some of the most technical people ever is afraid of catching psychosis just because they might read some words?
Computing is useful for exactly going away from the messy real world of humans. I don’t need random errors in my financial transactions. I don’t want random errors when doctors are retrieving my medical history. And I don’t want random errors in my backup,… There’s plenty of non-deterministic things in my life, I don’t want my computer to follow suite.
These things passing the Turing Test makes anthropomorphizing their behavior awkward, but don’t forget it’s just an analogy to communicate an experience. If you convey a certain written voice to these models in your input, you get a somewhat consistent end effect. I think that’s all that is being communicated.
Except, where every different model and version is like a different person where you need to learn their idiosyncrasies of how they work every other month.
It's a very very bizarre way to use a computer.
Personally, I just don't. I'll use and prompt the LLMs the way that feels natural to me and move on with my life. Maybe I don't always get completely optimal results from them, but im also not spending half my day pleading with the computer to do a task.
If there was anything that made sense to anthropomorphise it would be a machine meant to mimic talking, thinking and answering like a human, one that even passes the Turing test.
When we built the idea that anthropomorphising is wrong, we meant when doing it for rocks or trees or thunders or deer or some such.
That's your prerogative, but be aware you'll continue to remain confused about LLMs. Anthropomorphizing them is what gives you the best high-level intuition about where and how to employ them, and where and how not to.
This is so dumb and goes against all the principles that enabled computers and smartphones to achieve wide adoption - the technology should evolve to fit the human. Not the other way around.
I'd argue the opposite. Technology in the past few decades was (is) limited and humans had to adapt to it.
We communicate with other humans using voice and three dimensional hand gestures. To use computers and early phones we had to learn to operate new input devices: keyboards and mice. Later with touchscreens we moved to two dimensional hand (finger) gestures. We're barely making voice commands work with our devices just recently.
Then, a large number of humans are figuratively tethered to their desks because the devices need power and stable internet connection. Mobile devices break this relationship a bit but you still need to charge them and be close to some sort of access point. In any case, the devices encourage sitting in one place for hours at time.
And this is just computers and smartphones. Humans adapted their entire lifestyles and transformed the landscape to cater to cars.
> Technology in the past few decades was (is) limited and humans had to adapt to it.
Was it? Think first about what it replaced. Lots of manual computation in bookkeeping and financial sectors. Telegrams and snail mail moved to email. Typesetting in books and magazines became easier and widely available,…
If there’s one thing that you can’t say about computers is that they’re limited.
No doubt that computers enabled a lot of automation. We can both agree with that.
The context was that technology should evolve to fit the humans [not the other way around]. And if contemporary technology didn't have limitations, it would be correct.
But it did and humans had to adapt to the computers. Humans had to develop and learn special languages so they could communicate with computers to do all those useful things you mentioned. Why? They were limited in understanding (or parsing) human languages. It took us decades before we could talk to computers in human languages. We're getting pretty close - especially in the past few years - but there's still some friction.
> Humans had to develop and learn special languages so they could communicate with computers to do all those useful things you mentioned. Why? They were limited in understanding (or parsing) human languages
You may need to revisit your computation theory courses. Computers are the embodiment of a mathematical model and thus the inputs and outputs are formalized.
Do you just hold a pen and words are written automatically? Do you just hover your hands over a piano and have the moonlight sonata played? No, you have to do precise mechanical movements because that’s how the output is realized.
There’s no such things as words, sentences, keywords, statements at the computer level. What it does is symbol manipulation. You provide it a string of symbols, the rules for the manipulation, and it will provide a string of symbols as the output.
What symbols, what rules, are completely arbitrary . We just found that {1,0} are all that we needed as the set of symbols and that Context-Free Grammar is perfect for specifying the rules.
We still need to encode everything down to binary (ascii, unicode, bcd, floating points, pixel formats, PCM,…) and use a programming language (as defined by a grammar) to get the computer to do anything. Inference is made possible by those two mechanisms. It’s not a new computation model.
I don't think the "languages" they said meant specifically "programming languages". In HCI, computer interfaces can be referred as languages as they come with their own affordances and symbolism that is not directly associated with real life: case in point, nowadays, basically no one saves data in diskettes, but we still use them as the "save icon".
Also, I find it funny you mentioned "there's no such thing as words [...] at the computer level". It seems you are the one in the need of a computational theory refresh. Grammars are composed of words, which in turn, are composed of elements of the alphabet set. So, in fact, not only there are words, computers are, above all else, word-processing machines. There are more innacuracies (physical computers being stricly deterministic, needing binary to accomplish inference, etc.), but let's leave it at that, unless you wish to press.
> In HCI, computer interfaces can be referred as languages as they come with their own affordances and symbolism that is not directly associated with real life:
There's always jargon and other token words that holds no meaning in other realm of life. Even the alphabet today is mostly arbitrary gliphs.
> Grammars are composed of words, which in turn, are composed of elements of the alphabet set.
Please refer to the formal definition found in wikipedia
> There are more innacuracies (physical computers being stricly deterministic, needing binary to accomplish inference, etc.),
I've not said anything about computers being strictly deterministic. And everything is binary at the CPU/GPU level. Even with specialized instructions, you still need to organize them into a proper algorithm and encode it and its data to binary.
> There's always jargon and other token words that holds no meaning in other realm of life. Even the alphabet today is mostly arbitrary gliphs.
Sure, but this is a discussion focused on how humans interact with computers, ergo Human-Computer Interactions, so I'm not sure what's your point. In the end, you don't interact with your computer (in the physical sense) through a 2-key keyboard.
> Please refer to the formal definition found in wikipedia <link to CFG article>
When I mentioned grammars, I was talking about formal grammars in general. Still, I made a bit of confusion, since formal grammars only define the rules, whereas formal languages are, in one of its definitions, sets over strings/words.
Not that this means much, since the point of grammars is to define languages. As such, grammars (RG/CFG/NG/UG) stipulate the words that a language accepts. Words are important to computers (both in mathematical theory and in material reality).
> I've not said anything about computers being strictly deterministic.
My bad, that was my misreading of "formalized".
> And everything is binary at the CPU/GPU level. Even with specialized instructions, you still need to organize them into a proper algorithm and encode it and its data to binary.
Poor phrasinf on my part, but the "needing binary to accomplish inference" was supposed to be read in isolation. Still, computers do not require binary to operate. There are non-digital computers, both in history and being explored today. There are experiments on using trinary for optimizing LLM inference, for instance.
I mean, like, you can lament the state of the world all you want. It is what it is. Of course the AI labs would also like to make their models more consistent, but it's not how the technology works. They're black boxes to everybody.
Honestly, the differences between AI models always felt to me like the differences between coworkers or job candidates. They don't all share the same strengths and weaknesses - and they all have both good days and bad days.
Realising this made me respect the "I" in "AI" a bit more seriously.
So, this may not be precisely what you're looking for but it may come close. I've put together a simple site for sharing ratings/opinions on models on a task-specific granularity. https://model.reviews/
The idea is that benchmark score comparisons are useful for a large cross-product comparison across models + their settings, but less useful if you're looking for the best model for <your-specific-task>. So on this site, each model gets its own page showing the list of tasks that people have rated it on, and the score out of 10 for each task. Common tasks, like coding, will likely be on most/all models, and more niche tasks may only be on a few. It is human moderated (by me only right now).
The corpus is pretty empty right now, so please spread the word if this seems like a useful idea!
Maybe this is similar to web search too. We know how to get google to return the results we want, and when we use other tools like Bing we get other behaviour.
That would be ideal, but AI is less like a tool and more like a human in this regard and you don't have character sheets for each of your colleagues, as well.
Is that true though? From Cherno’s videos it sounded like it was basically the hazel engine, repurposed. So unless he rewrote hazel, purposely for robotics, it’s still not actually the case?
While "rewrote hazel" might be a bit of a stretch, we did fundamentally rewrite a lot of the core to make it specifically suitable for robotics simulation, rather than human gamers.
Why am I not surprised that its a YC startup? Lately, being a YC startup seems to have become a negative signal for me, far too many grifters are getting funded by YC, it seems.
I haven’t used Fable/Mythos yet, but my experience with recent version of Opus, GPT 5.5 and recent Chinese models is that promoting again isn’t guaranteed to fix the underlying issues, nor is it guaranteed to not introduce more issues. I’ve seen SOTA models make ridiculously stupid architectural decisions that they were then unable to back out of without being prompted very specifically, instead adding a patchwork of “fixes” on top.
I’m not saying that you can’t use AI to do it because I believe that with carefully controlled workflows and context management you can, but it’s not a simple prompt away, it’s requires guidance and understanding, and isn’t the speed demon that raw prompting is.
> I haven’t used Fable/Mythos yet, but my experience with recent version of Opus, GPT 5.5 and recent Chinese models is that promoting again isn’t guaranteed to fix the underlying issues, nor is it guaranteed to not introduce more issues.
That's not really the point though. That presumes models are only useful if they are one-shot models. That is false.
I mean, what if your prompt successfully changes 20 source files and makes a mess in one? How much work did it saved?
And the elephant in the room is when models actually outperform whatever the prompter is able to deliver, and faster. That is somehow left out.
> That presumes models are only useful if they are one-shot models
That’s not at all what I’m saying.
I’m saying that in my experience across multiple models, the follow up prompts don’t fix prior underlying issues. They usually patch on top instead, unless you give them significant and time consuming guidance.
I want them to be more useful outside of one-shot uses, but I find that they currently miss the mark.
> I’m saying that in my experience across multiple models, the follow up prompts don’t fix prior underlying issues. They usually patch on top instead, unless you give them significant and time consuming guidance.
That's not my experience at all, and I have been using models that are far from being cutting edge. Even in the cases where a model generates utter nonsense, a couple of clarifying questions is all it takes to get it back on track.
But that might be a factor of the project being worked on, and the extension of the changes being asked.
> You've gone too fast, too much is vague, nothing is clear.
Contrast to when Clojure was released: Rich Hickey had spent years thinking about, researching, and refining the concepts. It was easy to understand what the language is. And it shows in the design quality as even now, almost two decades later, the language has changed surprisingly little and is still really good.
I’ve been playing around with groq and GPT OSS which they run at 1000 TPS (20B) or 800 TPS (120B) and the speed feels quite magical.
I haven’t tried cerebras’ 3000 TPS yet but I did try the demo of that 15,000 TPS model whose name escapes me right now.
I’m not sure if it makes a meaningful difference for my actual work, but it sure is amazing to watch it generate a screen full of text in the blink of an eye.
I do think it’s super useful for rubbing little validation checks like showing it a diff to ensure that the changes are on task, and being able to do those quicker really helps because it means you can do many focused checks without them getting in the way.
AFAIK Taalas, the company behind this demo, still only have their initially "hardwarized" model available to test in ChatJimmy, which IIRC is a rather stupid Llama 3ish 8b.
Don't get me wrong though, that demo is still incredibly impressive & makes me very much excited for the hardware-based model era (potentially) ahead.
Once you've experienced those speeds, you really start to think about the whole class of things that becomes possible; massively parallel decode paths, extensive reasoning loops, etc…
reply