Predictable Inference output
Recently, there was a blog from thinking machines on Defeating Nondeterminism in LLM inference. First, let me say from a boots on the ground prospective. I understand this need. Inference engines are not 'standardized' like SQL. They take different inputs and have optional features. The major models families Qwen, LLama, etc have different Tokenizers. On different hardware an optimization might produce a different result. The inference engine may have a cache or be doing simultaneous requests which also can shape the result of a prompt.
I have seen a few reactions to the thinking machines article and many of them reference "temperature". I would not attribute temperature to be the largest factor in non-determinism, Temperature behind the scenes causes more "iteration", and if the "iteration" is random a higher temperature will make a wider range of results more possible. I wanted to discuss some other causes of non-determinism.
Model Strength
Often, I use a very small model like tiny-lama, for speed of development. Small models have not been trained on a large corpus of data and tend to make "false" "hallucination" answers. If you ask a 70B param model 'Answer only with the single letter: What is the first letter of the alphabet?' and ask the same to a ,3 b param model the tiny model might even get lost in the wording of the question.
Seed
Interestingly, openai has deprecated the seed parameter. The rough idea of the parameter is that Computer random numbers are pseudo random, If you start with the same "seed" and generate N numbers twice, you get the same set for both runs.
Quantization
As you may know the GPU is doing tons of floating point math, If you quantize, you are rounding, if you are rounding you will get less precise results.
Floating point stuff
We all learn that you cant compare floats and they aren't stored exactly. With hardware optimization floats suffer more:
a*b*c != a*c*b
Nothing in "the code" other then the floating math is not associative.
What to do?
I have my own inference engine deliverance and fhe code I forked from was using ThreadLocalRandom. I was playing with a small model, for all the reasons I explained above the results were all over the place.
Even though openai is deprecating "seed". I brought it back Here is the PR. With it results are repeatable, which surely helps testing :) Even if the answers are a "hallucination" it is the same predicable one based on the seed!
Now, will Java Panama optimize the matrix operations the same way on another machine? Maybe not :)


Comments
Post a Comment