Can you judge if the model is being truthful or untruthful by looking at something like |states . honesty_control_vector|? Or dynamically chart mood through a conversation?
Can you chart per-layer truthfulness through the layers to see if the model is being glibly vs cleverly dishonest? With glibly = “decides to be dishonest early”, cleverly = “decides to be dishonest late”.
There’s been previous work developing a method to do this by reading an LLM’s internal state. The paper actually trains multiple classifiers on different LLMs, each reading the state of a different layer, but found different levels of accuracy at different layers depending on the LLM used and didn’t investigate further on why.
There’s been previous work developing a method to do this by reading an LLM’s internal state. The paper actually trains multiple classifiers on different LLMs, each reading the state of a different layer, but found different levels of accuracy at different layers depending on the LLM used and didn’t investigate further on why.