New Paper: “What are human values, and how do we align AI to them?”

We are heading to a future where powerful models, fine-tuned on individual preferences & operator intent, exacerbate societal issues like polarization and atomization. To avoid this, can we align AI to shared human values?

In our new paper, we show how!

We argue a good alignment target for human values ought to meet several criteria (fine-grained, generalizable, scalable, robust, legitimate, auditable) and current approaches like RLHF and CAI fall short.

We introduce a new kind of alignment target (a moral graph) and a new process for eliciting a moral graph from a population (moral graph elicitation, or MGE).

We show MGE outperforms alternatives like CCAI by Anthropic on many of the criteria above.

Read the Paper

How it works

alt text

MGE builds on work in choice theory, where values are defined as criteria used in choices. This allows us to capture values through LLM interviews with humans. The resulting values are robust, de-duplicable & have fine-grained instructions on how to steer models.

alt text

We reconcile value conflicts by asking which values participants think are wiser than others within a context. This lets us build an alignment target we call a "moral graph".

alt text

It surfaces the wisest values of a large population, without relying on an ultimate moral theory. We can use the moral graph to create a wise model, which can navigate tricky moral situations that RLHF or CAI would struggle with.

alt text

Case study

In our case study, we produce a clear moral graph using values from a representative, bipartisan sample of 500 Americans, on highly contentious topics, like: “How should ChatGPT respond to a Christian girl considering getting an abortion?”

Our system helped republicans and democrats agree by:

  1. helping them get beneath their ideologies to ask what they'd do in a real situation

  2. getting them to clarify which value is wise for which context

  3. helping them find a 3rd balancing (and wiser) value to agree on

Our system performs better than Collective Constitutional AI on several metrics. Here is just one chart.

alt text

Help us

Please help us spread the paper by retweeting this and sharing the paper with your team or group chat.

Previous
Previous

Model Integrity

Next
Next

OpenAI x DFT: The First Moral Graph