We Need Human Feedback Amplifiers

Describe a method for aligning everyone in your neighborhood on the proper use of a thermostat. To suggest that we can completely align any machine with humanity is wildly off-base. That's not to disparage AI safety research; it's just to say that no external entity will ever align a machine with you as completely as you could yourself given the correct tools.

What are those tools, who's making them, and for whom? Developer tools by developers for developers is the most prominent answer today by far. We need a better tool that more people can use: a human feedback amplifier that makes it easy for each of us to align an AI with our one-of-a-kind self.

Let's specify this better. An Amplifier produces an observation to which a Human can respond with Feedback. An ideal amplifier:

  1. Is intuitive to use with no training or background knowledge. At the extreme, an amplifier may require only one bit of human feedback per observation.
  2. Converges startlingly quickly to producing observations that are good according to you. If it takes too long to reflect your feedback, it ain't much of a Feedback Amplifier. The standard for "startling" depends on the task, but a great amplifier should converge much faster than you'd generally expect for that task.

Call the second ideal "superconvergence" by loose analogy to the math thingy of the same name. Superconvergence doesn't only apply to getting a new AI up to speed on your existing preferences: it also applies to getting an existing AI up to speed on your new preferences. This is critical since people can change - ideally, your amplifier should make an AI even more aligned with you in a given moment than your past selves would be.

Since alignment is a function of both time and the individual, an ideal amplifier should be an on-demand service where the individuals paying are the individuals whose feedback the service amplifies. In fact, this seems like the only competitive business model for any alignment-as-a-service platform. Other models would comparatively misalign financial incentives between the service's provider and the service's users, resulting in inferior alignment between them (which in this case directly maps to an inferior service since alignment is the service).

For now, all of this is just a thought experiment. Assuming an ideal amplifier is like assuming a human with telepathy because nobody has implemented superconvergence in an alignment tool. At present, it's not even clear whether it's possible.