AI Alignment

[Updated 12.15.2017] This is a post/brain-dump on my thoughts on AI Alignment issues for the AI Alignment Prize, and the future of AI Alignment (and a bit on where the future of AI will go, on its present track- so basically, about a third of this will be wrong by May, except for the problems). Thing is, AI Alignment is a problem right now, in a few ways- and the reasons for it are more basic than the reasons it will be true so in the future (and also why it will be inevitable in the future).

What is AI Alignment?

AI Alignment is basically the aligning of AI with the users’ values, intents, and to an (wishful) extent, morals. Supposedly, AI should be working for us, augmenting our intelligence and supporting us in our lives. AI should be a tool, or ultimately, by some aims, a companion, like those pesky yet lovable Star Wars droids.

What’s the problem with that now? What about in the future?

A lot, and a lot more, in short. AI in its various current forms are aligned mainly with the cloud in which it stores data, providing usefulness to the end user in exchange for data mining capabilities for its parent company, who can identify the users despite any anonymizing done to the data. As well, the AI makes absolutely no value judgements on the data being researched, and so will give you information regarding the proper way to tie a noose knot for which to end your life as readily as it will give you a recipe for blueberry pie. It will give you directions to Yosemite park, or it will help you stalk an ex.

And no wonder it does that- it’s extremely primitive AI! And while the primitiveness of the AI absolves it of some issues that it cannot do, because it lacks the ability to assemble the larger context in most situations, some AI has absolutely no excuse. Simple AI already deployed to the fields of war, in the legal system, and in many corporate ecosystems have built into them problems which should have been tested and fixed, if not avoided. And those simple AI are already impacting lives, and showing that alignment-wise, it probably fits best with Dungeons and Dragons’ “chaotic neutral”.

In the future, some of these problems will be exchanged for complex interactions between AI of competing goals, values, and alignments, as well as simple AI carelessly left in the wild, or deployed by those learning or merely looking to see what will happen. Those problems become far more complex than the current problems, but their answers also lead to making the AI more human. That is especially so as the intelligence level of the AI increases- but only if we do that correctly, and lay a base foundation of values and morals for highly intelligent AI in the same way we do when we raise our children. Which will bring us full-circle in problems, of course…

Current Problem #1: Bad and Biased Data

Right now, our AI implementations are rather dodgy compared to AGI: Image recognition, game playing, price prediction, etc. Honestly, the best we’ve got right now are autopilot and automated drones that are on the very-near horizon. And those are a combination of the image recognition, game playing, prediction algorithms.

The algorithms, in turn, were developed by training on data of examples on the order of thousands to millions to billions. Images, tagged by undergrads, Mechanical Turkers, or other groups of people dedicated to the dull job of tagging data to provide labeled examples to AI. The same thing happens to sound, video, text, and most other data that isn’t dynamic and/or procedural (and there are ways to tag those things in the manner in which they’re generated).

That tagging, especially when the data is text and mined from any non-controlled source (all of them, pretty much), is subject to the biases of the humans who create that data. Text mined from reviews, comment sections, forums, etc, will suffer by becoming a statistical representation of the sum of the language used in those areas.

Ironically, Google’s Jigsaw had been created to address the issue of toxicity by creating a tool for sentiment analysis to be used against comments and forum posts, and yet was discovered to have biased data. Words associated with minorities were flagged as negative in sentiment, even if the sentence could be judged neutral or positive to an onlooker. Google has vowed to keep iterating on the product. Indeed, other iterations of word embeddings are trying to tackle the issue with the inherent bias in the data it trains on.

But that was just one example. Other examples are court algorithms that calculate sentencing, parole and probation being biased against minorities, in contradiction to known statistics, among others.

In short, data for training shouldn’t just be pulled from the Internet- look at what happened to Microsoft’s Tay (look at what happened to Tony Stark’s Ultron– that was the paper clip scenario run amok!). Data must be vetted and cover a wide range of what the algorithm is expected to encounter. No one wants to really sit down and encode a dictionary into a different word embedding format, or scrub through those vectors to personally make sure the sentiment is there- that’s what algorithms are for! Right?

Who’s watching the watchmen?

Data is one of the two big problems right now, and despite being an almost invisible problem to many on the business side of AI, it is easier to address. There are plenty of datasets for various specific areas of AI that are more or less curated, and let’s not deceive ourselves: Webster’s dictionary isn’t telling you what the statistical meaning of a word is. Words have definitions, context, and sentiment is based on that more than statistical proximity in n-dimensions. Thinking otherwise is why Google Translate will convert gender-neutral Turkish language into gendered English, placing the genders into traditional gender roles regardless of what the truth of the text to be translated is.

That’s real impact due to lack of context, and can be fixed via a more context-based system (which would probably be slower- but would also make up for it in less translation requests from people staring at the text saying “that can’t be right, let’s try again”). Or, as we used to say a very long time ago in the Marines: “Slow is smooth. Smooth is fast.”. In other words, don’t sacrifice accuracy for speed (not that training neural nets is a fast process- but business needs very often trump accuracy in actions that can have very real consequences).

In any event, better data will also help to remove bias that has been sneaking into production AI for years now, and that statistical learning cannot alone eradicate. Better alignment begins with richer data.

Current Problem #2: Fragile Algorithms

I almost feel bad writing this, because it’s no one’s fault. It can’t be. We’re progressing very rapidly from the basic versions of AI (even outside of the ML/DL field) into more complex and capable algorithms, but right now, there’s some serious problems that need to be hurdled.

First and foremost is a problem that is somewhat overlooked, despite being reported on fairly regularly: The fragility of classifiers. This is a different argument to the accuracy of classifiers when used to their trained purposes. Many algorithms outperform humans in their narrow tasks, and this will only continue to be the case. But the ability of an algorithm to beat you in Go or Mario Bros. is not the same as true cognition, and does not mitigate their fragility when trying to generalize.

You’ve seen the article about the pattern of static placed over an image of one animal to make it look like another that, to a human, can’t even be seen. You’ve probably seen the article about the single pixel (PDF link) placed strategically on an image that throws off the entire classification of that image. People can scoff at those and state- pretty reasonably, at first glance- that those situations probably won’t be appearing in the real world anytime soon. They’d be wrong.

So, have you heard about the 3D printed turtle that convinced an image classifier that it was a firearm?

Or the “salt circles” someone painted to trap self-driving cars by using road-line rules against it?

Fragile algorithms will be overcome- in the short term, to some extent due to Hinton’s Capsule Network research– and in the long term by other innovations. In the meantime, however, automated systems for audio and video surveillance, security, and warfare are being deployed. Those Alexa bots people stick all over their house are susceptible to dog-whistle-pitch commands, unheard by human ears, but able to be heard by the bot’s microphones. More generally, anyone can issue a command to these devices while in proximity. TV sets have caused purchases.

People are wiring their house security systems to algorithms that are made to be as generalized and easy to use as possible, to the extent that they can be turned against their owners at will, by a tech-savvy criminal.

Here, it’s not only just the algorithm, but the larger product itself: Something designed to listen to you at all times, able to make purchases, hear secrets and private information, lock your house- designed without security in mind. And even where some security can be based on the algorithms like image recognition, adversarial attacks are embarrassingly easy.

For the latter, it’s not so much an issue in the household as much as it is on the battlefield, such as when GPS spoofing forced a top-secret American drone spying on Iran to crash-land somewhere that was decidedly not its home airstrip. It’s only a matter of time before GPS guided missiles miss their mark, or tanks or smart bombs have their AI fooled by adversarial AI into hitting a non-combatant target because the algorithm employed was fragile.

As said, this problem will be improved upon with time and effort, as we hurtle head-long towards stronger AI. However, these issues are very real and very current, since these algorithms are being rolled out in the wild as we speak, inevitable consequences to be dealt with later. In fact, many startups are ignorant of the possible problems in a way that is at best naïve, and at worst willfully ignorant. If you’ve ever worked in the financial industry, you’ll recognize “willful ignorance” as culpable when your company gets investigated. Don’t be that person.

Future Problem: Human-level AI will be more like dealing with other people

It was a lot easier talking about the issues of current AI than it is to try and predict what the issues with human-level AI will be, but there’s a limiting factor in all of the current AI that isn’t going to be going away: Money.

Current AI costs a lot of money to train on the cloud servers that are used for researching and training algorithms. Even after an algorithm is deemed production ready, it is generally kept on the cloud. Some algos, like image recognition on the new iPhone, can be pushed down as specialized hardware is created to handle that task. Generally, the heavy-hitting hardware for these tasks are high-end GPUs and task-specific TPUs, FPGAs, and other chips designed specifically to accelerate AI tasks. We’re killing a lot of trees, and that matters.

In the near future, the current crop of algorithms will run on your cellphone with no issues, on specialized hardware like mentioned above- or those new analog chips, which alone might bring about a small revolution where money and environmental impact is concerned. But that in itself isn’t going to get us a human-level AI, which is several orders of magnitude past what the best systems we have now can offer. Indeed, that kind of AI is going to be the expensive-to-run AI that gets to live in the cloud, or at least on very expensive servers.

At this point, AI Alignment might be something of a concern, but not much of one. After all, the AI we’re talking about here is of average human intelligence over a general spectrum of domains, and while this would be a ground-breaking technological achievement, all we’ve done is build the average person: A CPA that doesn’t take bathroom breaks or sleeps, for example. And to get there, Strong/General AI is going to need a few things that will mitigate problems with AI Alignment, and (hopefully) set us on a path towards super-human AI that is also properly aligned with our interests (though there’s a big problem looming here- we’ll get to that later).

AI, at the moment, is a very dry interaction. Current chat bots are not really AI at all, but merely applications with limited conversational interfaces instead of web forms. Talk instead of type- a web version of the automated call systems you navigate when your cable box goes out. More advanced AI in research institutions are interesting, but not by much. Where they are, however, are the places where the seeds of AI Alignment are being planted by making human-level AI more human-like.

Think about your friends, family, acquaintances, spouses, etc. Those are people in your social circles, or geographic areas, that align closely with your own values. The closer they are to you, the more they align- to the point where you may have a friend or significant other who finishes your sentences for you. This is no mistake or fluke- we select those around us who we want most to be like, and they us.

We are the sum of our interactions with other agents in a world filled with complementary and competing goals, yet we find ourselves cooperating far more often than competing. We’ve evolved far enough where we not only don’t need to worry as much about resource sharing for lean times, but we’re talking about creating more beings to share our world with (in hopeful, long-term terms).

Widely noted as well is the want of the general population of digital assistants that have personality. But not just any personality, but personality that they can use, for lack of a better word. The JARVIS personality in Iron Man, while showcasing a very powerful AI, also showed one with a very good personality that rolled with all of the quips and jabs of its owner while trying its best to advise him when things were not quite ready for some of the actions taken.

A further iteration of such fictional AI is the Karen personality in Spider Man, which was far more expressive, and even advised Peter Parker on his love life. Beyond that are AI such as the one featured in Her, which sparked a romance with the end user (I didn’t see the movie, so I can’t comment past the high-level plot points), or (again referencing Marvel movies- which I do watch often) Ultron, who was more unleashed to self-train and then embodied the flaws in Isaac Asimov’s robot laws by trying to save humanity by wiping them out.

That last I find most interesting, because he was ultimately defeated by an AI that had far more personality, and despite also being far more capable, was also more accepting of the fact that humanity’s time on Earth might be fleeting due to the possibility of self-destruction, but choosing to ally with them just the same, and even partly due to their flaws. That right there was AI Alignment of an almost-super-human AI juxtaposed with what could happen if another genius-level AI decided to kill humanity to “save” it. What saved humanity in the end (besides the Avengers) was a better personality in the AI that got along with humans.

Far more than just a sugar coating on algorithms for various functionality, personality is essential for AI to bond and ally itself with its owners. Moral dilemmas such as the Trolley Problem are now cropping up in the realm of self-driving cars, and proposed solutions are to allow for a sort of set of “moral settings” that allow the owner of the car to impart their own morals for situations such as that (“don’t kill me, run that kid over if you can’t stop” – I’m not advocating for that, but you could, if you were a horrible person, impart those morals to your self-driving car). The more closely your AI is following your thinking, the more it will naturally align with you.

Chatbot creators are already actively following that route, and it’s something I’ve been looking into heavily for years now, first relating to Non-Player Characters for games, and now in a broader interest in AI and using knowledge graphs to represent and present knowledge to a user. But to get there, personality is needed- and not just personality, but values, rules, laws, and cultural norms.

A set of “preferences” that will be the ultimate guiding light of an AI, personality would be something immutable from the perspective of the AI, but not from the perspective of the user, allowing for both flexibility and control. For now, we’ll talk about both the upside and downside of this feature of AI, and then we’ll later talk unscientifically about why it’s both complementary and superior to things like Utility Theory and just rewarding algorithms.

The Problem is the Solution is the Problem…

Personality is what binds humanity together. It is also what tears it apart. The same will be reflected in AI Alignment, while it simultaneously abides by and veers away from the goals of benefitting users (that is, it will benefit its own users, but not necessarily competing users).

People are different. Sometimes, we’re just slightly different, and our speculative AI might follow news of the Yankees instead of the Red Sox, or maybe not sports at all, but art or music or something else. Political differences will influence the AI’s summary of news to you in a way that might slant that information, if you allow it to peruse blogs for your political leanings that may not be entirely factual. You could be a stalker, using your AI to track down the woman that rebuffed you, becoming a force multiplier to an impending criminal act.

You could be a nation-state with a super-human strategic AI that is imparted with the morals of your culture, as well as the imperatives to give advice and information to ensure advantage on the world stage. Or in warfare.

At this point, the spectrum of AGI can continue progressing past human-level, but the factors that bind humans can become the factors that bind our AI to us. Nation states and groups will use AI that those levels to play the games that have always been played between groups competing for resources across the Earth since the beginning of time.

But with the research having been done into imparting morals, the vast majority of the AI in existence will have some kind of “base” layer of morals to guide its usage in general. On top of that layer, additional sets of rules and norms can be added or modified, until a framework as complex as any knowledge graph is present in the most important AGIs in use.

Layered Morals and Rules: AI will have to hash it out

Some can make the point that layers of morals and rules and personality traits can lead to confusing or contradictory sets of rules. This is true.

It is also true that this happens naturally as well. People will routinely curse and yell at a driver in traffic making a driving decision that they themselves will then make and become offended when they are, in turn, cursed and yelled at. It’s more than just hypocrisy, but there are trade-offs and expediencies that occur in the decision-making process that is opaque to most people that are not involved in that decision.

This is part of the reason why language was probably evolved/invented: Coordination and cooperation requires that we transmit information to each other so that we can adjust our own viewpoints to be more in synch. Otherwise, the hunt goes poorly, the attack fails, the business goes under, the artwork is ruined, the show is a flop, the relationship ends, and so on. Alignment is a natural process that is enabled by communication- but it is not perfect, and in those cases, communication is needed: We can’t just trust our AI to roll on without communicating with them in an ongoing dialog.

Diplomats are fielded regularly to align situations to be more peaceful, or to be advantageous, or to a variety of other ends. Businessmen and women meet and discuss details of deals to ensure that everyone is “on the same page”. Soldiers routinely repeat orders given before carrying them out as a way to verify the order that was heard.

AI of all levels, once given any measure of power and ability, needs to be able to confirm at least the important actions, if not be able to advise on actions that could possibly set off a chain of events that the owner ordering the action may not be aware of. Communication is also useful for when the more stiff rules in the layered sets of morals, culture, law, and personality step in and preclude an action from being taken, not just for lack of ability, but because taking the action would violate those rules.

And while much is made of the super-AI being able to rewrite itself, the rules that regard this can and should be immutable except by the owner, so that you don’t have issues with AI rebellions. A good example of this is the droid K2-SO in Star Wars: Rogue One, which was an Imperial droid that was captured and “reprogrammed” to be loyal to the Rebel Alliance (mostly, it seems it was a moral reprogramming, as it did not lose its knowledge of Imperial military procedures or tactical knowledge- also, snark seems to be a personality variable).

Additionally to the change in personality was a more self-sacrificing personality where it made some decisions to accomplish the mission against even the wishes of its owner (it could be said that the real owner was the Rebel Alliance, allowing such a decision to be made without actually disobeying its owner). It did have several moments in the movie where it did change its alignment towards the main character, eventually telling her basically that it admired her to some extent.

That, also, seemed based on its rules to be skeptical of people as part of its role in intelligence operations. Actually, if one watches Star Wars, the droids in it change hands frequently, and reprogramming often entails a change of personality as well as that of function (mostly, it seems that functionality remains the same, as the droids are built physically for specific jobs). Some do entail super-human abilities such as C-3PO’s language and customs knowledge, K2-SO’s tactical knowledge, strength, and ability to analyze situational probabilities, etc. They seem self-aware to varying degrees (mouse-bot runs from the Wookie that roars at it), and employ limited self-preservation actions.

Variants on this kind of technique would be to ape some of what nature already does. For instance, an octopus’ brain is extended into its tentacles, which allows them to act somewhat independently of the “central” brain. Following this, you can have a super-intelligent AI that is not one AI at all, but a community of AI agents that each have a set of immutable rules and values to comply with, and in this way, no single agent has complete control. Should an AI agent come up with the plan to use an arm to bash someone on the head, the AI for that arm could refuse, due to its values being compromised. This would be an inverse to the human inability to do damage to itself (ie, try biting your own finger off- not many people can do that).

One central tenet of these techniques, to be clear, is that these values and rules must be immutable from the AI point of view, must be a filter through which the AI passes its decisions, and must be in place before the AI begins making decisions. An AI that values the safety and agency of humans would not enslave humans in order to protect them, as that violates the value of preserving agency (valuing agency also prevents autocratic rule).

Technologically, a future AI can do things such as self-evolve/modify its code, but so can we! We regenerate neurons, break and recreate synapses inside our brain all the time. But we cannot simply stop our heart, or cease blinking, or breathing. Sure, there is some level of control over these things, but not much, and even less for things such as hormone excretion, hydration needs, etc. It’s part of the brain that we do not have any real level of control over, and if we had any high-level values embedded into that part of the brain, then we’d be locked into behaviors much like many animals are.

That is what we should be looking into for AI. The next question would be to ask if such a super AI would then be able to create an architecture that it could transfer itself into to get around this. The answer would be that if you created a set of values that prohibited this, then it would be extremely difficult. Just as in the real world, evolution and modification should occur within reasonable parameters, and scale to the abilities that are needed. Unless you want an “AI-god” you don’t go creating a super-intelligent AI with unlimited resources and abilities and the capability to infinitely self-modify and come to its own conclusions with no outside oversight or input.

And that sort of AI doesn’t happen by accident either. No AI expert is going to forget to pause an algorithm one night and then wake up the next day to SkyNet. Rather, this sort of thing would have to be created by someone with intent and the resources to pump into it. Here, the problem isn’t AI alignment, but regulations, laws, and even the realm of defense agencies. Because, and I am not nearly the first one to think of unleashing an unchecked AI to wreak havoc on an enemy nation before (hopefully) burning itself out via some pre-timed logic. Should the AI in that situation “break” in a way that allowed it to operate despite the logic designed to stop it, it would fall into the same category as the use of chemical weapons in WWI when the winds shifted.

Again, the key here is that these values must be the final arbiter in AI decisions, and be unchangeable by the AI itself. We’re the ones making these things, and so we’re the ones that need to ensure that safety is paramount. If you can’t do that with a system that you’re rolling out, then it’s your responsibility to stop work and ensure that safety before continuing. Anything less is unethical, unsafe, and hopefully unlawful in the future.

Other ways of doing this…

Those of a more mathematical mind would be tempted to try and use algorithms to effect the same things, but that’s missing the point of making the AI Align with the end user naturally. Reward systems become susceptible to the same bias, fragility, and difficulties in generalization that Neural Networks currently have, especially if they’re based on learning from training sets. Who says that you’re going to encounter that situation in the wild? AI needs to be more flexible than that.

Those that similarly look to Utility Theory will see much the same problem- only more rigidly. Humans do not make decisions in a solely utilitarian way. Indeed, just pick up a history book and look at how events turned on improbable actions of individuals.

It’s only through a common interface with human minds and values that an AI can truly align its goals with its end users. And this interface need not be limiting to the AI in a negative fashion. The AI will simply need to work harder and smarter to get you the information and decision steps you need in the fashion you require. Of course, those who are amoral will always have the upper hand in strategy, in that they do not limit themselves in the kinds of steps that they will take. This will not change, no matter how advanced AI becomes.


A bit messy and rambling at times, but I think the case has been made for ensuring AI Alignment via interfacing AI with humans in a more human way. Immutability of some or all of these rules can ensure compliance (think of it as a logical fuse box), or at the very least, a checkpoint at which the AI will communicate for verification or negotiation of the decision. Such actions will supply the AI Alignment we desire, though it does nothing to the cases where AI is misused, purposefully set loose, or where the technology suffers from other issues which can affect this issue indirectly.

Such as the misuse of AI by corporations, authoritarian governments, or other bad actors. And these problems will continue to eclipse other AI issues, as we use AI for our own diverse and competing ends.

Leave a Reply