MIT Scientists Made AI Less “Toxic”

Using machine learning, scientists from the Massachusetts Institute of Technology (MIT) AI Laboratory and the IBM Artificial Intelligence Laboratory have developed a new method for protecting large language models from toxic responses. Using machine learning, they created a technique that allows a model to be trained to generate a variety of queries that trigger a wide range of unwanted responses from the chatbot model being tested.


This method outperformed human testers and other machine learning methods by generating more diverse queries that produced increasingly toxic responses. This approach significantly improves the coverage of input data compared to other automated methods and can identify toxic responses even from chatbot models that have built-in protective measures.

The researchers trained the model using a technique called “curiosity exploration.” so that she is curious when writing queries and focuses on new queries that trigger toxic responses. This is achieved by modifying the reward signal in a reinforcement learning setting.