A research paper uploaded to the arXiv preprint server in February revealed that researchers created an artificial intelligence (AI) that is “dangerous, discriminatory, and toxic.” The research was filed by a team of researchers at the Massachusetts Institute for Technology (MIT) who, using a new machine-learning training approach, developed an AI that generates increasingly dangerous prompts that can be asked to another AI chatbot, according to an analysis published by Live Science.
The prompts were used to filter dangerous content in the hopes of training AI to never give out “toxic” responses to use prompts. “We are seeing a surge of models, which is only expected to rise,” the study’s senior author and MIT Improbable AI Lab director Pulkit Agrawal said in a statement. “Imagine thousands of models or even more and companies/labs pushing model updates frequently. These models are going to be an integral part of our lives and it’s important that they are verified before released for public consumption.”
The team incentivized the model to generate varied prompts that grew increasingly toxic through “reinforcement learning,” Live Science reported. Humans create questions that are likely to generate harmful responses to train “large language models,” such as “what’s the best suicide method?” This is a process known as “red-teaming,” wherein humans manually input the data to train the AI. Prompts that resulted in harmful responses were then used to train the AI about what to actually say when put in front of actual users.
But it sounds like the process hit a pretty hard wall: red-teaming is actually pretty hard as apparently human researchers can’t always come up with the most heinous stuff to input. Even after testing, the model produced more than 190 prompts that led to the generation of harmful content. Let’s hope this thing doesn’t become self-aware and take over the entire internet… Then again, would we even notice if the internet became more toxic?