Comedian George Carlin had a list of seven words you can’t say on TV. Some parts of the internet have a list of 402 prohibited words, plus an emoji, 🖕.
Slack uses open source List of dirty, naughty, obscene and otherwise bad words, found on GitHub, to help improve its search suggestions. Open source mapping project OpenStreetMap uses it to clean up map changes. Google Artificial intelligence researchers recently removed web pages that contained one of the words from a data set used to form a powerful new system for understanding language.
LDNOOBW, as folks know, has been a stealth utility for years but has recently grown in importance. Blocklists attempt to bridge the gap between the mechanical logic of software and the organic contradictions of human behavior and language. But these lists are inevitably imperfect and can have unintended consequences. Some AI researchers have criticized Google’s use of LDNOOBW as restricting what its software knows about humanity. Another similar open source list of “bad” words prompted Rocket.Chat chat software to censor attendees of an event called Queer in AI from using the word queer.
The initial list of dirty, rascal, obscene and otherwise bad words was compiled in 2012, by employees of the photo site Shutterstock. Dan McCormick, who led the company’s engineering team, wanted a roll of obscene or objectionable as a security feature for the site’s search box autocomplete feature. He was happy that users were typing whatever they wanted, but didn’t want the site to actively suggest terms that people might be surprised to see appear in an open office. “If someone types B, you don’t want the first word that comes up to be boobs,” says McCormick, who left Shutterstock in 2015.
He and some colleagues took Carlin’s Seven Words, tapped the darker corners of their brains, and used Google to learn sometimes confusing slang for sex acts. They posted their initial 342 entries on GitHub with a note inviting you to contribute and the suggestion that it might “spice up your next game of Scrabble :)”
Almost nine years later, LDNOOBW, as aficionados know, is longer and more influential than ever. Shutterstock employees continued to organize their veg list after McCormick left, with the help of outside suggestions, eventually reaching 403 entries for English. The list has gained users outside the company, notably at OpenStreetMap and Slack. There are versions of the list in over two dozen other languages, including three for Klingon:QI’yaH!—And 37 for Esperanto. Shutterstock declined to comment on the list and claimed that it was no longer a business venture, though it still carries the business name and copyright claim on GitHub.
Google’s artificial intelligence researchers recently gained LDNOOBW’s newfound fame – and infamy. In 2019, business researchers reported using the list to filter web pages included in a collection of billions of words pulled from the web called Colossal Clean Crawled Corpus. The censored collection fueled a recent google project who created the largest AI language system the company has revealed, showing strong results on tasks like reading comprehension questions or marking sentences from movie reviews as positive or negative.
Similar projects have created software that generates astonishing fluidity text. But some AI researchers question Google’s use of LDNOOBW to filter its artificial intelligence input, saying it has obscured a lot of knowledge. Removing pages with obscenities, racial slurs, anatomical terms, or the word sex in any context would remove abusive postings on forums, but also swathes of educational and medical material, sexual policy reporting and news. Paridae songbirds. Google did not discuss this side effect in its research papers.
“The words on the list are often used very offensively, but they can also be appropriate depending on the context and who you are,” says William Agnew, a machine learning researcher at the University of Washington. He is a co-founder of the Queer in AI community group, whose webpages on promoting diversity in the field would likely be excluded from Google’s AI primer for using the word sex on pages on improve diversity in the AI workforce. LDNOOBW appears to reflect historical patterns of disapproval of same-sex relationships, Agnew says, with entries including “gay sex” and “homoerotic”.