Deleting unethical data sets is not enough


The researcher’s analysis also shows that the data set “Labeled Faces in the Wild” (LFW) launched in 2007 is the first data set that uses face images captured from the Internet. It has been used in the past 15 years. Sub-deformation. Although it was originally used as a resource for evaluating facial recognition models used only for research, it is now almost exclusively used to evaluate systems used in the real world. Although there is a warning label on the website of the dataset, which warns against such use.

Recently, this data set was reused in a derivative product called SMFRD, which added a mask to each image to advance facial recognition during the pandemic. The author points out that this may bring new ethical challenges. For example, privacy advocates criticized such applications for encouraging surveillance, especially by enabling the government to identify masked protesters.

“This is a very important paper because people usually don’t notice the complexity, potential harms and risks of data sets,” said Margaret Mitchell, an artificial intelligence ethics researcher and aS leader in charge of data practices. Did not participate in the research.

She added that for a long time, the culture of the artificial intelligence community has assumed that data exists for use. This article shows how this can cause problems. “It is very important to carefully consider the various values ​​of the data set encoding and to have the values ​​of the available data set encoding,” she said.


The study authors provide several suggestions for the advancement of the AI ​​community. First, creators should more clearly communicate the intended use of their data sets through licensing and detailed documentation. They should also set stricter restrictions on access to their data, perhaps by requiring researchers to sign the terms of an agreement or requiring them to fill out an application form, especially if they intend to construct a derived data set.

Second, research meetings should establish specifications on how to collect, label, and use data, and should provide incentives for responsible data set creation. NeurIPS is the largest artificial intelligence research conference and already includes a list of best practices and ethics.

Mitchell suggested going further. As part of the BigScience project, AI researchers collaborated to develop an AI model that can parse and generate natural language under strict ethical standards. She has been trying to create the idea of ​​a data set management organization-a team of people , Not only deals with the management, maintenance and use of data, but also cooperates with lawyers, activists and the public to ensure that it meets legal standards and is only collected with consent. If someone chooses to withdraw personal information, it can be deleted. Not all data sets require this type of management organization, but it is certainly necessary for scraped data that may contain biometric or personally identifiable information or intellectual property.

“Data set collection and monitoring is not a one-time task for one or two people,” she said. “If you do this responsibly, it will be broken down into a large number of different tasks that require deep thinking, deep expertise and a variety of different people.”

In recent years, the field has increasingly believed that more carefully planned data sets will be the key to overcoming many of the industry’s technical and ethical challenges. It is now clear that building a more responsible data set is not enough. People who work in artificial intelligence must also be committed to maintaining them for a long time and using them in an ethical manner.


Source link