Scraping the Web Is a Powerful Tool. Clearview AI Abused It

The internet was designed to make information free and easy for anyone to access. But as the amount of personal information online has grown, so too have the risks. Last weekend, a nightmare scenario for many privacy advocates arrived. The New York Times revealed Clearview AI, a secretive surveillance company, was selling a facial recognition tool to law enforcement powered by “three billion images” culled from the open web. Cops have long had access to similar technology, but what makes Clearview different is where it obtained its data. The company scraped pictures from millions of public sites including Facebook, YouTube, and Venmo, according to the Times.

To use the tool, cops simply upload an image of a suspect, and Clearview spits back photos of them and links to where they were posted. The company has made it easy to instantly connect a person to their online footprint—the very capability many people have long feared someone would possess. (Clearview’s claims should be taken with a grain of salt; a Buzzfeed News investigation found its marketing materials appear to contain exaggerations and lies. The company did not immediately return a request for comment.)

Like almost any tool, scraping can be used for noble or nefarious purposes. Without it, we wouldn’t have the Internet Archive’s invaluable WayBack Machine, for instance. But it’s also how Stanford researchers a few years ago built a widely condemned “gaydar,” an algorithm they claimed could detect a person’s sexuality by looking at their face. “It’s a fundamental thing that we rely on every day, a lot of people without realizing, because it’s going on behind the scenes,” says Jamie Lee Williams, a staff attorney at the Electronic Frontier Foundation on the civil liberties team. The EFF and other digital rights groups have often argued the benefits of scraping outweigh the harms.

Automated scraping violates the policies of sites like Facebook and Twitter, the latter of which specifically prohibits scraping to build facial recognition databases. Twitter sent a letter to Clearview this week asking it to stop pilfering data from the site “for any reason,” and Facebook is also reportedly examining the matter, according to the Times. But it’s unclear whether they have any legal recourse in the current system.

To fight back against scraping, companies have often used the Computer Fraud and Abuse Act, claiming the practice amounts to accessing a computer without proper authorization. Last year, however, the Ninth Circuit Court of Appeals ruled that automated scraping doesn’t violate the CFAA. In that case, LinkedIn sued and lost against a company called HiQ, which scraped public LinkedIn profiles in bulk and combined them with other information into a database for employers. The EFF and other groups heralded the ruling as a victory, because it limited the scope of the CFAA—which they argue has frequently been abused by companies—and helped protect researchers who break terms of service agreements in the name of freedom of information.

The CFFA is one of few options available to companies who want to stop scrapers, which is part of the problem. “It’s a 1986, pre-internet statute,” says WIlliams. “If that’s the best we can do to protect our privacy with these very complicated, very modern problems, then I think we’re screwed.”

Civil liberties groups and technology companies both have been calling for a federal law that would establish Americans’ right to privacy in the digital era. Clearview, and companies like it, make the matter that much more urgent. “We need a comprehensive privacy statute that covers biometric data,” says Williams.

Right now, there’s only a patchwork of state regulations that potentially provide those kinds of protections. The California Consumer Privacy Act, which went into effect this month, gives state residents the right to ask companies like Clearview to delete data it collects about them. Other regulations, like the Illinois Biometric Information Privacy Act, require corporations to obtain consent before collecting biometric data, including faces. A class action lawsuit filed earlier this week accuses Clearview of violating that law. Texas and Washington have similar regulations on the books, but don’t allow for private lawsuits; California’s law also doesn’t allow for private right of action.

Some experts argue that empowering consumers is not enough. “We just can’t be expected to manage every use of our data online,” says Dylan Gilbert, a privacy lawyer at the civil liberties group Public Knowledge. He argues the solution instead is to make some uses of personal data illegal. For example, some cities, including San Francisco, have banned facial recognition by city agencies all together.

Another option is to give some power to organizations, rather than only individuals. “Companies and platforms like LinkedIn or Facebook or Twitter should have the right to protect their users’ privacy downstream,” says Tiffany C. Li, a technology lawyer and visiting professor at Boston University School of Law. A federal law could allow online platforms to sue entities like Clearview on behalf of their users to protect their right to privacy. The risk, though, is that corporations will pursue litigation that mostly serves their own interests. A 2018 article in Boston University’s Journal of Science & Technology Law found that in 20 years of scraping cases based on the CFAA, “a tremendous number” concerned claims brought by “by direct commercial competitors or companies in closely adjacent markets to each other.”

In the absence of legal recourse, one way companies have blocked people from scraping their sites is by using technical tools. Facebook has been particularly aggressive in this regard. It requires users to sign in to view almost anything on its site, and it uses a lengthy robots.txt file to stop Google from indexing many of its pages. That’s why if you Google your name, all of your Facebook activity likely isn’t in the search results. But not all of the social network’s efforts have been popular. Last year, the company blocked third-party transparency tools used by nonprofits and journalists, because it said it needed to prevent malicious actors from scraping its site.

Not all companies have the resources, or priorities, to create those kinds of barriers against any would-be scrapers. Venmo, the payment app owned by PayPal, has repeatedly been criticized for making all transactions public by default. Several researchers and artists have scraped millions of payments from Venmo to demonstrate how it puts people’s privacy at risk. Clearview says it also mined the site for its database. “Scraping Venmo is a violation of our terms of service and we actively work to limit and block activity that violates these policies,” a spokesperson said in a statement. While the app, and others like it, could do more to protect users, catching malicious scraping will always be an evolving cat-and-mouse game, and regulatory action could be more effective to stop it.

“We don’t want to limit access to information and we don’t want to ban web scraping,” Li says. “But we need to think about other ways to prevent some of the privacy harms we saw with Clearview.”