Ars Technica has a detailed breakdown of how Swartz was able to remotely access PACER’s servers. I did not know precisely how he did it.
In early September, Swartz e-mailed Malamud to discuss an alternative approach: instead of sending volunteers to libraries, they could crawl PACER directly from Malamud’s server. Malamud was skeptical. “The thumb drive corps is based on going to the library and using their access,” he noted. “Do you have some kind of magic account or something?”
Swartz asked a friend to go to a Sacramento library that was participating in the program. After the librarian logged the friend into the library’s PACER account, the friend extracted an authentication cookie set by the PACER site. Because this cookie wasn’t tied to any specific IP address, it allowed access to the library’s PACER account from anywhere on the Internet. But Swartz admitted to Malamud that he didn’t have the library’s permission to use this cookie for off-site scraping.
“This is not how we do things,” Malamud scolded in a September 4 e-mail. “We don’t cut corners, we belly up to the bar and get permission.”
“Fair enough,” Swartz replied. “Stephen is building a team to go to the library.”
But without telling Malamud or Schultze, Swartz pushed forward with his offsite scraping plan. Rather than using Malamud’s server, he began crawling PACER from Amazon cloud servers.
“I thought at the time he was actually in the libraries” downloading the documents that were accumulating on his server, Malamud told Ars in a phone interview. In reality, Swartz merely had to dispatch a volunteer to the library once a week to get a fresh authentication cookie. Swartz could do the rest of the work from the comfort of his apartment.