“Research Without Borders: Big Open Data” Review

Flyer for Research Without Borders: Big Open Data courtesy of Columbia CDRS' Twitter.

Big Open Data was the focus of the second event on December 4 in the 2014-2015 academic school year’s panel discussion series at Columbia University, Research Without Borders. Co-sponsored by the Center for Digital Research and Scholarship’s Scholarly Communication Program, the Data Science Institute and the School of Continuing Education’s Information and Knowledge Strategy Program, the Research Without Borders series, “focuses on pivotal issues in scholarly communication,” and is free and open to the public as well as webcasted live. This particular event asked the following questions of Big Open Data: “How are large amounts of data managed, made sense of, and made accessible? What are the challenges of working with large open datasets, and how are different academic disciplines making use of them?”

David Park, Columbia’s Dean of Strategic Initiatives, served as moderator of the event and kicked off the discussion by introducing the panel as well as the topic to be discussed. Each panelist then presented their particular perspective on Big Open Data, which was followed by a question-and-answer session from the in-person audience as well as from the virtual audience via Twitter. The three researchers chosen to delve into this topic came from three distinct disciplines and, consequently, proposed three very unique viewpoints. The first offered a perspective from the humanities; David Joseph Wrisley is a medievalist as well as Associate Professor in the Department of English and the Civilization Sequence Program at the American University of Beirut whose current work includes Visualizing Medieval Places: a digital humanities project “exploring space, place, and time in medieval text.” The second panelist, Jonathan Stray, is a computational journalist—a career he described as meaning both “using computers to help journalism” and “doing journalism about computer science.” In addition to teaching a course on the subject at Columbia, he is currently the lead on development of the Overview Project, an open-source tool to aid investigative journalists in finding stories by analyzing large numbers of documents. Alice Marwick, the third and final presenter, is a professor of Internet Studies and New Media Theory at Fordham University. Her work “investigates online identity and consumer culture through lenses of privacy, surveillance, consumption, and celebrity.”

The lone humanist, David J. Wrisley, asserted that his presence at Research Without Borders might have seemed unusual since it was an event discussing data based research. However, he did an excellent job demonstrating the role big open data can play in his discipline by citing several projects, from various areas of study under the humanities umbrella, as examples. For his presentation, he punctuated the topic to emphasize its different permutations: “Big, Open Data,” explaining how these “different beasts” have encroached into the humanities in different ways. Wrisley not only addressed big, open data as simply an issue of “so much data, so little time”; he also discussed ways in which the humanities can change data and computing as well as change the ways we think about these things. Wrisley presented how data-driven projects in the humanities are being conducted by scale and granularity. “Big humanities projects” analyze large datasets to shed new light on traditional areas of research. Conversely, projects using “small big data” study objects by getting extremely close—closer than what can be seen by the human eye. Regardless of the method, Wrisley ultimately advocated for collaboration among scholars. He explained that the use of big, open data is shifting the work of humanists in scholarly communications. Previously humanities arguments would be constructed in singularity and then connected with the academic community through writing. Nowadays, these projects must often be done in teams in order to secure big funding and to protect these works from digital rot.

Journalist and computer scientist, Jonathan Stray then looked at the presence of big open data in journalism asserting that (a) it is “not always big,” (b) it is “not always open,” and (c) it is “not always data.” He began by emphasizing that, no matter the size, data does not mean anything by itself and must be interpreted. To explain, he cited two very distinct interpretations of the exact same unemployment data. Stray then explained that to overcome the challenge of data that is “not always open,” you can file a request via the Freedom of Information Act if it is government data, or you can simply generate your own data. He used credit scores, which we have very little control over nor understanding of, as an extreme example of this closed data. Stray then advocated fighting for more “algorithmic accountability,” i.e., more transparency, of this type of information since it is able to affect our lives quite seriously. To illustrate big open data that is “not always data,” Stray discussed the archives of Guatemalan Secret Police, which are currently being analyzed at the University of Texas. He explained that even if all of these documents were scanned and text-searchable, language is so contextual that searching just a string of text would not prove a very effective form of research. Instead, he asserted that it is important to find a way to narrow large document sets down so they can be useful for the purposes of journalists. Stray then presented his current work on the Overview Project as an example of a tool for this “bulk data mining in investigative journalism.” Stray ended his portion of the presentations by proposing the hard problems of computational journalism he currently struggles with: “How best can we combine human and machine intelligence? How to sustainably fund technology transfer into journalism? What stories should we cover? Does the reality of transparency match the theory? (And do we even know what the reality is?).”

Before beginning her presentation, Alice Marwick warned she would be taking a more dystopian approach than either Wrisley or Stray. She was not exaggerating. Her discussion of “Tracking and Mining the Mobile, Social Web” centered on the idea of how our personal data turns into big data. She began with a discussion of the launch of Facebook’s new ad-serving technology, Atlas. Atlas solves what Marwick calls their “cookie problem” of third party cookies placed by ad networks that track not only what sites you visit, but in what order, and what you click on. These cookies only work about 60% of the time due to a user’s browser settings and extensions so Facebook has circumvented this issue by using personal data collected from user profiles and aggregating this with data across devices as well as offline. One of the ways this is accomplished is by connecting users’ Facebook profiles with email addresses on other, non-social media, websites to triangulate their information. Marwick continued by discussing some of the concerns with data brokers, whose “big promise” is to micro target ads and sort people into segments. For instance, Facebook’s ability to target ads goes far beyond facebook.com, and even though Atlas allows users to opt-out of targeted ads, it does not allow them to opt-out of having their data being gathered. Perhaps the most unsettling of Marwick’s examples is that these data brokers can collect this information and sell it to the government, which cannot legally collect this information itself. For instance, the Obama campaign purchased information from data brokers during the last election to determine ideal times to pay to run campaign commercials. Furthermore, data brokers have also accidentally sold this, often sensitive, personal information to criminals such as an identity theft ring in Vietnam. Marwick finished up by discussing the need for “big data due process,” arguing for more transparency on the part of data brokers about how they’re gathering this information, and that people should have a right to learn how it can be used against them. “Social media is about human connection,” she said, “the price shouldn’t be comprehensive surveillance.”

As could be expected after such an unsettling final presentation, during the following question-and-answer session questions mostly reflected concern about data collection online and the dystopian possibilities of big data. There was a discussion on apps such as SnapChat, which claim to be “anonymous.” However it turns out that they do not actually have the type of encryption software necessary to truly be anonymous. One audience member brought up questions of obfuscation and faux data, which can also be used as a form of activism. At one point Stray did assert that the future is not entirely bleak although we will never be able to opt-out entirely because this technology is simply too easy and too useful. Furthermore, Marwick pointed to the strength of the new European privacy laws as a hopeful example. The panel discussion ended with Wrisley explaining the many challenges of interdisciplinary work. To illustrate the importance of crossing disciplines, he referenced a recent conversation with a friend who expressed dismay that, “at the information visualization conference there were 500 solutions to no problems…and at the humanities conference there were 500 problems with no solutions.” Moderator David Park also chimed in stating that despite these challenges, he feels more and more scholars are seeing the value in collaboration. Academic librarians can help to facilitate these valuable scholarly connections among different fields of research in addition to maintaining collections of data and making accessible.


