As librarians at a university, one of the most common questions that we get is: “Can I access articles from (newspaper) around (time period)?”

History as it’s been told in newspapers and articles has value across many different researchers and communities, and without it, a great deal of scholarship, personal research, and the historical record would be lost. Luckily, the archives of many newspapers and magazines have been saved and maintained, and their historical copies are available through libraries, commercial vendors like LexisNexis, or free repositories on the Web.

Archiving modern journalism, on the other hand, is a lot more complex than saving newspaper articles of the past.

With the rise of data-informed journalism, we are seeing more and more complex “interactives,” or news stories that have some interactive storytelling webpage loading content from a database or other data source in the background (e.g. stories from The Upshot team at NYT). This allows readers to explore and engage with data, and to understand the bigger picture behind each human story. Iconic examples of data journalism projects include “Dollars for Docs” by ProPublica and “Gun Deaths in Your District” by The Guardian, among many others.

But because of their technological complexity, these cutting-edge projects (also called “news apps”) cannot be fully or systematically saved by libraries or web archiving tools. As such, they are being lost. You have a better shot at reading a full newspaper from 1892 than you do one interactive story from a newsroom in 2005.

This startling and unacceptable fact has served as the guiding motivation for our Saving Data Journalism project.

This gets to the crux of the archiving problem: current web archiving technologies, which have been successful in capturing snapshots of static news content, fail to capture the look, feel, and functionality of a significant amount of dynamic content. While tools like WebRecorder have alleviated some aspects of this problem, there are several limitations to these tools, which need to be addressed to satisfactorily capture a full news app.

With current tools, clicking around on a dynamic website is necessary to record content. So picture this for a database-driven website; you’d need to click every permutation of content and data, which is not only a lot of time and effort but also substantially susceptible to human error. It would require an archivist to click thousands to hundreds of thousands of links to fully capture a single project, on top of quality assurance time.

So how can we solve this problem?

We set out on our IMLS-funded Saving Data Journalism project to create a prototype of a tool that would automate the capture and archiving of these news apps. And so, our research team is excited to present ReproZip-Web (RZW), an open-source prototype aimed at saving these news applications from extinction!


See Katy Boss’s other Tweets

ReproZip-Web leverages ReproZip, a computational reproducibility tool, and pywb, a python web archiving toolkit for recording and replaying of web archives, to automatically and transparently capture and replay dynamic Websites.

The prototype needs to be run on a server or computer where the news application is currently running; when the recording and tracing is started, ReproZip-Web will make a record of the dependencies and also capture their source to rerun the news app, including any source code for special software libraries used, and all input/output data.

Next, RZW packs all this into a bundle that contains everything needed to reproduce a news application in a different environment. These .rpz bundles are used to automatically set up the dependencies for the user so that they are replaying the news app in the same environment, with all the same dependencies, as the original application. Since the bundles are also lightweight and generalized, they are ideal for distribution and preservation.

Our project is very much a prototype, and more work is needed in testing, developing, and generalizing it. If you’d like to test it out, our prototype is available on GitHub and the documentation is available here: https://reprozip-news-app-archiving-tool.readthedocs.io/. Feel free to leave an issue on GitHub or reach out to us with any questions over email!

Our next steps for Saving Data Journalism are:

  1. Take RZW out of the prototyping phase and into a “production-level” software, including usability testing with archivists and newsrooms
  2. Engage with more data journalists and newsrooms for testing and evaluating our tool, and also to see where the best integration point is for RZW in their workflows
  3. We’ll end with this call to arms — data journalists and newsroom archivists: please contact the project manager, Katy Boss at katherine.boss@nyu.edu if you are interested in working with us to save data journalism!

Our team

We are a group of former journalists, librarians, software engineers, and computer science researchers that are very concerned about the loss of dynamic websites and data journalism! Read more about who we are and what we’ve been doing on our websites: Katherine BossVicky SteevesFernando ChirigatiRémi Rampin.

Thanks

We would like to thank the Institute of Museum and Library Services (LG-87-18-0062-18) for their generous support of this project. Thanks also to Rhizome and especially Ilya Kreymer for the work on Webrecorder, without which this project would have had even greater barriers to overcome.

Preferred citation:

Boss, Katherine. Steeves, Vicky. Saving Data Journalism: Using ReproZip-Web to Capture Dynamic Websites for Future Reuse. Software Preservation Network. https://www.softwarepreservationnetwork.org/saving-data-journalism-using-reprozip-web-to-capture-dynamic-websites-for-future-reuse/