Tag: Scraping Data

“Internet History Is Fragile”: Archiving and Preserving The Web

“Once it’s on the internet, it’s there forever.” That is the popular sentiment. But what happens when web data has been altered or deleted, how do we access the original data? Recently, there was public outcry in response to the removal of several pages from the official White House website by the Trump administration. Pages on LGBTQ, civil rights, and climate changes were removed within moments of President Trump’s inauguration. This erasure was particularly alarming for many people because it indicated the new administration’s sentiment towards minorities and the environment. Many people also believed these pages were perminately deleted and its data could never be accessed again. However, these web pages were in fact migrated to an archived version of Obama’s administration website.  Even though the web data was migrated, its swift removal from the White House website reminded me that valuable information can easily be removed from public access. As users have the ability to alter and delete web data, data itself is rather fragile and transient.

Continue reading

Module 2 Exercise 4.1: Scraping Data With Wget

Wget is a Tool for Downloading Internet Sources

The purpose of this Programming Historian Exercise was to help me get a sense of how to use wget to download a specific set of files, and how to download internet sources by creating a mirror of an entire website. For this exercise I decided to complete the section: “Step Two: Learning about the Structure of Wget – Downloading a Specific Set of Files.” In this exercise I ran wget through the command line to download the papers located in the active history website under the “features” tab. I was introduced to a series of useful commands for wget:

Continue reading