How to Archive Open Source Materials

By Aric Toler 
ru.bellingcat.com
12 min
March 23, 2018

All pictures in this article are available in full size by clicking on them.

When conducting open source research, it is important to think about how to archive the materials you study. For example, a user may delete a post on a social network after your investigation has been published, or a video with shocking footage (for example, a war crime in Syria) may be deleted due to YouTube's censorship policies .

There are two main reasons for archiving all digital evidence used in an investigation: to preserve it in case it is deleted from the original source, and to prove to the audience that the material (if it has been deleted) actually existed in the form you present it. Screenshots can be easily faked, so it's critical to find a way to save your content in a way that shows you couldn't change the content.

Third Party Archiving Platforms

For most content, including social media posts, news articles, and other web pages, there are two services that usually work: Archive.today and Archive.org. These sites store web pages on their own servers, after which they are made available through a link. In addition, both sites save pages at a specific point in time, so you can see changes between different archives, for example before and after information was cut from an article. We recommend saving content on both sites to maximize the amount of content archived. We will briefly describe the operation of both sites and their effectiveness in archiving pages of various popular social networks. In general, Archive.today is better suited for saving pages on social networks, since it does this through a specially created account, while archive.org only sees completely public pages that do not require an account.

Archive.today

Of the two main archiving sites, Archive.is is more effective when it comes to social media. However, it has not been around for nearly as long as archive.org. It should be considered less stable because it is much more modest in scale. Additionally, the site is blocked in various countries because extremist content is sometimes distributed through links on archive.today. Alternative links to this site (Archive.is, Archive.li, Archive.ch...) allow you to bypass censorship in some (but not all) countries, such as Russia, China and Finland.

Archive.today saves pages solely based on user requests, and not automatically, like Archive.org. To save a page on this site, simply enter the link to the page you want to save into the box in the red rectangle.

You can also archive pages by saving a bookmark in your browser, allowing you to save the pages you're on in one click. To do this, save the new page to your bookmarks (or favorites) with the link:

javascript:void(open(‘https://archive.today/?run=1&url=’+encodeURIComponent(document.location)))

Now just click on the newly created bookmark to save any page you have open in your browser.

You can also drag the button on the Archive.today home page to your bookmarks bar to avoid manually creating a bookmark.

To check if a link has already been saved, enter it in the field in the blue rectangle.

There are more advanced ways to find saved pages if you don't know the exact link. For example, if you want to find all archived Bellingcat articles tagged MENA (Middle East North Africa), search for the following:

The asterisk at the end of the link will allow you to find all articles on Bellingcat whose links begin with “news/mena”. This includes all articles in the “MENA” section of our website.

The results will show articles manually saved by users who entered the link, as well as pages with links to Archive.org's database of saved pages. In some cases, you may be able to open different versions of the same page if changes have been made to the article.

Another useful feature of Archive.today is the ability to save an entire page as an image, even if it is very long. However, this should not be used as a replacement for an archive link, as screenshots can be edited after saving.

Archive.today is relatively successful at archiving social media pages, but its performance is far from perfect. Below are saved pages from various social networks. As a rule, it is almost impossible to archive a social network page protected by some privacy settings, such as “only friends of friends can see this page” on Facebook, using third-party archivers like Archive.today or Archive.org.

In the examples below, click on the hyperlink to each social network to view the saved page on Archive.today.

Facebook:

Works pretty well, except for the photos and videos embedded in posts.

Instagram:

Does not work.

Twitter:

Works very well, except for embedded content in tweets, particularly photos, videos and links.

VKontakte (VK)

Works very well except for embedded photos and videos.

Classmates (OK)

Works very well except for embedded photos and videos.

YouTube

Can only save metadata and text, but not the videos themselves.

Archive.org

Founded in 1996, the Internet Archive has been preserving web pages for more than 20 years and has a significant budget, providing stability that cannot be expected with Archive.today. While Archive.org has many great projects, we're primarily interested in the Internet Archive Wayback Machine  (web.archive.org), which allows users to archive specific pages and view pages archived by other users.

As with Archive.today, the process of searching and saving web pages is very simple. Enter the link into the search bar at the top of the page to view archived versions. To save a page using a link, enter it at the bottom right.

While Archive.today only saves pages based on user requests, Archive.org uses both user requests and scripts to automatically save pages. For example, Bellingcat's home page has been photographed more than 800 times since the domain was purchased in May 2014. Surely only a small part of them was saved due to user requests.

When saving regular web pages and news articles, Archive.org often beats Archive.today by allowing you to click through to other archived pages. For example, using the Internet Archive Wayback Machine, you can navigate much of the Bellingcat site as if you were in 2014, since all of these pages were saved about 4 years ago. There are far fewer archived pages to be found on Archive.today.

Archive.org isn't as good at social media as Archive.today, but it still comes in handy sometimes.

Facebook

Works well with completely public pages, but, unlike Archive.today, does not have access to pages that require a Facebook account.

Instagram

Does not work.

Twitter

Works very well, except for embedded content in tweets, particularly photos, videos and links.

VKontakte (VK)

It works well with completely public pages, but, unlike Archive.today, it does not have access to pages that require a VK account.

Classmates (OK)

Works well with completely public pages, but, unlike Archive.today, does not have access to pages that require an OK account.

YouTube

Doesn't work very well on the main Wayback Machine site as it doesn't even retain metadata and text from videos well.

However, Archive.org has a separate project called YouTube Crawl, which archives YouTube videos along with metadata. You can read more about participating in their project here . This requires more effort than simply solving one click on web.archive.org and archive.today.

Saving photos and videos

From the previous section, you learned that neither Archive.org nor Archive.today can save photos and videos from Instagram and YouTube, and also have problems saving photos from Facebook, VK and other sites. Creating a third-party “neutral” platform to preserve media from these sites is much more difficult. Instead, you must download the materials separately and then provide additional materials (such as screenshots with metadata, materials on mirror sites, etc.) to prove the authenticity of the screenshots and videos.

YouTube

There are many sites that allow you to download videos from YouTube, such as KeepVid , Y2Mate and others. Archiving YouTube videos is not difficult at all if you have enough space to save them on your hard drive or in the cloud. Be sure to take a screenshot of the metadata and save the page to Archive.today to preserve the title, upload date, and description, even if the video itself is not saved to the page.

Instagram

Unfortunately, archiving Instagram pages is very difficult. Often, all we can do is hope for a cross-post on another site (many dubious sites “borrow” Instagram content and host it) or manually save the images at full resolution.

To open an Instagram photo in full resolution, follow this procedure:

  1. Find the link to the photo on Instagram and delete all the data after its ID. For example, for a photo with a link instagram.com/p/ BfZJzBphUr1 / ID will be  BfZJzBphUr1 . If there is anything else after this ID (such as “taken-by=username”), remove that part.
  2. Type “/media/?size=l” (lowercase L) at the end of the link. For the link instagram.com/p/BfZJzBphUr1/ the result will be instagram.com/p/BfZJzBphUr1/ media/?size=l
  3. Now the Instagram photo will open in the maximum available resolution in JPG format. In the case of the post mentioned above, this will give the following result.

To save Instagram videos, you can use various sites like KeepVid, such as Gramblast  and DreDown .

Facebook

Downloading high-resolution photos from Facebook is much easier than from Instagram, since this feature is built into the site's user interface. Select "Options" and then "Save" from the photo menu to download it from Facebook's servers. The image may not be the same resolution as on the camera, but it is the best you can download from Facebook itself.

Saving videos from Facebook is a little more complicated, but still relatively simple. While watching a video, right-click on it and select “show link.” Now you can copy this link and paste it into a third party site to download the video.

As with YouTube and Instagram, there are several third-party sites that allow you to download videos from Facebook's servers in case the person who uploaded the material deletes it. FBDown.net works great and there are few ads or pop-ups. By pasting the video link that you copied from the source, you can download this video in the best quality from the link in the red box below.

Saving photos from VK in full resolution is very simple: you need to select “show original” in the photo menu, and it will open in the maximum available resolution. Even if the user deletes the photo from his page, the link in VK with the image in full resolution will remain forever.

Saving videos from VK is a little more difficult than saving from YouTube, but various free (and paid) tools allow you to do this. For example, GetVideo.org allows you to download videos uploaded to VK in the original resolution. To get a link to a video, right-click on it and select “Copy video link.”

It should be noted that you should not click “Best Quality” on GetVideo. Instead, select the largest specific resolution (eg 720p). Please note that downloading files from this site is quite slow.

The best way to save photos at full or near full resolution is to select "full screen" and then save the image or take a screenshot.

There are fewer sites for downloading videos from Odnoklassniki than for other social networks, but they still exist, for example Video-Download.co .

Other archiving solutions

Often, the above methods for downloading web pages or videos are not possible because they are protected by privacy settings (which limits access from sites like Archive.today) or use little-known video platforms that sites like KeepVid do not work with. All the solutions given above in this guide are free. However, some other paid or shareware services can make your life easier. We won't tell you how to spend your money, but Bellingcat researchers have successfully used the following solutions (and even developed one themselves):

Some software solutions allow you to download videos from most sites, even if they don't use YouTube or other popular platforms. Apowersoft's Video Download Capture works surprisingly well for almost all embedded videos, as well as (in some cases) live streams. However, this service requires payment for full use. This program detects that a video is playing in the browser and then (usually successfully) downloads it from the original source. If you're trying to download a specific video and can't find any other solution, it might be worth taking advantage of the program's trial period . If you can't use the trial period or don't want to buy this program, ask the author of this article (@AricToler) on Twitter for help downloading a specific video.

If web pages are protected by privacy settings, it is very difficult to find a solution that can create a full-fledged third-party archive copy of the site. Simply saving pages in HTML format is extremely inconvenient because it creates many subfolders on your hard drive. An alternative option is to save the page as a PDF, either by printing it to PDF (File -> Print -> Print to PDF) or by using Adobe Create to save the page to PDF .

However, it is quite possible to change the contents of pages in the PDF file itself. Currently, perhaps the most credible, if not ideal, way to display the contents of a secure page is to record your screen (for a list of simple solutions for this procedure, see here ) while you view the page.

Finally, if you do a lot of online research and want an automated tracking solution to help you recapture your steps, we suggest checking out Hunch.ly , developed by Bellingcat contributor and Python master Justin Seitz. When this plugin is active, it automatically saves every page you visit during investigations. If one of these pages is subsequently deleted and you forget to archive it, Hunch.ly is here to help.

Do you use other sites and resources to archive web pages, images and videos? Suggest your suggestions in the comments if you think they should be added to this guide.

Просмотры:

Коментарі

Популярні публікації