File Information Tool Set – Technical Tuesday

FITS (File Information Tool Set) is an application that extracts technical metadata from digital files.  It’s extremely important in WGBH’s digital preservation process because it makes use of several other metadata information tools and combines their output into a single xml document.

FITS combines the following tools:

ADL Tool
Apache Tika
DROID
Exiftool
File Utility (windows port)
Jhove
MediaInfo
National Library of New Zealand Metadata Extractor
OIS Audio Information
OIS File Information
OIS XML Information

We use FITS to generate a xml document for every file we preserve.  FITS even generates an MD5 value for each file it processes so if you’d normally be running a MD5 checksum on your files, you can generate a rich technical metadata document and it will take nearly the same amount of time.

You can get more information and download FITS on Harvard’s Institute for Quantitative Social Science project website.  The FITS Users Google Group is also a great place to ask questions, report bugs, or suggest improvements.

NameChanger – Technical Tuesday

NameChanger from MRR Software is a Mac OS X 10.3 and above application used to rename digital files.

We’ve found NameChanger extremely useful when grooming hard drives before processing them for digital preservation.  When files are delivered to the Media Library and Archives we find that sometimes a user will have used special characters when naming them.
Special characters include & : ; , ' / and can cause errors in the way we process the files.

For example, a slash / commonly used in writing a date means something different when a computer is processing a file named that way.

Screen Shot 2016-04-19 at 10.33.57 AM

In our digital preservation processes the computer would try and find a text file named “2016.txt” located in folders “4” and then “1”.  It reads the slash as a change in the folder path.

What the NameChanger application does is lets you replace characters across all the files you drag into a session window.

Screen Shot 2016-04-19 at 10.38.50 AM

See above, NameChanger actually reads the slash character as a colon.  You just have to tell the application to take all the occurrences of : and replace them with _ or another character that won’t mess up your processing.

It even lets you prepend or append things to your file names, which is also very handy if you are trying to clean up an entire hard drive of files.

NameChanger is available for free on the MRR Software website, donations are encouraged.

Digital Format Information – Technical Tuesday

Thanks for checking out our new Technical Tuesday blog where we hope to share useful tips and techniques related to digital preservation and access.
This week’s topic is the selecting a digital format for your audio/video files.

Choosing the right digital format for your media files can be tricky. You have to take into account a number of factors and use cases as well as limitations that may be imposed by your institution. Some archives accept and preserve the digital files as they are while others may decide to digitize or transcode to another format more suited for preservation.

We’ve found a good resource to learn more about digital file formats is the Sustainability of Digital Formats Planning for the Library of Congress Collection website.

Screen Shot 2016-04-08 at 8.49.40 AM

There, you can search for different codecs and wrappers you may be thinking of using on a project. On each record page there is a good amount of information related to things you may not have been thinking about, like common adoption of a codec or any licensing or patent information. When planning for long term digital preservation it’s good to know if a codec or wrapper you are using may have limited access or support in the future because of those kinds of restrictions. Maybe your format is proprietary and could someday require a license? If you discover a format that’s open source, you may want to also archive a copy of the source code.

Being able to gather your own information and reference a source like this can also support your case if you have to explain your workflows to any other institutional management wondering why you would choose to use format “X”.

One extremely useful bit of information we’ve found on these pages is the “Production phase” information. Here, it describes what the format is more than likely used for in the phase of a production.
“Middle state, used for storage or archiving” or “Production (initial state) and post production (middle state).”

If you don’t have much experience with different digital file formats, this kind of information can help steer you into a solution that will fit your project goals.

MD5 Checksum – Technical Tuesday

Hello,
Thanks for checking out the WGBH Media Library and Archives’ blog for our first Technical Tuesday. We’ll be sharing some of the techniques we use in our daily digital preservation and access processes. First up, creating MD5 checksums for files.

What’s a MD5?
A MD5 checksum hash is a value composed of 32 digits that can be calculated from a digital file to verify integrity and looks like this 9aee1a70c2055b5eaba6dcb73ffe42cc

At WGBH we generate and compare MD5 values every time we copy a file from one storage medium to another. If the MD5 value is not identical between the source and copied file, it means there was a change to the file somewhere during the transfer and the files are not identical.

We generate and store MD5 checksums for every file we preserve. When we run processes to check the integrity of our digital files, it’s important we have a base value to compare to.

Systems and software we use:
– Computer with Mac OS X 10.5 or higher
– “Terminal” application included with OS X

Generating an MD5 for a file is simple.
Open the Terminal application.
Type

$ md5 /folder/path/to/your/file/example.txt

Press “return”
That should return a value that looks similar to this:

MD5 (/folder/path/to/your/file/example.txt) = 9aee1a70c2055b5eaba6dcb73ffe42cc

That is the MD5 checksum for that example text file.

If you wanted to save that MD5 value to a separate csv report file you can do this:

$ md5 /folder/path/to/your/file/example.txt >> /path/to/your/report/file/md5_report.csv

Press “return” and you’ll find a new file created in the folder, /path/to/your/report/file/, called “md5_report.csv”. Inside it will have the filename MD5 output for the original file.
In the WGBH Media Library and Archives, we generate a MD5 csv report file for an entire directory on files on a hard drive using these commands:

$ cd directory
$ find "$(pwd -P)" -not -path '*/\.*' -type f -exec md5 '{}' \; >> /path/to/your/destination/folder/md5_report.csv

Once we have that, we can compare those MD5 values to another list to verify files have been copied successfully.

It’s important to note that there are other checksum algorithms besides MD5 that are more unique, such as SHA-256.

To calculate the SHA-256 value:

$ shasum -a 256 /folder/path/to/your/file/example.txt

The value should look something like this
53971fee91ae8530f32dad213d76aac0cc5cf9cb9771e6268b7568e791de0327.

We don’t use SHA-256 yet at WGBH because the preservation software and systems are not yet making use of it.

Check back here every Tuesday for more tips!