Friday, November 18, 2011

Is This Bit Rot?

The term "bit rot" gets batted around a lot, but its definition isn't so easy to nail down. ComputerUser's dictionary describes it as "gradual decay of storage media" while the Free On-Line Dictionary of Computing states that it is "a hypothetical disease the existence of which has been deduced from the observation that unused programs or features will often stop working after sufficient time has passed, even if 'nothing has changed.'" Some online technical dictionaries do not include the term. Meanwhile, Wikipedia's bit rot entry is surprisingly short and contains multiple definitions: decay of storage media, decay of data on storage media, and degradation of software programs. (The entry is flagged as needing citations to reliable sources.)

So maybe I'm not alone in wondering what bit rot looks like in a word-processing file. Some of our archival collections, such as the Marshall Green papers and Notgemeinschaft für eine freie Universität records, contain 5.25-inch floppy disks, which often present problems. Today I'm looking at a disk containing files that open on a PC. They suffer from some malady, as indicated by this sample from a letter dated January 21, 1986:

<<<<<>>>>

Thankó verù mucè foò á lovelù luncheoî anä somå splendiä views® Wå 
imaginå yoõ no÷ iî Indiá anä wondeò iæ yoõ arå listeninç tï somå oæ thå 
samå Indianó witè whoí wå talkeä yearó ago® Thå artistó anä economistó 
werå quitå remarkable¬ buô thå politicaì scientistó useä tï talë abouô 
atomiã bombó foò Indiá witè eager¬ burninç eyeó whilå beinç verù carefuì 
noô tï kilì anù insects® (Severaì haä theiò beardó covereä iî whitå 
silk so that no insect would get caught and be stifled there.)

<<<<<>>>>>

The file name, Enid, has no file extension, so it is difficult to determine what software was used to create it. The sample above is from the rendering in MS Word with Windows character encoding, but no matter what software I open the file in, I get some gibberish. But it isn't all gibberish--the last line is completely legible. Sometimes a letter displays correctly, like the "n" in "insect," but not in other cases, like the "n" that should end "luncheoî" in the first line. Is this bit rot?

Regardless of the diagnosis, the next question is what to do. Because we can infer what many of the corrupted characters should be, we can match them to their actual counterpart, as in this sample:

Corrupted version True character
ë k
ì l
í m
î n
ï o
ð p

Then we could use the find-and-replace feature to fix all the corrupted characters. But it is more difficult to infer the correct characters for corrupted numbers. In addition, I assume that matching corrupted to true characters also varies from one file to the next--after all, this decay probably doesn't occur in the same, predictable manner and at a steady rate. Furthermore, we've got to deal with thousands of corrupted files on hundreds of disks. Lastly, if we did restore all the files, how can we ensure that the researchers studying them understand our restoration process and its implications for the authenticity and reliability of the content?

If you have answers to any of the above problems, I think Wikipedia needs you to enhance its bit rot entry.

An example of bit rot. Hoover Institution Archives

No comments:

Post a Comment