The recent past is littered with unsuccessful attempts to store information for the longer term. The problem is technology changes incredibly quickly. Some readers will remember microfiche – small sheets of film viewed through a large light box – at libraries. How many computers have a slot for 3.5 inch floppy disks any more? And forget about 5.25 inch disks. Regardless of the format you are currently storing your precious digital baby photos on, they are likely to be unreadable in 10 years time.
But while your digital baby album may be a personal priority, the issue of digital sustainability is a massive problem on a worldwide scale. Consider the necessity of maintaining crime scene evidence (increasingly stored digitally as images or instrument readings) for a statutory period of time, or the far more serious problem of long-term records of nuclear waste. On a more mundane level, government records, articles in academic online-only journals and the data that was gathered in the name of research all need to be looked after for an extended period of time.
There is a new emphasis on the reuse of data, with the perception that the initial government investment in research should not be ‘single use’. Potentially the same data can be used by other researchers for different research later down the track or to check the accuracy of analysis. The changing focus of research means that while information today is usually collected in digital format, previous data collected in analogue form needs to be digitised for preservation or analysis.
One group, based at ANU, is taking the issue of digital sustainability very seriously. The Australian Partnership for Sustainable Repositories (APSR) is working towards a future where all major public institutions have a digital repository – not only as their back-up plan, but as a way of enabling research. APSR was established in 2002 with a brief to share software tools, expertise and planning strategies; participate in international standards; maintain a technology watching brief; and support national teaching and research with technical advisory services. A partnership between the University of Sydney, University of Queensland, ANU and the National Library, APSR has developed several practical open source software tools to assist the use and running of repositories.
The data problem
A UK report released last year showed that less than 20 per cent of UK organisations surveyed have a strategy in place to deal with the loss of or degradation to their digital resources.
Of the large amount of data collected in academic contexts, much of it is unavailable after it is used for analysis. This is partly because it may be sitting in a box under a researcher’s desk or because it is in a format that only an old computer can read. Indeed, as Margaret Henty and Kevin Bradley discovered when interviewing academics for a survey on data habits, some researchers keep old computers in their offices so they can read old disks that are storing data from previous research.
But this does not have to be the case. There are data archives which look after data in a sustainable way. The Australian Social Sciences Data Archive (ASSDA), based at ANU, was set up in 1981 with a brief to collect and preserve computer-readable data relating to social, political and economic affairs and to make the data available for further analysis.
Great care was put into the Archive when it was established, explains Margaret Henty, the National Services Program Coordinator at APSR. “The use of internationally recognised standards has meant that data can be exchanged and sustained with the minimum of intervention, even though the hardware and operating systems have changed over time,” she says. “This emphasis on standards has been maintained, and the ASSDA has been an important contributor to international work in this field.”
Software - so last millennium
Software incompatibility is a serious problem for long-term data storage. Some software programs are simply no longer readable, such as WordStar or WordPerfect. Even earlier versions of current software are sometimes unreadable. It all comes down to backward compatibility – the ability of newer versions of programs to convert older files into the new format.
The 2007 versions of Microsoft Word, Excel and PowerPoint have been designed so they will not automatically read earlier versions of the same programs. That is, they are not backwardly compatible. People who are buying the new versions of the software will need to also obtain a ‘compatibility pack’ to allow the migration of these files across.
This is a problem for digital repositories. APSR is developing a program called Automated Obsolescence Notification System (AONS), which scans the repository and gives an alert to the repository manager if there are items that have become, or are about to be, obsolete. “The scanning runs overnight and is configured by the repository manager. Scans can occur daily, weekly or monthly,” explains Peter Raftos, Project Manager in Scholarly Technology Services at ANU. “We won’t know until the program is operational how often things will become obsolete. The terms ‘obsolete’ and ‘obsolescent’ depend as much on context as format.”
Considering those home albums once again, the sustainability problem goes further than having hardware able to read your storage medium. What about the format the images have been saved as? Images are commonly saved as JPEG files, but this is simply a standard method of compression of images created by the Joint Photographic Experts Group committee - hence the name. (The video equivalent MPEG was created by the Moving Picture Experts Group.) There is no guarantee these standards will remain in the future as digital image requirements change.
404 error - document not found
The internet began in the 1980s and the World Wide Web software was released with the first web server in 1991. The World Wide Web Consortium was created in September 1994. The internet was originally designed to withstand disaster by having multiple nodes so if one node was wiped out, the information would be sitting elsewhere. But the internet provides its own challenges.
The web in itself is not a way to archive anything. For example search engines regularly rewrite the past by updating their indices, overwriting web pages with new ones. The ‘Wayback Machine’ (www.archive.org) is one attempt to create a record of what has passed, by taking snapshots of the web at given points in time, and unlike web pages these are properly stored in the Internet Archive in California.
Academic publishing is increasingly moving onto the web. The issue of URL permanence gains importance as more articles include online citations. The phenomenon of URLs disappearing over time has been described as ‘link rot’. A recent study showed that over a four year period more than 37 per cent of online citations of top refereed communication journals had disappeared.
An attempt to address the problem of link rot is to use a globally unique persistent identifier. “These are in theory everlasting,” explains Scott Yeadon, a DSpace Committer based at ANU, which means he’s one of the few people in the world who can access the code behind the research-focused repository system. “As long as the identifier resolution service and object repository keeps running it’s not a problem.” There are several persistent identifier programs, with the Handle System being the most well known. Handles provide a single URL based at a separate server which points to a document. The fairly well-known Digital Object Identifier (DOI) System is a subset of the Handle System with a cost attached. “These systems are arguably better than using an arbitrary URL,” explains Yeadon. “They are external and the URL is meaningless, which avoids any problems when the meaning of the object changes.”
Another approach is by a group that started out of Stanford University called LOCKSS (Lots of Copies Keep Stuff Safe) and involves creating a distributed, self-repairing, robust, digital preservation system. LOCKSS uses a process called format migration which converts material to a newer format that the browsers do understand.
Show me the money
Organisations and governments worldwide are grappling with this problem and working towards solutions is proving costly. On an institutional scale there is the general cost of setting up repositories. Even repositories based on open source software have set-up costs, and all repositories have costs associated with their running and maintenance.
The Federal Government has recognised the need to address sustainability issues and has started investing considerable sums into the area. The Systemic Infrastructure Initiative was announced in 2001, funding several projects including APSR, Australian Research Repositories Online to the World (ARROW), the Digital Theses Project, the Meta Access Management System (MAMS) to the tune of tens of millions of dollars. Future funding for sustainability will come out of the National Collaborative Research Infrastructure Strategy (NCRIS).
But there is only so much money in the bucket and this is causing discomfort in research circles. “Researchers are concerned that funds will be taken away from their research to cover sustainability,” explains Henty.
There are no easy solutions to the problem of large scale digital sustainability, but there are things that you can do to look after your own digital information. Run regular back-ups and make sure that when you upgrade your computer you also update all of your files. Put versions of your work into your own institutional repository. And those baby photos? Print them out and put them in an album.
Editor's Note: First published in the Winter 2007 edition of the ANU Reporter. For permission to reproduce this article please contact ANU.