Conclusions Surprisingly, the general attitude of librarians on preservation issues has... cantly in the last 15 years, despite the great change...
Transcription
Conclusions Surprisingly, the general attitude of librarians on preservation issues has... cantly in the last 15 years, despite the great change...
Cultural heritage completo LTC 7-02-2008 12:31 Pagina 250 TOMMASO GIORDANO EUROPEAN UNIVERSITY INSTITUTE, FIESOLE Conclusions Surprisingly, the general attitude of librarians on preservation issues has not changed significantly in the last 15 years, despite the great change that has occurred in knowledge management and cultural communication systems. On the other hand, the increased awareness of the problem has not reduced the gap between perception and practice. On the organizational level, however, the difference between the traditional approach to collection development and the emerging model is quite radical, as is the professional culture that they respectively imply. However, we are not dealing solely with a “cultural” issue; it is also a structural matter of vast dimensions that undermines the business model, which has until now supported libraries. There is in progress a radical shift from an economic model based on the accumulation (and 'capitalization') of the resources acquired to a model based on renting resources for temporary use with no heritage and no guarantees for the future. It is not a change - it is a genetic mutation in libraries, which is challenging the foundations of modern librarianship. “The library is a growing organism,” Ranganathan's 5th law declares; the sustainability of this principle is now an open question. 250 CATHERINE LUPOVICI WEB ARCHIVING: WHAT SHALL WE PRESERVE AND HOW TO MAKE IT USABLE? Cultural heritage completo LTC 7-02-2008 12:31 Pagina 252 CATHERINE LUPOVICI BIBLIOTHÈQUE NATIONALE DE FRANCE, PARIS WEB ARCHIVING: WHAT SHALL WE PRESERVE AND HOW TO MAKE IT USABLE? CATHERINE LUPOVICI value to the pages. So the Web can only be archived on a periodical basis and the archive She is Head of the Digital Library Department, Direction des Services et des Réseaux, must include the links as part of the content itself. In addition the mass can only be handled Bibliothèque Nationale de France, this department is in charge of the coordination of pilot by automatic processes. projects building on digital library services, technologies and standards. Prior to joining BnF › The Web is divided into two parts: the surface web or visible web which is accessible to she was for ten years Libraries Activity Manager within Jouve SA, a French electronic publi- robots for harvesting and indexing and the deep web with restricted access to robots shing company offering data conversion and document scanning services. Her previous because of passwords or technical limitations. experiences include national responsibilities for university libraries co-operation networking The Web is exponentially growing and its real size is difficult to know precisely. If we refer to and automation in France within the Ministry of Education, DBMIST (Direction des the OCLC web characterization study1, the number of unique web sites grew from 2.636.000 Bibliothèques, des Musées et de l'Information Scientifique et Technique) and heading the sites in 1998 to 7.128.000 sites in 2000 and 8.712.000 sites in 2002. Out of this numbers, the pub- National Academy of Medicine Library, Paris. lic sites (sites or significant portion of sites accessible free of charge and without any restriction) grew from 1.457.000 in 1998 to 2.942.000 sites in 2000 and 3.080.000 sites in 2002. Other studies focused on the indexable pages discovered by the search engines. A first study ABSTRACT estimated in 19972 the number of pages at 200 million and a similar study conducted in January Web contents are more and more complementing the classical resources and cultural and scientific institutions are looking at integrating Web material into their collection development policies. The shift in scale of the number of items that the Web represents cannot be processed in a traditional way in term of acquisition, description and access. The International Internet Preservation Consortium (IIPC) was launched in 2003 between institutions already involved in web archiving. The objectives are to share the understanding of the specific requirements and to develop appropriate methods, tools and standards that will 20053 estimated the number at 11, 5 billion pages. If we consider the national domain of a European country like France, a snapshot of the .fr domain resulting of a one month broad crawl processed fall 2004 was 121 million files pertaining to 500 000 different hosts. The size of the snapshot is 3 TB. By comparison the BnF receives about 60 000 monographs by Legal deposit per year. We can see that the number of items to be considered by a National Library having the memory mission to preserve and provide access to what is made available publicly in the country represents a tremendous change of scale when going to the Web. allow the future interoperability of the repositories in order to facilitate cross access to largescale collections and usage in the future through smart analysis tools that will be developed. HOW NATIONAL LIBRARIES ARE ARCHIVING THE WEB Collection policy WHY ARCHIVING THE WEB Web contents are more and more complementing the classical resources that are traditionally acquired by cultural and scientific institutions and it is obvious today that the Web is the place where classical documents are becoming digital. Web contents typology is extending from the classical publications with more and more self publications as well as grey literature to emerging types of online contents which are contin- National Libraries started web archiving in the mid 90s with different collection policy approaches. The National Library of Canada in 1994 then the National Library of Australia in 1996 started with the deposit of only digital resources (e.g. e-journals). They processed them in the same way as classical resources for the deposit workflow. The corresponding collections were catalogued item per item. uously appearing like digital arts, e-learning, e-business but also blogs and new public spaces dedicated to discussion and chat. The Web has also specific technical characteristics creating new challenges for acquisition, preservation and communication. The most significant are: › The Web represents massive dynamic contents with a lot of interlinking adding semantic 252 1 O'Neill, Edward T. ; Lavoie, Brian F. ; Bennett, Rick. - Trends in the evolution of the public web 1998-2002. In D-Lib Magazine, April 2003, vol. 9, n° 4, http://www.dlib.org/dlib/april03/lavoie/04lavoie.html 2 K.Bharat and A.Broder. - A technique for measuring the relative size and overlap of public search engines. WWW conference 1998 3 A.Gulli and A.Signorini. - The indexable web is more than 11.5 billion pages. WWW conference 2005. http://www.cs.uiowa.edu/~asignori/web-size/ 253 Cultural heritage completo LTC 7-02-2008 12:31 Pagina 254 CATHERINE LUPOVICI BIBLIOTHÈQUE NATIONALE DE FRANCE, PARIS WEB ARCHIVING: WHAT SHALL WE PRESERVE AND HOW TO MAKE IT USABLE? The Royal Library of Sweden started in 1997, just like Internet Archive, with periodical auto- Navigation through the archive by URLs and over time matic harvesting of the national web domain building on the model of a domain centric policy. Those minimum requirements are already offered by IIPC members and tools in the public Of course the collections were not catalogued but only indexed by the URLs of the files. domain are available for download on the consortium web site7. The Library of Congress started thematic event based and worldwide coverage harvesting for The following example shows the interface developed by the Nordic Web Archives project. presidential elections in 2000 and 11 September 2001 building on a topic centric policy. The The demo available on line8 is build with the IIPC tools applied to several snapshots of the insti- collected sites were not catalogued in a classical way but some descriptive metadata were tutional sites of IIPC partners. It applies the principles defined by the consortium members for provided at the collection level. the minimum requirements. The National libraries already started in web archiving created in partnership with Internet › Search by URI Archive the International Internet Preservation Consortium4 in July 2003 with the objectives to › Full text search share their expertise and to develop appropriate methodologies and tools for the whole pro- › Time line with the harvest times available for the selected URL. The resolution of the time cessing chain considering web archive collection at least at the domain scale. line can be set up on the following values: minutes, hours, days, months, years, which cor- The work done in the consortium demonstrated that complementary approaches have to be responds to the possible frequency parameter used for harvesting. implemented to collect public web sites as well as the deep web. › Harvesting is preserving the web inter-linking and navigation feature and the context has to be recorded at harvest time. › Deposit by the producer allows getting the deep web. The relationships with the producers facilitate the negotiation for inclusion of some preservation metadata in the deposited files. The consortium elaborated a format allowing to handle a large number of harvested files as well as deposited files and to record preservation metadata compliant with the OAIS5 (Open Archival Information System) information model. The WARC6 format has been accepted as an ISO TC46 work item and is intended to become an ISO standard. User access Beyond the copyright problem that leads to restrict the public access to the result of web harvesting in some countries where the Legal Deposit legislation has already been extended to the Web, search and navigation across huge collections of web archives is a new challenge. The minimum requirements identified by the IIPC members for access tools to collections at the domain level are: In addition to those simple features already available whatever the size of the collections, the Indexing and search by URI and by date of harvest. IIPC members recognized the need for smarter tools enabling automatic classification and Full text search engines like functionality semantic organization of the collections. The next programmes of the IIPC will concentrate on such tools. 4 IIPC web site. http://netpreserve.org Model for an Open Archival Information System. Consultative Committee for Space Data Systems. Blue book, January 2002 http://public.ccsds.org/publications/archive/650x0b1.pdf 6 WARC, Web ARChive file format. http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml 5 254 7 8 http://netpreserve.org/software/downloads.php WERA (Web aRchive Access) http://nwa.nb.no/wera/index.php 255 Cultural heritage completo LTC 7-02-2008 12:31 Pagina 256 CATHERINE LUPOVICI BIBLIOTHÈQUE NATIONALE DE FRANCE, PARIS WEB ARCHIVING AT BNF WEB ARCHIVING: WHAT SHALL WE PRESERVE AND HOW TO MAKE IT USABLE? Elections have been archived (23 million files). The latest (at the end of 2005) run on 4 500 sites BnF started its web archiving experiments as pilots for the extension of the legal deposit legis- (40 million files, a third of which were blogs). The current one is about the presidential and lation to online publications. Legal deposit is the legal obligation for every publisher, printer, general elections and runs from October 2006 to June 2007. The election collections are producer, distributor, importer of documents to deposit copies of all published materials in the indexed by URL, date of harvest and subject metadata provided by curators at the selection mandated institutions. Originally promulgated for printed books in 1537, legal deposit has been time. BnF is also carrying out a user study of students and scholars of the Institut d'études progressively extended to all types of materials of expression and creation, including new tech- politiques de Paris on the use of election collections of 2002 and 2004. nologies as they appeared in France. After books, engravings, music scores, photographs, Continuous crawl: the online edition of the Journal official de la République française (the posters, audiovisual and multimedia documents, time has come to archive Web sites as well. Government's main publication) is harvested on a daily basis. The full collection of the title has 9 been harvested since the first online issue in June 2004. In 2006 the legislation has been extended in two directions. › The DADVSI Law (DADVSI stands for Droit d'auteur et droits voisins dans la société de l'in- IMPACT OF LARGE SCALE DIGITAL COLLECTIONS ON PRESERVATION AND ACCESS POLICIES formation- loi 2006-961) was officially published on August 3rd, 2006. The BnF has waited a The size of digital collections is currently changing drastically in our institutions. The key long time for this law, which Title IV (Clauses 39 to 47) officially establishes the Web legal changes are produced on one side by the mass digitisation programmes resulting of the impul- deposit. All the collections created during the pilot phase since 2001 will be made accessi- sion given by initiatives like Google Library or Open Content Alliance and aiming at bringing ble to authorized visitors of BnF in the reading rooms of the Research Library only. This more classical contents available on the web and on the other side by the introduction of web restriction applies to all legal deposit collections. Access will be authorized only after pub- contents into our cultural heritage through more and more web archiving initiatives. lication of a specific decree, possibly in 2007. The mass digitisation of classical analogue collections will allow using the digital surrogates › In June 2006 a modification of the previous legal deposit decree allowed BnF to negotiate in place of the original not only for remote access but also locally in the library. It is reducing with the producers the deposit of electronic files in place of the previous classical medium the pressure of classical preservation actions requested for the more on demand analogue (for instance paper). material and allows to transfer the corresponding resources from preservation to digitisation. For instance in BnF the Digitisation service was attached to the Preservation department in The current archiving process is organised in three complementary collecting methods: 2004 in order to facilitate the evolution. Bulk automatic harvesting of French national domain websites. BnF signed a 3-year agree- The possibility offered to BnF by the law to choice between the electronic format and the ment with Internet Archive (IA) in 2004 whereby both partners agreed to embark on a research printed output urges to manage digital preservation of large scale digital collections. The top project on the French national domain. Through this partnership, BnF has captured two snap- priorities for cultural heritage institutions is moving towards setting up trusted repositories shots of French domain sites, at the end of 2004 and 2005. Each snapshot contains 118 to 140 providing risks management as well as long term preservation and access. In addition the million files equal to a volume of 7 Terabytes. The third snapshot is under process. The target access applications have to improve and to offer good search and navigation facilities not is two snapshots per year. The whole collections are indexed by URL, date of harvest and the only through the mediation of descriptive metadata but also using smart tools for content current one will be in addition full text searchable. analysis and facets navigation. Thematic focused harvesting on a selection of sites by subject or reference librarians. Focus crawls can be thematic or event-based. About 3 500 websites from the 2002 and 2004 French 9 Legal deposit : five questions about Web Archiving at BnF. http://www.bnf.fr/pages/version_anglaise/depotleg/dl-internet_quest_eng.htm 256 257