Project Document Cover Sheet

Transcription

Project Document Cover Sheet
Project Document Cover Sheet
Project Information
Project Acronym
YS-IMG
Project Title
Yale-SOAS Islamic Manuscript Gallery
Start Date
1 September 2009
Lead Institution
School of Oriental and African Studies, University of London
Project Director
John Robinson
Project Manager &
contact details
Huei-Lan Liu
[email protected]
Partner Institutions
Yale University
Project Web URL
http://www.soas.ac.uk/ysimg/
http://www.library.yale.edu/img/
Programme Name (and
number)
JISC-NEH Transatlantic Digitisation Collaboration Grants
Programme Manager
Alastair Dunning
31 August 2009
End Date
Document Name
Project Plan
Document Title
Reporting Period
Author(s) & project role
John Robinson
Date
Filename
Project plan_ysimg
URL
⌧ Project and JISC internal
Access
General dissemination
Document History
Version
Date
1
26/10/09
2
24/11/09
Page 1 of 24
Document title: JISC Project Plan
Last updated:
Comments
Project Acronym: YS-IMG
Version: v1.0
Contact: John Robinson
Date: 1 October 2009
JISC Project Plan
Overview of Project
1. Background
The majority of manuscripts accessible online today favours Psalters, books of hours, bestiaries, and
other medieval manuscripts from Europe. Manuscript study in terms of Western texts is wellsupported and enjoys high standards of critical scholarship. The Middle Eastern manuscript culture is
less well known, primarily because the materials are far harder for qualified scholars to examine or
study. Thus, by enhancing access, the possibilities for making a material change in our understanding
of Middle Eastern cultures of the medieval, early modern and modern times are great. Yale and SOAS
seek to join collections, resources, staff, and expertise to create a united source of reference works
related to Middle Eastern manuscripts and to develop the technical apparatus needed for connecting
reference materials to original works. The proposed selection represents a wealth of intellectual
interests that will highlight the contributions of Middle Eastern scholars, among them philosophers,
poets, physicians, and scientists. The project partners plan to provide new and enhanced access to
these important materials, as well as to develop a technical model that can be used by other libraries.
Increasing interest and attention has been given of late to the scholarly community that specializes in
Arabic and Middle Eastern Studies. In the U.S., it is clear that there is an enormous rise in enrolments
in Arabic classes. In May 2007, the Modern Language Association (MLA) published “Enrolments in
Languages Other Than English in United States Institutions of Higher Education, Fall 2006,” in which
Furman et al. presented comparative data on foreign language enrolments from 1998 to 2006. The
authors reported that “Arabic continued its impressive expansion: from 1998 to 2002, it lifted its
enrolments by 92.3%, and between 2002 and 2006 by a remarkable 126.5%.”2 In the UK, Ph.D.
theses reflect the increase in enrolment and attention paid to Middle Eastern studies. In 1997, 55
theses were accepted; the average per year by 2006 was 86. Birmingham, SOAS, and Oxford are the
top three institutions granting PhDs. in Middle Eastern topics since 1997.
Concurrent to this visible growth, specific interest has been focused on the research needs of a
scholarly community that was established centuries ago. In 2007, responding to an initiative in the
United Kingdom regarding the importance of supporting Islamic Studies in higher education, the Joint
Information Systems Committee (JISC) issued a call for an investigation into user needs within the
field.
The University of Exeter won the bid to complete the study, and in June of 2008 the project team
published the results-based data extracted from online questionnaires, focus groups, and telephone
interviews. In addition, the study’s authors reviewed reading lists from UK institutions, doctoral theses,
and existing online gateways to Islamic Studies materials. The recommendation ranked first by the
authors was the creation of a gateway to Islamic Resources, including primary texts, fully digitised
Islamic manuscript catalogues, and reference tools such as dictionaries and Islamic websites.
A worldwide initiative has further concentrated on manuscripts in particular. Following the First Islamic
Manuscript Conference held at King's College, Cambridge, in 2005, the conference participants
encouraged the founding of a global association to coordinate the efforts of scholars and librarians
working with Islamic manuscripts. In the following year, forty-five founding members established The
Islamic Manuscript Association (TIMA), an international group pledged to protect Islamic manuscript
collections and to support those individuals working with these collections. One of TIMA's ongoing
projects focuses on facilitating access to these manuscripts.
Page 2 of 24
Document title: JISC Project Plan
Project Acronym: YS-IMG
Version: v1.0
Contact: John Robinson
Date: 1 October 2009
Yale and SOAS have collaborated to form a small but selective list of resources in Arabic and
Western scripts from their collections. In addition, the project organizers have culled lists of existing
digital copies of Arabic and Middle Eastern manuscripts held in repositories of prominent libraries
around the world, for example in the digital library at the Beinecke as well as the British Library. We
intend this selection to be a pilot project that is scalable and extensible to other collections on the
partner campuses or in other libraries, such as the Digital Shikshapatri collection at Oxford and the
Genizah manuscript collections at Cambridge.
The proposed project builds on work done at Yale University Library in three digital initiatives related
to the Middle East. First, the OACIS project (http://www.library.yale.edu/oacis/) laid the foundation by
creating an electronic union list of serials published in and about the Middle East. From this
beginning, the library team expanded the usability of the bibliographical catalogue by digitising full text
articles from a selection of academic journals published in the Middle East. Second, with funding from
the National Endowment for the Humanities (NEH), this digitisation began with the Iraq ReCollection
project, which converted nine (9) Iraqi journals (104,000 pages). (http://www.library.yale.edu/digIraq/)
Third, the full text entries from this digitisation effort became the first deposited articles in a searchable
repository developed as part of the AMEEL project. Like OACIS, Project AMEEL began with funding
from the U.S. Department of Education. (http://www.library.yale.edu/ameel/) The AMEEL project also
added a regional selection of academic journals spanning Tunisia to Saudi Arabia. Development work
during the AMEEL project has produced the technical infrastructure for search, retrieval and display of
the journal articles using the open source FEDORA repository software. At present, over 125,000
pages have been deposited with a goal of finishing 240,000 pages by the fall of 2009.
While accomplishing these goals, the Yale project team has gained considerable expertise in Arabic
text digitisation. Starting in 2005 with the assistance form the digital staff at the Bibliotheca
Alexandrina, the Yale team has since formulated its own digitisation workflow. Further, the team has
worked to automate as much of this workflow as possible, in order to keep labor costs low while
producing high quality scanned images and OCR output. Additionally, the Yale team held a
digitisation workshop, geared toward the needs of U.S. academic libraries, at the November 2008
annual Middle Eastern Studies Association conference.
(http://www.library.yale.edu/ameel/MESAworkshop/index.htm )
Digitisation projects at SOAS are increasing in number and scope. The Endangered Languages
Archive (ELAR), started in 2005 as part of an international network of digital endangered language
archives, permits scholars to deposit documentations and descriptions of endangered languages.
(http://elar.soas.ac.uk/) Begun in Fall 2008 with JISC funding, the Fürer-Haimendorf project plans to
digitise, research, catalogue and mount online approximately 20,000 photographs from the FürerHaimendorf archive held at SOAS. (http://www.soas.ac.uk/furer-haimendorf/ ) Christoph von FürerHaimendorf (1909-1995) amassed an important collection of his own photographs, film, and written
materials during fifty years of scholarship on tribal cultures in South Asia and the Himalayas. The
project is part of a longer-term strategy to mount online the entire Fürer-Haimendorf archive, as well
as other special collections at SOAS.
Since the library at SOAS has not previously worked on digitisation projects involving manuscripts or
printed text as other UK institutions such as Oxford and Cambridge, the Yale-SOAS partnership aims
not only to address user needs by uniting essential reference material related to Arabic and Middle
Eastern manuscripts, but also to share expertise gained at Yale and increase digitisation capacities at
SOAS, especially as the work relates to the digitisation of Arabic text and the integration of digital
resources.
2. Aims and Objectives
Yale University Library and the School of Oriental and African Studies (SOAS) seek to improve
online access to trans-Atlantic collections of digitised manuscripts, manuscript catalogues, and
Page 3 of 24
Document title: JISC Project Plan
Project Acronym: YS-IMG
Version: v1.0
Contact: John Robinson
Date: 1 October 2009
dictionaries by creating a virtual archive, open and freely accessible, for researchers working in
the field of Arabic and Middle Eastern Studies.
2.1 Digitisation endeavour
• To create an integrated set of full-text digital resources supporting manuscript research from
manuscript catalogues and dictionaries by converting materials in Arabic, Persian, and
Western scripts (primarily Latin, German, Spanish, and French) and depositing these into
searchable repositories.
• To augment existing digital collections of Arabic and Persian manuscripts by scanning,
depositing, and indexing selected Yale- and SOAS-held historical manuscripts.
2.2 Integration Project
• To develop an infrastructure to integrate manuscripts with related reference resources.
• To build a suite of tools that will analyse digitised materials and construct internal crossreferences for connecting the materials in the archive.
3. Overall Approach
3.1 Preparation and processing of materials
We will use a combination of in-house and outsourcing – sharing lessons learned from previous
projects – to complete all image capture. Varying costs will be factored into the budget to
account for this hybrid approach.
The manuscripts will be processed with appropriate supervision by related curators and
Preservation librarians at on-campus facilities to limit the exposure of the fragile documents to
outside conditions. The scanning of the dictionaries and catalogues may be outsourced,
depending on their physical condition.
Three OCR software products will be used: 1) Automatic Reader – OCR Gold from Sakhr
Software Co., in Cairo, Egypt, 2) VERUS, the OCR product from NovoDynamics in Ann Arbor,
Michigan, and 3) ABBY FineReader, an international company founded by David Yang for the
automated translation of Russian dictionaries. ABBY FineReader is known for high accuracy
conversion of text, especially in Western scripts. Sakhr and VERUS were developed for Arabic
text specifically. VERUS, due to its original design, can handle a mix of languages and degraded
documents better than Sakhr. On the other hand, Sakhr’s engine is based on a study of modern
newspapers from the Middle East and thus recognizes a wider range of vocabulary. By
incorporating two different OCR software packages into the digitisation workflow, we can
accommodate and manage varying conditions found in the selected materials. The OCR
conversion of texts with a mixture of languages may require periodic modifications to existing
workflows.
The adjustments will be managed on a timely basis and will be documented so that workflow
knowledge may be shared with other libraries. We will follow established best practices for TEI
mark-up as well as the Project AMEEL model to include Dublin Core for repository organization
and MARCXML metadata for librarian perusal. Individuals with language expertise, as well as
metadata training, will manage mark-up and quality assurance tasks. The Yale team will share
best practices and guidelines with the SOAS team. For example, we will perform quality control
checks on a statistical sample of all finished work, in accordance with American National
Standards Institute ANSI/ASQ Z1.4-2003. A random sample equal to 10% of the total batch of
files shall serve as the inspection sample for each file type. In tiered approaches employed in
previous projects, a batch failing the 10% test is rechecked using a different sampling of 5% to
determine final acceptance or rejection (and reprocessing) of the batch. Experience at SOAS
suggests that, while post-digitisation inspection is necessary, it is essential to build quality
Page 4 of 24
Document title: JISC Project Plan
Project Acronym: YS-IMG
Version: v1.0
Contact: John Robinson
Date: 1 October 2009
control into the digitisation process in order to minimize the need to reprocess material. SOAS
has adopted a strategy of embedding essential metadata within the image file using IPTC tags;
this reduces the risk of orphan files (where image files cannot be found by the catalogue
database or subsequently identified). It also facilitates the parallel development of the database
(Phase 2) while digitisation proceeds (Phase 1), with the embedded metadata being
automatically ingested into the cataloguing database when ready.
3.2 Organization of and access to materials
3.2.1 Metadata generation and Cross-reference links
All metadata created, for ingest to the Fedora archive, will be stored in XML format. We will
begin with MARC21 records extracted from the partners’ Online Public Access Catalogue
(OPAC). Within the first month of the project, the technical team will convert these records
into Dublin Core (DC) and MARCXML files at the title, volume, and author levels for
manuscript catalogues and dictionaries; the manuscript records will have DC and
MARCXML at the title, author, and accession level. The MARCXML file will be available
when viewing searched materials, so that librarians may review encoded data. The
customized Dublin Core (DC) file, added at the time of ingestion into the digital archive, will
include tags to describe the resource as well as external and redirect instructions in the
technical metadata for those materials with recognized cross-references. We have, initially,
identified three Use Cases as follows:
♦ Use Case #1: a direct link exists from a digitised manuscript catalogue to a digital copy of
a manuscript freely accessible via the Internet.
♦ Use Case #2: a “See also” link from a digitised manuscript catalogue to a digital copy of
another manuscript from same author or, if possible, same subject.
♦ Use Case #3: a direct or a “See also” link from a digitised manuscript catalogue to
digitised dictionary entry.
We will begin the proof of concept with those Arabic or Persian manuscripts already
available in Yale’s BRBL digital library. For those resources identified as existing in nonpartner digital libraries, we will seek permission and technical specifications to create and
maintain cross-reference links.
3.2.2 Cross-collection searching: EPrints / Fedora1
Project AMEEL at Yale uses Fedora 2.2, an open source software product. The Fedora
framework provides for OAI compatible harvesting for resource discovery, thus permitting
other repositories to discover the newly generated metadata. EPrints, in use at SOAS and
other UK academic institutions, is also open source software for OAI compliant
repositories. Both repository approaches share commonalities, which will be explored to
resolve the connectivity needed for searching the joined collections simultaneously. While
the design of the front door to the proposed digital archive may appear the same, the
underlying architecture at each library site will correspond to the requirements of the
repository software in use. The technical team from both campuses will develop modules
of code that can be adapted to both software approaches.
3.2.3 Durable URLs and Citation creation
Persistent identifiers are essential in developing a digital archive since storage media and
formats are sure to change over the life of a digital library. Citation links will cease to
1
Note - Both partners will develop its own front end to host the digital objects. Yale will integrate the digitised
materials with their existing archives through their AMEEL (An Arabic and Middle Eastern Electronic Library)
portal site, SOAS will mount its own collections on the Digital Archives and Special Collections website. A suite of
tools will be developed to enable cross-collection searching and cross-references for connecting materials in the
archive.
Page 5 of 24
Document title: JISC Project Plan
Project Acronym: YS-IMG
Version: v1.0
Contact: John Robinson
Date: 1 October 2009
function and cause frustration for library patrons without persistent identifiers. Some
current standards for persistent identifiers include:
• PURLs, or Persistent Uniform Resource Locators, which use an intermediate resolution
service that points to the URL of the digital object;
• DOIs: Digital Object Identifiers, are managed by an open membership consortium, which
provides a name to identify a digital object that remains unchanged over the life of the
object. The location of the object may change, but the name associated with it will not;
• Handles: a handle server is a naming management system that creates a unique
identifier at the time an object is added to a repository. The Yale and SOAS technical
teams will collaborate on the configuration of persistent identifiers compatible with both
repository approaches.
3.3 Storage, maintenance and protection of data
The Fedora repository will reside on a Linux server at Yale; the EPrints repository will also be
Linux- based but at SOAS. The archival TIFFs will be stored separately to the repository, but at
each institution, which will arrange suitable off-site backup. Both teams will follow technical
policies, including sensible naming conventions and file structure, so that content can be
integrated effectively and efficiently as the project progresses. The servers will include systems
management subscriptions to keep the server software protected and current. Information
Technology support staff at the partner locations will regularly monitor server traffic, guard
against attacks, and apply widely accepted security practices to development and maintenance
servers. SOAS uses a mix of out-sourced and in-house server solutions, working in partnership
with the University of London Computer Centre (ULCC). It is likely that ULCC will be used to
host the EPrints repository while the archive of high resolution image files will be on SOAS’s
own data storage solution.
4. Project Outputs
•
•
•
•
•
•
Digital copies of 16,800 pages in fifteen significant manuscript catalogues and six dictionaries,
and sixteen historical manuscripts. (approx. 3000 leaves) See Appendix A for a consolidated
list of selected materials for digitisation.
OCR text extraction and metadata of selected manuscript catalogues and Arabic and Persian
dictionaries.
Cross-reference links from the initial set of existing scanned manuscripts to newly digitised
catalogues and dictionaries.
OAI configuration for the newly digitised materials to be discovered by other electronic
resources and indexed by internet search engines.
Digitised materials to be deposited into open and freely accessible networked repositories.
Findings regarding the new digital collection and project documentation to be published on
the project website for use by other academic libraries.
5. Project Outcomes
•
•
•
•
Support manuscript research from manuscript catalogues and dictionaries, many of which
exist only in printed form with publication dates from the 19th century, by converting materials
in Arabic, Persian, and Western scripts (primarily Latin, German, Spanish, and French) and
depositing these into searchable repositories.
Serve as a scalable and extensible model for other special collections and libraries rich in
manuscripts and related reference materials.
Completion of transatlantic specialist collections by making them electronically accessible via
the internet
Enhanced preservation of rare and fragile materials
Page 6 of 24
Document title: JISC Project Plan
Project Acronym: YS-IMG
Version: v1.0
Contact: John Robinson
Date: 1 October 2009
•
Highlight the contribution made to world knowledge by Arab philosophers, physicians, and
scientists.
6. Stakeholder Analysis
Stakeholder
Interest / stake
Status as leaders in Islamic
studies
Models for other special
collections and libraries to
follow suit
Enhanced access to valuable
historical manuscripts linked
to robust reference materials
Online access to rare
resources made easy through
cross-repository searching
SOAS and Yale
SOAS and Yale Libraries
Researchers around the world
Transatlantic collections
Importance
High
High
High
Medium to High
7. Risk Analysis
Risk
Probability
(1-5)
Severity
(1-5)
Score
(P x S)
Action to Prevent/Manage Risk
3
5
15
Project manager unable
to spend sufficient time on
project due to other work
commitments
3
4
12
Not able to recruit suitable
staff
2
4
8
Comprehensive documentation,
including contractual obligations,
will help to minimize the risk. In the
case of the specialist staff, SOAS
has recruited two people; Yale has
cross-trained staff.
SOAS is a small institution, so
there is no full-time back-up for
individual members of staff. Project
tasks will be allocated to other
members of the project team,
effectively spreading the load. Yale
’s Project Director will monitor.
The project team is already in
place and it is intended that a
number of digital photographers
will be recruited from the student
body at SOAS to ensure that there
is a pool of competent staff to
complete the project.
2
3
6
1
4
4
Staffing
Loss of staff, in particular,
specialist staff
Organisational
Project fails to keep to
schedule
Project scope creep and/or
over-run
Page 7 of 24
Document title: JISC Project Plan
Effective monitoring, regular
meetings, SMART objectives, and
contingency timing built into the
key targets will ensure that the
project keeps to schedule.
The project has deliberately
chosen a small list of important
materials. Regular review of
objectives will ensure that the
Project Acronym: YS-IMG
Version: v1.0
Contact: John Robinson
Date: 1 October 2009
project is restricted to its initial
aims. Any areas of the collection
identified as potential projects for
the future will be documented and
form part of the “next steps”
strategy.
Regular team meetings and prompt
sharing/distribution of
documentation; key project
outcomes and documents stored in
central facility available to all
partners; regular reviews of
progress and objectives against
plan
There will be regular reviews of
financial expenditure and most of
the costs are up front in terms of
equipment purchase and staff
costs.
The project team has agreed to
ensure both quantity and quality
outputs for cataloguing and
metadata.
Lack of communication
between project partners
1
3
3
Cost over-run/failure to keep
to budget
2
3
6
Failure to maintain
metadata creation and
cataloguing targets
2
4
8
Technical
Poor quality metadata
2
4
8
Poor quality images
2
4
8
Loss of digital images
2
5
10
Inappropriate storage
1
2
2
Technology failures
2
3
6
External suppliers
Failure of contractors to
deliver Web system
2
5
10
Clear requirements and timescale
to be agreed with contractor, with
penalty clauses. Clear and
frequent milestones to be agreed
with contractor.
Legal
IP issues
1
5
5
Copyright issues
1
5
5
Copyright permission is on file at
Yale for only title post-1923.
Yale will consult with its General
Counsel. A copyright audit will be
carried out in consultation with
Page 8 of 24
Document title: JISC Project Plan
Metadata guidelines will be drawn
up based on Dublin Core and other
successful web-based projects of a
similar nature. The team will follow
mutually agreed-upon standards
for ensuring quality control of the
metadata.
The digitisation assistants carrying
out the work have or will receive
training on the equipment. The
team will test quality at regular
intervals.
Ensure back-up and off-site
storage of images.
Workflow will include storing files
on centralized resilient RAID
system at both partner campuses.
Essential equipment will be under
warranty or support. Potential
down time will be included as a
contingency in the work plan.
Project Acronym: YS-IMG
Version: v1.0
Contact: John Robinson
Date: 1 October 2009
SOAS’s Information Compliance
Manager.
8. Standards
Name of standard or
specification
Metadata standards
Version
Notes
TEI
MARCXML
Dublin Core
Image standards
TIFF
JPG
Optical Character Recognition
OCR Gold
VERUS
ABBYY
Repository
Fedora
Eprints
OAI-PMH
We will follow established best practices for
TEI mark-up as well as the Project AMEEL
model to include Dublin Core for repository
organization and MARCXML metadata for
librarian perusal. Individuals with language
expertise, as well as metadata training, will
manage mark-up and quality assurance tasks.
The Yale team will share best practices and
guidelines with the SOAS team.
Following the Project AMEEL model, the
archival format will be TIFF; while the display
format, i.e. from the repository to the page
viewer, will be JPG to achieve faster online
delivery.
Three OCR software products will be used: 1)
Automatic Reader – OCR Gold from Sakhr
Software Co., in Cairo, Egypt,
2) VERUS, the OCR product from
NovoDynamics in Ann Arbor, Michigan
3) ABBYY FineReader, an international
company founded by David Yang for the
automated translation of Russian dictionaries.
Project AMEEL at Yale uses Fedora 2.2, an
open source software product. The Fedora
framework provides for OAI compatible
harvesting for resource discovery, thus
permitting other repositories to discover the
newly generated metadata.
EPrints, in use at SOAS and other UK
academic institutions, is also open source
software for OAI compliant repositories.
Both repository approaches share
commonalities, which will be explored to
resolve the connectivity needed for searching
the joined collections simultaneously. The
technical team from both campuses will
develop modules of code that can be adapted
to both software approaches.
Standards for digitization:
media
resolution
ratio
archival
format
manuscripts
600ppi
1:1
TIFF
text—
300ppi*
1:1
TIFF**
Page 9 of 24
Document title: JISC Project Plan
Compression
uncompressed or
lossless
compression; no
LZW
CCIT Group 4+
number of digital
copies
4 (small, medium, large,
thumbnail)
1 ***
Project Acronym: YS-IMG
Version: v1.0
Contact: John Robinson
Date: 1 October 2009
bitonal
text—
greyscale
300ppi*
1:1
TIFF**
LZW++
1 ***
* The resolution for text is set at 300ppi based on experience with text extraction using OCR
software.
** Following the Project AMEEL model, the archival format will be TIFF; while the display format, i.e.
from the repository to the page viewer, will be JPG to achieve faster online delivery.
*** The thumbnail will be generated at the time automated scripts deposit each digital file into the
repository.
+ CCIT Group 4 is an image compression schema based on the "Comité Consultatif International
Téléphonique et Télégraphique"), a telecommunications standard created in 1956
++ LZW is a universal lossless data compression algorithm created by Abraham Lempel, Jacob Ziv,
and Terry Welch
9. Technical Development
9.1 Metadata generation and cross-reference links
We will conduct the preliminary analyses using an Open Source product called AraMorph,
which returns morphological tokens, or categorized blocks of text for lexical analysis. The
resulting index will be reviewed manually to determine rules for extracting essential keywords,
titles, place names, and subject headings.
In addition, we will rely on language experts to develop crosswalks, or interpretive tables, to
link modern spellings to the many transliteration schemas, varying over time, which are present
in the selected materials. For example, the manuscript listings for ibn Jazlah and al-Suyuti in
the Bodleian catalogue and the Rieu supplement appear as Ali B. Djazla and Alsoiuthi.
Further, in order to compile a full listing of existing digital copies of targeted manuscripts, we
will employ two methods: 1) OAI harvesting of appropriately configured online databases, and
2) student workers, with language skills, to conduct Internet searches and review the OAI
harvested results. The Yale team will share the methodology and results with the SOAS
technical team in order to determine mutual processes for link creation.
9.2 Page viewing
As part of the AMEEL project at Yale, the technical team has developed a page turner to
simulate the reading experience, with adjustable page size and navigation. The technical team
for the proposed project will work to adapt these methods, as well as other suitable Open
Source methods, as needed to deliver a page viewer of high usability for library patrons. There
are two basic outcomes from this effort: 1) to permit search word highlighting, and 2) to
manage oversize displays from manuscript folios, catalogues, and dictionaries that allow the
patron to work with more than one display at a time and well as enlarge specified sections of
the works.
9.3 Repositories and cross-collection searching
Yale’s Project AMEEL has developed a full-text repository using the Fedora open source
framework, SOAS uses Eprints for their digital libraries. The two technical teams will create the
necessary tools to permit cross-collection transatlantic searching from each other’s archive and
regardless of the entry point chosen by the library patron. We will insure that interoperability
between EPrints (SOAS) and Fedora (Yale) is a key goal during the project.
Page 10 of 24
Document title: JISC Project Plan
Project Acronym: YS-IMG
Version: v1.0
Contact: John Robinson
Date: 1 October 2009
10. Intellectual Property Rights
Out of copyright resources will be our first priority for this digitisation project, in order to avoid the timeconsuming efforts needed for seeking copyright permission. However, exceptions will also be
considered if they enhance accessibility. We will carry out copyright clearance and obtain permissions
for digitisation and online access.
The IPR of the digitised images, metadata and transcripts of the manuscripts will lie with SOAS and
Yale.
Project Resources
11. Project Partners
Yale University Library
Yale has extensive experience in digitising Arabic material and is the acknowledged experts in
applying OCR systems to Arabic script. Their role is to share best practices and knowledge in
digitisation and OCR text extraction with the SOAS team, and select and digitise materials that are
lacking in SOAS Library.
12. Project Management
The Project will be under supervision of the Digitisation Project Board, established to oversee the
running of digitisation projects at SOAS. The Project Team will be responsible for delivering the
project. Conference calls between project partners will be scheduled monthly, usually on the last
Friday of each month to report on progress and agree actions and deadlines for the next stage.
Project title
Name
Responsibilities
Principle Investigator and
Project Director (SOAS)
John Robinson
John is the Director of Library and
Information Services.
Principal Investigator (Yale)
Ann Okerson
Ann is Yale's Associate University Librarian
with specific responsibility for Collections
Development and International Programs
Project Director (Yale)
Elizabeth A. S.
Beaudin
Elizabeth is the Manager of International
Digital Projects, Yale
Curator (SOAS)
Narguess Farzad
Narguess is a senior fellow in Persian in the
Department of Languages and Cultures of
Near and Middle East
Simon Samoeil
Simon has been Curator of the Near East
Collection at Yale Library since 1990
and is a member of the Yale Council on ME
Studies
Huei-Lan Liu
Responsible for the management,
coordination and administration of the
project. Huei-Lan is the Repository Support
Officer.
Curator (Yale)
Project Manager (SOAS)
Page 11 of 24
Document title: JISC Project Plan
Project Acronym: YS-IMG
Version: v1.0
Contact: John Robinson
Date: 1 October 2009
Academic Advisor (SOAS)
Peter Colvin
Peter is a former specialist librarian for
Islamic Middle East in SOAS Library.
Project Technical Consultant
(SOAS)
Malcolm Raggett
Malcolm is the Head of the Centre for Digital
Asia, Africa and the Middle East
Systems Programmer (Yale)
Xinjian Guo
Guo is a senior member of the central
Library IT staff
Preservation and Collection
Care Librarian (Yale)
Ian Bogus
Archivist (Yale)
Bill Landis
Digitisation Assistant (SOAS)
Alex Shipman
Ian is Head of the Collections Care unit in
the Preservation department
Bill is Head of Arrangement, Description, &
Metadata Coordinator
Responsible for digitising and scanning
images and the input of basic metadata.
13. Programme Support
Facilitate cooperation with other JISC supported Islamic digitisation projects.
14. Budget
See Appendix B
Detailed Project Planning
5. Workpackages
See Appendix C
16. Evaluation Plan
Timing
Oct 2009 –
Jun 2010
Oct 2009 –
Aug 2010
Factor to
Evaluate
Digitisation of
3,000 folio
images; 16,800
page images
Questions to
Address
Competing on
schedule
OCR text
extraction and
consequent
metadata
creation
Lexical analysis of
OCR-extracted text
Page 12 of 24
Document title: JISC Project Plan
Method(s)
Measure of Success
Monitoring
progress at
Project team
meetings
100% of images are
completed to
schedule
Quality control
checks on a
statistical sample
of all finished work
A random sample
equal to 10% of the
total batch of files
being accepted.
Project Acronym: YS-IMG
Version: v1.0
Contact: John Robinson
Date: 1 October 2009
Jan 2010 –
Aug 2010
Crossreferencing
Usability testing
Identify 12 crossreference links to
be created during
OCR extraction
100 % of the twelve
links properly
produced from
automated scripts
developed during the
project
Jan 2010 –
Aug 2010
Cross-collection
searching
Usability testing
We will judge success
when students can
retrieve 80% or more
of the search
materials.
Aug 2010 -
Usage statistics
How visitors find the
content, where are
they from, which
content is popular
Usability study
with control
groups of
undergraduate
and graduate
students and
specific types of
searches
Collection of a
range of usage
statistics (website
traffic; searches;
downloads) for
further analysis
Ongoing monitoring
17. Quality Plan
Output
Timing
Quality
criteria
Sep 2009Dec 2009
XML Mark
up and
metadata
Oct 2009 Jun 2010
Creation of
scanned
images
Jan 2010 Aug 2010
Usability of
crosscollection
search
QA method(s)
Evidence of
compliance
Quality
responsibilitie
s
Yale/SOAS
TEI
Inspection of
randomly
selected batch
files
Image quality
Conformant with
international
metadata/encoding
standards
Images meet the
required standards
Yale/SOAS
Usability study
with control
groups of
students
Minimum 80% of
retrieval success
rate
Yale/SOAS
Quality tools
(if
applicable)
Dublin Core
MARCXML
TIFF
JPG
18. Dissemination Plan
Timing
Oct 2009
Dissemination Activity
Wiki
Audience
Project team
members
Oct 2009
Project website
JISC, SOAS
students and
staff, academic
community, other
Page 13 of 24
Document title: JISC Project Plan
Purpose
To share project
documentation
and
communication
To highlight the
project, comply
with JISC
requirements and
Key Message
The existence of
the project and
that advice and
feedback are
Project Acronym: YS-IMG
Version: v1.0
Contact: John Robinson
Date: 1 October 2009
digitisation
projects
Oct 2009
One day workshop. 27th
October 2009
SOAS Project
team members
Dec 2009
Presentation at NACIRA
(National Conference for
Information Resources on
Asia). December 2009
Presentation at MELCOM
(Middle East Libraries
Committee) UK Meeting.
12th or 13th January
2010
Presentation at MELCOM
International Conference.
Cordoba, 19th-21th April
2010
Presentation at TIMA
(The Islamic Manuscript
Association). Cambridge,
8th-10th July 2009
Project launch event
Academic
community
Jan 2010
Apr 2010
July 2010
July 2010
encourage
feedback about
the process
To familiarise with
issues related to
digitisation project
To promote the
existence of this
resource
Highlight the
contribution made
to world
knowledge by
Arab
philosophers,
physicians, and
scientists
To promote the
existence of this
openly accessible
online resource
The availability of
the online
resource relating
to Islamic studies
Academic
community
Academic
community
welcome as part
of the process
Academic
community
Academic and
state sector
communities
19. Exit and Sustainability Plans
Project Outputs
Digitised materials
Action for Take-up & Embedding
Digital images are captured
according to preservation standards
Findings regarding the new digital
collection and project documentation
to be published for use by other
academic libraries
Documentation
OCR text extraction
Cross-collection searching
Project Outputs
Incorporate different OCR software
packages into the digitisation
workflow
Digitised materials to be deposited
into partner’s repositories
Why Sustainable
Action for Exit
Outputs checked and approved
by preservation experts
Ensure all procedures and
technical standards are
documented and made
available on project Wiki or
website
Quality control of conversion of
texts and periodic modifications
to existing workflows
Infrastructure is implemented to
permit cross-collection
transatlantic searching
Online repository
Will continue to be
maintained by SOAS
Digitised materials
Will be preserved by
SOAS
Scenarios for Taking
Forward
Investigate further funding
opportunities to enhance
and expand content
To upgrade/migrate to new
hardware/software formats
Metadata Standards
Maintained by SOAS
To be used and enhanced
Page 14 of 24
Document title: JISC Project Plan
Issues to Address
Further funding
opportunities
Ensuring built in
capacity to fund
upgrades
Funding and staff
Project Acronym: YS-IMG
Version: v1.0
Contact: John Robinson
Date: 1 October 2009
to support new digitisation
projects
Appendixes
Appendix A. List of Selected Materials
Page 15 of 24
Document title: JISC Project Plan
capacity to maintain
and extend this
resource
Project Acronym: YS-IMG
Version: v1.0
Contact: John Robinson
Date: 1 October 2009
Page 16 of 24
Document title: JISC Project Plan
Project Acronym: YS-IMG
Version: v1.0
Contact: John Robinson
Date: 1 October 2009
Page 17 of 24
Document title: JISC Project Plan
Project Acronym: YS-IMG
Version: v1.0
Contact: John Robinson
Date: 1 October 2009
Page 18 of 24
Document title: JISC Project Plan
Project Acronym: YS-IMG
Version: v1.0
Contact: John Robinson
Date: 1 October 2009
Appendix B. Project Budget
Project Acronym: YS-IMG
Version: v1.0
Contact: John Robinson
Date: 1 October 2009
Page 20 of 24
Document title: JISC Project Plan
Project Acronym:
Version:
Contact:
Date:
Appendix C. Workpackages
JISC WORK PACKAGE
WORKPACKAGES
Month
1:
2:
3:
4:
5:
Sep 09
Sep 09
Sep 09
Sep 09
Oct 09
Project Initiation
Integration
Digitisation
Dissemination
Evaluation
1
2
3
4
5
6
7
8
9
10
11
12
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
x
X
X
X
x
X
X
X
x
X
X
X
X
X
X
X
X
X
Project start date: 1 September 2009
Project completion date: 31 August 2009
Duration: 12 months
Page 21 of 24
Document title: JISC Project Plan
Last updated: April 2007
13
14
15
16
17
18
19
20
21
22
23
24
Project Acronym:
Version:
Contact:
Date:
Earliest
start date
Latest
completion
date
1. Wiki
Sep 2009
Oct 2009
Tikiwiki site for use by teams
2. Project Web Site
Oct 2009
Oct 2009
Initial project promotion site
Yale/SOAS
3. Recruitment of staff
Sep 2009
Nov 2009
Digitisation assistant recruited
SOAS
4. Text Analysis
Sep 2009
Dec 2009
Mapping of extracted text for
keyword and subject heading
12/09 delivery of final
mapping to tech team
Curatorial teams
5. Lexical Token Creation
Oct 2009
Mar 2010
Modification of AraMorph to
generate extracted keywords
test dataset by 01/10;
working specs by 03/10
Yale
6. Metadata Identification: TEI, DTD creation
Sep 2009
Dec 2009
TEI schema for ingest
ingest of test dataset
Yale/SOAS
7. Metadata Mark-up tool
Oct 2009
Dec 2009
Online forms for use with TEI
schema
first use by team 01/10
Yale/SOAS
Workpackage and activity
Outputs
Milestone
Responsibility
YEAR 1
WORKPACKAGE 1: Project Initiation
Objective: Staff recruitment, Project
website
Entry of meeting notes;
subsequent monthly
updates
Yale/SOAS
WORKPACKAGE 2: Integration
Objective: Infrastructure development,
analysis of digitised materials
Page 22 of 24
Document title: JISC Project Plan
Last updated: April 2007
Project Acronym:
Version:
Contact:
Date:
8. Page Turner
Oct 2009
Aug 2010
Modification to incorporate
MSS images and active links
prototype demonstrated
02/10
Yale
9. Citation Generation
Oct 2009
Aug 2010
Bookbag features – save
query, research notes
prototype demonstrated
04/10
Yale/SOAS
10. Cross-Collection Searching
Jan 2009
Aug 2010
Ground truth testing of known
data deposited into repository
usability testing 02/10
Yale/SOAS
11. 12 MSS links – manual link generation
Oct 2009
Dec 2009
Ground truth testing of 12 links
usability testing 02/10
Yale/SOAS
12. MSS links – automated link generation
Jan 2010
Aug 2010
Ground truth testing of 12 links
usability testing 07/10
Yale/SOAS
13. Workshop on Text Digitisation
Oct 2009
Oct 2009
Participation and workbook
10/27/09 workshop
EB
14. Scanning
Oct 2009
Jun 2010
3000 folio images; 16,800
page images
15. OCR Processing
Oct 2009
Aug 2010
16,800 page conversions
half by 02/10
Yale/SOAS
16. Quality Control
Oct 2009
Aug 2010
10% randomly selected from
16,800 page conversions
half by 02/10
Yale/SOAS
17. Deposit Objects to Archives
Oct 2009
Aug 2010
3000 folio images; 16,800
page images; metadata files
for each
33% by 02/10
Yale/SOAS
WORKPACKAGE 3: Digitisation
Objective: scanning, depositing, and
indexing materials
Page 23 of 24
Document title: JISC Project Plan
Last updated: April 2007
Yale/SOAS
Project Acronym:
Version:
Contact:
Date:
Oct 2009
Aug 2010
Publishable documentation
reviewed by team by
05/10
19. Press Releases; Talks at conferences
Sep 2009
Aug 2010
Press releases
Presentations
at beginning and
completion MELCOM
2010
20. Launch Event
Jul 2010
Jul 2010
21. Sustainability Planning
Jan 2010
Aug 2010
Mission statement; text for
distribution to possible funding
sources
first session- 01/10;
second session 04/10
Yale/SOAS
22. Usability Study with focus group
Feb 2010
Aug 2010
Full text content searchable by
student testers
Able to retrieve 80% or
more of the search
materials
Yale/SOAS
23. Functionality testing
July 2010
Aug 2010
Metadata harvesting; search
and retrieval; page viewing;
citation generation
System meet
specifications
Yale/SOAS
24. Final Project Report
Aug 2010
Aug 2010
Report submitted to JISC
18. Compile Workflow Documentation
Yale/SOAS
WORKPACKAGE 4: Dissemination
Objective: to highlight the project,
promote the existence of this resource
AO /EB /PC
PC / HL
WORKPACKAGE 5: Evaluation
Objective: functionality and usability
study
Members of Project Team:
AO=Ann Okerson (Yale)
EB=Elizabeth Beaudin (Yale)
PC=Peter Colvin (SOAS)
HL=Huei-Lan Liu (SOAS)
Page 24 of 24
Document title: JISC Project Plan
Last updated: April 2007
HL