Technical Resources Workshop 22 September 2010

Transcription

Technical Resources Workshop
22 September 2010
Oxford eResearch Centre
PROGRAMME
10.00 Coffee
10.30 Welcome and description of DIAMM activities (Julia Craig-McFeely)
11.15 Elizabeth Eva Leach (DIAMM): the virtual learning environment and DIAMM
11.40 Paul Vetch (Centre for Computing in the Humanities): Online delivery of DIAMM materials; how
DIAMM relates to other projects
12.30 Feedback session 1
1.00
Lunch
2.00
2.45
3.30
Theodor Dumitrescu (University of Utrecht): CMME software and the CMME database
Ichiro Fujinaga (McGill University, Montreal): Gamera and its applications in shape-recognition for
musicology and other subjects
Feedback session 2 (Grace de la Flor)
4.00
Tea
4.30
5.00
Summary: website analytics (Greg Skidmore); ways forward for DIAMM
Close
Questions:
How can DIAMM become more useful to researchers;
Where do we see DIAMM going in the future (e.g. what will we be doing in 10 years?);
How can we improve our ouputs;
What new technologies should we be embracing;
How important is it to link up with other online resources and what sort of connections should we be
making?
Present: Julia Craig-McFeely, Elizabeth Eva Leach, Segolene Tarte, Stella Holman, Esther Anstice, Greg
Skidmore, Theodor Dumitrescu, Ichiro Fujinaga, Grace de la Flor, Paul Vetch, John Pybus, Richard
Polfreman, Paul Kolb
Technical Resources Workshop Report
Summary of Workshop
1. Julia Craig-McFeely: DIAMM
Website: www.diamm.ac.uk
The presentation summarized the activities of DIAMM over the past 12 years, and described the activities
that are funded by the current grant from the AHRC. It also discussed the database and back-end
descriptive data content and how that is managed in the database framework. A detailed version of the
presentation and accompanying materials were provided to participants on USB sticks.
The image content ranges very widely from tiny fragments and heavily damaged leaves or partial leaves
that have been recovered from sources as diverse as wallpaper scrapings to binding reinforcements to
complete choirbooks in excellent condition, sometimes in its original binding.
The content was acquired originally with grant support from the HRC, and the AHRB (now the AHRC),
but has been expanded through collaboration with other projects that have obtained funding to digitize
documents and from the use of project-acquired funds earned through consultancy. The project also
obtains a small number of images from donations by the owners, but this is usually only possible for small
documents where the cost of digitization is small.
Quality has always been an issue and, surprisingly, it is still very difficult to obtain images of a consistent
quality from suppliers such as library digitization services, where a fast commercial throughput means that
it is not easy to maintain quality controls, and where sometimes a lack of visual acuity in the staff means
that they cannot see artifacts on the images caused by e.g. using an unsharp mask during capture.
The reasons for inclusion of a colour and size scale in each image have been demonstrated and accepted
widely, but some archives continue to create ‘reference’ shots separately from their main images. This has
proven ineffective where the camera operator subsequently moved the camera, but also has implications
for long-term viability of image ‘collections’ like this, since digital images are still relatively new, and
therefore we know little or nothing of colour drift over time. The problem with having a single reference
image is that if that one file becomes corrupted (either to unreadability or without the operator being
aware) all the colour and size information for the accompanying images is also therefore corrupted,
rendering a whole set unreliable.
Since the start of the project, imaging technology has moved on from scanning backs to single-shot
cameras, and the project now undertakes most of its imaging using a single shot 65 megapixel camera,
though the 144 Mpx scanning back is still invaluable for larger sources, and damaged sources where the
tighter pixel resolution allows more complex digital restoration to be undertaken. Using the single-shot
camera the project can undertake UV, IR and multi-sepctral imaging, and the scanning back can also be
used for UV work.
Digital restoration has aided scholars in many disciplines to retrieve material originally believed lost from
damaged documents. Restoration however is now rarely undertaken within the project, since it is
extremely time-consuming.
The database is the most important current activity of DIAMM: the majority of users of the website visit it
in order to access catalogue metadata and other information about manuscripts, and many only access
images rarely. Partly this is because the collection of images is extremely incomplete, and many very
important sources are not available online even if the project has obtained images which are stored in the
dark archive.
2
The database is currently populated using FileMaker Pro, but is delivered via a SQL database to the
website. There have been problems in the past in translating data from one medium to the other, largely
because ODBC connection with FileMaker (FM) was not well managed. Newer versions of the software
have corrected this problem and it is now possible to export directly from one database to the other.
However SQL is a far less ‘forgiving’ environment than FM, and getting the two databases to talk to each
other is far from simple. The end result however will be an upload mechanism that can be run more-orless at the click of a button, and will allow the online dataset to be updated as frequently as we wish,
instead of the current system in which content updates are only done every few months.
The first diagram shows the base data structure that is used to deliver our current online system. It was
originally a mechanism that allowed users to get to images, so has the image as the smallest level of
information available, and all images are a subset of the Source table, which is basically a list of
manuscripts (each manuscript being anything from a fragment to a large, complete, bound source).
Attached to the image table is a ‘SecondaryImages’ table which allows us to deliver different versions of
an image (UV, IR, restored, watermark, detail etc.) as an adjunct to the main image.
Manuscripts, however, are rarely considered in isolation. There are many ways in which manuscripts were
originally linked to each other or have become linked:
• By design – e.g. a set of partbooks
• By content – e.g. the same copyist working in multiple sources, or the works of a particular
composer being preserved in a set of books
• By intellectual construct – e.g. because the MSS originate in the same geographical region, or
because they had political connections
• By reconstruction – so a set of fragments may originally all have come from the same complete
MS
And so on.
3
On each page, there are items – in this case usually a musical composition or work. Depending on the
layout this may represent a complete work, or it may represent one or more voice parts of a musical
work. Sometimes there are many items on a page (which corresponds to an image) and sometimes an
item is copied over several pages.
The database deals with this by allowing each MS to belong to a ‘Set’ and the type of set is also defined.
Because an Source (MS) can belong to more than one set, and any set can have more than one member,
these are linked by an intersection set.
Items are
linked directly to the Source table, since they are effectively independent of the images, but they are
connected to the image set by an intersection set.
Musical works present issues that can be difficult to resolve: is a Kyrie a work in its own right, or should it
only be considered an item in a larger whole called ‘a Mass’. The repertory itself answers this question,
since the idea of the parts of the proper of the Mass forming part of a linked cycle is a relatively new one,
and these ‘movements’ were originally composed in isolation. Motets too create issues: some motets are
written in sections, and any of those sections might appear in a different manuscript as a motet in its own
right. Therefore items too have to be linked into sets or groups via an intersection set.
Items require a composer database, and in fact also require a ‘composition’ database as a superstructure
to the individual items, since an appearance of a work in one source may be substantially different from
its appearance in another source. Works may also be based on another work in the database, used as a
model, and there are models for texts which are used in adjusted forms as well as in their pristine form.
As anyone who has worked on a structure like this knows, bibliographies have their own problems. In
DIAMM we are not simply dealing with a bibliography, but with a bibliography that is linked to a large
manuscript database. Some bibliographical items therefore may refer to a manuscript, some to items in a
manuscript, some to items by particular composers (who appear in many manuscripts) and some to
cross-composer and cross-manuscript subjects such as genre or a particular titled work. This causes quite
a complex network of intersection sets between the bibliography database and all the other tables in the
database, including the tables now added to deal with composers.
4
Texts require their own database (giving original and standardized spellings, language tables etc), so do
purely musical items such as voicing, clefs, mensuration and so on. The database grows with each iteration
of information, and expressing it creates a complex structure that also complexifies the ways in which
that data can be accessed and searched online, since the whole has to be presented in ways that web
browsers can understand and display meaningfully. The design implications are enormous, and design is
one of the most complex tasks undertaken by our technical partners.
5
Knowing where to stop – where to decide that certain types of information must be provided by a
different resource, is quite difficult. The temptation is that DIAMM tries to be all things to all people,
whereas the real answer of how to deal with this difficulty is to create a web-services model in which a
number of linked databases with different specialisations connect to each other, and each deals with
different types of information. By connecting to the existing information in a complementary database,
duplication is avoided, and a broader information resource can be created.
Where, then does DIAMM go in the future? How does the project become self-sustaning: not just to
stand still, but to continue growing.
DIAMM has always been soft-funded, but this has set the project up to be dependent. We need to
create income streams but we’re determined to to charge for access to our content. Having considered
many income streams – and that list is always growing as new technologies emerge: we are currently
examining iPhone and iPad apps – it is clear that a major income stream, and probably our primary one
for the foreseeable future, is publications. When the project started, those working in digital spheres
believed that the book would very soon become obsolete, and the digital facsimile would be the way
forward. It is clear however that there is simply no substitute for holding something beautiful, particularly
a beautifully-produced colour facsimile, and sales of our first two facsimiles have shown that – for now
anyway – this is a viable way forward.
2. Elizabeth Eva Leach: DIAMM and Moodle
Websites: www.music.ox.ac.uk/people/staff.../e_leach.html;
web.me.com/elizabethevaleach/; twitter.com/eeleach
Dr Leach has been working on the creation of an online teaching resource for learning medieval notation.
Rather than develop a new system she has used MOODLE software, originally developed for distance
learning in Australia, but now in use by many educational institutions for interactive coursework that
students must complete in their own time.
The Virtual Learning Environment (VLE) was of particular interest to the AHRC in funding this period of
the project, since it would open the content to a much wider public, and would also enhance teaching of
medieval music in other HE institutions, since it would not be limited to internal use, but will be accessible
to any user – just as the DIAMM resource as a whole is open to access to anyone who can get online
Dr Leach demonstrated some of the course parts that she had already constructed, which she hoped
would form the basis for similar courses by colleagues that would cover different periods and traditions of
notation. She demonstrated the various types of module that could be used: simple flat text description;
web pages with images and other materials; quizzes; tests in which the user is required to reach a certain
score.
She did point out however that creation of such a course is very time consuming. If a ‘student’ is offered
multiple choice answers, in order for their learning experience to be useful she had decided that a wrong
answer should not simply be marked as incorrect, but that the student should receive feedback on why
they may have got that answer wrong, thus enabling them to learn from their mistakes as much as from
reading the preparatory material. Doing this for each question takes a great deal of time, as does the
mechanical tasks of preparing suitable sections of larger images to provide the student with samples from
which to work.
6
7
There is no point in creating a system that requires input and feedback from a real person, since this
would simply be impossible because nobody can devote that sort of time input, and would limit its
usefulness drastically. The course is therefore designed to be self-supporting: self-marking and providing
feedback based on scores and responses, all of which takes a great deal of forethought and planning. This
is however precisely what Moodle is designed to do, although it requires a significant investment of time
from the person creating the course at the outset.
The only thing not possible (presently) in Moodle was the ability to incorporate some way for a student
to transcribe the music they were reading online, and have that transcription ‘marked’ by the system.
However Dr Dumitrescu’s presentation in the afternoon suggested that with some adaptation it may be
possible to use his CMME software to fulfil this function, at least in part.
3. Paul Vetch (Centre for Computing in the Humanities): Online delivery of DIAMM
materials; how DIAMM relates to other projects
Websites: www.kdcs.kcl.ac.uk/who/bios/paul-vetch.html;
www.cch.kcl.ac.uk/research/projects; www.bpi1700.org.uk/jsp/
The online delivery of DIAMM has been undertaken for 8 years by CCH, and Paul Vetch, who has been
involved in that work from the outset, was able to give an overview of some of the technical challenges
that the database presents in online delivery. Specifically he discussed the need for users to be able to cut
down to the piece of information that they wanted in a number of ways (e.g. by searching for a
composer, a particular work, a genre, a manuscript, etc.) and how that might be achieved in design terms
by using faceted browsing.
8
He used an online ecommerce database of knitting patterns as an example of faceted browsing in use, in
which a user could whittle down a dataset of 200,000 knitting patterns to find the type of pattern that
they might be interested in, limiting search results progressively by applying a series of filters. This meant
that a visitor who didn’t necessarily know exactly what they wanted was able to view the dataset’s
content in progressively smaller segments, rather than having to browse the entire content item by item,
which is rather what users currently do with DIAMM content.
This type of ‘searching’ is technically ‘browsing’, and is commonly used now on many e-commerce sites,
where the user is invited to filter a set of search results e.g. by cost, or manufacturer to save looking
through a very long list.
Without being particularly aware it, users are therefore trained to filter search results or browse results,
and designing an intuitive way for this activity to be applied to academic materials has been part of their
work for a number of projects. One that was demonstrated was pbi1700 (British Printed Images to
1700), developed by CCH. The faceted browsing facility used to search this project’s database
(http://www.bpi1700.org.uk/jsp/) allows users to choose from a number of different categories with the
intention that a user could never create a search that would yield zero results. The user is ‘directed’ by
having free-text fields actually auto-complete using information from the database (thus preventing
mistyping or the entry of non-matching data), and the richness of the results was immediately apparent
since each category had a number appended indicating how many records would match that search.
Categories could be search alone, or combined, so a print producer could be fixed, and then a category
within his output fixed, then that could in turn be filtered by date or technique, all of which were
predefined by the underlying database.
9
The basic development database using DIAMM materials was shown. Although this is not yet fully
populated, nor constructed apart from its basic components, the workshop participants were able to see
the type of browsing delivered by bpi in use for DIAMM materials. The DIAMM faceted browser would
look similar to this, but with the primary filters (here: Producer, Person show, subject, date) being source,
composer, genre, composition, date etc. The difficulty in DIAMM is the huge variety of type of content
that a user might wish to examine. At some point however we have to make a decision on behalf of the
user group to limit the searchability to certain facets.
CCH is also developing work for Thomas Schmidt-Beste’s PRoMMS project, another musicology project
in a significant group of musicological outputs from CCH, but one with defined connections to DIAMM.
PRoMMS will draw on the DIAMM database and its image content, as well as using DIAMM to create
additional images. The project is concerned with the creation and mise-en-page of musical manuscripts in
the late 15th and early 16th centuries, and ways in which manuscript production can be defined and
described by researchers an electronic medium, allowing us to better understand the process of
manuscript creation and what each item on a view might mean to a reader. Part of the development
work on PRoMMS will allow CCH to develop a web-services connector to the DIAMM resource, which
can in turn be exploited by other projects wishing to connect to the datasets.
At present the markup is planned to be done manually, but the presentation by Prof Fujinaga in the
afternoon session suggested that his software had progressed to a stage where some of the identification
and classification of zones on pages and openings of manuscripts could be automated.
4. Theodor Dumitrescu (University of Utrecht): CMME software and the CMME
database
Project website: www.cmme.org
CMME is now well known in the musicological community thanks to presentations by Theodor
Dumitrescu at major musicological conferences and a website that offers a clear history of the project,
the software development, and a fascinating interactive demo version of the tool.
10
Essentially the software provides end users with a simple interface for making online editions of early
music by allowing the user to input original note shapes and values. Much is lost in the process of ‘editing’
a piece into modern notation or a modern edition, and a great deal of that can be retained if the editing
software allows the user to retain more of the information as it appears in the original source. Because it
is xml-based, the content is searchable in a way that most music-processed content (using other
softwares) is not, and simple drop-down menus allow the user to change the display to modern note
shapes, modern cleffing, modern (or various other styles of) barring etc. Individual parts can be viewed in
score or part format.
However, CMME is much more than simply a piece of software, as it is being used to create a large
collection of new editions of early sources from a period complementary to that covered by DIAMM
(and to a small extent overlapping with DIAMM). The editions involve transcribing all the sources for a
work, and instead of producing a single ‘collated’ edition (which does not represent any contemporary
version or performance of the work) produces a version prepared from one source that can be easily
and graphically compared with all the other editions. For instance the ‘master’ source can be chosen, and
then variants between that source and all the others in notes and underlay can be shown using colours or
other graphic devices. The CMME editions are available online through the CMME website
(http://www.cmme.org/?page=database).
The advantage of this type of edition is its fluid presentation, and also the improvement that it offers in
understanding of the original sources for students and performers of the repertories. Most performers
and students work only from modern editions of these works, where an editor has made often significant
decisions about the content that is taken from the original and passed on to the modern user. This
significantly damages our ability to understand the work in its original context, and can be extremely
misleading. We could say that all students of music from this period should understand the notation and
work from the original sources, but actually expertise in this notation is not easy to come by and access
to the sources, even with a resource like DIAMM available to the public, is slow and laborious for most
users whose expertise is generally less specific.
The MOTET database, now merged with DIAMM, includes as part of its remit the creation of new
musical and text incipits for sources that do not have them, or sources for which RISM provided incipits
in a standardized notation that does not represent the appearance of the original. As soon as CMME was
available the MOTET team moved to creating any new incipits using this software, and will retro-convert
their older work as time and funding permits.
The CMME project involves a growing database of sources and content (most of which can be seen and
searched online at www.cmme.org) and for several years CMME, DIAMM and the MOTET databases,
along with other datasets of a complementary nature (Oliver Huck’s Trecento database and the Base
Chanson databases at Tours have been moving towards a closer collaboration and data-sharing effort.
The primary purpose of this collaboration is to ensure that time and funding is not wasted by projects
duplicating each others work, and also to allow projects to utilise a richer content than they could do
using ony their own data. To this end the intention is that the projects involved should create a pilot
data-sharing or web-services model that would both allow the ‘borrowing’ or ‘mining’ of data between
different databases and also allow these datasets to be searched via one independent portal. The portal
could be linked to by new datasets that would contribute complementary data to the existing databases.
The project has been in the planning stages for some time, and has been named ‘REMEDIUM’. One of
the first steps in the REMEDIUM collaboration was the sharing of database structures between the
partner projects. The next step would seem to be the incorporation of a master key number in all the
11
relevant tables of the databases that would connect together such that a REMEDIUM portal search for a
specific item would show its position in all the collaborating datasets, with direct links and e.g. thumbnail
views of images, editions and other material that could be found in the individual resources. We were
shown a sample web page of how this might appear, and discussion followed about other projects that
were potentially involved in duplicating material that already existed in this group of databases, and which
should really be sharing these resources so that their time would be better spent in dealing with the
material and information that was unique to their project.
There is a difficulty in dealing with funders who continue to fund the repetition of data-gathering of this
sort, instead of limiting the focus of new projects to new research making use of data already in existence
and available to the public. It was felt that there was massive duplication taking place and thus a huge
waste of the limited resources available for musicological research. The refusal of new projects to
collaborate with existing ones, or to share new data that they were creating should be discouraged
actively in order to maximize the potential for new projects to produce new research, instead of work
that duplicates information already in existence elsewhere. It is important therefore that the REMEDIUM
project moves forward as quickly as possible.
5. Ichiro Fujinaga (McGill University, Montreal): Gamera and its applications in shaperecognition for musicology and other subjects
Websites: www.music.mcgill.ca/~ich/; www.aruspix.net; gamera.informatik.hsnr.de;
gamera.sourceforge.net/doc/html/
Although GAMERA and its ‘children’ have been in use for many years in advancing the needs of Optical
Music Recognition (OMR) for repertories that do not use modern typography, the progress of the
software in undertaking new tasks demonstrates its flexibility and the ability to undertake tasks beyond
the remit of its original design.
Early GAMERA applications were related to the recognition of lute tablature symbols, particularly in
printed sources where gaps between typographical elements meant that ‘staff lines’ (actually tablature
lines) were not continuous and often not particularly precisely aligned. Because most music typography
involved hand-carved wood type, the figures are rarely perfect. GAMERA therefore had to be teachable
so that damaged or irregular type was understood by the software. Having built software that could
manage the task of removing the staff lines so that it could then read tablature letters and rhythm flags, it
could then be applied to any irregular shape-based script, as long as it was given a glossary from which to
choose. Each task undertaken teaches the software more about the items in its glossary, so recognition of
pages quickly becomes faster and more accurate.
GAMERA is designed for domain experts, and includes image-processing tools (filters, binarizations etc.),
document analysis features that allow images to be segmented and processed in sections and symbol
segmentation and classification. Because it is extensible it can be adapted for a variety of different tasks.
One such task that was tested for DIAMM in the past was the recognition of scribes: the software was
‘shown’ a large corpus of manuscript samples with the intention of classifying the corpus by similarity to a
master sample. The software then ranked the samples (very successfully) from most to least similar.
Because the software was able to remove staff lines and recognize elements in the content, it was used
for the Online Chopin Variorum Edition (OCVE) to create markup for barlines so that that information
could be used to create individual bar crops on a large corpus of printed music (over 7000 pages)
12
(the original page here is shown at the top left).
GAMERA was used to build GAMUT (Gamera-based Automatic Music Understanding Toolkit) in 2004,
and was extended also to GEMM (Gamut of Early Music on Microfilms) in 2005. In 2006 the GAMERA
family joind forces with ARUSPIX (HMM-based), a specialized application for recognizing typographic
music. Unlike many of the GAMERA-based predecessors, this does not remove the staff lines, but dealing
with early typographic single-impression sources (where individual notes each have their own set of staff
lines, so the staff-lines are not continuous) creates a different set of difficulties from reading engraved or
later typeset musics, where the staff-lines are continuous and therefore predictable.
A movie demo of Aruspix showed that the recognized version of the typographic content retained the
note shapes and types of the original, amd was presented in an intuitive GUI with extremely simple dragand-drop editing tools. Most interesting to the workshop however was the early pre-processing of an
image, in which Aruspix cleaned up the image and pre-classified the content, defining a historiated initial,
title text, text underlay and music text, and using colour overlays to show the classification. This had very
obvious relevance to the newly-funded PRoMMS project, in which the mise-en-page of manuscript and
printed sources is examined and will require manual markup of a large number of pages. If these pages
can be largely pre-classified before the manual phase, this will speed up the work considerably, and may
allow the team to extend their content exploration.
13
Compared with the results of one of the leading OCR softwares, ABBYY Finereader Pro, where the first
run at a page of chant with an extremely clear original (unlike the original of the sample above, which was
quite dirty and had show-through) is shown below, Aruspix has an obvious edge.
Aruspix is the leading neume recognition editor, with 7000 symbols already trained and a very high
accuracy rate of over 95%.
One of the difficulties encountered by the Chopin project was that most digital images are (correctly)
supplied showing everything around a page, not just the internal content. So the edges of the page are
showing in the picture (not always square or straight), and colour and size scales are also shown.
Sometimes there is additional material. In order for GAMERA to perform its bar-line recognition, the
edges of page images had to be trimmed off, to avoid confusion in the result. New developments in the
software include intelligent cropping and boundary detection.
14
The software performed equally well on approximately square originals (as above) as on fragments of
extremely irregular shape, with completely non-linear edges (e.g. disintegrating from mould).
The potential for Aruspix and the various applications of GAMERA seem limitless, particular given the
very wide variety of requirements of musicologists and new musicology initiatives exploiting the online
environment. The family of GAMERA-Aruspix project are not limited by period, notation type, the quality
of the original source, whether the original is printed or manuscript, or the type of data being recognized,
and this opens them to a much wider musicological (and other disciplines) application field. One of the
problems in OMR for early repertories is that so much music has text underlay that is difficult to read or
recognize for a user not familiar with early scripts, and this software could transform the ability of nonexpert users to appreciate and interact with medieval and early modern sources.
6. Grace de la Flor (Oxford Internet Institute, University of Oxford)
Project website: www.oii.ox.ac.uk/research/?id=58
There are four project partners – oii oerc, ucl centre for digital humanities/dept of information studies,
virtual knowledge studio, Maastricht
Grace de la Flor is studying internet behaviours, and her current research involved six web resource case
studies: University of Birmingham English; Electonic Enlightenment 2: letters & lives online; The
Proceedings of he Old Bailey 1674-1913; DIAMM; UCL department of Philosophy; and Corpus
Linguistics.
The aim of the research is to understand how humanities scholars use both traditional and digital
resources in their work. The study results will provide guidance to funders in the formulation of their
research strategy
Her project is interested in understanding more about DIAMM because it is a resource – a digital image
archive – and it is provided to scholars to assist them in their research. One of the questions she asked
therefore was whether we achieved our purpose, and if so how well we succeeded.
The researchers Grace interviewed find DIAMM an invaluable resource and is central to their substantive
research practice. Added to the slides shown in the workshop the list below includes a last slide which is
a wish list of further features as a response to the question "what changes might improve DIAMM"? ...
(you're probably familiar with most if not all of the suggestions). After I submit my thesis I'll start working
on the RIN project again and writing that up. Grace will send us the draft of her research on DIAMM for
the RIN project probably around end of October for our feedback.
Early results
These are some of the areas of musicology research that the archive is used in:
materiality of manuscripts
codicology
(manuscript physical properties)
mise-en-page (manuscript layout)
digital image restorations
erasures identification and repair material damage
high-resolution digital images
fragment comparison
letter comparison
15
historical/cultural context
a manuscript's association with other cultural artefacts
The importance of DIAMM to individual researchers:
Letter Comparison – an interview quote
"The scribes of the Middle Ages worked really hard to be anonymous. If somebody started taking
over the work of another scribe on page 50, the scribe would copy the handwriting of the other person.
They would try not to use their own. And so you end up looking for very, very small, subtle, tell-tale sort
of habits …that you can only see with really great quality photographs, and then I’m able to see how
various manuscripts connect with each other."
The next researcher discusses erasures:
"I don’t feel particularly disadvantaged by the fact that I’ve worked with a digitization of it (the
manuscript) because I have seen all the elements in that digital reproduction that I would see if I was
consulting a manuscript first-hand ... Being able to actually detect these changes, the scribe’s erased
something and rewritten something over the top of it. That has changed my scholarship in the sense that
much of my work is concerned with identifying these erasures and identifying why these erasures
occurred from the context of musical culture. "
Here, a researcher discusses the importance of access:
"Nowadays you can probably do 80, 90% of everything through the high-res digital images and
only for the remaining 10, 15, 20% you need to see the manuscript. So you can go much further (in your
research) working from your own location."
Importance of DIAMM: enabling new discoveries through
•
•
•
Comparison across a variety of source materials
Digital image restoration
Access to sources that may require considerable travel and expense to view first-hand or that
may be confined and impossible to access otherwise
Grace uncovered an interesting aspect of medieval musicology that we have not considered, and which
may impact take-up and usage of the online resource significantly:
Do you feel pressured to work in certain ways?
"Yeah, I mean, I think—I do feel pressure, for instance—there’s this notion in the field that you can
always get more out of seeing the original than seeing a digital image of it, and I do feel pressure
to work more with originals than with the digital images because of the traditions of the field,
whereas frankly, there are times that that’s really important, but for the most part I do feel like I
get more out of sitting, using these images on my computer... I do feel like there’s a certain
pressure that that’s not what top scholars do because that’s not what top scholars did 25 years
ago."
She then asked the workshop the following questions:
•How do you discover new information sources? (grad students, peers, etc)
•What are the most important sources of information? How did you find it, how did you decide it was a
good resource?
•Over the course of their career have information sources changed or just changed in style/presentation?
Her conclusions about what changes might improve DIAMM are as follows:
What changes might improve DIAMM:
16
•
•
•
•
•
•
•
•
•
Image manipulation tools (adjust gamma)
Save a copy of the image onto your hard drive for research
Improved search tool (search by composer, by piece, by kind of manuscript)
Access to more of DIAMM images via the web interface
The registration process a little bit more streamlined
Full transcriptions
Updated bibliography
Link content between related projects and libraries
More funding!
7. Website analytics (Greg Skidmore); ways forward for DIAMM
Greg Skidmore’s presentation demonstrated very fully the ways in which the internet and
communications technology can be used to bridge geographical boundaries without the need for special
facilities, since it was given from a seat in a bus travelling on the M40. As his travel had been delayed he
uploaded a document to GoogleDocs which was presented to the workshop participants, and which he
updated in real-time as the content was being discussed.
A summary of the Google data is given below; Google Analytics started taking data on the site on 6th
May, 2010.
Pro fil e of visit ors:
6684 unique visitors
12206 visits (87.9 visits per day)
105 countries (GB-26%, US-18%, D-8%, NL-7% -- EU-71%, Am-23%)
Top cities, in order: London, Oxford, Utrecht, Vienna, Leipzig, Cambridge
54% were new visitors
They spoke 64 different languages
Average time spent on site = 4min 47sec
Average pageviews/visit = 8.5
46% of visitors only viewed one page
However, 4676 visits (38%) involved viewing 5 pages or more
Most visitors only ever visit the site once (54.6%)
However, 2862 visitors visited between 9 and 200 times (23.4%)
Roughly half the visits lasted less than 10 seconds
However, 22% of visits lasted between 1 min and 10 mins
Total number of visits which lasted longer than 1 min = 4164 (34%)
Vast majority of visits were made on monitors with greater than 1024 horizontal resolution using broadband connections
How vis itors got to the site
80% of viistors come from either search engines or were direct traffic
95% of search engine traffic came via Google, then Yahoo!, then Bing.
39% of search engine traffic came via a search for ‘diamm’
Next was ‘medieval music’ at 2%
Exam ination o f /in de x.html
/index.html was the ‘Landing Page’ 51% of visits
40% of these visits resulted in the user leaving the site after viewing /index.html
38% of users who searched for ‘diamm’ left the site after one page view
33% of these ‘diamm’ searchers were confronted with /index.html and left immediately
17
Use o f sit e
There may be something wrong here: Google says that pages with ‘DisplayImage’ in the URL (i.e. the actual Image Viewer)
have only been viewed 27 times. This may indeed be true.
Pages with URL containing ‘source.jsp’ have been viewed 23 249 times. 3988 separate pages containing ‘source.jsp’ in the URL
have been viewed.
Most popular source is Eton Choirbook. Its source description page has been viewed 224 times.
Search.jsp and Results.jsp are Nos 3 and 6 on the list of most popular pages. This means that many users are indeed searching,
not browsing using ArchiveList.jsp. (ArchiveList.jsp is actually the most popular page on the site).
CONCLUSION
One of the outcomes of the workshop was a list of ways in which users could be encouraged to help the
project. Because the project is well established there is a danger that users assume it will always be there,
and that it has reached a stage of existence in which it will continue indefinitely without any help from its
user-community. This is manifestly not the case, and the following list was drawn up as a result of this
discussion, to be added to the website and to the materials included on the promotional USB stuck
(funded by the AHRC). This version has been edited after consultation with the project team:
Wh at ca n I d o f or D IA MM ?
What DIAMM needs primarily is funding: small amounts and large. In order to continue to deliver a world-class
collection FREE to any user we need to increase our online collections, and continue to develop our delivery
mechanisms to keep up with demand and changes in technology. DIAMM can only exist if the community who
benefit most from it recognize its worth and support it to keep it alive. Web resources require more-or-less
constant maintenance to keep abreast of the changing behaviours of browsers and ongoing development of web
technologies. Unlike books they cannot simply be put on a shelf and be accessed in the same way. Maintenance is
costly, and needs to be funded and justified by evidence of use and value to the community who use it.
You can help us in small ways:
• Cite DIAMM in your publications if you have used the resource. It is also very important to tell DIAMM when
you have used us or cited us in a publication so that we can keep track of the ways in which the project
supports research and use that information to help us obtain future funding;
• Tell people about us and encourage them in turn to use the website (www.diamm.ac.uk);
• Use DIAMM as a teaching resource, and tell us about how you have used it and ways in which it worked or
failed to work for you: this allows us to shape the resource to meet your needs;
• Send updates to online data wherever you spot an inaccuracy or lacuna, and feed back information about
anything that does not work as it should. Contributions from the user community are essential to DIAMM;
• You can donate any amount of money to the project which will go towards creating more images of sources
that are difficult to access, or that cannot be photographed at sufficient quality by the owners. To donate
directly use our online shop (tinyurl.com/DIAMMPublications) and the ‘donate’ facility;
• Always insist on the highest quality of imaging from your suppliers. Our imaging checklist is included in the
DIAMM handbook and can be found on the website. If you are concerned about the quality of the images you
have received, ask DIAMM to evaluate them and, if necessary, write a quality assessment report.
You can also help in larger ways:
• Make sure that when planning your project you do not waste time and resources on doing something that
DIAMM has the expertise to do for you: outsource imaging and image-processing – this will usually strengthen
18
•
•
•
•
your application as you will use your funding to pay for the things that only you can do. Write realistic
consultancy costs into your budget either for having DIAMM undertake work for you, or for our experts to
train your imaging technicians or help you in ordering suitable-quality images from third-party suppliers.
Include a budget line in grant applications for DIAMM resource usage, even if you do not use DIAMM for
imaging or consultancy – even small amounts will help us to maintain our web presence and keep the resource
free and available so that you can use it for your research. In the bigger picture of a grant for £250,000 or
more, £2000 donated to DIAMM in order to keep the resource running and available to you during the period
of your project will make a huge difference to us, and very little difference to your application budget.
If you are buying digital images ask your image for permission to donate copies of the images to DIAMM so
that they can be displayed online, or deposit copies of your images in the ‘dark’ archive so that they are stored
for the future; the project may be able to negotiate permission to put them online at a later date.
Make sure that your data structures take account of existing related datasets, and take advantage of the
willingness of other projects such as DIAMM and the Remedium consortium to share their data, thus saving you
the time, cost and effort of re-creating metadata that already exists in another database. Share your own data
so that the wider community can benefit. Link to existing datasets using a web-services model, minimizing the
amount of work you have to do, enhancing those other resources by contributing your data easily, and creating
a richer and more connected research environment for the user community.
Make use of technical resources that already exist, rather than attempting to build a new resource from scratch
that does the same thing. DIAMM and its funders have dedicated a huge amount of expertise and time to
creating an online delivery mechanism for images and their associated metadata, and you could benefit from
this by delivering your images through DIAMM – contact us to find out how.
19
Database and data connectivity workshop report
DIAMM Database and data connectivity workshop
1 April 2011
Oxford eResearch Centre
PROGRAMME
9.30
10.15
10.35
Julia Craig-McFeely/Elizabeth Leach – Welcome and introduction to DIAMM data content
and some problems in data extent
Prof Thomas Schmidt-Beste – the Production and Reading of Music Sources 1480-1530
Lief Isaksen & Gregorio Bevilacqua – Cantum pulcriorem invenire (research project on 13th
century conductus)
11.00
Coffee
11.20
11.40
11.55
David de Roure – SALAMI (Structural Analysis of Large Amounts of Music Information)
John Milsom and Nicolas Bell (Early Music Online) – standardised titles and composer
names in early music repertories
Discussion/Feedback session 1
12.30
Lunch
1.30
1.50
2.30
2.50
3.00
3.30
Prof. Henrike Lähnemann – the Medingen project
Paul Vetch – DIAMM, ProMS and web-services for musicology datasets: technical issues
Theodor Dumitrescu – Remedium (t.b.c.)
Michael Scott Cuthbert – interconnectivity between music projects
Other datasets: general discussion to cover projects of other participants in the workshop
(this can be extended to 4.00)
Discussion/Feedback session 2
4.00
Tea
4.30
5.00
Summary: feedback and forward planning
(approx) Close
Participants: Nicolas Bell, Margaret Bent, Gregorio Bevilacqua, Julia Craig-McFeely, Tim Crawford,
David de Roure, Helen Deeming, Ted Dumitrescu, Elliott Hall, Leif Isaksen, Karl Kugle, Henrike
Laehnemann, Elizabeth E. Leach, John Milsom, Stefan Morent, David Robey, Thomas Schmidt-Beste,
Michael Scott Cuthbert, Philippe Vendrix, Paul Vetch, Raffaele Vigilanti, Magnus Williamson, Martin
Wynne
SUMMARY NOTES ON THE PRESENTATIONS
(the workshop was fairly informal in style, so the notes presented here do not attempt to
summarise the presentations in full, but note the salient points)
1. Julia Craig-McFeely/Elizabeth Leach – Introduction to DIAMM and the workshop
aims.
The DIAMM database has grown steadily both in content and in structure since it was started as a
simple tool for controlling information regarding photographic activity in 2009. One danger in
managing this dataset is that it can attempt to be all things to all people, resulting in both an
unwieldy mass of information and an inability for its content ever to be complete.
We try to be all things to all people
We spend disproportionate amounts of effort creating particular esoteric data types that
might never be used
• We allow our dataset to grow unckecked so that it never achieves any level of
completeness
• We run the risk of misleading users by offering searchable data that is not complete
• Our database runs the risk of becoming unmanageable
• We can’t provide every bit of data for every record
At some point a line has to be drawn beyond which the project team will not attempt to provide
users with information. So I decided:
• to eliminate parts of the database that are currently unpopulated, and are unlikely to be
populated (realistically)
• to consolidate the information and content that we have
The result is now more controllable, but still vastly incomplete in some areas: I eliminated extended
realationships and fields of more esoteric content (e.g. liminary text). The next step therefore is to
connect with datasets that supply that more specialist data and allow them to access our data; we
don’t want to supply data that is simply copied into another database because updates and changes
will not apply across the board
•
•
Data interconnectivity is becoming ever more necessary as projects start up which duplicate work
already done in other projects. Funding sources are finite, and with grant funding going down and
the competition for that funding on the increase it is essential that new projects utilize what already
exists, and devote their time and expertise to performing those tasks that only they, as specialists,
can do. Some PIs refuse to do this, and perhaps this should be a cause to with-hold funding?
• Web services avoids mindless duplication of content
• New datasets can be richer because they access a broader spectrum of data
• Updates in one database are available to all connected databases
• User access is improved
• Quality and completeness of data is improved
It is equally important that new research initiatives inform other projects that use of their resource
is an integral part of their research strategy. A number of projects name DIAMM as imaging
consultants (without consulting us first) or rely either implicitly or explicitly on the continued
available of DIAMM in order to pursue their research. An important note here is that DIAMM is
widely perceived as permanent and publicly funded, whereas the opposite is true: DIAMM is now
self-funding and can only remain online as long as we are able to fund maintenance. Growth is
dependent on goodwill, earned income and input from other projects.
The aims of the workshop are to consider the following questions:
Datasets and Databases:
• How can we better use the data we have?
•
•
•
•
How can we use our time more efficiently when creating new datasets?
How can we optimise our research time both in creating our own, and using other
datasets?
What are the mechanisms that can be used to share or harvest data?
What are the implications of sharing:
Cost
Time and effort
Programming
Conformity
Credit (acknowledgement, academic credit)
Visibility
Secrecy (?)
What do users think DIAMM is, and what do they think it does? We have been running for nearly
13 years:
• They expect us to be available 365 days a year without fail
• They think we are publicly funded
• They believe they have a right to use us (for free)
• They believe they have a right to complain if the website goes offline
• They plan grant-funded, research and teaching activities that depend on DIAMM
• They don’t tell us that they are relying on DIAMM
• They do not offer funding to ensure the continuation of DIAMM
• They think our data is complete
• They think our data is accurate
• They expect access to what we have, but don’t share what they have
• They do not contribute data to improve content (for the most part)
How do we conform our data?
We already do this to some extent with RISM sigla and RISM/CCM abbreviations for MSS, but this
is neither consistent nor comprehensive. How can we connect City, Library, Source and
Compositions? Our solution is to publish a list of key numbers for these (and other) things that we
might wish to connect to (see DIAMM source list for key numbers). Collborators don’t have to
change their own key numbers, just include numbers from an agreed master list. This should result
in consistency of titling; agreement over titles of works; consistency of composer’s names;
agreement over standard spellings; etc.
2. Prof Thomas Schmidt-Beste – the Production and Reading of Music Sources
1480-1530
ThePRoMS project examines a fixed body of sources as musicologists AND art historians, treating
them not simply as sources of information. The funding includes money for web services
development, and in particular to allow connection to the DIAMM data about manuscripts which
PRoMS has no need to repeat. The bonus to DIAMM is that new descriptive material arising from
the PRoMS research will be fed directly to DIAMM, and will appear online as quickly as it is added.
PRoMS examines things that we take for granted:
• whether the initial is the first of the text or the first of the voice designation
• If text is written in red do you sing it differently? No? so what is the visual and practical
function of what is shown on the page
• Do images on the page follow the music, or the other way around.
• Hierarchy of parts suggested by the initials or illumination.
•
•
•
•
•
•
•
What does it mean if there are initials in the middle of lines – is this practical, aesthetic or
something else?
Pretty noteheads – where visual and practical aspects come together: something special
happens here
Turn instructions in all 5 parts, even though only one person can really turn the page
Why are there custodes in music notation but not in text notation
Why are some continuations across an opening from right to left instead of left to right?
Signum congruentiae are not audible, but they establish a link that is not normally shown in
edited versions of the music.
Line-fillers at the ends of pages…
The web page at the moment shows very basic information: ‘about’ and list of participants. It will
show information about the project and a project blog. It will eventually show the meat of the
project: i.e. database of 300-320 MS sources and about 80 printed sources; about 25 detailed MS
studies and about 5 printed-source detailed studies. Unlike most musicology projects PRoMS has
the added value of the art historian, who looks at the page as a visual presentation, whereas
musicologists are programmed to read the content instead of looking at it as an art work
Database structure is basically a flat file at the moment, which will get together the kind of layout
and codicological information necessary to engage in the research and present it to users.
Text; placement; colour; density; calligraphy; interpretive; relative density;
The presentation mentioned in passing a project at the University of Rochester which attempts to
work out how someone looking at it in the 15th century responds to a picture – trying to
understand virtually a reaction from a viewer
3. Lief Isaksen & Gregorio Bevilacqua – Cantum pulcriorem invenire (research
project on 13th century conductus)
The project studies 12th-13th century latin songs – conducti surviving in both polyphonic and
monophonic forms. Unusually these composition, with very few exceptions, were not based on
pre-existing music (e.g. chants). The most recent existing catalogue is 30 years old, so anything
discovered after that is missing. The paper catalogue is extremely difficult to manage and to get
information from: as we know, it is much easier and quicker to find information from searchable
databases – there is also no problem with space and saving paper issues in this medium.
The aim of the project is to integrate information from the catalogues with information from 12th14th-century MSS and some handwritten notes, commentaries on the catalogues, new sources and
recent work by scholars.
Information provided will be as follows:
• Sources (folios, format, notation, DIAMM link)
• Style (stanziac, though composer melismatic syllabic)
• Poetry
• Contrafacta
• Techniques
• Main existing editions and recordings
Instances of works – combination of conductus and poem – connects to:
• References and publications
• Tag clouds
The database is constructed using Filemaker and is entitled ‘Cantum pulchriorem invenire’. They are
in the process of creating a web portal
Linked music data is a web-based approach to the way the data can integrate with itself – there is
considerable in-house expertise in Southampton. It can be taken to extremes building semantic
ontologies: the question arises, How far do you link? What to do you link to. It does help to have
other data to link to. URIs as global identifiers for information is crucial, and more powerful than
having standard spelling, since it allows you to have all the variant spellings and the unique web
identifier is the master title
Perhaps the day of the standard title has gone!
4. David de Roure – SALAMI (Structural Analysis of Large Amounts of Music
Information)
SALAMI is funded internationally (UK and N America). The project team do not claim to be
musiciologists, but are building resources to support musicologists, and digital audio performances
are included.
Techniques and methodology:
Words mean completely different things in music than in computing, which can cause confusion!
The idea is to build something that is sustained by the community. Looking at previous attempts to
do something like this gave as an example the San Francisco internet archive, which has an
extraordinary collection of live performance recordings. The metadata is rather a mess though!
Semantic web, linked data and ontologies:
A music ontology designed by Yves Raimond going to be used by SALAMI. Music information
retrieval meetings annually involve sharing of data and publication of results, so this community is
already sharing their data, though perhaps not as efficiently as it could.
The project, as its name suggests, will involve large quantities of audio recordings and the software
development will involve data extraction and structural analysis which can be done better because
of the increase of scale and volume; the sort of features extracted will be:
• Genre
• Key
• Rhythm
• Pitch
• Onset
…in all, about 24 features. One of the technical features to be address will be signal processing:
how can processing be re-used, improved, and how can an efficient workflow be developed.
The data content includes around 23,000 hours of recorded music, which is examined using
‘crowd-sourcing’. However in this context crowd-sourcing is actually student sourcing: students are
paid to annotate 1000 pieces of music from US billboard charts – the annotation was free-form so
a further analysis remains to be done on how music students annotate music! In order to process
the vast amounts of data generated, the project has been allocated a significant portion of
supercomputing time, which will enable the data to be analysed, and the results to be published
very quickly. (Processing in mirex is done with a workflow engine that tries to identify genre and
mood!) It could enable us to answer questions such as ’How much country music comes from my
country?’
The linked data movement started in 2007, which is a rebranding of the term ‘semantic web’, which
says what it is, not what it does. Linked data is what it does. The idea makes sense, but has proven
controversial: when presented to the open data conference it was not very well received! If data is
shared the community can add more ‘signal’ and publish more analyses. Why is this a problem?
People are used to data being contained in silos, and a loss of control and massive distribution is
alarming. However, there is nothing to stop a researcher selecting the parts of the data that s/he
wants to work with, and giving it an identity, then other researchers can point at that identity – or
URI. So all the time, instead of working within your silo, you are pushing things outwards for other
people to use.
Data shouldn’t be hoarded into one database, it should be distributed.
Questions:
EEL: is this going to assist us in sharing data from projects that want to badge their data in certain
(visible) ways?
DDR: This domain is peculiar because there’s lots of stuff out there which isn’t freely copy-able, but
lots of people have bits of it. If you want to analyse a particular piece of music you might have one
version, but someone else has the other version. The notion of ‘same as’ in music is a very complex
notion, so you have ‘same enough as’ which is not definable and probably item specific. Another
level in music is how effective fingerprinting is – you can tie a fingerprint into other sources of the
‘same’ data. You can hold your phone up to your car radio and it will tell you what music is playing.
The BBC is committed to publishing things as linked data. If you create a website you just have a
website, but if it is linked it is really available.
PV: a philosophical and epistomological problem: when you say that to recognise a signal gives you
a possibility to understand a musical work… Ugo Riemann said ‘listen, then say something about the
musical work’. Are these computing analyses useful?
DDR: this is done by a chain of experts and the work is there to assist, not to replace. We see this
every time we apply e-research into other domains – some say that it’s not possible to analyse
more than one piece of music at once. Time will tell as to whether this genuinely supports
musicology.
TC: this is one type of analysis based on audio recordings, mostly of popular music. The same
approach could be applied to 14th century music, but the results might not be the same as a
scholarly analysis, but you might get NEW bits of information because a human being can only
manage one thing at a time, and then they will have a preconception that they’ll apply to other
pieces. At least with a computer there is more ‘objectivity’ in the comparative work.
PV: agrees – quantity is very helpful
TD: computation produces basic structures and basic information, not an analysis in terms most
musicologists understand.
PV: if the analysis is so basic, then why do it? If we were experimenting with this on a purely oral
tradition of Libyan songs something might come up because we have never looked at this before.
DDR:Working with audio means you can work across a much larger background of ethnicity and
style than is possible with notated musics.
TC: non-western musics are really interesting in this respect because they use non-western scales,
and these are being investigated by other projects
DDR: Salami is a starting point – classical music is very small in this. Whether you will get anything
interesting from a key segmentation of the data remains to be seen
5. John Milsom – standardised titles and composer names in early music
repertories
The Christ Church Catalogue: the first phase is completed in terms of putting information online,
and this now needs to be updated and put online. This is a closed collection – the intention is only
to describe what is in the collection donated to Christ Church by one of the masters of the collect
in the 18th century.
When the catalogue started there was no ‘online’ and no attempt to present catalogues online.
There are printed authorities for printed books, providing standardized titles and spellings for both
titles and composers, but there are NOT similar authorities for MSS. The authorities when this
started were New Grove because everybody had it, plus any other bibliographical material that was
sophisticated and up to date such as the Viola da Gamba Society catalogues – New Grove had also
used these. Standard modern editions of music could also be used as authorities. The catalogue
does refer to them, but evades the issue of standard titles.
DIAMM doesn’t cover the repertory represented in this collection, and didn’t exist at the time the
cataloguing was started.
Faced with this situation, what authorities do you turn to for information?
Why do you believe one authority and not another?
Some people feel very strongly about one source or another, particularly researchers, but librarians
can have different views. So one area where an interface has to be created is between the world of
musicologists and the world of librarians
RILM has been thinking about this for a long time. (It is surprising that RISM has not picked up and
run with this!)
6. Nicolas Bell (Early English Music O nline)
This project received funding under JISC rapid digitization stream. The money has to be spent by
the end of July. EEMO will provide images of 300 16th-century printed music publications from the
BLs collection, scanned from microfilm (which means about 4x as much stuff as if they did new
photos). The microfilms are scanned at 400 dpi from the master copies, so the films are in good
condition. The online resource will be delivered by Royal Holloway Equella, a system currently used
to deal with exam papers, but it is being expanded for EEMO.
The funding will allow the BL to update all catalogue records for these publications and introduce
links to the images both from the BL catalogue, from COPAQ and from the RISM UK catalogue
series A2. RISM A2 is MSS, but they’re going to add these prints. This in itself is a radical decision.
Most of the prints in the collection are anthologies, and the library has just chosen a selection – a
few Petruccis, Gardanos, some German, some English – which form a representative selection of
music prints from that period.
The problem arises of how to upgrade the catalogues to be most useful to musicologists but also
keep in line with library standards. This means getting people who know about 16th-century music
to inventory and transcribe title pages. It also means examining published bibliographies.
The researchers will use their transcriptions as the basis of making a cat record but will normalise
text in order to make it more searchable.
An accurate diplomatic transcription retains all capitals and abbreviations – but that makes it
unsearchable – e.g. XX Soges (with macron over o) means 20 songs, so you wouldn’t find it if your
search string said ’20 songs’. You need to know what you’re looking for before you can find it!
Therefore they need to list all forms of the titles so that a search can return meaningful results.
The contents list will therefore need to show variant spellings of composer names and attributions
that we know from our knowledge of the repertory, but which are not manifest in the copy.
Versions they will use will probably be the version approved by the Library of Congress
This is a normal procedure for library catalogues, but most early printed books haven’t been
described in such detail because they are old acquisitions. For later publications a longer record
could be derived from another library, but a lot of the prints in EEMO don’t exist in other libraries
because they are early and sometimes unique.
Anglo-american cataloguing rules they are forced to apply to these records say that there must be
no capitalisation, which is frustrating when it’s clearly there. The information in its original form will
be kept though, and presented perhaps in a .pdf or word file so that it can be used e.g. to print out
and study alongside the book.
This could be the first stage in a much expanded future – one project or more. The scanned
microfilm images can gradually be replaced with new colour images. They will be available to
ARUSPIX, which is investigating ways of applying Music recognition to early printed music, and Ted
could transcribe them in CMME.
TC: another project – Sandra Tuppen and Richard Chesser is examining optical characater
recognition looking at printed lute tablatures. We can do optical recognition on these fairly
successfully already. Many lute anthologies are very miscellaneous: those collections contain
arrangements of vocal music, a large amount of which comes from the vocal partbooks that form
the bulk of the BL collection. So there is possibility for internal concordance connection. In certain
cases the xls spreadsheet mentions other copies …
NB: the collection will not include copies of the other copies of the books. The transcription
however will come from the most perfect original – if the BL one is not very good they will go to
another version to read it. If some partbooks are missing, the record will say that information has
been taken from another source. The project will add bibliographical information and the pictures
will be downloadable – DIAMM can have them, even though they don’t meet the colour standard.
However any picture is better than none.
7. Discussion/Feedback session 1
Q. Is the idea of a discrete title for a work now dead if we’re linking via a URI or meaningless
number?
A. It’s probably not dead because we still need paper publications, but we need to think in electonic
data terms of URIs.
Giving a MS a single identifying number is worthwhile, but we can’t even decide ‘what is a work?’
We are also dealing with questions about what is the master work – one with 3 voices or one with
4 voices?
Texts might be in one source only – voice parts might be interchangeable between pieces. Single
object identifiers at the level of the text or the voice part are needed, not just at the level of the
work.
See: Andre Guerra Cotta, Tom Moore: ‘Modelling Descriptive Elements and selecting information
exchange formats for musical manuscript sources.’ Fontes artis musicae 53/4 (2006)
8. Prof. Henrike Lähnemann – the Medingen project
This project reconstructs the text output of a specific convent at a specific time: the Convent of
Medingen. The MSS from the convent are now scattered worldwide. What I want is to bring
together digital images of the MSS – not just as pretty images though, I want to make them
interconnected and searchable.
What is exciting is that none are identical, but all are connected – the texts are in Latin and were
transcribed to German. The results, copying from MSS, turned it into a devotional resource. This is
a prototypical group of resources: a record of how the nuns think. The order provides the matrix of
how they produce devotional texts.
Medingen was a centre for Asiatic trade, with MSS and artefacts from all across this trade route.
This was a very rich area – Luneburg the only place with salt. They financed five convents with a
huge number of artefacts. In the Lutheran reformation they became protestant in name, but nothing
else changed. No new altarpieces or MSS, no ‘baroquisation’. The first reform came from the
Netherlands. All the MSS in this dataset come from between the reformation and the reform.
The nuns wrote in Low German for lay-women and this was especially picked up by Anglicans on
the grand tour who would buy the books. The MSS are small, so made ideal keepsakes. They all
start from the matrix of the liturgy and constantly quote small parts of the liturgy, but also
vernacular religious songs and texts of the time. All were linked into the daily life of the convent:
there are rubrics about the choir singing and what you should do while they are doing that.
Shorthand is used for the liturgical parts, while more complex bits are written out in full.
The liturgical parts are taken over into the vernacular MSS – for lay-women in Luneburg related to
the nuns (sisters, nieces). There are many references to ‘play on the organ of your heart’, the ‘noble
harp of the soul’ etc. which seem very secular. The shorthand music is very close in form to
notation fully written on staves. A whole group of nuns entered as novices in 1481 and started their
prayer books at the same time. They put their names as initials in the middle of the book. A group
of three nuns (sisters) entered the convent, and gave a book to a fourth sister, which links into the
wealthy families in Luneburg.
The challenge is how to integrate in the description and editing of these MSS (as a literary historian
and a linguist) the music notation and shorthand, the low-german things with the latin etc. to make
the network of one literacy and one devotion visible.
This shows an example of the markup based on TEI.
This has to be integrated into the website to make the database into a usable web resource that
would be of interest to other disciplines: literature, linguistics.
A Pilot website is available online: http://research.ncl.ac.uk/medingen/ There would be a link to
existing catalog data of each of the MSS; full bibliography for them with linked in pdf files of all
copyright free papers, and they have permission for many later articles.
You should be able to browse the MSS, to enlarge, and to compare by opening two windows.
The overall structure of the database will be based on HiDA4 (Hierarchischer DokumentAdministrator Version 4.0, a software especially developed for archives and libraries). Each
manuscript will form a data-set which will be accompanied by a 'readme' file describing the
state of the cataloguing; this allows for flexible handling of the depth of cataloguing as the
project develops. New manuscripts will be continuously added as catalogue entries while the
existing entries are digitalised or transcribed.
All data will be hosted on a common fileserver, managed by Subversion.
The metadata set will be following TEI P5, the Text Encoding Initiative module on manuscript
cataloguing. Each manuscript will be structured, in addition to folio-numbers, by its division in
chapters or paragraphs formed by rubrics and by liturgical occasion, using and expanding the
codes developed by the CURSUS-Project. The liturgical occasion is applicable to all of the
manuscripts from Medingen except for the library books. This allows a thorough crossreferencing with the liturgical manuscripts from Medingen (cf. appendix 3) and to the other
numerous databases of liturgical music.
The catalogue entries use the MASTER-DTD text that will be XML-compatible, the tagging to be
developed in close cooperation with the manuscript census of German manuscripts.
The images of the manuscript-pages are high-resolution tiff-files with jpg-files of different scaling for
internet representation. The manuscript can be browsed and enlarged by using a magnifier.
For internal use over 7.000 working images of manuscript double-spread pages are already
available but they are of variable quality and are restricted to use for transcribing and editing.
The data set of the final database will consist of 600 dpi tiff-files with colour and metrical
scale; for internet representation, smaller jpg2000 files will be produced on-the-fly. Copyright
for all manuscripts will rest with the libraries and has been negotiated with twelve of the
libraries (Berlin, Göttingen, Hamburg, Hildesheim). Professional images of the manuscripts will
be taken through the library and their contractors and should be housed, whenever possible,
at the institutions. The images will be provided with a non-exclusive licence, as established
e.g. by DIAMM. When the institutions involved cannot provide the resources necessary for
hosting the images, images will be held on the project's server. Agreements with all the
libraries will be made to have the royalty use of the manuscript images on the web. Licensing
will take into account the DFG guidelines on copyright and rights management issues.
Transcriptions are entered extending the TEI P5 scheme where necessary with special characters
encoded in Unicode developed by the TITUS-project wordcruncher.
Musical notation: The main form of musical notation in many of the prayer-books consists of
reduced Gothic choral notation written between the lines of text using the ruling for the
text. Since this is a form largely confined to devotional texts from Medingen and the other
Lüneburg convent, new MEI customisations will be developed and published via the TEI
Music Special Interest Group. These can be based on a set of descriptors by Stefan Morent
for marking up earlier medieval neumatic notation in combination with the existing
descriptors for Gothic choral notation from which the music notation in the prayer-books is
copied.
Music samples of the liturgical items and hymns are given as 8-bit wav sound-files and linked with
the appropriate manuscript pages and the transcripts to allow following the actual musical
notation.
The bibliography will include all literature on the Lüneburg convents in general, on the Medingen
manuscripts in particular and resources on Middle Low German. All copyright-free older
publications (for example the history of the Medingen convent by Lyßmann 1772) are made
available as searchable PDF-Files and all open access resources are linked in. A particular
emphasis will be on regional and older specialist literature which is often not available in the
major libraries; whenever possible, full text scans will be made available in agreement with
the publishers (for example publications on the convents by the local newspapers).
A full documentation of the technical side of the project will be available via the website.
There will be a printed edition of one sample prayer book but each book is different… the nuns
were post-medieval: they had access to a massive wealth of information and each nun did her own
MS and each linked to her own choice of information. This has no funding yet. This is something
that couldn’t’ possibly work in print, so a website presentation adds so much more than just the
sum of the single editions.
9. Paul Vetch – DIAMM, PRoMS and web-services for musicology datasets:
technical issues.
Web services in reality in the context of our projects may mean that it’s difficult to develop a
realistic web-services model the way we think we want to. There are a number of difficult
problems; how do you get projects designed to do difficult things to work together, and also how
to make sense together. This is not simply a question of identifying consonant data, but how to deal
with it. DIAMM is logical to itself, but may not be logical to another organisation. An example is the
‘Item’ entity – consonant to the instance, but not to general usage.
At what level therefore do you connect individual resources? Web services can connect by way of
‘ligatures’ – this creates high-level web services, a low level – database-to-database connection is
simpler, but what happens if one database is down? You can cache the data: you have to store a
local version, and that is updated when you access your page, but if there is no connection it just
shows the cached version of the data.
A connection between DIAMM and PRoMS is not one-way, it is reciprocal. PRoMS needs to be
able to update DIAMM content – e.g. if they discover a change to the descriptive information when
they look at a MS. In terms of these two projects we are still very locked in to images – which is not
all bad.
Situation 1: A project such as CFEO (Chopin’s First Editions Online, www.cfeo.org.uk) starts being
very similar to DIAMM-PRoMS but then diverges massively. CFEO is based on a printed catalogue.
On the other hand, if we compare CFEO with OCVE (online Chopin Variorum Edition,
www.ocve.org.uk) these are two projects showing very much similar data. OCVE has corrected
data from the print catalogue. The plan is to merge them together: combine them and make an
online edition of the annotated catalogue so all three resources become one glorious big resource.
This is a good plan, but there is one major problem: money!
CUP, the publishers of the printed catalogue, don’t really want to give the printed catalogue away.
Regardless of these barriers thought, to support this sort of situation you need a very complex web
services model: someone who has bought the catalogue needs to be able to log in and then see the
images from the online projects with the most updated catalogue information. This sounds good.
Problem. 1 – libraries who supplied the images didn’t do so on the basis that they would be resold
as a CUP online publication. A subscription view for the catalogue is good for sustainability, but not
good for the libraries permissions.
These are samples of some of the things that come up and bite you on the bum. Things that should
work very well, but for purely practical reasons they don’t.
‘Web-servicing’ images
JPG2000 and J2K are both image formats that have exciting prospects. It doesn’t make sense to
store two copies of the images, but if they are served as JPG2000 from a services server you only
have one master image, and all surrogates are derived from that, on the fly.
How do you stop people from pirating the images and using them for themselves ? We have to
hide the information about keys from the rest of the world to stop copyright infringement and data
theft. So what does it mean if you see an image from one project in another? Do you watermark it
with library logo, project logo? Nobody likes those!
Searching across datasets
There are various ways of searching (not enormously interesting). The federated approach has a
central place, or broker, from which you can carry out a search across a number of different
resources. The most interesing in the UK is the London museums – a federated-services-enabled
search.
The technical problems are all about how you display the data: what does it mean to see a certain
field from a certain context in a new context? How do you show where it came from? How do you
aggregate the information? If information came from a number of resources, how do you find out
where it came from or who created it?
Experiment – Anglo-Saxon cluster: What it means in practice to bring together subjects with a
common subject domain.
Four projects ASChart (anglo-saxon charters), eSawyer, LangScape, PASE (Prosopography of
Anglo-Saxon England). In many cases these projects are using the same texts – PASE uses many of
the ASChart texts. Bringing these apparently congruent things together in a way that is not
superficial looks as if it should be easy, but is frankly a nightmare.
One approach is to do something very simply that just goes off to the individual resources, finds
text strings, and dumps the results on your screen. A more interesting way is to try and find ways to
relate data to make useful information and useful connections back about it, but it is very difficult to
use the resource unless you know a lot about the individual projects. In the case of the AngloSaxon group, a list of research questions were put together. The challenge in the end isn’t therefore
a technical challenge – it’s a research question problem. The site is therefore littered with sample
research questions.
One of the principles is that these projects were ‘finished’. They didn’t want to have to go back and
rebuild them therefor, but bolt on something that would work with the lightest of touches to
connect them up. There is no money to redevelop, so that’s not an option. They wanted to put up
lightweight connectors. It took 4 developers to put a connector on the project that they themselves
had worked on.
If you follow a query, it may send back results from 3 resources, but you can only see the data by
going to one resource at a time. There is no hint in the search results that three different resources
are throwing back results. Acknowledging the source is quite difficult.
Most of the problems of aggregating results are in how you display the information. How do you
put the stuff together in a way that solves the problems?
A final note is in relation to stained glass windows….
Analysis on images that identifies similarity or classification. This can now be provided as a web
service. Algorhythms are sufficiently efficient that you can go away and perform CVIR in real time.
This has very exciting possibilities. How can we use web services as part of the editorial process?
With content-based image retrieval we can bring up clusters of images that have already been
classified to help build consistency.
DDR: this links back to something said before: web-services consistently referring to connecting up
(small initial letters) data sets. Web Services with caps means something different in informatics
Creating connections between resources that were not designed to be connected is incredibly
difficult, and the only real answer is for the resources to be built in the first place with the intention
of eventually connecting. This doesn’t however deal with the simple issues described above, of one
piece of data meaning something different to the two different projects or researchers who created
that data.
10. Theodor Dumitrescu – ReMeDiUM: Renaissance and medieval music digital
universal media platform
All of our projects have a lot of overlap, and we don’t need to reinvent the wheel. What we should
be looking at now is intelligent data interchange. Paul has laid out the conceptual background, so
the following is about things that are directly relevant to our own projects
Consortium: In Jan 2007 a group of musicology projects got together to brainstorm ways in which
they might connect up their content and share data. Since then, several grant applications have
been written, but none submitted, but this is becoming more pressing as new projects are springing
up. Why do we want to do something like this?
If we look at the CMME website we can see a sample of a chanson MS from around 1500. If we
consider that a group of different projects are going to look at the source, each in a different way.
We’re going to do background work, but we’re also going to display that information on the
website. We all have a certain amount of common information, but approach with different
emphases, and with entirely different datasets.
How can we make it so that a user going to any one of these websites can automatically pull up the
information about that same data from the other projects without having to go to yet another
website?
Thanks to semantic web technology these problems of different naming or tagging conventions are
solved. We can use standardized ways of referring. If you are a consortium member you have to
use a certain set of display parameters. Sometimes for practical or political reasons that is not
possible. E.g using anglicised city names.
What do we do instead? We get semantic web technology to provide linking that happens behind
the scenes. You have to expose very little to the public, so most of the work is done in the
background.
The ReMeDiUM model suggests a set of information returned by a query regarding MS ‘X’. This
could show an edition drawn from CMME, images of the manuscript from DIAMM, and text
transcriptions from a third website: but all the data is clearly marked to show where it came from.
Having decided what he wants to see, the user can then choose the resource that will show him
the information he seeks.
ReMeDiUM deliverables: how might we go about implementing this.
Core scheme – data ontology for the smallest subset of data that all our projects share
Simple standard identifiers – RM (remedium music) RS (remedium source). These are common to
everyone. This provides an identifier to a piece that might have no other identifier – e.g. a
composer, to differentiate it from other things.
Data harvesting/interlinking software
Public web interface – just one way of harvesting data
Schema: what is the minimal amount of information that you might want to build into your back
end? This looks big, but actually it’s pretty minimal. From this any of the collaborating websites/
projects should be able to expose this kind of data to other projects in a standardized structured
way. Most projects do have this information, but maybe not all of it. This is a basic core.
The resource map is an xml file using semantic processes that are common to any kinds of data.
This sits in the background to any website telling people where bits of data are, and how to get to
it. E.g. I have three transcriptions of this piece, this is where they are and how to get to them. A
robot from Google can look at this and get some idea of what is on the website.
A simple extension to this would allow us to describe simply musical things. The conceptual stuff is
already there and becoming more standardized. We need to extend it a bit though. We create a
ReMeDiUM standard that says ‘this is the way that our websites can talk and communicate music
information.’
This is becoming more pressing as more individual scholars and teams start creating their own
datasets.
MW: Paul makes this sound very difficult, Ted makes this sound very easy. Which is true?
PV: Conceptually it can be very simple, but from the programmers point of view the interface is the
problem. Ted is looking at it from the other end, with a few small usable interfaces in mind, and
thinking about easy ways we could combine those things. The ReMeDiUM web page is just one
way of envisaging simple data presentation. We don’t need to present all the data in one place.
TD: WE don’t need to show more than a superficial linkage. In many ways there is more
congruence between CMME and DIAMM. Have we thought about using FRBR which has been
used to describe works of art.
11. Michael Scott Cuthbert – interconnectivity between music projects
Twigs growing out of a tree – this is good advice in telling students how to organise their thoughts
and data. How do you advance your main topic? How am I going to organise my data so as to solve
my problem? This was a main paradigm in how to program: how to get from data to new
information. Moving from one end to another. Some problems we tend to have in humanities in
general and medieval music in particular is that very often we organise our data in order to solve
one particular problem. We need to think more about reversing the paradigm. There are twigs
everywhere. Our data is a lot richer than many projects tend to be.
Object oriented programming – re-organising projects around the data and the particular things that
data can do. It has certain properties – so we need to think about how to organise data into
subsets.
Music21 a toolkit similar to ‘humdrum’. It takes musicolgists using Finale or Sibelius as a model.
These music-pocessing softwares are not simple to use, but people get to grips with them pretty
quickly. Our humanities projects can be either very complex or very simple, so if you want to solve
problems that requires a certain amount of commitment – 2-3 weeks of learning curve, and this is
what people will apply to learning a new software that does something they particularly want to do.
The Music21 project is quite ‘buggy’. Even very early and very poor information is useful, so we get
it out there so people engage with it and update it. They will find ways you might not have thought
through your project.
Skip to slide #11
Here’s an example: Take all the lyrics and assemble them into one text. The automated web
browser will automatically google the word exultavit, and will classify whether this is a common text
or not. We have to be careful, though, not to attempt to reinvent the wheel: there are lots of
sophisticated analysis packages available to analyse data. We want to do something different, but we
can build on what is already out there.
As an example of what software can do: software can transpose. Can show sections – and label
things on strong and weak beats. This may help to analyse whether a composer does certain things
on strong beats rather than weak beats. The machine can do this extremely quickly over a massive
quantity of data (accurately), but the human would be so slow as to be pointless attempting to do
the same thing accurately and with demonstrable data to back up the results.
Music librarians tend to be interested in flexible metadata.
We can program the toolkit to follow rules about e.g. adding ficta to a melody.This could be very
useful in indicating editorial interventions. So if you are analysing style statistically the system will
ignore the part that is put in by someone else. You could then apply the rules and see how often
the instance shown really does get corrected. Unfortunately in the case of ficta, professional
judgement is also needed, so this particular example didn’t work, or perhaps we needed to come
up with a more complex set of rules than we used the first time around. So the wrong result can
tell us how to change our rules.
Unfortunately in music though, rules cannot be applied slavishly, so you have to do more
programming to qualify each instance of a rule application – e.g. decide which notes are accented or
not, then qualify the rule to take that into consideration. The linked data approach may suggest
which pieces the writer of that treatise might have known – so perhaps not the pieces in which that
rule doesn’t work.
Going back to some of the other discussions today, the issue with metadata is the question of
authority. If you don’t get it right every time the computer isn’t going to find it, because it is very
literal. You do therefore need an absolute authority list somewhere, so I believe there is still a need
for a standard title or standard composer list, and that can be supplied by something like the Library
of Congress unified name catalogue.
Philippe Vendrix – Some questions arising from Digital projects and today’s
discussions.
How can we teach now? Money goes to research, but not to teaching. Digital Humanities causes a
dichotomy. If the online courses take off, will you need a professor of paleography?
It is good to make mistakes: we put stuff online even though we know there may be mistakes.
To distinguish between data and research team is very important. If we distinguish between what
we really want to do, and the data we are gathering, then we can work in a better way. We want
to know about the way music appears on a page then we work on that – we don’t ask who owned
it, we just work on that particular information, and give the data to the team that wants to work on
this.
15 years ago we were still on the level of universal enterprise that wanted to cover everything for
everybody. Today it is impossible – we can see mistakes: why would we repeat those? The more a
project centres on one subject, the better. Does it really matter that these projects don’t connect
up?
So we could defend a web project the same way that we defend an article in a journal. It stands
alone. Rather than absorb, you quote.
Another question: we may not have to search for universal systems. We can work only thinking
about our problem: we should not worry therefore about other people’s problems!
Mensural notation is what we’re interested in. Let’s not worry about Stockhausen. CMME is a tool
dedicated to one type of notation, so it’s good at what it does. If we use this tool then we know
that our research can be used by anyone else.
Surprising: we still want to be complete. We write bibliographies – who cares about bibliography
today? If we dedicate a lot of time to transcribing RISM/CCM on the web, then we’re not really
adding to scholarship.
We will never succeed in being global. Is it useful to do this – will it bring something new to
scholarship?
DIAMM has little in the way of research outputs: it is a research resource. We are just encoding
information that already exists in other ways. But we’re not doing research
Because we’ve collected lots of data we want to make it usable to other people. Data is usually a
by-product of something we wanted to know, not the end in itself.
When we make a proposal we spend a lot of time just collecting data. Sometimes we embrace a
too-large corpus. If all scholarship is only related to corpus problems then where will be go? Will
there be a discipline of musicology in the future? Will it look too boring because it is just seen as
data gathering?
Computational musicology is very strongly centred in early music.
MSC: the intense project of digital musicology in solving problems they set out to solve have been
slightly a failure – publications are more about the process than the result. The side-products using
the tools enable us to move faster. This can be amazingly fortuitous: not wanting to lose his train of
thought he needed to know if a particular motet was in San Lorenzo. So instead of crossing the
room he checked laTrobe, and found a new concordance.
Are we here to make research easier, or are we here to make research?
Workshop in Dublin about digital humanities – Willard McCarty said you can produce all this stuff
but where is the research question/answer?
What is the point in spending money on developing electronic resources when we should be
spending money on research?
JCM: This is actually what I’ve been saying for a while: the point about connecting up is that less of
the grant and the researcher’s time is spend gathering data that is already out there, and more on
the research that only the expert can do!
It is crucial to use more cataloguing data. DFG insists that every catalogue of a public institution
must be open access online. It is scandalous that some catalogues are not online – e.g. CUL!
A sample project from CCH worked with mediated crowd-sourcing. Content goes through an
independent editorial board. This can work if you are working with relatively small materials, if not
it’s not so easy. They have a pilot phase available. A catalogue that would work as a bridge between
different catalogues. How can we reward people who contribute, and the editors – a way that is
academically recognized. The idea now is to make sure that any contribution has a name attached.
The authority is the contributor, not the database where the contribution or contributor lands. The
editors have credit by publishing annually. So you evaluate for a year, then you only publish once a
year.
DIAMM has a byline. What prevents Myke from contributing is that it’s not immediate. Sending
something and not seeing it is demoralising, and so the contributor doesn’t send the next thing.
Should we have a list of contributors to DIAMM pointing to their MSS?
Another problem we have to think of is people not coming through on their commitments: they
say they’ll produce stuff, but then they don’t.
People want DIAMM to have more image content – no need to expand data and textual content,
but we do need to start attaching good images to the data we have.
E-codices (Switzerland) put out a call for suggestions of important MSS to be digitized – the person
had to put in their own research as their part of getting the MSS digitized. This kind of approach
where MSS are prioritised if there is pertinent is very useful.

Technical Resources Workshop 22 September 2010

Transcription

Similar documents

Lukman Hakim

Case Western Reserve University`s Otto Ege and the

Electronic Parts Catalogue

From clay Tabrets to web: Journey of Library catarogue

Information Brochure - Kamla Nehru Institute of Technology, Sultanpur

ShelfMarks - Research Group on Manuscript Evidence

Standard submission of a manuscript through the online OAPL system

Manuscript Description

PLM-PLANET • RC