Technical Resources Workshop 22 September 2010
Transcription
Technical Resources Workshop 22 September 2010
Technical Resources Workshop 22 September 2010 Oxford eResearch Centre PROGRAMME 10.00 Coffee 10.30 Welcome and description of DIAMM activities (Julia Craig-McFeely) 11.15 Elizabeth Eva Leach (DIAMM): the virtual learning environment and DIAMM 11.40 Paul Vetch (Centre for Computing in the Humanities): Online delivery of DIAMM materials; how DIAMM relates to other projects 12.30 Feedback session 1 1.00 Lunch 2.00 2.45 3.30 Theodor Dumitrescu (University of Utrecht): CMME software and the CMME database Ichiro Fujinaga (McGill University, Montreal): Gamera and its applications in shape-recognition for musicology and other subjects Feedback session 2 (Grace de la Flor) 4.00 Tea 4.30 5.00 Summary: website analytics (Greg Skidmore); ways forward for DIAMM Close Questions: How can DIAMM become more useful to researchers; Where do we see DIAMM going in the future (e.g. what will we be doing in 10 years?); How can we improve our ouputs; What new technologies should we be embracing; How important is it to link up with other online resources and what sort of connections should we be making? Present: Julia Craig-McFeely, Elizabeth Eva Leach, Segolene Tarte, Stella Holman, Esther Anstice, Greg Skidmore, Theodor Dumitrescu, Ichiro Fujinaga, Grace de la Flor, Paul Vetch, John Pybus, Richard Polfreman, Paul Kolb Technical Resources Workshop Report Summary of Workshop 1. Julia Craig-McFeely: DIAMM Website: www.diamm.ac.uk The presentation summarized the activities of DIAMM over the past 12 years, and described the activities that are funded by the current grant from the AHRC. It also discussed the database and back-end descriptive data content and how that is managed in the database framework. A detailed version of the presentation and accompanying materials were provided to participants on USB sticks. The image content ranges very widely from tiny fragments and heavily damaged leaves or partial leaves that have been recovered from sources as diverse as wallpaper scrapings to binding reinforcements to complete choirbooks in excellent condition, sometimes in its original binding. The content was acquired originally with grant support from the HRC, and the AHRB (now the AHRC), but has been expanded through collaboration with other projects that have obtained funding to digitize documents and from the use of project-acquired funds earned through consultancy. The project also obtains a small number of images from donations by the owners, but this is usually only possible for small documents where the cost of digitization is small. Quality has always been an issue and, surprisingly, it is still very difficult to obtain images of a consistent quality from suppliers such as library digitization services, where a fast commercial throughput means that it is not easy to maintain quality controls, and where sometimes a lack of visual acuity in the staff means that they cannot see artifacts on the images caused by e.g. using an unsharp mask during capture. The reasons for inclusion of a colour and size scale in each image have been demonstrated and accepted widely, but some archives continue to create ‘reference’ shots separately from their main images. This has proven ineffective where the camera operator subsequently moved the camera, but also has implications for long-term viability of image ‘collections’ like this, since digital images are still relatively new, and therefore we know little or nothing of colour drift over time. The problem with having a single reference image is that if that one file becomes corrupted (either to unreadability or without the operator being aware) all the colour and size information for the accompanying images is also therefore corrupted, rendering a whole set unreliable. Since the start of the project, imaging technology has moved on from scanning backs to single-shot cameras, and the project now undertakes most of its imaging using a single shot 65 megapixel camera, though the 144 Mpx scanning back is still invaluable for larger sources, and damaged sources where the tighter pixel resolution allows more complex digital restoration to be undertaken. Using the single-shot camera the project can undertake UV, IR and multi-sepctral imaging, and the scanning back can also be used for UV work. Digital restoration has aided scholars in many disciplines to retrieve material originally believed lost from damaged documents. Restoration however is now rarely undertaken within the project, since it is extremely time-consuming. The database is the most important current activity of DIAMM: the majority of users of the website visit it in order to access catalogue metadata and other information about manuscripts, and many only access images rarely. Partly this is because the collection of images is extremely incomplete, and many very important sources are not available online even if the project has obtained images which are stored in the dark archive. 2 Technical Resources Workshop Report The database is currently populated using FileMaker Pro, but is delivered via a SQL database to the website. There have been problems in the past in translating data from one medium to the other, largely because ODBC connection with FileMaker (FM) was not well managed. Newer versions of the software have corrected this problem and it is now possible to export directly from one database to the other. However SQL is a far less ‘forgiving’ environment than FM, and getting the two databases to talk to each other is far from simple. The end result however will be an upload mechanism that can be run more-orless at the click of a button, and will allow the online dataset to be updated as frequently as we wish, instead of the current system in which content updates are only done every few months. The first diagram shows the base data structure that is used to deliver our current online system. It was originally a mechanism that allowed users to get to images, so has the image as the smallest level of information available, and all images are a subset of the Source table, which is basically a list of manuscripts (each manuscript being anything from a fragment to a large, complete, bound source). Attached to the image table is a ‘SecondaryImages’ table which allows us to deliver different versions of an image (UV, IR, restored, watermark, detail etc.) as an adjunct to the main image. Manuscripts, however, are rarely considered in isolation. There are many ways in which manuscripts were originally linked to each other or have become linked: • By design – e.g. a set of partbooks • By content – e.g. the same copyist working in multiple sources, or the works of a particular composer being preserved in a set of books • By intellectual construct – e.g. because the MSS originate in the same geographical region, or because they had political connections • By reconstruction – so a set of fragments may originally all have come from the same complete MS And so on. 3 Technical Resources Workshop Report On each page, there are items – in this case usually a musical composition or work. Depending on the layout this may represent a complete work, or it may represent one or more voice parts of a musical work. Sometimes there are many items on a page (which corresponds to an image) and sometimes an item is copied over several pages. The database deals with this by allowing each MS to belong to a ‘Set’ and the type of set is also defined. Because an Source (MS) can belong to more than one set, and any set can have more than one member, these are linked by an intersection set. Items are linked directly to the Source table, since they are effectively independent of the images, but they are connected to the image set by an intersection set. Musical works present issues that can be difficult to resolve: is a Kyrie a work in its own right, or should it only be considered an item in a larger whole called ‘a Mass’. The repertory itself answers this question, since the idea of the parts of the proper of the Mass forming part of a linked cycle is a relatively new one, and these ‘movements’ were originally composed in isolation. Motets too create issues: some motets are written in sections, and any of those sections might appear in a different manuscript as a motet in its own right. Therefore items too have to be linked into sets or groups via an intersection set. Items require a composer database, and in fact also require a ‘composition’ database as a superstructure to the individual items, since an appearance of a work in one source may be substantially different from its appearance in another source. Works may also be based on another work in the database, used as a model, and there are models for texts which are used in adjusted forms as well as in their pristine form. As anyone who has worked on a structure like this knows, bibliographies have their own problems. In DIAMM we are not simply dealing with a bibliography, but with a bibliography that is linked to a large manuscript database. Some bibliographical items therefore may refer to a manuscript, some to items in a manuscript, some to items by particular composers (who appear in many manuscripts) and some to cross-composer and cross-manuscript subjects such as genre or a particular titled work. This causes quite a complex network of intersection sets between the bibliography database and all the other tables in the database, including the tables now added to deal with composers. 4 Technical Resources Workshop Report Texts require their own database (giving original and standardized spellings, language tables etc), so do purely musical items such as voicing, clefs, mensuration and so on. The database grows with each iteration of information, and expressing it creates a complex structure that also complexifies the ways in which that data can be accessed and searched online, since the whole has to be presented in ways that web browsers can understand and display meaningfully. The design implications are enormous, and design is one of the most complex tasks undertaken by our technical partners. 5 Technical Resources Workshop Report Knowing where to stop – where to decide that certain types of information must be provided by a different resource, is quite difficult. The temptation is that DIAMM tries to be all things to all people, whereas the real answer of how to deal with this difficulty is to create a web-services model in which a number of linked databases with different specialisations connect to each other, and each deals with different types of information. By connecting to the existing information in a complementary database, duplication is avoided, and a broader information resource can be created. Where, then does DIAMM go in the future? How does the project become self-sustaning: not just to stand still, but to continue growing. DIAMM has always been soft-funded, but this has set the project up to be dependent. We need to create income streams but we’re determined to to charge for access to our content. Having considered many income streams – and that list is always growing as new technologies emerge: we are currently examining iPhone and iPad apps – it is clear that a major income stream, and probably our primary one for the foreseeable future, is publications. When the project started, those working in digital spheres believed that the book would very soon become obsolete, and the digital facsimile would be the way forward. It is clear however that there is simply no substitute for holding something beautiful, particularly a beautifully-produced colour facsimile, and sales of our first two facsimiles have shown that – for now anyway – this is a viable way forward. 2. Elizabeth Eva Leach: DIAMM and Moodle Websites: www.music.ox.ac.uk/people/staff.../e_leach.html; web.me.com/elizabethevaleach/; twitter.com/eeleach Dr Leach has been working on the creation of an online teaching resource for learning medieval notation. Rather than develop a new system she has used MOODLE software, originally developed for distance learning in Australia, but now in use by many educational institutions for interactive coursework that students must complete in their own time. The Virtual Learning Environment (VLE) was of particular interest to the AHRC in funding this period of the project, since it would open the content to a much wider public, and would also enhance teaching of medieval music in other HE institutions, since it would not be limited to internal use, but will be accessible to any user – just as the DIAMM resource as a whole is open to access to anyone who can get online Dr Leach demonstrated some of the course parts that she had already constructed, which she hoped would form the basis for similar courses by colleagues that would cover different periods and traditions of notation. She demonstrated the various types of module that could be used: simple flat text description; web pages with images and other materials; quizzes; tests in which the user is required to reach a certain score. She did point out however that creation of such a course is very time consuming. If a ‘student’ is offered multiple choice answers, in order for their learning experience to be useful she had decided that a wrong answer should not simply be marked as incorrect, but that the student should receive feedback on why they may have got that answer wrong, thus enabling them to learn from their mistakes as much as from reading the preparatory material. Doing this for each question takes a great deal of time, as does the mechanical tasks of preparing suitable sections of larger images to provide the student with samples from which to work. 6 Technical Resources Workshop Report 7 Technical Resources Workshop Report There is no point in creating a system that requires input and feedback from a real person, since this would simply be impossible because nobody can devote that sort of time input, and would limit its usefulness drastically. The course is therefore designed to be self-supporting: self-marking and providing feedback based on scores and responses, all of which takes a great deal of forethought and planning. This is however precisely what Moodle is designed to do, although it requires a significant investment of time from the person creating the course at the outset. The only thing not possible (presently) in Moodle was the ability to incorporate some way for a student to transcribe the music they were reading online, and have that transcription ‘marked’ by the system. However Dr Dumitrescu’s presentation in the afternoon suggested that with some adaptation it may be possible to use his CMME software to fulfil this function, at least in part. 3. Paul Vetch (Centre for Computing in the Humanities): Online delivery of DIAMM materials; how DIAMM relates to other projects Websites: www.kdcs.kcl.ac.uk/who/bios/paul-vetch.html; www.cch.kcl.ac.uk/research/projects; www.bpi1700.org.uk/jsp/ The online delivery of DIAMM has been undertaken for 8 years by CCH, and Paul Vetch, who has been involved in that work from the outset, was able to give an overview of some of the technical challenges that the database presents in online delivery. Specifically he discussed the need for users to be able to cut down to the piece of information that they wanted in a number of ways (e.g. by searching for a composer, a particular work, a genre, a manuscript, etc.) and how that might be achieved in design terms by using faceted browsing. 8 Technical Resources Workshop Report He used an online ecommerce database of knitting patterns as an example of faceted browsing in use, in which a user could whittle down a dataset of 200,000 knitting patterns to find the type of pattern that they might be interested in, limiting search results progressively by applying a series of filters. This meant that a visitor who didn’t necessarily know exactly what they wanted was able to view the dataset’s content in progressively smaller segments, rather than having to browse the entire content item by item, which is rather what users currently do with DIAMM content. This type of ‘searching’ is technically ‘browsing’, and is commonly used now on many e-commerce sites, where the user is invited to filter a set of search results e.g. by cost, or manufacturer to save looking through a very long list. Without being particularly aware it, users are therefore trained to filter search results or browse results, and designing an intuitive way for this activity to be applied to academic materials has been part of their work for a number of projects. One that was demonstrated was pbi1700 (British Printed Images to 1700), developed by CCH. The faceted browsing facility used to search this project’s database (http://www.bpi1700.org.uk/jsp/) allows users to choose from a number of different categories with the intention that a user could never create a search that would yield zero results. The user is ‘directed’ by having free-text fields actually auto-complete using information from the database (thus preventing mistyping or the entry of non-matching data), and the richness of the results was immediately apparent since each category had a number appended indicating how many records would match that search. Categories could be search alone, or combined, so a print producer could be fixed, and then a category within his output fixed, then that could in turn be filtered by date or technique, all of which were predefined by the underlying database. 9 Technical Resources Workshop Report The basic development database using DIAMM materials was shown. Although this is not yet fully populated, nor constructed apart from its basic components, the workshop participants were able to see the type of browsing delivered by bpi in use for DIAMM materials. The DIAMM faceted browser would look similar to this, but with the primary filters (here: Producer, Person show, subject, date) being source, composer, genre, composition, date etc. The difficulty in DIAMM is the huge variety of type of content that a user might wish to examine. At some point however we have to make a decision on behalf of the user group to limit the searchability to certain facets. CCH is also developing work for Thomas Schmidt-Beste’s PRoMMS project, another musicology project in a significant group of musicological outputs from CCH, but one with defined connections to DIAMM. PRoMMS will draw on the DIAMM database and its image content, as well as using DIAMM to create additional images. The project is concerned with the creation and mise-en-page of musical manuscripts in the late 15th and early 16th centuries, and ways in which manuscript production can be defined and described by researchers an electronic medium, allowing us to better understand the process of manuscript creation and what each item on a view might mean to a reader. Part of the development work on PRoMMS will allow CCH to develop a web-services connector to the DIAMM resource, which can in turn be exploited by other projects wishing to connect to the datasets. At present the markup is planned to be done manually, but the presentation by Prof Fujinaga in the afternoon session suggested that his software had progressed to a stage where some of the identification and classification of zones on pages and openings of manuscripts could be automated. 4. Theodor Dumitrescu (University of Utrecht): CMME software and the CMME database Project website: www.cmme.org CMME is now well known in the musicological community thanks to presentations by Theodor Dumitrescu at major musicological conferences and a website that offers a clear history of the project, the software development, and a fascinating interactive demo version of the tool. 10 Technical Resources Workshop Report Essentially the software provides end users with a simple interface for making online editions of early music by allowing the user to input original note shapes and values. Much is lost in the process of ‘editing’ a piece into modern notation or a modern edition, and a great deal of that can be retained if the editing software allows the user to retain more of the information as it appears in the original source. Because it is xml-based, the content is searchable in a way that most music-processed content (using other softwares) is not, and simple drop-down menus allow the user to change the display to modern note shapes, modern cleffing, modern (or various other styles of) barring etc. Individual parts can be viewed in score or part format. However, CMME is much more than simply a piece of software, as it is being used to create a large collection of new editions of early sources from a period complementary to that covered by DIAMM (and to a small extent overlapping with DIAMM). The editions involve transcribing all the sources for a work, and instead of producing a single ‘collated’ edition (which does not represent any contemporary version or performance of the work) produces a version prepared from one source that can be easily and graphically compared with all the other editions. For instance the ‘master’ source can be chosen, and then variants between that source and all the others in notes and underlay can be shown using colours or other graphic devices. The CMME editions are available online through the CMME website (http://www.cmme.org/?page=database). The advantage of this type of edition is its fluid presentation, and also the improvement that it offers in understanding of the original sources for students and performers of the repertories. Most performers and students work only from modern editions of these works, where an editor has made often significant decisions about the content that is taken from the original and passed on to the modern user. This significantly damages our ability to understand the work in its original context, and can be extremely misleading. We could say that all students of music from this period should understand the notation and work from the original sources, but actually expertise in this notation is not easy to come by and access to the sources, even with a resource like DIAMM available to the public, is slow and laborious for most users whose expertise is generally less specific. The MOTET database, now merged with DIAMM, includes as part of its remit the creation of new musical and text incipits for sources that do not have them, or sources for which RISM provided incipits in a standardized notation that does not represent the appearance of the original. As soon as CMME was available the MOTET team moved to creating any new incipits using this software, and will retro-convert their older work as time and funding permits. The CMME project involves a growing database of sources and content (most of which can be seen and searched online at www.cmme.org) and for several years CMME, DIAMM and the MOTET databases, along with other datasets of a complementary nature (Oliver Huck’s Trecento database and the Base Chanson databases at Tours have been moving towards a closer collaboration and data-sharing effort. The primary purpose of this collaboration is to ensure that time and funding is not wasted by projects duplicating each others work, and also to allow projects to utilise a richer content than they could do using ony their own data. To this end the intention is that the projects involved should create a pilot data-sharing or web-services model that would both allow the ‘borrowing’ or ‘mining’ of data between different databases and also allow these datasets to be searched via one independent portal. The portal could be linked to by new datasets that would contribute complementary data to the existing databases. The project has been in the planning stages for some time, and has been named ‘REMEDIUM’. One of the first steps in the REMEDIUM collaboration was the sharing of database structures between the partner projects. The next step would seem to be the incorporation of a master key number in all the 11 Technical Resources Workshop Report relevant tables of the databases that would connect together such that a REMEDIUM portal search for a specific item would show its position in all the collaborating datasets, with direct links and e.g. thumbnail views of images, editions and other material that could be found in the individual resources. We were shown a sample web page of how this might appear, and discussion followed about other projects that were potentially involved in duplicating material that already existed in this group of databases, and which should really be sharing these resources so that their time would be better spent in dealing with the material and information that was unique to their project. There is a difficulty in dealing with funders who continue to fund the repetition of data-gathering of this sort, instead of limiting the focus of new projects to new research making use of data already in existence and available to the public. It was felt that there was massive duplication taking place and thus a huge waste of the limited resources available for musicological research. The refusal of new projects to collaborate with existing ones, or to share new data that they were creating should be discouraged actively in order to maximize the potential for new projects to produce new research, instead of work that duplicates information already in existence elsewhere. It is important therefore that the REMEDIUM project moves forward as quickly as possible. 5. Ichiro Fujinaga (McGill University, Montreal): Gamera and its applications in shaperecognition for musicology and other subjects Websites: www.music.mcgill.ca/~ich/; www.aruspix.net; gamera.informatik.hsnr.de; gamera.sourceforge.net/doc/html/ Although GAMERA and its ‘children’ have been in use for many years in advancing the needs of Optical Music Recognition (OMR) for repertories that do not use modern typography, the progress of the software in undertaking new tasks demonstrates its flexibility and the ability to undertake tasks beyond the remit of its original design. Early GAMERA applications were related to the recognition of lute tablature symbols, particularly in printed sources where gaps between typographical elements meant that ‘staff lines’ (actually tablature lines) were not continuous and often not particularly precisely aligned. Because most music typography involved hand-carved wood type, the figures are rarely perfect. GAMERA therefore had to be teachable so that damaged or irregular type was understood by the software. Having built software that could manage the task of removing the staff lines so that it could then read tablature letters and rhythm flags, it could then be applied to any irregular shape-based script, as long as it was given a glossary from which to choose. Each task undertaken teaches the software more about the items in its glossary, so recognition of pages quickly becomes faster and more accurate. GAMERA is designed for domain experts, and includes image-processing tools (filters, binarizations etc.), document analysis features that allow images to be segmented and processed in sections and symbol segmentation and classification. Because it is extensible it can be adapted for a variety of different tasks. One such task that was tested for DIAMM in the past was the recognition of scribes: the software was ‘shown’ a large corpus of manuscript samples with the intention of classifying the corpus by similarity to a master sample. The software then ranked the samples (very successfully) from most to least similar. Because the software was able to remove staff lines and recognize elements in the content, it was used for the Online Chopin Variorum Edition (OCVE) to create markup for barlines so that that information could be used to create individual bar crops on a large corpus of printed music (over 7000 pages) 12 Technical Resources Workshop Report (the original page here is shown at the top left). GAMERA was used to build GAMUT (Gamera-based Automatic Music Understanding Toolkit) in 2004, and was extended also to GEMM (Gamut of Early Music on Microfilms) in 2005. In 2006 the GAMERA family joind forces with ARUSPIX (HMM-based), a specialized application for recognizing typographic music. Unlike many of the GAMERA-based predecessors, this does not remove the staff lines, but dealing with early typographic single-impression sources (where individual notes each have their own set of staff lines, so the staff-lines are not continuous) creates a different set of difficulties from reading engraved or later typeset musics, where the staff-lines are continuous and therefore predictable. A movie demo of Aruspix showed that the recognized version of the typographic content retained the note shapes and types of the original, amd was presented in an intuitive GUI with extremely simple dragand-drop editing tools. Most interesting to the workshop however was the early pre-processing of an image, in which Aruspix cleaned up the image and pre-classified the content, defining a historiated initial, title text, text underlay and music text, and using colour overlays to show the classification. This had very obvious relevance to the newly-funded PRoMMS project, in which the mise-en-page of manuscript and printed sources is examined and will require manual markup of a large number of pages. If these pages can be largely pre-classified before the manual phase, this will speed up the work considerably, and may allow the team to extend their content exploration. 13 Technical Resources Workshop Report Compared with the results of one of the leading OCR softwares, ABBYY Finereader Pro, where the first run at a page of chant with an extremely clear original (unlike the original of the sample above, which was quite dirty and had show-through) is shown below, Aruspix has an obvious edge. Aruspix is the leading neume recognition editor, with 7000 symbols already trained and a very high accuracy rate of over 95%. One of the difficulties encountered by the Chopin project was that most digital images are (correctly) supplied showing everything around a page, not just the internal content. So the edges of the page are showing in the picture (not always square or straight), and colour and size scales are also shown. Sometimes there is additional material. In order for GAMERA to perform its bar-line recognition, the edges of page images had to be trimmed off, to avoid confusion in the result. New developments in the software include intelligent cropping and boundary detection. 14 Technical Resources Workshop Report The software performed equally well on approximately square originals (as above) as on fragments of extremely irregular shape, with completely non-linear edges (e.g. disintegrating from mould). The potential for Aruspix and the various applications of GAMERA seem limitless, particular given the very wide variety of requirements of musicologists and new musicology initiatives exploiting the online environment. The family of GAMERA-Aruspix project are not limited by period, notation type, the quality of the original source, whether the original is printed or manuscript, or the type of data being recognized, and this opens them to a much wider musicological (and other disciplines) application field. One of the problems in OMR for early repertories is that so much music has text underlay that is difficult to read or recognize for a user not familiar with early scripts, and this software could transform the ability of nonexpert users to appreciate and interact with medieval and early modern sources. 6. Grace de la Flor (Oxford Internet Institute, University of Oxford) Project website: www.oii.ox.ac.uk/research/?id=58 There are four project partners – oii oerc, ucl centre for digital humanities/dept of information studies, virtual knowledge studio, Maastricht Grace de la Flor is studying internet behaviours, and her current research involved six web resource case studies: University of Birmingham English; Electonic Enlightenment 2: letters & lives online; The Proceedings of he Old Bailey 1674-1913; DIAMM; UCL department of Philosophy; and Corpus Linguistics. The aim of the research is to understand how humanities scholars use both traditional and digital resources in their work. The study results will provide guidance to funders in the formulation of their research strategy Her project is interested in understanding more about DIAMM because it is a resource – a digital image archive – and it is provided to scholars to assist them in their research. One of the questions she asked therefore was whether we achieved our purpose, and if so how well we succeeded. The researchers Grace interviewed find DIAMM an invaluable resource and is central to their substantive research practice. Added to the slides shown in the workshop the list below includes a last slide which is a wish list of further features as a response to the question "what changes might improve DIAMM"? ... (you're probably familiar with most if not all of the suggestions). After I submit my thesis I'll start working on the RIN project again and writing that up. Grace will send us the draft of her research on DIAMM for the RIN project probably around end of October for our feedback. Early results These are some of the areas of musicology research that the archive is used in: materiality of manuscripts codicology (manuscript physical properties) mise-en-page (manuscript layout) digital image restorations erasures identification and repair material damage high-resolution digital images fragment comparison letter comparison 15 Technical Resources Workshop Report historical/cultural context a manuscript's association with other cultural artefacts The importance of DIAMM to individual researchers: Letter Comparison – an interview quote "The scribes of the Middle Ages worked really hard to be anonymous. If somebody started taking over the work of another scribe on page 50, the scribe would copy the handwriting of the other person. They would try not to use their own. And so you end up looking for very, very small, subtle, tell-tale sort of habits …that you can only see with really great quality photographs, and then I’m able to see how various manuscripts connect with each other." The next researcher discusses erasures: "I don’t feel particularly disadvantaged by the fact that I’ve worked with a digitization of it (the manuscript) because I have seen all the elements in that digital reproduction that I would see if I was consulting a manuscript first-hand ... Being able to actually detect these changes, the scribe’s erased something and rewritten something over the top of it. That has changed my scholarship in the sense that much of my work is concerned with identifying these erasures and identifying why these erasures occurred from the context of musical culture. " Here, a researcher discusses the importance of access: "Nowadays you can probably do 80, 90% of everything through the high-res digital images and only for the remaining 10, 15, 20% you need to see the manuscript. So you can go much further (in your research) working from your own location." Importance of DIAMM: enabling new discoveries through • • • Comparison across a variety of source materials Digital image restoration Access to sources that may require considerable travel and expense to view first-hand or that may be confined and impossible to access otherwise Grace uncovered an interesting aspect of medieval musicology that we have not considered, and which may impact take-up and usage of the online resource significantly: Do you feel pressured to work in certain ways? "Yeah, I mean, I think—I do feel pressure, for instance—there’s this notion in the field that you can always get more out of seeing the original than seeing a digital image of it, and I do feel pressure to work more with originals than with the digital images because of the traditions of the field, whereas frankly, there are times that that’s really important, but for the most part I do feel like I get more out of sitting, using these images on my computer... I do feel like there’s a certain pressure that that’s not what top scholars do because that’s not what top scholars did 25 years ago." She then asked the workshop the following questions: •How do you discover new information sources? (grad students, peers, etc) •What are the most important sources of information? How did you find it, how did you decide it was a good resource? •Over the course of their career have information sources changed or just changed in style/presentation? Her conclusions about what changes might improve DIAMM are as follows: What changes might improve DIAMM: 16 Technical Resources Workshop Report • • • • • • • • • Image manipulation tools (adjust gamma) Save a copy of the image onto your hard drive for research Improved search tool (search by composer, by piece, by kind of manuscript) Access to more of DIAMM images via the web interface The registration process a little bit more streamlined Full transcriptions Updated bibliography Link content between related projects and libraries More funding! 7. Website analytics (Greg Skidmore); ways forward for DIAMM Greg Skidmore’s presentation demonstrated very fully the ways in which the internet and communications technology can be used to bridge geographical boundaries without the need for special facilities, since it was given from a seat in a bus travelling on the M40. As his travel had been delayed he uploaded a document to GoogleDocs which was presented to the workshop participants, and which he updated in real-time as the content was being discussed. A summary of the Google data is given below; Google Analytics started taking data on the site on 6th May, 2010. Pro fil e of visit ors: 6684 unique visitors 12206 visits (87.9 visits per day) 105 countries (GB-26%, US-18%, D-8%, NL-7% -- EU-71%, Am-23%) Top cities, in order: London, Oxford, Utrecht, Vienna, Leipzig, Cambridge 54% were new visitors They spoke 64 different languages Average time spent on site = 4min 47sec Average pageviews/visit = 8.5 46% of visitors only viewed one page However, 4676 visits (38%) involved viewing 5 pages or more Most visitors only ever visit the site once (54.6%) However, 2862 visitors visited between 9 and 200 times (23.4%) Roughly half the visits lasted less than 10 seconds However, 22% of visits lasted between 1 min and 10 mins Total number of visits which lasted longer than 1 min = 4164 (34%) Vast majority of visits were made on monitors with greater than 1024 horizontal resolution using broadband connections How vis itors got to the site 80% of viistors come from either search engines or were direct traffic 95% of search engine traffic came via Google, then Yahoo!, then Bing. 39% of search engine traffic came via a search for ‘diamm’ Next was ‘medieval music’ at 2% Exam ination o f /in de x.html /index.html was the ‘Landing Page’ 51% of visits 40% of these visits resulted in the user leaving the site after viewing /index.html 38% of users who searched for ‘diamm’ left the site after one page view 33% of these ‘diamm’ searchers were confronted with /index.html and left immediately 17 Technical Resources Workshop Report Use o f sit e There may be something wrong here: Google says that pages with ‘DisplayImage’ in the URL (i.e. the actual Image Viewer) have only been viewed 27 times. This may indeed be true. Pages with URL containing ‘source.jsp’ have been viewed 23 249 times. 3988 separate pages containing ‘source.jsp’ in the URL have been viewed. Most popular source is Eton Choirbook. Its source description page has been viewed 224 times. Search.jsp and Results.jsp are Nos 3 and 6 on the list of most popular pages. This means that many users are indeed searching, not browsing using ArchiveList.jsp. (ArchiveList.jsp is actually the most popular page on the site). CONCLUSION One of the outcomes of the workshop was a list of ways in which users could be encouraged to help the project. Because the project is well established there is a danger that users assume it will always be there, and that it has reached a stage of existence in which it will continue indefinitely without any help from its user-community. This is manifestly not the case, and the following list was drawn up as a result of this discussion, to be added to the website and to the materials included on the promotional USB stuck (funded by the AHRC). This version has been edited after consultation with the project team: Wh at ca n I d o f or D IA MM ? What DIAMM needs primarily is funding: small amounts and large. In order to continue to deliver a world-class collection FREE to any user we need to increase our online collections, and continue to develop our delivery mechanisms to keep up with demand and changes in technology. DIAMM can only exist if the community who benefit most from it recognize its worth and support it to keep it alive. Web resources require more-or-less constant maintenance to keep abreast of the changing behaviours of browsers and ongoing development of web technologies. Unlike books they cannot simply be put on a shelf and be accessed in the same way. Maintenance is costly, and needs to be funded and justified by evidence of use and value to the community who use it. You can help us in small ways: • Cite DIAMM in your publications if you have used the resource. It is also very important to tell DIAMM when you have used us or cited us in a publication so that we can keep track of the ways in which the project supports research and use that information to help us obtain future funding; • Tell people about us and encourage them in turn to use the website (www.diamm.ac.uk); • Use DIAMM as a teaching resource, and tell us about how you have used it and ways in which it worked or failed to work for you: this allows us to shape the resource to meet your needs; • Send updates to online data wherever you spot an inaccuracy or lacuna, and feed back information about anything that does not work as it should. Contributions from the user community are essential to DIAMM; • You can donate any amount of money to the project which will go towards creating more images of sources that are difficult to access, or that cannot be photographed at sufficient quality by the owners. To donate directly use our online shop (tinyurl.com/DIAMMPublications) and the ‘donate’ facility; • Always insist on the highest quality of imaging from your suppliers. Our imaging checklist is included in the DIAMM handbook and can be found on the website. If you are concerned about the quality of the images you have received, ask DIAMM to evaluate them and, if necessary, write a quality assessment report. You can also help in larger ways: • Make sure that when planning your project you do not waste time and resources on doing something that DIAMM has the expertise to do for you: outsource imaging and image-processing – this will usually strengthen 18 Technical Resources Workshop Report • • • • your application as you will use your funding to pay for the things that only you can do. Write realistic consultancy costs into your budget either for having DIAMM undertake work for you, or for our experts to train your imaging technicians or help you in ordering suitable-quality images from third-party suppliers. Include a budget line in grant applications for DIAMM resource usage, even if you do not use DIAMM for imaging or consultancy – even small amounts will help us to maintain our web presence and keep the resource free and available so that you can use it for your research. In the bigger picture of a grant for £250,000 or more, £2000 donated to DIAMM in order to keep the resource running and available to you during the period of your project will make a huge difference to us, and very little difference to your application budget. If you are buying digital images ask your image for permission to donate copies of the images to DIAMM so that they can be displayed online, or deposit copies of your images in the ‘dark’ archive so that they are stored for the future; the project may be able to negotiate permission to put them online at a later date. Make sure that your data structures take account of existing related datasets, and take advantage of the willingness of other projects such as DIAMM and the Remedium consortium to share their data, thus saving you the time, cost and effort of re-creating metadata that already exists in another database. Share your own data so that the wider community can benefit. Link to existing datasets using a web-services model, minimizing the amount of work you have to do, enhancing those other resources by contributing your data easily, and creating a richer and more connected research environment for the user community. Make use of technical resources that already exist, rather than attempting to build a new resource from scratch that does the same thing. DIAMM and its funders have dedicated a huge amount of expertise and time to creating an online delivery mechanism for images and their associated metadata, and you could benefit from this by delivering your images through DIAMM – contact us to find out how. 19 Database and data connectivity workshop report DIAMM Database and data connectivity workshop 1 April 2011 Oxford eResearch Centre PROGRAMME 9.30 10.15 10.35 Julia Craig-McFeely/Elizabeth Leach – Welcome and introduction to DIAMM data content and some problems in data extent Prof Thomas Schmidt-Beste – the Production and Reading of Music Sources 1480-1530 Lief Isaksen & Gregorio Bevilacqua – Cantum pulcriorem invenire (research project on 13th century conductus) 11.00 Coffee 11.20 11.40 11.55 David de Roure – SALAMI (Structural Analysis of Large Amounts of Music Information) John Milsom and Nicolas Bell (Early Music Online) – standardised titles and composer names in early music repertories Discussion/Feedback session 1 12.30 Lunch 1.30 1.50 2.30 2.50 3.00 3.30 Prof. Henrike Lähnemann – the Medingen project Paul Vetch – DIAMM, ProMS and web-services for musicology datasets: technical issues Theodor Dumitrescu – Remedium (t.b.c.) Michael Scott Cuthbert – interconnectivity between music projects Other datasets: general discussion to cover projects of other participants in the workshop (this can be extended to 4.00) Discussion/Feedback session 2 4.00 Tea 4.30 5.00 Summary: feedback and forward planning (approx) Close Participants: Nicolas Bell, Margaret Bent, Gregorio Bevilacqua, Julia Craig-McFeely, Tim Crawford, David de Roure, Helen Deeming, Ted Dumitrescu, Elliott Hall, Leif Isaksen, Karl Kugle, Henrike Laehnemann, Elizabeth E. Leach, John Milsom, Stefan Morent, David Robey, Thomas Schmidt-Beste, Michael Scott Cuthbert, Philippe Vendrix, Paul Vetch, Raffaele Vigilanti, Magnus Williamson, Martin Wynne Database and data connectivity workshop report SUMMARY NOTES ON THE PRESENTATIONS (the workshop was fairly informal in style, so the notes presented here do not attempt to summarise the presentations in full, but note the salient points) 1. Julia Craig-McFeely/Elizabeth Leach – Introduction to DIAMM and the workshop aims. The DIAMM database has grown steadily both in content and in structure since it was started as a simple tool for controlling information regarding photographic activity in 2009. One danger in managing this dataset is that it can attempt to be all things to all people, resulting in both an unwieldy mass of information and an inability for its content ever to be complete. We try to be all things to all people We spend disproportionate amounts of effort creating particular esoteric data types that might never be used • We allow our dataset to grow unckecked so that it never achieves any level of completeness • We run the risk of misleading users by offering searchable data that is not complete • Our database runs the risk of becoming unmanageable • We can’t provide every bit of data for every record At some point a line has to be drawn beyond which the project team will not attempt to provide users with information. So I decided: • to eliminate parts of the database that are currently unpopulated, and are unlikely to be populated (realistically) • to consolidate the information and content that we have The result is now more controllable, but still vastly incomplete in some areas: I eliminated extended realationships and fields of more esoteric content (e.g. liminary text). The next step therefore is to connect with datasets that supply that more specialist data and allow them to access our data; we don’t want to supply data that is simply copied into another database because updates and changes will not apply across the board • • Data interconnectivity is becoming ever more necessary as projects start up which duplicate work already done in other projects. Funding sources are finite, and with grant funding going down and the competition for that funding on the increase it is essential that new projects utilize what already exists, and devote their time and expertise to performing those tasks that only they, as specialists, can do. Some PIs refuse to do this, and perhaps this should be a cause to with-hold funding? • Web services avoids mindless duplication of content • New datasets can be richer because they access a broader spectrum of data • Updates in one database are available to all connected databases • User access is improved • Quality and completeness of data is improved It is equally important that new research initiatives inform other projects that use of their resource is an integral part of their research strategy. A number of projects name DIAMM as imaging consultants (without consulting us first) or rely either implicitly or explicitly on the continued available of DIAMM in order to pursue their research. An important note here is that DIAMM is widely perceived as permanent and publicly funded, whereas the opposite is true: DIAMM is now self-funding and can only remain online as long as we are able to fund maintenance. Growth is dependent on goodwill, earned income and input from other projects. The aims of the workshop are to consider the following questions: Datasets and Databases: • How can we better use the data we have? Database and data connectivity workshop report • • • • How can we use our time more efficiently when creating new datasets? How can we optimise our research time both in creating our own, and using other datasets? What are the mechanisms that can be used to share or harvest data? What are the implications of sharing: Cost Time and effort Programming Conformity Credit (acknowledgement, academic credit) Visibility Secrecy (?) What do users think DIAMM is, and what do they think it does? We have been running for nearly 13 years: • They expect us to be available 365 days a year without fail • They think we are publicly funded • They believe they have a right to use us (for free) • They believe they have a right to complain if the website goes offline • They plan grant-funded, research and teaching activities that depend on DIAMM • They don’t tell us that they are relying on DIAMM • They do not offer funding to ensure the continuation of DIAMM • They think our data is complete • They think our data is accurate • They expect access to what we have, but don’t share what they have • They do not contribute data to improve content (for the most part) How do we conform our data? We already do this to some extent with RISM sigla and RISM/CCM abbreviations for MSS, but this is neither consistent nor comprehensive. How can we connect City, Library, Source and Compositions? Our solution is to publish a list of key numbers for these (and other) things that we might wish to connect to (see DIAMM source list for key numbers). Collborators don’t have to change their own key numbers, just include numbers from an agreed master list. This should result in consistency of titling; agreement over titles of works; consistency of composer’s names; agreement over standard spellings; etc. 2. Prof Thomas Schmidt-Beste – the Production and Reading of Music Sources 1480-1530 ThePRoMS project examines a fixed body of sources as musicologists AND art historians, treating them not simply as sources of information. The funding includes money for web services development, and in particular to allow connection to the DIAMM data about manuscripts which PRoMS has no need to repeat. The bonus to DIAMM is that new descriptive material arising from the PRoMS research will be fed directly to DIAMM, and will appear online as quickly as it is added. PRoMS examines things that we take for granted: • whether the initial is the first of the text or the first of the voice designation • If text is written in red do you sing it differently? No? so what is the visual and practical function of what is shown on the page • Do images on the page follow the music, or the other way around. • Hierarchy of parts suggested by the initials or illumination. Database and data connectivity workshop report • • • • • • • What does it mean if there are initials in the middle of lines – is this practical, aesthetic or something else? Pretty noteheads – where visual and practical aspects come together: something special happens here Turn instructions in all 5 parts, even though only one person can really turn the page Why are there custodes in music notation but not in text notation Why are some continuations across an opening from right to left instead of left to right? Signum congruentiae are not audible, but they establish a link that is not normally shown in edited versions of the music. Line-fillers at the ends of pages… The web page at the moment shows very basic information: ‘about’ and list of participants. It will show information about the project and a project blog. It will eventually show the meat of the project: i.e. database of 300-320 MS sources and about 80 printed sources; about 25 detailed MS studies and about 5 printed-source detailed studies. Unlike most musicology projects PRoMS has the added value of the art historian, who looks at the page as a visual presentation, whereas musicologists are programmed to read the content instead of looking at it as an art work Database structure is basically a flat file at the moment, which will get together the kind of layout and codicological information necessary to engage in the research and present it to users. Text; placement; colour; density; calligraphy; interpretive; relative density; The presentation mentioned in passing a project at the University of Rochester which attempts to work out how someone looking at it in the 15th century responds to a picture – trying to understand virtually a reaction from a viewer 3. Lief Isaksen & Gregorio Bevilacqua – Cantum pulcriorem invenire (research project on 13th century conductus) The project studies 12th-13th century latin songs – conducti surviving in both polyphonic and monophonic forms. Unusually these composition, with very few exceptions, were not based on pre-existing music (e.g. chants). The most recent existing catalogue is 30 years old, so anything discovered after that is missing. The paper catalogue is extremely difficult to manage and to get information from: as we know, it is much easier and quicker to find information from searchable databases – there is also no problem with space and saving paper issues in this medium. The aim of the project is to integrate information from the catalogues with information from 12th14th-century MSS and some handwritten notes, commentaries on the catalogues, new sources and recent work by scholars. Information provided will be as follows: • Sources (folios, format, notation, DIAMM link) • Style (stanziac, though composer melismatic syllabic) • Poetry • Contrafacta • Techniques • Main existing editions and recordings Instances of works – combination of conductus and poem – connects to: • References and publications • Tag clouds Database and data connectivity workshop report The database is constructed using Filemaker and is entitled ‘Cantum pulchriorem invenire’. They are in the process of creating a web portal Linked music data is a web-based approach to the way the data can integrate with itself – there is considerable in-house expertise in Southampton. It can be taken to extremes building semantic ontologies: the question arises, How far do you link? What to do you link to. It does help to have other data to link to. URIs as global identifiers for information is crucial, and more powerful than having standard spelling, since it allows you to have all the variant spellings and the unique web identifier is the master title Perhaps the day of the standard title has gone! 4. David de Roure – SALAMI (Structural Analysis of Large Amounts of Music Information) SALAMI is funded internationally (UK and N America). The project team do not claim to be musiciologists, but are building resources to support musicologists, and digital audio performances are included. Techniques and methodology: Words mean completely different things in music than in computing, which can cause confusion! The idea is to build something that is sustained by the community. Looking at previous attempts to do something like this gave as an example the San Francisco internet archive, which has an extraordinary collection of live performance recordings. The metadata is rather a mess though! Semantic web, linked data and ontologies: A music ontology designed by Yves Raimond going to be used by SALAMI. Music information retrieval meetings annually involve sharing of data and publication of results, so this community is already sharing their data, though perhaps not as efficiently as it could. The project, as its name suggests, will involve large quantities of audio recordings and the software development will involve data extraction and structural analysis which can be done better because of the increase of scale and volume; the sort of features extracted will be: • Genre • Key Database and data connectivity workshop report • Rhythm • Pitch • Onset …in all, about 24 features. One of the technical features to be address will be signal processing: how can processing be re-used, improved, and how can an efficient workflow be developed. The data content includes around 23,000 hours of recorded music, which is examined using ‘crowd-sourcing’. However in this context crowd-sourcing is actually student sourcing: students are paid to annotate 1000 pieces of music from US billboard charts – the annotation was free-form so a further analysis remains to be done on how music students annotate music! In order to process the vast amounts of data generated, the project has been allocated a significant portion of supercomputing time, which will enable the data to be analysed, and the results to be published very quickly. (Processing in mirex is done with a workflow engine that tries to identify genre and mood!) It could enable us to answer questions such as ’How much country music comes from my country?’ The linked data movement started in 2007, which is a rebranding of the term ‘semantic web’, which says what it is, not what it does. Linked data is what it does. The idea makes sense, but has proven controversial: when presented to the open data conference it was not very well received! If data is shared the community can add more ‘signal’ and publish more analyses. Why is this a problem? People are used to data being contained in silos, and a loss of control and massive distribution is alarming. However, there is nothing to stop a researcher selecting the parts of the data that s/he wants to work with, and giving it an identity, then other researchers can point at that identity – or URI. So all the time, instead of working within your silo, you are pushing things outwards for other people to use. Data shouldn’t be hoarded into one database, it should be distributed. Questions: EEL: is this going to assist us in sharing data from projects that want to badge their data in certain (visible) ways? DDR: This domain is peculiar because there’s lots of stuff out there which isn’t freely copy-able, but lots of people have bits of it. If you want to analyse a particular piece of music you might have one version, but someone else has the other version. The notion of ‘same as’ in music is a very complex notion, so you have ‘same enough as’ which is not definable and probably item specific. Another level in music is how effective fingerprinting is – you can tie a fingerprint into other sources of the ‘same’ data. You can hold your phone up to your car radio and it will tell you what music is playing. The BBC is committed to publishing things as linked data. If you create a website you just have a website, but if it is linked it is really available. PV: a philosophical and epistomological problem: when you say that to recognise a signal gives you a possibility to understand a musical work… Ugo Riemann said ‘listen, then say something about the musical work’. Are these computing analyses useful? DDR: this is done by a chain of experts and the work is there to assist, not to replace. We see this every time we apply e-research into other domains – some say that it’s not possible to analyse more than one piece of music at once. Time will tell as to whether this genuinely supports musicology. TC: this is one type of analysis based on audio recordings, mostly of popular music. The same approach could be applied to 14th century music, but the results might not be the same as a scholarly analysis, but you might get NEW bits of information because a human being can only manage one thing at a time, and then they will have a preconception that they’ll apply to other pieces. At least with a computer there is more ‘objectivity’ in the comparative work. PV: agrees – quantity is very helpful TD: computation produces basic structures and basic information, not an analysis in terms most musicologists understand. Database and data connectivity workshop report PV: if the analysis is so basic, then why do it? If we were experimenting with this on a purely oral tradition of Libyan songs something might come up because we have never looked at this before. DDR:Working with audio means you can work across a much larger background of ethnicity and style than is possible with notated musics. TC: non-western musics are really interesting in this respect because they use non-western scales, and these are being investigated by other projects DDR: Salami is a starting point – classical music is very small in this. Whether you will get anything interesting from a key segmentation of the data remains to be seen 5. John Milsom – standardised titles and composer names in early music repertories The Christ Church Catalogue: the first phase is completed in terms of putting information online, and this now needs to be updated and put online. This is a closed collection – the intention is only to describe what is in the collection donated to Christ Church by one of the masters of the collect in the 18th century. When the catalogue started there was no ‘online’ and no attempt to present catalogues online. There are printed authorities for printed books, providing standardized titles and spellings for both titles and composers, but there are NOT similar authorities for MSS. The authorities when this started were New Grove because everybody had it, plus any other bibliographical material that was sophisticated and up to date such as the Viola da Gamba Society catalogues – New Grove had also used these. Standard modern editions of music could also be used as authorities. The catalogue does refer to them, but evades the issue of standard titles. DIAMM doesn’t cover the repertory represented in this collection, and didn’t exist at the time the cataloguing was started. Faced with this situation, what authorities do you turn to for information? Why do you believe one authority and not another? Some people feel very strongly about one source or another, particularly researchers, but librarians can have different views. So one area where an interface has to be created is between the world of musicologists and the world of librarians RILM has been thinking about this for a long time. (It is surprising that RISM has not picked up and run with this!) 6. Nicolas Bell (Early English Music O nline) This project received funding under JISC rapid digitization stream. The money has to be spent by the end of July. EEMO will provide images of 300 16th-century printed music publications from the BLs collection, scanned from microfilm (which means about 4x as much stuff as if they did new photos). The microfilms are scanned at 400 dpi from the master copies, so the films are in good condition. The online resource will be delivered by Royal Holloway Equella, a system currently used to deal with exam papers, but it is being expanded for EEMO. The funding will allow the BL to update all catalogue records for these publications and introduce links to the images both from the BL catalogue, from COPAQ and from the RISM UK catalogue series A2. RISM A2 is MSS, but they’re going to add these prints. This in itself is a radical decision. Database and data connectivity workshop report Most of the prints in the collection are anthologies, and the library has just chosen a selection – a few Petruccis, Gardanos, some German, some English – which form a representative selection of music prints from that period. The problem arises of how to upgrade the catalogues to be most useful to musicologists but also keep in line with library standards. This means getting people who know about 16th-century music to inventory and transcribe title pages. It also means examining published bibliographies. The researchers will use their transcriptions as the basis of making a cat record but will normalise text in order to make it more searchable. An accurate diplomatic transcription retains all capitals and abbreviations – but that makes it unsearchable – e.g. XX Soges (with macron over o) means 20 songs, so you wouldn’t find it if your search string said ’20 songs’. You need to know what you’re looking for before you can find it! Therefore they need to list all forms of the titles so that a search can return meaningful results. The contents list will therefore need to show variant spellings of composer names and attributions that we know from our knowledge of the repertory, but which are not manifest in the copy. Versions they will use will probably be the version approved by the Library of Congress This is a normal procedure for library catalogues, but most early printed books haven’t been described in such detail because they are old acquisitions. For later publications a longer record could be derived from another library, but a lot of the prints in EEMO don’t exist in other libraries because they are early and sometimes unique. Anglo-american cataloguing rules they are forced to apply to these records say that there must be no capitalisation, which is frustrating when it’s clearly there. The information in its original form will be kept though, and presented perhaps in a .pdf or word file so that it can be used e.g. to print out and study alongside the book. This could be the first stage in a much expanded future – one project or more. The scanned microfilm images can gradually be replaced with new colour images. They will be available to ARUSPIX, which is investigating ways of applying Music recognition to early printed music, and Ted could transcribe them in CMME. TC: another project – Sandra Tuppen and Richard Chesser is examining optical characater recognition looking at printed lute tablatures. We can do optical recognition on these fairly successfully already. Many lute anthologies are very miscellaneous: those collections contain arrangements of vocal music, a large amount of which comes from the vocal partbooks that form the bulk of the BL collection. So there is possibility for internal concordance connection. In certain cases the xls spreadsheet mentions other copies … NB: the collection will not include copies of the other copies of the books. The transcription however will come from the most perfect original – if the BL one is not very good they will go to another version to read it. If some partbooks are missing, the record will say that information has been taken from another source. The project will add bibliographical information and the pictures will be downloadable – DIAMM can have them, even though they don’t meet the colour standard. However any picture is better than none. 7. Discussion/Feedback session 1 Q. Is the idea of a discrete title for a work now dead if we’re linking via a URI or meaningless number? Database and data connectivity workshop report A. It’s probably not dead because we still need paper publications, but we need to think in electonic data terms of URIs. Giving a MS a single identifying number is worthwhile, but we can’t even decide ‘what is a work?’ We are also dealing with questions about what is the master work – one with 3 voices or one with 4 voices? Texts might be in one source only – voice parts might be interchangeable between pieces. Single object identifiers at the level of the text or the voice part are needed, not just at the level of the work. See: Andre Guerra Cotta, Tom Moore: ‘Modelling Descriptive Elements and selecting information exchange formats for musical manuscript sources.’ Fontes artis musicae 53/4 (2006) 8. Prof. Henrike Lähnemann – the Medingen project This project reconstructs the text output of a specific convent at a specific time: the Convent of Medingen. The MSS from the convent are now scattered worldwide. What I want is to bring together digital images of the MSS – not just as pretty images though, I want to make them interconnected and searchable. What is exciting is that none are identical, but all are connected – the texts are in Latin and were transcribed to German. The results, copying from MSS, turned it into a devotional resource. This is a prototypical group of resources: a record of how the nuns think. The order provides the matrix of how they produce devotional texts. Medingen was a centre for Asiatic trade, with MSS and artefacts from all across this trade route. This was a very rich area – Luneburg the only place with salt. They financed five convents with a huge number of artefacts. In the Lutheran reformation they became protestant in name, but nothing else changed. No new altarpieces or MSS, no ‘baroquisation’. The first reform came from the Netherlands. All the MSS in this dataset come from between the reformation and the reform. Database and data connectivity workshop report The nuns wrote in Low German for lay-women and this was especially picked up by Anglicans on the grand tour who would buy the books. The MSS are small, so made ideal keepsakes. They all start from the matrix of the liturgy and constantly quote small parts of the liturgy, but also vernacular religious songs and texts of the time. All were linked into the daily life of the convent: there are rubrics about the choir singing and what you should do while they are doing that. Shorthand is used for the liturgical parts, while more complex bits are written out in full. The liturgical parts are taken over into the vernacular MSS – for lay-women in Luneburg related to the nuns (sisters, nieces). There are many references to ‘play on the organ of your heart’, the ‘noble harp of the soul’ etc. which seem very secular. The shorthand music is very close in form to notation fully written on staves. A whole group of nuns entered as novices in 1481 and started their prayer books at the same time. They put their names as initials in the middle of the book. A group of three nuns (sisters) entered the convent, and gave a book to a fourth sister, which links into the wealthy families in Luneburg. The challenge is how to integrate in the description and editing of these MSS (as a literary historian and a linguist) the music notation and shorthand, the low-german things with the latin etc. to make the network of one literacy and one devotion visible. Database and data connectivity workshop report This shows an example of the markup based on TEI. This has to be integrated into the website to make the database into a usable web resource that would be of interest to other disciplines: literature, linguistics. A Pilot website is available online: http://research.ncl.ac.uk/medingen/ There would be a link to existing catalog data of each of the MSS; full bibliography for them with linked in pdf files of all copyright free papers, and they have permission for many later articles. You should be able to browse the MSS, to enlarge, and to compare by opening two windows. The overall structure of the database will be based on HiDA4 (Hierarchischer DokumentAdministrator Version 4.0, a software especially developed for archives and libraries). Each manuscript will form a data-set which will be accompanied by a 'readme' file describing the state of the cataloguing; this allows for flexible handling of the depth of cataloguing as the Database and data connectivity workshop report project develops. New manuscripts will be continuously added as catalogue entries while the existing entries are digitalised or transcribed. All data will be hosted on a common fileserver, managed by Subversion. The metadata set will be following TEI P5, the Text Encoding Initiative module on manuscript cataloguing. Each manuscript will be structured, in addition to folio-numbers, by its division in chapters or paragraphs formed by rubrics and by liturgical occasion, using and expanding the codes developed by the CURSUS-Project. The liturgical occasion is applicable to all of the manuscripts from Medingen except for the library books. This allows a thorough crossreferencing with the liturgical manuscripts from Medingen (cf. appendix 3) and to the other numerous databases of liturgical music. The catalogue entries use the MASTER-DTD text that will be XML-compatible, the tagging to be developed in close cooperation with the manuscript census of German manuscripts. The images of the manuscript-pages are high-resolution tiff-files with jpg-files of different scaling for internet representation. The manuscript can be browsed and enlarged by using a magnifier. For internal use over 7.000 working images of manuscript double-spread pages are already available but they are of variable quality and are restricted to use for transcribing and editing. The data set of the final database will consist of 600 dpi tiff-files with colour and metrical scale; for internet representation, smaller jpg2000 files will be produced on-the-fly. Copyright for all manuscripts will rest with the libraries and has been negotiated with twelve of the libraries (Berlin, Göttingen, Hamburg, Hildesheim). Professional images of the manuscripts will be taken through the library and their contractors and should be housed, whenever possible, at the institutions. The images will be provided with a non-exclusive licence, as established e.g. by DIAMM. When the institutions involved cannot provide the resources necessary for hosting the images, images will be held on the project's server. Agreements with all the libraries will be made to have the royalty use of the manuscript images on the web. Licensing will take into account the DFG guidelines on copyright and rights management issues. Transcriptions are entered extending the TEI P5 scheme where necessary with special characters encoded in Unicode developed by the TITUS-project wordcruncher. Musical notation: The main form of musical notation in many of the prayer-books consists of reduced Gothic choral notation written between the lines of text using the ruling for the text. Since this is a form largely confined to devotional texts from Medingen and the other Lüneburg convent, new MEI customisations will be developed and published via the TEI Music Special Interest Group. These can be based on a set of descriptors by Stefan Morent for marking up earlier medieval neumatic notation in combination with the existing descriptors for Gothic choral notation from which the music notation in the prayer-books is copied. Music samples of the liturgical items and hymns are given as 8-bit wav sound-files and linked with the appropriate manuscript pages and the transcripts to allow following the actual musical notation. The bibliography will include all literature on the Lüneburg convents in general, on the Medingen manuscripts in particular and resources on Middle Low German. All copyright-free older publications (for example the history of the Medingen convent by Lyßmann 1772) are made available as searchable PDF-Files and all open access resources are linked in. A particular emphasis will be on regional and older specialist literature which is often not available in the major libraries; whenever possible, full text scans will be made available in agreement with the publishers (for example publications on the convents by the local newspapers). A full documentation of the technical side of the project will be available via the website. There will be a printed edition of one sample prayer book but each book is different… the nuns were post-medieval: they had access to a massive wealth of information and each nun did her own MS and each linked to her own choice of information. This has no funding yet. This is something that couldn’t’ possibly work in print, so a website presentation adds so much more than just the sum of the single editions. Database and data connectivity workshop report 9. Paul Vetch – DIAMM, PRoMS and web-services for musicology datasets: technical issues. Web services in reality in the context of our projects may mean that it’s difficult to develop a realistic web-services model the way we think we want to. There are a number of difficult problems; how do you get projects designed to do difficult things to work together, and also how to make sense together. This is not simply a question of identifying consonant data, but how to deal with it. DIAMM is logical to itself, but may not be logical to another organisation. An example is the ‘Item’ entity – consonant to the instance, but not to general usage. At what level therefore do you connect individual resources? Web services can connect by way of ‘ligatures’ – this creates high-level web services, a low level – database-to-database connection is simpler, but what happens if one database is down? You can cache the data: you have to store a local version, and that is updated when you access your page, but if there is no connection it just shows the cached version of the data. A connection between DIAMM and PRoMS is not one-way, it is reciprocal. PRoMS needs to be able to update DIAMM content – e.g. if they discover a change to the descriptive information when they look at a MS. In terms of these two projects we are still very locked in to images – which is not all bad. Situation 1: A project such as CFEO (Chopin’s First Editions Online, www.cfeo.org.uk) starts being very similar to DIAMM-PRoMS but then diverges massively. CFEO is based on a printed catalogue. On the other hand, if we compare CFEO with OCVE (online Chopin Variorum Edition, www.ocve.org.uk) these are two projects showing very much similar data. OCVE has corrected data from the print catalogue. The plan is to merge them together: combine them and make an online edition of the annotated catalogue so all three resources become one glorious big resource. This is a good plan, but there is one major problem: money! CUP, the publishers of the printed catalogue, don’t really want to give the printed catalogue away. Regardless of these barriers thought, to support this sort of situation you need a very complex web services model: someone who has bought the catalogue needs to be able to log in and then see the images from the online projects with the most updated catalogue information. This sounds good. Problem. 1 – libraries who supplied the images didn’t do so on the basis that they would be resold as a CUP online publication. A subscription view for the catalogue is good for sustainability, but not good for the libraries permissions. These are samples of some of the things that come up and bite you on the bum. Things that should work very well, but for purely practical reasons they don’t. ‘Web-servicing’ images JPG2000 and J2K are both image formats that have exciting prospects. It doesn’t make sense to store two copies of the images, but if they are served as JPG2000 from a services server you only have one master image, and all surrogates are derived from that, on the fly. How do you stop people from pirating the images and using them for themselves ? We have to hide the information about keys from the rest of the world to stop copyright infringement and data theft. So what does it mean if you see an image from one project in another? Do you watermark it with library logo, project logo? Nobody likes those! Searching across datasets There are various ways of searching (not enormously interesting). The federated approach has a central place, or broker, from which you can carry out a search across a number of different Database and data connectivity workshop report resources. The most interesing in the UK is the London museums – a federated-services-enabled search. The technical problems are all about how you display the data: what does it mean to see a certain field from a certain context in a new context? How do you show where it came from? How do you aggregate the information? If information came from a number of resources, how do you find out where it came from or who created it? Experiment – Anglo-Saxon cluster: What it means in practice to bring together subjects with a common subject domain. Four projects ASChart (anglo-saxon charters), eSawyer, LangScape, PASE (Prosopography of Anglo-Saxon England). In many cases these projects are using the same texts – PASE uses many of the ASChart texts. Bringing these apparently congruent things together in a way that is not superficial looks as if it should be easy, but is frankly a nightmare. One approach is to do something very simply that just goes off to the individual resources, finds text strings, and dumps the results on your screen. A more interesting way is to try and find ways to relate data to make useful information and useful connections back about it, but it is very difficult to use the resource unless you know a lot about the individual projects. In the case of the AngloSaxon group, a list of research questions were put together. The challenge in the end isn’t therefore a technical challenge – it’s a research question problem. The site is therefore littered with sample research questions. One of the principles is that these projects were ‘finished’. They didn’t want to have to go back and rebuild them therefor, but bolt on something that would work with the lightest of touches to connect them up. There is no money to redevelop, so that’s not an option. They wanted to put up lightweight connectors. It took 4 developers to put a connector on the project that they themselves had worked on. If you follow a query, it may send back results from 3 resources, but you can only see the data by going to one resource at a time. There is no hint in the search results that three different resources are throwing back results. Acknowledging the source is quite difficult. Most of the problems of aggregating results are in how you display the information. How do you put the stuff together in a way that solves the problems? A final note is in relation to stained glass windows…. Analysis on images that identifies similarity or classification. This can now be provided as a web service. Algorhythms are sufficiently efficient that you can go away and perform CVIR in real time. This has very exciting possibilities. How can we use web services as part of the editorial process? With content-based image retrieval we can bring up clusters of images that have already been classified to help build consistency. DDR: this links back to something said before: web-services consistently referring to connecting up (small initial letters) data sets. Web Services with caps means something different in informatics Creating connections between resources that were not designed to be connected is incredibly difficult, and the only real answer is for the resources to be built in the first place with the intention of eventually connecting. This doesn’t however deal with the simple issues described above, of one piece of data meaning something different to the two different projects or researchers who created that data. 10. Theodor Dumitrescu – ReMeDiUM: Renaissance and medieval music digital universal media platform Database and data connectivity workshop report All of our projects have a lot of overlap, and we don’t need to reinvent the wheel. What we should be looking at now is intelligent data interchange. Paul has laid out the conceptual background, so the following is about things that are directly relevant to our own projects Consortium: In Jan 2007 a group of musicology projects got together to brainstorm ways in which they might connect up their content and share data. Since then, several grant applications have been written, but none submitted, but this is becoming more pressing as new projects are springing up. Why do we want to do something like this? If we look at the CMME website we can see a sample of a chanson MS from around 1500. If we consider that a group of different projects are going to look at the source, each in a different way. We’re going to do background work, but we’re also going to display that information on the website. We all have a certain amount of common information, but approach with different emphases, and with entirely different datasets. How can we make it so that a user going to any one of these websites can automatically pull up the information about that same data from the other projects without having to go to yet another website? Thanks to semantic web technology these problems of different naming or tagging conventions are solved. We can use standardized ways of referring. If you are a consortium member you have to use a certain set of display parameters. Sometimes for practical or political reasons that is not possible. E.g using anglicised city names. What do we do instead? We get semantic web technology to provide linking that happens behind the scenes. You have to expose very little to the public, so most of the work is done in the background. The ReMeDiUM model suggests a set of information returned by a query regarding MS ‘X’. This could show an edition drawn from CMME, images of the manuscript from DIAMM, and text transcriptions from a third website: but all the data is clearly marked to show where it came from. Having decided what he wants to see, the user can then choose the resource that will show him the information he seeks. ReMeDiUM deliverables: how might we go about implementing this. Core scheme – data ontology for the smallest subset of data that all our projects share Simple standard identifiers – RM (remedium music) RS (remedium source). These are common to everyone. This provides an identifier to a piece that might have no other identifier – e.g. a composer, to differentiate it from other things. Data harvesting/interlinking software Public web interface – just one way of harvesting data Schema: what is the minimal amount of information that you might want to build into your back end? This looks big, but actually it’s pretty minimal. From this any of the collaborating websites/ projects should be able to expose this kind of data to other projects in a standardized structured way. Most projects do have this information, but maybe not all of it. This is a basic core. The resource map is an xml file using semantic processes that are common to any kinds of data. This sits in the background to any website telling people where bits of data are, and how to get to it. E.g. I have three transcriptions of this piece, this is where they are and how to get to them. A robot from Google can look at this and get some idea of what is on the website. Database and data connectivity workshop report A simple extension to this would allow us to describe simply musical things. The conceptual stuff is already there and becoming more standardized. We need to extend it a bit though. We create a ReMeDiUM standard that says ‘this is the way that our websites can talk and communicate music information.’ This is becoming more pressing as more individual scholars and teams start creating their own datasets. MW: Paul makes this sound very difficult, Ted makes this sound very easy. Which is true? PV: Conceptually it can be very simple, but from the programmers point of view the interface is the problem. Ted is looking at it from the other end, with a few small usable interfaces in mind, and thinking about easy ways we could combine those things. The ReMeDiUM web page is just one way of envisaging simple data presentation. We don’t need to present all the data in one place. TD: WE don’t need to show more than a superficial linkage. In many ways there is more congruence between CMME and DIAMM. Have we thought about using FRBR which has been used to describe works of art. 11. Michael Scott Cuthbert – interconnectivity between music projects Twigs growing out of a tree – this is good advice in telling students how to organise their thoughts and data. How do you advance your main topic? How am I going to organise my data so as to solve my problem? This was a main paradigm in how to program: how to get from data to new information. Moving from one end to another. Some problems we tend to have in humanities in general and medieval music in particular is that very often we organise our data in order to solve one particular problem. We need to think more about reversing the paradigm. There are twigs everywhere. Our data is a lot richer than many projects tend to be. Object oriented programming – re-organising projects around the data and the particular things that data can do. It has certain properties – so we need to think about how to organise data into subsets. Music21 a toolkit similar to ‘humdrum’. It takes musicolgists using Finale or Sibelius as a model. These music-pocessing softwares are not simple to use, but people get to grips with them pretty quickly. Our humanities projects can be either very complex or very simple, so if you want to solve problems that requires a certain amount of commitment – 2-3 weeks of learning curve, and this is what people will apply to learning a new software that does something they particularly want to do. The Music21 project is quite ‘buggy’. Even very early and very poor information is useful, so we get it out there so people engage with it and update it. They will find ways you might not have thought through your project. Skip to slide #11 Here’s an example: Take all the lyrics and assemble them into one text. The automated web browser will automatically google the word exultavit, and will classify whether this is a common text or not. We have to be careful, though, not to attempt to reinvent the wheel: there are lots of sophisticated analysis packages available to analyse data. We want to do something different, but we can build on what is already out there. As an example of what software can do: software can transpose. Can show sections – and label things on strong and weak beats. This may help to analyse whether a composer does certain things on strong beats rather than weak beats. The machine can do this extremely quickly over a massive Database and data connectivity workshop report quantity of data (accurately), but the human would be so slow as to be pointless attempting to do the same thing accurately and with demonstrable data to back up the results. Music librarians tend to be interested in flexible metadata. We can program the toolkit to follow rules about e.g. adding ficta to a melody.This could be very useful in indicating editorial interventions. So if you are analysing style statistically the system will ignore the part that is put in by someone else. You could then apply the rules and see how often the instance shown really does get corrected. Unfortunately in the case of ficta, professional judgement is also needed, so this particular example didn’t work, or perhaps we needed to come up with a more complex set of rules than we used the first time around. So the wrong result can tell us how to change our rules. Unfortunately in music though, rules cannot be applied slavishly, so you have to do more programming to qualify each instance of a rule application – e.g. decide which notes are accented or not, then qualify the rule to take that into consideration. The linked data approach may suggest which pieces the writer of that treatise might have known – so perhaps not the pieces in which that rule doesn’t work. Going back to some of the other discussions today, the issue with metadata is the question of authority. If you don’t get it right every time the computer isn’t going to find it, because it is very literal. You do therefore need an absolute authority list somewhere, so I believe there is still a need for a standard title or standard composer list, and that can be supplied by something like the Library of Congress unified name catalogue. Philippe Vendrix – Some questions arising from Digital projects and today’s discussions. How can we teach now? Money goes to research, but not to teaching. Digital Humanities causes a dichotomy. If the online courses take off, will you need a professor of paleography? It is good to make mistakes: we put stuff online even though we know there may be mistakes. To distinguish between data and research team is very important. If we distinguish between what we really want to do, and the data we are gathering, then we can work in a better way. We want to know about the way music appears on a page then we work on that – we don’t ask who owned it, we just work on that particular information, and give the data to the team that wants to work on this. 15 years ago we were still on the level of universal enterprise that wanted to cover everything for everybody. Today it is impossible – we can see mistakes: why would we repeat those? The more a project centres on one subject, the better. Does it really matter that these projects don’t connect up? So we could defend a web project the same way that we defend an article in a journal. It stands alone. Rather than absorb, you quote. Another question: we may not have to search for universal systems. We can work only thinking about our problem: we should not worry therefore about other people’s problems! Mensural notation is what we’re interested in. Let’s not worry about Stockhausen. CMME is a tool dedicated to one type of notation, so it’s good at what it does. If we use this tool then we know that our research can be used by anyone else. Database and data connectivity workshop report Surprising: we still want to be complete. We write bibliographies – who cares about bibliography today? If we dedicate a lot of time to transcribing RISM/CCM on the web, then we’re not really adding to scholarship. We will never succeed in being global. Is it useful to do this – will it bring something new to scholarship? DIAMM has little in the way of research outputs: it is a research resource. We are just encoding information that already exists in other ways. But we’re not doing research Because we’ve collected lots of data we want to make it usable to other people. Data is usually a by-product of something we wanted to know, not the end in itself. When we make a proposal we spend a lot of time just collecting data. Sometimes we embrace a too-large corpus. If all scholarship is only related to corpus problems then where will be go? Will there be a discipline of musicology in the future? Will it look too boring because it is just seen as data gathering? Computational musicology is very strongly centred in early music. MSC: the intense project of digital musicology in solving problems they set out to solve have been slightly a failure – publications are more about the process than the result. The side-products using the tools enable us to move faster. This can be amazingly fortuitous: not wanting to lose his train of thought he needed to know if a particular motet was in San Lorenzo. So instead of crossing the room he checked laTrobe, and found a new concordance. Are we here to make research easier, or are we here to make research? Workshop in Dublin about digital humanities – Willard McCarty said you can produce all this stuff but where is the research question/answer? What is the point in spending money on developing electronic resources when we should be spending money on research? JCM: This is actually what I’ve been saying for a while: the point about connecting up is that less of the grant and the researcher’s time is spend gathering data that is already out there, and more on the research that only the expert can do! It is crucial to use more cataloguing data. DFG insists that every catalogue of a public institution must be open access online. It is scandalous that some catalogues are not online – e.g. CUL! A sample project from CCH worked with mediated crowd-sourcing. Content goes through an independent editorial board. This can work if you are working with relatively small materials, if not it’s not so easy. They have a pilot phase available. A catalogue that would work as a bridge between different catalogues. How can we reward people who contribute, and the editors – a way that is academically recognized. The idea now is to make sure that any contribution has a name attached. The authority is the contributor, not the database where the contribution or contributor lands. The editors have credit by publishing annually. So you evaluate for a year, then you only publish once a year. DIAMM has a byline. What prevents Myke from contributing is that it’s not immediate. Sending something and not seeing it is demoralising, and so the contributor doesn’t send the next thing. Should we have a list of contributors to DIAMM pointing to their MSS? Database and data connectivity workshop report Another problem we have to think of is people not coming through on their commitments: they say they’ll produce stuff, but then they don’t. People want DIAMM to have more image content – no need to expand data and textual content, but we do need to start attaching good images to the data we have. E-codices (Switzerland) put out a call for suggestions of important MSS to be digitized – the person had to put in their own research as their part of getting the MSS digitized. This kind of approach where MSS are prioritised if there is pertinent is very useful.