CSDH/SCHN Cyberinfrastructure Conversations Summary
Transcription
CSDH/SCHN Cyberinfrastructure Conversations Summary
CSDH/SCHN Cyberinfrastructure Conversations Summary This is a high-level summary of the outcome of a series of conversations regarding the CFI Cyberinfrastructure Initiative among Canadian Digital Humanists. The conversations emerged from CSDH/SCHN consultations that began in the Spring of 2014. The document tries to reflect the priorities and areas of emphasis that have emerged from these discussions, and suggests several areas of focus for broad-based collaborative cyberinfrastructure that would serve the needs of many in the digital humanities research community. The diversity of work in the digital humanities makes it impossible to mention every need, but in the view of the CSDH executive, this summary covers a number of pressing needs from a range of research groups across the country, and balances the need to serve existing researchers with that of expanding access to important datasets and cyberinfrastructure to leading humanities researchers who are experimenting with advanced research computing. This summary is not meant to be prescriptive, nor does it mean to limit the number of proposals going forward to CFI. It may be that the diversity of work and the range of needs in our interdisciplinary community makes it impossible to work within the parameters of one or two applications. CSDH will not coordinate a national consortium or proposal, but it hopes to have provided a foundation for the formation of a broad-based consortium of institutions to collaborate in one or more successful proposals to the CFI Cyberinfrastructure Initiative. In the view of the Executive, this should bring the CSDH/SCHN role in the coordination of discussion related to this initiative to an end. The cyberinfrastructure listserv will remain available as a resource to allow those participating in applications to communicate and coordinate as they wish, but our assumption is that the next stage will involve participants affiliating themselves with these and likely other areas, forming smaller working groups associated with those areas, and discussing institutional participation and leadership, in preparation for the upcoming EOI stage. We thank all of those who have contributed to these discussions, to the production of the SPARC White paper, and the response to the draft call. 2 General points Consultation within the digital humanities community has produced a wide range of ideas for what kinds of investments in cyberinfrastructure might have the most value and impact. The richness and diversity of Digital Humanities research is both a strength and a challenge in the context of the CFI Cyberinfrastructure Initiative. One recurring theme during the consultation has been the need for a well-supported shared digital infracture that would provide access to data in conjunction with tools, platforms and training materials that would serve both leading digital humanities scholars and scholars keen to incorporate digital methods into their research. Within Canada, the obstacles to getting up and running with a digital project are still too daunting for many scholars in the humanities. There are many projects which have created large amounts of data and sophisticated tools but whose data and tools remain inaccessible and isolated for structural reasons, especially lack of a shared and powerful cyberinfrastructure enabling collaborative and innovative research. There is the need for a user-friendly platform, perhaps a cloud environment that offers both infrastructure-as-a-service and software-as-a-service, that will enable a much greater number of humanities researchers to make the most of existing and emerging (i.e. grant-funded) digital datasets through shared web-based infrastructure. Just-in-time resources could be managed centrally (providing economies of scale) and coupled with the technical and domain expertise support that humanities researchers need. Compute Canada’s server and account infrastructure could form the foundation for a welcoming platform customized for digital humanists. Among the benefits would be large-scale sharing of content and datasets, as well as the capacity to spinup project-specific or event-specific instances of tools to mobilize or combine particular datasets. There is a widespread consensus that the digital humanities research community needs personnel with sufficient technical expertise and depth of experience working with digital humanities projects to work with Compute Canada staff and equipment to increase the capacity of Compute Canada to serve the humanities. A distributed group of data analysts (similar to those in Compute Canada and perhaps on a trajectory to become part of the CC complement of analysts) would be positioned to promote standards, adapt and adopt existing open-source software solutions, and install emergent tools on Compute Canada equipment to create accessible datasets, tools, and services with web-based, user-friendly front ends that will dramatically increase the capacity of Compute Canada to serve the humanities. 3 Potential areas of focus The following areas of focus have emerged from our consultations as addressing a number of researchers from more than one research group. They are not an exhaustive representation of the infrastructure requirements of the community, but represent areas where there is significant need cutting across several communities of researchers. Of course, the needs could be grouped and represented differently. Needs of the community are diverse, and listing these areas of focus together here does not suggest that they should all be part of a single application, even though there are synergies and overlap between these areas. Big humanities data accessibility and interoperability: Aggregating and making available large-scale humanities datasets in an environment that respects copyright and other rights restrictions by providing data management services and Improving methods of data fusion and interoperability related to large collections of datasets that cannot be directly accessed by researchers, but whose owners (large collections such as that held by the HathiTrust, commercial datasets made available for research purposes, and heritage institutions around the world) are willing to lend it for research purposes. Infrastructure as a service: big data storage services Software-as-a-service: data conversion and ingestion tools; text mining tools such as Hadoop, Weka, Mallett; visualization tools such as Voyant Research Rights Management tool Ancillary work by community and partners: access policy development Possible partners: Hathi Trust, ARC, Canadiana, research libraries, OCUL, CARL, Archive.org or Cdn contributors thereto; Gale-Cengage; Érudit, CRKN; Mukurtu; heritage and memory institutions Possible research projects: Text Mining the Novel; HistoryCrawler; Textual Communities; Editing Modernism in Canada; Hispanic Baroque Project Multi-disciplinary collaboration with heterogeneous cultural and textual heritage data: A number of projects in Canada and beyond either work collaboratively with heterogeneous datatypes or work with data in a context that could benefit from collaboration with others working with other datatypes. This data might include 2D photography, moving image collections, gaming, 3D cloudpoints and meshes, and so on. The applications these projects might involve include Digital Libraries, Serious Games, Immersive environments, data visualisation, crowdsourcing applications, and other research and mobilization activities. 4 A strategic group of multi-disciplinary, national and international partners working with cultural and textual heritage data in these contexts could collaborate on using existing research datasets to develop appropriate tools and protocols for collaboration, visualisation, and "publication" (in the broad sense of "sharing with end users"). The infrastructure will help place Canadian researchers at the world forefront of research into cultural evolution, and provide data to permit researchers in Computer Science and mathematics to explore various ways to traverse graphs and to model heterogeneous multi-network data structures belonging to different semantic domains but intersecting at various points. Possible partners: CulturePlex Lab Related research projects: Visionary Cross Project; Hispanic Baroque Project Linking humanities data: Semantic web infrastructure that will allow humanities researchers to participate in the emergent semantic web, and leverage the growing knowledge graph to answer key questions about cultural and social change that can only be addressed through the “internet of things.” Provide tools to allow researchers to link their data, annotate others’ data, manage, curate, and accept/incorporate annotations by others of their data, to expose the data and annotations on the web, and to manage the complex relationship between such annotations and the dynamic content to which they refer. Infrastructure as a service: National triple-store linked to SPARQL end-point, visualization tools, reasoners, Software as a service: tools for mobilizing common humanities data formats as linked open data to provide greater accessibility, interoperability, and contribute scholarly knowledge to the public good via the emerging semantic web; tools for ontology building, ontology management (including dynamic ontology management) and linking; tools for harvesting annotations for data corrections, commentary, Possible partners: Canadiana, ARC, InPho, CWRC Related research projects: Textual Communities; Linked Modernisms; Coding Character; MARGOT; Mariposa Folk Festival Archives project Dynamic data management: Addresses challenge associated with the extent to which humanities datasets are almost always incomplete, in dialogue with ongoing scholarly dialogue, and in need of updating and curation. Allowing data to evolve and tracking that evolution is key to maintaining robust datasets that support investigation how patterns of cultural information change and affect human behavior and knowledge. Producing environments that foster adherence to standards and support community curation of humanities datasets is key to ensuring the quality and interoperability of humanities datasets. A step change in humanities research capacity could be produced by leveraging and 5 making interoperable key components of existing software suites associated with major humanities data producers, aggregators, and disseminators. Platform-as-a-service: ability to spin up and configure with particular sets of compatible tools virtual machine sandboxes for particular research and research training purposes; this is key to growing the number of active data managers and curators among the humanities research community Software as a service: commonly needed applications for discovery, analysis and visualization, plus tools for enhancing and/or preparing for research existing data through metadata enhancement, automated XML markup, OCR cleanup, NER and triple extraction, annotation, and tools for managing crowd-sourcing of transcription, collation and annotation in the making of digital editions. Possible partners: PKP, Érudit, PREEO lab; CWRC Related research projects: Textual Communities; INKE; MARGOT; Orlando