Cross-Domain Resource Sharing

Transcription

Cross-Domain Resource Sharing
Cross-Domain Resource Sharing
Resource Access Recommendation Based on
Semantics, Provenance and Traceability
Information
Nuno Bettencourt
Universidade de Trás-os-Montes e Alto Douro
This dissertation is submitted for the degree of
PhilosophiæDoctor (PhD)
Under the supervision of:
Professor Nuno Silva and Professor João Barroso
Vila Real, June 2015
To those who encouraged and helped me to get where I am today.
Declaration
This work is presented by Nuno Bettencourt to Universidade de Trás-os-Montes e Alto
Douro for compliance with the requirements for obtaining a degree of PhilosophiæDoctor (PhD) in Computer Science, under the supervision of Professor Nuno Silva
and Professor João Barroso.
Nuno Bettencourt
Vila Real, June 2015
Acknowledgements
I would like to equally thank all that contributed in the research and development of
this work. The agreements, disagreements and discussions were all part of a greater
learning process.
I would also like to thank my family for their unconditional support during this period
of time and their understanding for when I was not around.
The accomplishment of this research work was partly supported by Grupo de Investigação em Engenharia do Conhecimento e Apoio à Decisão; Instituto Superior de
Engenharia do Porto; Instituto Politécnico do Porto; FEDER funds through “Programa Operacional Factores de Competitividade - COMPETE”’ program and by
national funds through FCT “Fundação para a Ciência e a Tecnologia”, under the
projects: EDGAR (POSI/EIA/61307/2004); COALESCE (PTDC/EIA/74417/2006);
OOBIAN - Living Knowledge (QREN 12677); WS-World Search (QREN 11495) and
AAL4ALL - Ambient Assisted Living for All (QREN 13852).
Resumo
A Internet cresceu recentemente para mais de três mil milhões de utilizadores1 . Isto
representa ligeiramente mais de quarenta por cento de toda a população mundial. Em
algumas redes sociais, mais de duzentas mil fotografias são enviadas a cada minuto2 .
Este volume de criação e geração de conteúdos nas redes sociais torna a tarefa de
partilhar recursos mais difícil para os utilizadores.
A partilha típica de recursos na Internet é conseguida através da concessão de direitos
de acesso dos utilizadores aos recursos, comummente restrito aos recursos alojados
num único domínio. As políticas de acesso são, por conseguinte, emitidas para os utilizadores registados no mesmo domínio. A partilha de recursos com utilizadores não
registados no mesmo domínio tem-se provado insegura ou difícil de alcançar. Referenciar e aceder a recursos protegidos por políticas de acesso em outros domínios Web
(para além de onde eles estão alojados) é praticamente não suportado pelas aplicações
Web atuais.
Na partilha de recursos em diferentes domínios, tais dificuldades incentivam a clonagem dos recursos, a multiplicação da identidade interna e social do utilizador e o
aumento do peso da gestão de políticas de acesso, de acordo com cada domínio.
O objetivo deste trabalho é proporcionar uma infraestrutura multi-domínio que fornece
processos de gestão e apoio à partilha de recursos, de forma segura.
Para atingir este objetivo quatro principais questões de investigação são formuladas:
QI1: Como pode um mecanismo distribuído, descentralizado e baseado em padrões realizar as responsabilidades de autenticação e autorização? QI2: O quê e como podem
as ações dos utilizadores sobre os recursos serem capturadas? QI3: Como pode um
utilizador partilhar um recurso com os restantes, com uma gestão de políticas de acessos baseada em regras, em vez de políticas discricionárias e estaticamente definidas?
QI4: Como é que é possível automatizar ou facilitar o processo de gestão de políticas
de acesso a recursos do ponto de vista do autor de um recurso e suas relações?
Esta tese propõe um modelo de arquitetura distribuído e descentralizado que promove
a partilha de recursos multi-domínio, a referenciação de recursos e a gestão de políticas
1
2
http://www.internetlivestats.com/internet-users/
http://aci.info/2014/07/12/the-data-explosion-in-2014-minute-by-minute-infographic/
x
de acesso, adotando os princípios da Web e padrões/recomendações do World Wide
Web Consortium (W3C).
O modelo de arquitetura é composto por seis entidades interconectadas, capazes de:
gerar identidade de utilizadores e suas credenciais de acesso; capturar ações e conteúdo
gerados pelos utilizadores; verificar a autenticação de um utilizador; aplicar restrições
de acesso sobre os recursos e apoiar os utilizadores na gestão de políticas de acesso.
A proposta sugere a adoção do vocabulário Friend Of A Friend (FOAF) para representar os utilizadores, sua identidade interna e social. Em combinação com estes perfis,
a adoção conjunta de Secure Socket Layer (SSL) e FOAF, fornece uma autenticação
distribuída.
A arquitetura incorpora mecanismos conceptuais para capturar as ações do utilizador
sobre os recursos, que são representadas e armazenadas como anotações semânticas.
Com base nestes anotações, o conceito de rastreio aplicado aos recursos da Internet é
introduzido.
As políticas de acesso são dissociadas dos recursos e dos seus pontos de aplicação. Os
utilizadores mantêm total controlo sobre os seus recursos e é-lhes fornecida uma experiência de partilha de recurso multi-domínio independentemente de como o recurso é
tratado pela infraestrutura, evitando a duplicação de recursos em diferentes domínios.
A partilha de recursos deve ser alcançada através da definição de regras semânticas
capazes de especificar a razão pela qual um recurso está a ser partilhado e com quem,
em vez de estaticamente definir quem tem acesso a quê.
Com o intuito de apoiar o utilizador na gestão de políticas de acesso, foi adicionado
à infraestrutura um serviço de recomendação de políticas de acesso. O mecanismo de
recomendação apresenta um motor híbrido que consiste na combinação de diferentes
técnicas de filtragem que explora os perfis dos utilizadores, suas redes sociais, conteúdo
de recursos e informação (distribuída) de proveniência e rastreio.
Um protótipo para demonstrar a viabilidade da infraestrutura foi projetado e implementado para provar que o modelo de arquitetura pode ser implantado em cenários
reais. Para ilustrar como a infraestrutura pode beneficiar aplicações legadas, foi também aplicada sobre uma aplicação Web já existente.
A avaliação do protótipo foi realizada de duas formas diferentes para atestar a validade da proposta. Em primeiro lugar, um conjunto de testes funcionais foi realizado
durante o protótipo para validar os componentes propostos. Em segundo lugar, a recomendação híbrida foi testada utilizando um conjunto de dados que foi interpretado
para simular o comportamento humano no sistema.
A adoção de um mecanismo híbrido de recomendação de políticas de acesso permitiu o enriquecimento dessas recomendações por utilizar informações adicionais que a
arquitetura proporciona. As informações de proveniência e rastreio são utilizadas em
xi
conjunto com as redes sociais dos utilizadores e conteúdo dos recursos para automaticamente propor que políticas de acesso devem ser adicionadas a um determinado
recurso.
Enquanto o atual modelo de desenvolvimento seguido pela comunidade Web está definido para aprisionar os utilizadores (consumidores e editores) em grandes domínios
Web, esta nova abordagem quebra o modelo existente, conferindo aos utilizadores um
maior grau de controlo sobre os seus recursos. Esta nova abordagem fornece meios e
suporte para a publicação de recursos num modo privado, fazendo com que os websites se comportem novamente, como malhas de recursos referenciadas e interligadas
de diferentes domínios, que se mantêm em conformidade com as políticas de acesso
estabelecidas.
Abstract
The Internet has recently grown to over three billion users1 . This represents slightly
more than forty per cent of the whole world population. On certain social networks,
more than two hundred thousand photographs are uploaded every minute2 . Such rate
of content generation and social network building make the task of sharing resources
more difficult for users.
Standard resource sharing in the Internet is achieved by granting users the access to
resources, but they are commonly restricted to resources hosted on a single domain.
Access policies are consequently issued to users registered on the same domain. Sharing
resources with users that are not registered on the same domain has proven insecure
or difficult to achieve. Referencing and accessing resources protected by access policies
in other web domains (apart from where they are hosted) is practically unsupported
by existing web applications.
In cross-domain sharing, such difficulties encourage: the cloning of the resource to
different domains; the multiplication of users’ internal and social identity and increases
the burden of managing access policies according to each domain.
The goal of this work is to provide a seamless cross-web-domain infrastructure that
provides secure, rich and supportive resource managing and sharing processes.
To achieve this goal, four main research questions are formulated: RQ1: How can a
distributed, decentralised and standard-based mechanism perform authentication and
authorisation? RQ2: What and how can user-generated actions upon resources be
captured? RQ3: How can a user share a resource with others, based on rules instead
of statically defined discretionary access control? RQ4: How is it possible to automate
or ease the process of managing access policies to resources from a resource’s owner
perspective and his/her relationships?
This thesis proposes a distributed and decentralised architectural model by fostering
cross-web-domain resource sharing, resource dereferencing and access policy management. It adopts the principles of the Web and of W3C standards/recommendations.
1
2
http://www.internetlivestats.com/internet-users/
http://aci.info/2014/07/12/the-data-explosion-in-2014-minute-by-minute-infographic/
xiv
The architectural model is comprised of six interconnected entities capable of: providing user identity and credentials; capturing user-generated actions and content; enforcing authentication and authorisation over resources; supporting users’ management
of access policies.
The proposal suggests the adoption of Friend Of A Friend (FOAF) vocabulary to
represent users’ internal and social identity, complemented with Secure Sockets Layer
(SSL) to provide distributed authentication.
The architecture incorporates conceptual mechanisms to capture user actions over resources, which are further represented and stored as semantic annotations. Based on
these annotations, the concept of traceability applied to Internet resources is introduced.
Access policies are decoupled from resources and enforcement points. Users maintain
full control over their resources and are provided with a cross-domain sharing experience disregarding how it is handled by the infrastructure, and avoiding the duplication
of resources in different domains.
Resource sharing is to be achieved by the definition of semantic rules capable of specifying the rationale behind the share.
In order to support user management of access policies, a recommendation provider
capable of recommending access policies to users is included in the architecture. The
proposed recommendation engine features a hybrid engine consisting on the combination of different filtering techniques that exploit user profiles, their social networks,
resources content, (distributed) provenance and traceability information.
A prototype to demonstrate the infrastructure’s feasibility was designed and implemented to prove that the architecture model can be deployed in a real world scenario.
Part of the infrastructure was also applied over a legacy web application to illustrate
how it could benefit legacy applications.
The prototype’s evaluation was performed in two different manners to attest the validity of the proposal. Firstly, a set of functional tests was conducted over the prototype
to verify the proposed components. Secondly, the hybrid recommendation was tested
using an available data set where information was interpreted to simulate human behaviour in the system.
The adoption of a hybrid access-policy-recommendation engine enabled the enrichment
of access policy recommendations by using additional information provided by the
system. Captured provenance and traceability information are used together with the
user’s social networks and resources’ contents as to automatically propose which access
policies should be added to a certain resource.
While the current web paradigm of web architecture is set to imprison users (consumers
and publishers) in big web domains, this novel approach is set to disrupt this state
xv
of affairs by empowering users with a higher degree of control over their resources.
It provides means and support for publishing resources in a private manner, hereby
making websites behave (again) like meshes of dereferenced resources from different
web domains, yet complying with the established access policies.
Contents
List of Tables
xxv
List of Figures
Acronyms
I
xxvii
xxxi
Contextualisation
1 Introduction
1
3
1.1
History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.2
Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.3
Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
1.3.1
Weak Cross-Domain Security . . . . . . . . . . . . . . . . . . .
7
1.3.2
Poor Cross-Domain Directory Service . . . . . . . . . . . . . . .
8
1.3.3
Poor Sharing Mechanisms . . . . . . . . . . . . . . . . . . . . .
8
1.4
Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
1.5
Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
1.6
Research Questions and Proposals . . . . . . . . . . . . . . . . . . . . .
9
1.7
Research Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.8
Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.9
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
xviii
Contents
2 Background Knowledge
2.1
Authentication, Authorisation, Accountability . . . . . . . . . . . . . . 15
2.1.1
Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.1.1
Social Network . . . . . . . . . . . . . . . . . . . . . . 17
2.1.1.2
Friend Of A Friend . . . . . . . . . . . . . . . . . . . . 18
2.1.2
Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1.3
Authorisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.1.4
2.1.5
2.2
15
2.1.3.1
Access Control Design . . . . . . . . . . . . . . . . . . 24
2.1.3.2
Languages & Frameworks . . . . . . . . . . . . . . . . 27
2.1.3.3
Access Control Administration . . . . . . . . . . . . . 29
2.1.3.4
Resource State . . . . . . . . . . . . . . . . . . . . . . 29
2.1.3.5
Building Trusted Social Network . . . . . . . . . . . . 30
Accountability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.1.4.1
User Actions & Generated Content . . . . . . . . . . . 33
2.1.4.2
Resources Directory . . . . . . . . . . . . . . . . . . . 34
2.1.4.3
Provenance . . . . . . . . . . . . . . . . . . . . . . . . 34
2.1.4.4
Traceability . . . . . . . . . . . . . . . . . . . . . . . . 36
2.1.4.5
Annotations . . . . . . . . . . . . . . . . . . . . . . . . 37
Reference Architecture . . . . . . . . . . . . . . . . . . . . . . . 38
2.1.5.1
IDentity Provider . . . . . . . . . . . . . . . . . . . . . 40
2.1.5.2
Policy Enforcement Point . . . . . . . . . . . . . . . . 40
2.1.5.3
Policy Decision Point . . . . . . . . . . . . . . . . . . . 41
2.1.5.4
Policy Administration Point . . . . . . . . . . . . . . . 41
2.1.5.5
Policy Information Point . . . . . . . . . . . . . . . . . 41
2.1.5.6
Policy Retrieval Point . . . . . . . . . . . . . . . . . . 42
Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.2.1
Location Awareness vs. Knowledge Awareness . . . . . . . . . . 43
2.2.2
Recommendation Techniques
2.2.3
Recommender Engines . . . . . . . . . . . . . . . . . . . . . . . 47
2.2.4
Similarity Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 48
. . . . . . . . . . . . . . . . . . . 45
xix
Contents
2.3
II
2.2.5
Relevant Actions and Training Model . . . . . . . . . . . . . . . 50
2.2.6
Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.2.6.1
Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.2.6.2
Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.2.6.3
Precision . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.2.6.4
F Measure . . . . . . . . . . . . . . . . . . . . . . . . . 52
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Proposal
55
3 Use Case Scenario
57
3.1
3.2
3.3
Characterisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.1.1
Social Event . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.1.2
People Social Relationships . . . . . . . . . . . . . . . . . . . . 58
3.1.3
Content Generation . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.1.4
Contextual Information . . . . . . . . . . . . . . . . . . . . . . . 60
3.1.5
Virtual Identity . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.1.6
Content Hosting Domains . . . . . . . . . . . . . . . . . . . . . 63
3.1.7
Access Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.2.1
Content Sharing on a Single Domain . . . . . . . . . . . . . . . 64
3.2.2
Content Sharing on Multiple Domains . . . . . . . . . . . . . . 65
3.2.3
Content Sharing by Using Social Networks . . . . . . . . . . . . 65
3.2.4
Content Sharing by Using Resource Attributes . . . . . . . . . . 66
3.2.5
Compound Resources by Hyperlinking Resources . . . . . . . . 66
3.2.6
Content Sharing by Dynamically Grouping Users and Resources
3.2.7
Known-Unknowns Discovery . . . . . . . . . . . . . . . . . . . . 68
3.2.8
Unknown-Unknowns Discovery . . . . . . . . . . . . . . . . . . 68
67
Systematisation of Problems . . . . . . . . . . . . . . . . . . . . . . . . 68
3.3.1
User Multiple Identity . . . . . . . . . . . . . . . . . . . . . . . 69
3.3.2
Cross-Domain Profile Semantic Heterogeneity . . . . . . . . . . 69
xx
Contents
3.3.3
Cross-Domain Authentication . . . . . . . . . . . . . . . . . . . 69
3.3.4
Cross-Domain Authorisation . . . . . . . . . . . . . . . . . . . . 70
3.3.5
Contextual Information Loss . . . . . . . . . . . . . . . . . . . . 70
3.3.6
Expressivity Problem of the Access Policy Language . . . . . . . 71
3.3.7
Expansion Problem of the Access Policy Language . . . . . . . . 71
3.3.8
Semantics of the Access Policy Language . . . . . . . . . . . . . 72
3.3.9
Cross-Domain Compliance of the Access Policy Language . . . . 72
3.3.10 Cross-Domain Resource Hosting . . . . . . . . . . . . . . . . . . 73
3.3.11 Multiple Renderings of Compound Resources . . . . . . . . . . . 74
3.3.12 Known-Unknown Situation . . . . . . . . . . . . . . . . . . . . . 75
3.3.13 Unknown-Unknown Situation . . . . . . . . . . . . . . . . . . . 75
3.3.14 Insecure Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.4
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4 Architecture
77
4.1
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.2
Research Questions vs. Features . . . . . . . . . . . . . . . . . . . . . . 81
4.3
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5 Design
5.1
Identification Provider Point . . . . . . . . . . . . . . . . . . . . . . . . 86
5.1.1
5.1.2
5.2
85
Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.1.1.1
User Identity Creation . . . . . . . . . . . . . . . . . . 87
5.1.1.2
User Profile Management . . . . . . . . . . . . . . . . 89
5.1.1.3
Authentication Relying Party . . . . . . . . . . . . . . 90
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Policy Enforcement Point . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.2.1
Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.2.1.1
Authentication . . . . . . . . . . . . . . . . . . . . . . 93
5.2.1.2
Authorisation . . . . . . . . . . . . . . . . . . . . . . . 94
5.2.1.3
Provenance & Traceability Information . . . . . . . . . 96
xxi
Contents
5.2.2
5.3
5.4
Rendering Compound Resources . . . . . . . . . . . . 104
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Policy Information Point . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.4.1.1
Information Storage and Retrieval . . . . . . . . . . . 111
5.4.1.2
Publishing Semantic Information . . . . . . . . . . . . 111
5.4.1.3
Creating Resource Contextual Information . . . . . . . 112
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Policy Administration Point . . . . . . . . . . . . . . . . . . . . . . . . 113
Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.5.1.1
Access Policy Management . . . . . . . . . . . . . . . 115
5.5.1.2
Dynamic User Grouping
5.5.1.3
Dynamic Resource Grouping . . . . . . . . . . . . . . 118
. . . . . . . . . . . . . . . . 117
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Policy Recommendation Point . . . . . . . . . . . . . . . . . . . . . . . 119
5.6.1
5.6.2
Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.6.1.1
Hybrid Recommendation Engine . . . . . . . . . . . . 121
5.6.1.2
Notifications & Feedback
5.6.1.3
Known-Unknown and Unknown-Unknown Resources . 125
. . . . . . . . . . . . . . . . 124
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Evaluation
6 Prototype
6.1
5.2.1.6
5.3.2
5.5.2
III
Decentralised Resource Hosting . . . . . . . . . . . . . 100
Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.5.1
5.7
5.2.1.5
5.3.1
5.4.2
5.6
Capturing User Actions . . . . . . . . . . . . . . . . . 99
Policy Decision Point . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.4.1
5.5
5.2.1.4
131
133
Identification Provider Point . . . . . . . . . . . . . . . . . . . . . . . . 133
xxii
6.2
Contents
6.1.1
User Identity Creation . . . . . . . . . . . . . . . . . . . . . . . 135
6.1.2
Profile Management . . . . . . . . . . . . . . . . . . . . . . . . 138
6.1.3
Relying Authentication Party . . . . . . . . . . . . . . . . . . . 138
Policy Enforcement Point . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.2.1
Authentication Module
. . . . . . . . . . . . . . . . . . . . . . 139
6.2.2
Authorisation Module . . . . . . . . . . . . . . . . . . . . . . . 140
6.2.3
Upload Action Sensor Module . . . . . . . . . . . . . . . . . . . 140
6.3
Policy Decision Point . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.4
Policy Information Point . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.4.1
Hosting and Resources Retrieval
. . . . . . . . . . . . . . . . . 143
6.4.2
Publishing Provenance and Traceability Information . . . . . . . 143
6.4.3
User’s Manual Addition of Contextual Information to Resources 144
6.5
Policy Administration Point . . . . . . . . . . . . . . . . . . . . . . . . 144
6.6
Policy Recommendation Point . . . . . . . . . . . . . . . . . . . . . . . 146
6.7
Wordpress Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.7.1
6.8
User Authentication . . . . . . . . . . . . . . . . . . . . . . . . 148
6.7.1.1
Request Redirection . . . . . . . . . . . . . . . . . . . 149
6.7.1.2
Validating Authentication Response . . . . . . . . . . 151
6.7.1.3
Wordpress User Registration
6.7.1.4
Session Generation . . . . . . . . . . . . . . . . . . . . 152
. . . . . . . . . . . . . . 151
6.7.2
Resource Upload . . . . . . . . . . . . . . . . . . . . . . . . . . 152
6.7.3
Rendering Compound Resources . . . . . . . . . . . . . . . . . . 152
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
7 Experiments
7.1
7.2
155
Dataset Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
7.1.1
Source Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
7.1.2
Domain Ontology Mapping & Integration
7.1.3
System Ontology Mapping . . . . . . . . . . . . . . . . . . . . . 162
7.1.4
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
. . . . . . . . . . . . 158
Recommender System . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
xxiii
Contents
7.3
7.4
7.2.1
Recommendation Dataset . . . . . . . . . . . . . . . . . . . . . 167
7.2.2
Evaluation Dataset Generator . . . . . . . . . . . . . . . . . . . 169
7.2.3
Similarity Generator . . . . . . . . . . . . . . . . . . . . . . . . 170
7.2.4
Similarity Aggregator . . . . . . . . . . . . . . . . . . . . . . . . 173
7.2.5
Predictions Generator . . . . . . . . . . . . . . . . . . . . . . . 176
7.2.6
Predictions Aggregator . . . . . . . . . . . . . . . . . . . . . . . 176
7.2.7
Measurement Calculator . . . . . . . . . . . . . . . . . . . . . . 180
7.2.8
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
Evaluation & Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
7.3.1
Baseline Configurations Analysis . . . . . . . . . . . . . . . . . 182
7.3.2
C1 Derived Configurations Analysis . . . . . . . . . . . . . . . . 182
7.3.3
C2 Derived Configurations Analysis . . . . . . . . . . . . . . . . 183
7.3.4
C4 Derived Configurations Analysis . . . . . . . . . . . . . . . . 184
7.3.5
C7 Derived Configurations Analysis . . . . . . . . . . . . . . . . 185
7.3.6
C8 Derived Configurations Analysis . . . . . . . . . . . . . . . . 186
7.3.7
C10 Derived Configurations Analysis . . . . . . . . . . . . . . . 186
7.3.8
Aggregated Predictions Analysis . . . . . . . . . . . . . . . . . . 187
7.3.9
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
8 Conclusions & Future Work
191
8.1
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
8.2
Research Questions / Contributions . . . . . . . . . . . . . . . . . . . . 191
8.3
Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
8.4
Limitations & Future Work . . . . . . . . . . . . . . . . . . . . . . . . 196
8.5
8.4.1
Access Policy Revocation . . . . . . . . . . . . . . . . . . . . . . 196
8.4.2
Indivisible Resources . . . . . . . . . . . . . . . . . . . . . . . . 196
8.4.3
Embeddable Resources . . . . . . . . . . . . . . . . . . . . . . . 197
8.4.4
Blacklisting Users & Resources . . . . . . . . . . . . . . . . . . 197
8.4.5
Performance impact . . . . . . . . . . . . . . . . . . . . . . . . 198
Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
xxiv
IV
Contents
Bibliography and Appendix
201
Bibliography
203
A Dataset Preparation
221
A.1 OpenRefine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
A.2 Domain Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
B Recommendation Evaluation Results
225
B.1 Baseline Configurations Results . . . . . . . . . . . . . . . . . . . . . . 226
B.2 C1 Derived Configurations Results . . . . . . . . . . . . . . . . . . . . 227
B.3 C2 Derived Configurations Results . . . . . . . . . . . . . . . . . . . . 228
B.4 C4 Derived Configurations Results . . . . . . . . . . . . . . . . . . . . 229
B.5 C7 Derived Configurations Results . . . . . . . . . . . . . . . . . . . . 230
B.6 C8 Derived Configurations Results . . . . . . . . . . . . . . . . . . . . 231
B.7 C10 Derived Configurations Results . . . . . . . . . . . . . . . . . . . . 232
B.8 Baseline Aggregated Predictions Results . . . . . . . . . . . . . . . . . 233
B.9 Derived Configurations Aggregated Predictions Results . . . . . . . . . 234
List of Tables
2.1
Recommender Engines Comparison . . . . . . . . . . . . . . . . . . . . 47
2.2
Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.1
Filtering Methods Input Information . . . . . . . . . . . . . . . . . . . 122
7.1
Analysed Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
7.2
Entities Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
7.3
Mahout-based Similarities [Weightless] . . . . . . . . . . . . . . . . . . 171
7.4
Semantic Similarities [Weightless] . . . . . . . . . . . . . . . . . . . . . 171
7.5
Mahout-Based Similarities [Boolean] . . . . . . . . . . . . . . . . . . . 174
7.6
Semantic Similarities [Boolean] . . . . . . . . . . . . . . . . . . . . . . 174
7.7
Mahout-based Similarities [Weighted] . . . . . . . . . . . . . . . . . . . 174
7.8
Semantic Similarities [Weighted] . . . . . . . . . . . . . . . . . . . . . . 174
7.9
Mahout Similarities [Normalised Weight] . . . . . . . . . . . . . . . . . 175
7.10 Semantic Similarities [Normalised Weight] . . . . . . . . . . . . . . . . 175
7.11 Similarities Aggregation [Union Average] . . . . . . . . . . . . . . . . . 176
7.12 Similarities Aggregation [Intersection Average] . . . . . . . . . . . . . . 176
7.13 Predictions Without Normalised Scoring . . . . . . . . . . . . . . . . . 180
7.14 Predictions With Normalised Scoring . . . . . . . . . . . . . . . . . . . 180
7.15 Baseline Configurations Results [AT=25] . . . . . . . . . . . . . . . . . 182
7.16 C1 Derived Configurations Results [AT=25] . . . . . . . . . . . . . . . 183
7.17 C2 Derived Configurations Results [AT=25] . . . . . . . . . . . . . . . 184
7.18 C4 Derived Configuration Results [AT=25] . . . . . . . . . . . . . . . . 184
7.19 C7 Derived Configurations Results [AT=25] . . . . . . . . . . . . . . . 185
xxvi
List of Tables
7.20 C8 Derived Configuration Results [AT=25] . . . . . . . . . . . . . . . . 186
7.21 C10 Derived Configurations Results [AT=25] . . . . . . . . . . . . . . . 187
7.22 Baseline Configurations Aggregated Predictions Results [AT=25] . . . . 188
7.23 Derived Configurations Aggregated Predictions Results [AT=25] . . . . 188
A.1 Domain Ontology Facts & Numbers
. . . . . . . . . . . . . . . . . . . 222
A.2 Domain Concepts’ Source Datasets . . . . . . . . . . . . . . . . . . . . 223
B.1 Baseline Configurations Results . . . . . . . . . . . . . . . . . . . . . . 226
B.2 C1 Derived Configurations Results . . . . . . . . . . . . . . . . . . . . 227
B.3 C2 Derived Configurations Results . . . . . . . . . . . . . . . . . . . . 228
B.4 C4 Derived Configurations Results . . . . . . . . . . . . . . . . . . . . 229
B.5 C7 Derived Configurations Results . . . . . . . . . . . . . . . . . . . . 230
B.6 C8 Derived Configurations Results . . . . . . . . . . . . . . . . . . . . 231
B.7 C10 Derived Configurations Results
. . . . . . . . . . . . . . . . . . . 232
B.8 Baseline Configurations Aggregated Predictions Results
. . . . . . . . 233
B.9 Derived Configurations Aggregated Predictions Results . . . . . . . . . 234
List of Figures
1.1
Cross-Domain Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
2.1
WebID Description [134] . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2
FOAF Multi Domain Social Networks [37] . . . . . . . . . . . . . . . . 20
2.3
FOAF+SSL Authentication Process [147] . . . . . . . . . . . . . . . . . 22
2.4
Access Control Domain Diagram
2.5
Resources Accessibility [State Diagram]
2.6
Social Relationships using Self-Introduction . . . . . . . . . . . . . . . 31
2.7
Social Relationships by Group-Association . . . . . . . . . . . . . . . . 32
2.8
System-Proposed Social Relationships . . . . . . . . . . . . . . . . . . . 33
2.9
Reference Architecture [Components Diagram]
. . . . . . . . . . . . . . . . . . . . . 27
. . . . . . . . . . . . . . . . . 30
. . . . . . . . . . . . . 38
2.10 PEP Pattern [Sequence Diagram] (adapted from [116]) . . . . . . . . . 39
2.11 Resource Location vs. User Awareness . . . . . . . . . . . . . . . . . . 44
3.1
Uploaded Photographs and Sharing Statistics [55] . . . . . . . . . . . . 58
3.2
Use Case Scenario: Users . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.3
Use Case Scenario: Users’ Social Relationships . . . . . . . . . . . . . . 60
3.4
Use Case Scenario: Resources & Authorship . . . . . . . . . . . . . . . 61
3.5
Use Case Scenario: Resources Information . . . . . . . . . . . . . . . . 62
3.6
Use Case Scenario: Annotated Photographs . . . . . . . . . . . . . . . 62
3.7
Use Case Scenario: Users Web Domain Registration . . . . . . . . . . . 63
3.8
Compound Resource . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.9
Compound Resource Rendering . . . . . . . . . . . . . . . . . . . . . . 67
4.1
Proposed Architecture [Components Diagram] . . . . . . . . . . . . . . 79
xxviii
List of Figures
5.1
IDP: Identity Creation Process [System Sequence Diagram] . . . . . . . 87
5.2
IDP: Identity Creation Process [Sequence Diagram] . . . . . . . . . . . 88
5.3
IDP: Authentication Relying Party Process [Sequence Diagram] . . . . 91
5.4
PEP: Authentication Process [Sequence Diagram] . . . . . . . . . . . . 95
5.5
PEP: Authorisation Process [Sequence Diagram] . . . . . . . . . . . . . 97
5.6
PEP: Provenance and Traceability Entities
5.7
PEP: Intercepting User Actions [Sequence Diagram]
5.8
PEP Components Diagram
5.9
PEP: Resource Upload Action Process [Sequence Diagram] . . . . . . . 103
. . . . . . . . . . . . . . . 98
. . . . . . . . . . 101
. . . . . . . . . . . . . . . . . . . . . . . . 102
5.10 PEP: Resource Download Action [Sequence Diagram] . . . . . . . . . . 105
5.11 PEP: Rendering Compound Resources Example [Screenshot] . . . . . . 106
5.12 PDP: Access Policy Evaluation Process [Sequence Diagram] . . . . . . 109
5.13 PAP: Static Access Approach . . . . . . . . . . . . . . . . . . . . . . . 115
5.14 PAP: RBAC Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.15 PAP: ABAC Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.16 PAP: FOAF-based Approach . . . . . . . . . . . . . . . . . . . . . . . . 117
5.17 PRP: FOAF and Context-based Rules . . . . . . . . . . . . . . . . . . 120
5.18 PRP: Hybrid Recommendation Information . . . . . . . . . . . . . . . 122
5.19 PRP: Recommendation Process . . . . . . . . . . . . . . . . . . . . . . 125
5.20 PRP: Recommendation Notifications & Feedback . . . . . . . . . . . . 126
6.1
Prototype [Components Diagram]
. . . . . . . . . . . . . . . . . . . . 134
6.2
IDP: FOAF+SSL Profile Creation [Screenshot] . . . . . . . . . . . . . . 135
6.3
PIP: RDF FOAF Profile Example [Screenshot] . . . . . . . . . . . . . . 136
6.4
SSL Certificate Associated to FOAF Profile [Screenshot] . . . . . . . . 137
6.5
IDP: Social Network Management [Screenshot] . . . . . . . . . . . . . . 138
6.6
IDP: Topic Preferences Management [Screenshot] . . . . . . . . . . . . 138
6.7
SSL Certificate Selection [Screenshot] . . . . . . . . . . . . . . . . . . . 139
6.8
PEP: Resource Upload Process [Sequence Diagram] . . . . . . . . . . . 142
6.9
PAP: Resources Listing [Screenshot] . . . . . . . . . . . . . . . . . . . . 144
6.10 PIP: Resource’s Related Semantic Topic Management [Screenshot]
. . 145
xxix
List of Figures
6.11 PAP: Resource Access Policy Definition Example [Screenshot] . . . . . 146
6.12 PRP: List of Recommended Access Policies [Screenshot] . . . . . . . . . 147
6.13 PRP: List of Recommended Resources Notification [Screenshot] . . . . 147
6.14 Wordpress: Authentication Options [Activity Diagram] . . . . . . . . . 149
6.15 Wordpress: FOAF+SSL Authentication Process [Sequence Diagram]
. 150
6.16 Wordpress: Authentication Extension Login Page [Screenshot] . . . . . 151
7.1
Source Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
7.2
Domain Ontology
7.3
Sources Dataset Mapping to Domain Ontology . . . . . . . . . . . . . . 161
7.4
System Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
7.5
Domain Ontology to System Ontology Mapping . . . . . . . . . . . . . 164
7.6
Recommender System Overall Evaluation Process . . . . . . . . . . . . 166
7.7
Recommendation Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 167
7.8
System Ontology to Recommendation Dataset Mapping . . . . . . . . . 168
7.9
Evaluation Dataset Generator Process . . . . . . . . . . . . . . . . . . 169
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
7.10 Similarities Generator Process . . . . . . . . . . . . . . . . . . . . . . . 170
7.11 System Ontology Semantic Similarity Enrichment . . . . . . . . . . . . 172
7.12 Semantic Enrichment Example . . . . . . . . . . . . . . . . . . . . . . . 172
7.13 Similarity Aggregation Process Example . . . . . . . . . . . . . . . . . 173
7.14 Similarities Aggregator Process . . . . . . . . . . . . . . . . . . . . . . 174
7.15 Predictions Generator Process . . . . . . . . . . . . . . . . . . . . . . . 177
7.16 Predictions Aggregator Process . . . . . . . . . . . . . . . . . . . . . . 178
7.17 Measurement Calculator Process
. . . . . . . . . . . . . . . . . . . . . 181
Acronyms
AAA Authentication, Authorisation, Accountability. xxxi, 15, 16, 38, 40–42, 53, 78,
81, 110
ABAC Attribute Based Access Control. xxxi, 26, 28, 85, 116, 119
ACL Access Control List. xxxi, 25
API Application Program Interface. xxxi, 47
ARBAC Attribute and Role-Based Access Control. xxxi, 26
CIM Common Information Model. 40, 41
D-FOAF Distributed Identity Management System. 30
DAC Discretionary Access Control. xxxi, 10, 24–26
DMTF Distributed Management Task Force. 40, 41
FOAF Friend Of A Friend. xiv, xxxi, 19, 21, 22, 31, 85, 87, 89, 92–94, 104, 112,
117–119, 121, 123, 125, 135, 137, 138, 140, 143, 148, 152, 153, 192, 197
FOAF+SSL Friend Of A Friend + Secure Socket Layer. 12, 20–23, 78, 81, 83, 87,
90, 92–94, 99, 107, 133, 135, 138–140, 148, 149, 151–153, 192
HTML Hyper Text Markup Language. xxxi, 3, 112, 196, 197
HTTP HyperText Transfer Protocol. 20, 83, 92, 139, 141, 152, 198
HTTPS HTTP Secure. 21, 22, 94, 139, 148
IBAC Identity-Based Access Control. xxxi, 26
IDP Identification Provider Point. xxxi, 78, 86, 87, 90, 92, 111, 133, 135, 138–140,
148, 149, 151, 154
xxxii
Acronyms
IdP IDentity Provider. xxxii, 38, 40, 78, 86
IETF Internet Engineering Task Force. 38, 40, 41
KAoS KAoS Services Framework. 27
Ln RBAC N-leveled RBAC. xxxii, 26
LD Linked Data. 5, 22, 35
LOD Linked Open Data. 5, 35, 36, 96
MAC Mandatory Access Control. xxxii, 24–26
nDAC Non-Discretionary Access Control. xxxii, 24–26
OWL Web Ontology Language. 27–29
PAP Policy Administration Point. xxxii, 38, 41, 78, 80, 111, 139, 143, 153
PC Personal Computer. xxxii, 9
PDF Portable Document Format. 37, 112, 143
PDP Policy Decision Point. xxxii, 38, 40, 41, 78, 96, 107, 108, 110, 111, 119, 140,
143
PEP Policy Enforcement Point. xxxii, 38, 40, 41, 78, 90, 92, 93, 100, 108, 110, 111,
114, 140, 141, 148, 149, 154
PIP Policy Information Point. xxxii, 38, 41, 42, 78, 80, 87, 92, 100, 102, 108, 110,
111, 113, 139, 141, 143, 144, 153
PKI Public Key Infrastructure. 21
PRP Policy Retrieval Point. xxxii, 38, 42, 80
PRP Policy Recommendation Point. xxxii, 77, 78, 81, 111, 121, 124, 139, 154, 162,
163, 165
RBAC Role Based Access Control. 26, 71, 85, 115, 116, 119
RDF Resource Description Framework. 18, 19, 28, 33, 47, 87, 96, 135, 137, 192, 196
RuleML Rule Markup Language. 28
SAML Security Assertion Markup Language. xxxii, 28, 40
Acronyms
xxxiii
SSL Secure Sockets Layer. xiv, xxxiii, 20, 21, 87, 89, 94, 135, 137, 140, 192
SSO Single Sign-On. xxxiii, 20, 28
SWRL Semantic Web Rule Language. 28, 115, 117, 192, 194
TAMI Transparent Accountable Datamining Initiative. xxxiii, 27, 83
TLS Transport Layer Security. xxxiii, 21
UGC User Generated Content. 33, 34, 96, 98–100, 133, 140, 143, 160, 163
URI Uniform Resource Identifier. 4, 5, 7, 8, 16, 18, 19, 21, 22, 64, 69, 73, 74, 76, 85,
87, 89, 96, 100, 104, 106, 108, 112, 122, 124, 137, 140, 141, 149, 152, 192, 196,
197
URL Uniform Resource Locator. 158, 197
W3C World Wide Web Consortium. x, xiii, 9, 21, 34, 35, 69, 73, 81, 133
WebDAV Web Distributed Authoring and Versioning. 35
WebID Web Identity and Discovery. 19, 21, 22, 85, 87, 90, 94, 96, 100, 102, 108,
111, 140, 141, 151, 152, 192
WoT Web-Of-Trust. 22, 46, 129
WWW World Wide Web. 3–6, 8, 16, 17, 19, 20, 26, 33, 34, 41, 89, 92, 96, 133, 154
XACML eXtensible Access Control Markup Language. xxxiii, 28, 40–42
XML Extensible Markup Language. 28, 33
Part I
Contextualisation
Chapter 1
Introduction
With current information systems moving towards the Internet, information is stored
in several web servers on the Internet where more and more people are using social
networks to connect with others and share resources among themselves.
When Sir Tim Berners Lee’s invented the web of hypertext, mostly known as World
Wide Web (WWW), one of the established rules was that hyperlinks were relationship anchors between hypertext documents written in Hyper Text Markup Language
(HTML) [16].
As the web evolved, more than just hypertext documents became supported, thus
being able to host and hyperlink any other kind of resource. Yet, contrary to the early
adoption of an amassment of publicly accessible hyperlinked resources, nowadays,
due to the proliferation of different access control policies, this mesh of hyperlinked
resources is not achieved for access-protected resources. This makes access-protected
resources second-class citizens of the WWW when it comes to being hyperlinked or
embedded in different domains than the one where those resources are hosted.
In fact, most of the privately owned and shared resources are only accessible within
the hosting domain and by users enrolled on that domain that have been given access
to the resource.
While the term “to publish” is intrinsically associated with “to disseminate to the
public”1 , sharing a resource is related to publishing it privately so that only certain
users should have access to it. When a resource is uploaded to a web domain [106, 107],
the author is in fact publishing the resource under certain and particular access policy
rules that only allow access to him/her2 or to a particular set of users. This is an act
of privately sharing the resource.
1
http://www.merriam-webster.com/dictionary/publish
While the examples and arguments used in this work apply equally to both genders, from this
point forward, only the male pronouns are used.
2
4
Introduction
Resource sharing is therefore a main issue on the web, which this work intends to
facilitate by giving users the chance to choose where newly created or published resources should be physically located, independently of where they are being referenced,
accessed and by whom.
1.1
History
The WWW is continuously growing in terms of information, users and services. The
demand for information and services is increased by this growth of users, which itself
translates into a constant rise of suppliers for both information and services.
In Web 1.0 users mostly consumed data made available by companies. The contribution of each user to the growing web of data was reduced and mostly limited to
developing and maintaining their own personal web page.
When a user wishes to refer to a resource provided by others, he would create a
hyperlink in his own homepage to refer to such resource, independently of the domain
where it is hosted.
Hyperlinking enriches the whole concept of the Web. Users know exactly where their
resources resided as they are dereferenced and attainable by both humans and machines via a Uniform Resource Identifier (URI) in a hyperlinked fashion.
At that time, domain servers did not provide appropriate authorisation access to resources, hence published data on a web domain was mostly available free of restrictions
for public access. Besides, there were not too many tools that enabled online collaboration. Other non-web based technologies existed like chat rooms and newsgroups, but
none of these relied on hypertext. Online communication and collaboration between
users was still quite rudimentary.
As the WWW evolved to Web 2.0, many notable applications appeared and became
popular such as web forums (e.g. phpBB, vBulletin), followed by personal blogs (e.g.
Blogger3 , Wordpress4 ) and nowadays, Facebook5 and Twitter6 among others, became
important in human behavioural modelling.
Like in real life, internet-based relationships evolved and users started publishing information and interacting with each other. Emergent groups of users gave rise to
small communities of interest all over the WWW. This trend took a professional path
upon the emergence and consolidation of social networks intended for business and
professional purposes (e.g. LinkedIn7 ).
3
https://www.blogger.com/
https://wordpress.org/
5
https://www.facebook.com
6
https://twitter.com/
7
https://www.linkedin.com/
4
1.2 Problem
5
Users started to engage more on web applications and they commenced sharing resources among them and having preponderant control over what and how they would
share their resources.
Conversely, many stopped worrying about their own personally-developed home pages,
such that the concept of a personally-developed home page rapidly faded out and
traded for the one provided by Web 2.0 applications such as Facebook, Google+,
Tumblr, Microsoft Live, to name a few. Using such platforms is one of the simplest
ways of publishing information – even though mostly unstructured – like text, photographs, multimedia files, and other types of content. All the information is, at most,
solely categorised by tags or keywords.
With the advent of Web 3.0, also known as the Semantic Web, resources’ minimum
granularity is reduced to a triple in the form of <subject, predicate, object> where
each part is a URI (except the object that can be a primitive value), thus forming
a web of fine-grained interlinked resources. Resources are semantically described by
ontologies, which are themselves sets of triples.
Large and interlinked repositories of triples emerged and continue enlarging into the
so-called Linked Data (LD). When these repositories are free of access authorisation
restrictions they become Linked Open Data (LOD) repositories.
The emergent Web 4.0 will exploit this linked (open) data and respective semantics
to provide advanced intelligence-based services to the users [4].
1.2
Problem
The process of engendering data and sharing resources became much simpler with the
Web 2.0 but with a major trade-off, as most of the sharing actions are only perceived
inside closed domains that act as data silos.
These web applications are hosted inside domains that are typically identified by a
name8 [89] and each domain defines a realm of administrative autonomy, authority or
control on the WWW.
Most of these internet-based services require some sort of authentication and rely
on centralised systems forcing users to have a different identifier (account) for each
organisation or website the user interacts with.
Such properties are barely transposed from one domain to another thus making each
web application more distant and apart from each other.
Resources protected by one system’s credentials are not accessible through the other
system’s credentials, promoting difficulties while browsing, searching, generating and
8
http://en.wikipedia.org/wiki/Domain_name
6
Introduction
World Wide Web
Web Domain 1
Web Domain 2
WebPage.html
has access
publicResource.png
public Resource
has no access
protected Resource
protectedResource.png
has access
is registered on
has no access
is author of
is friend of
is friend of
Amelia
is registered on
is registered on
John
Mathew
Figure 1.1: Cross-Domain Sharing
sharing information resources between users and systems. These are referred as resource and application islands. This is more or less innocuous when each user’s social
network remains in the same domain but quite the opposite when users register and
share resources on multiple different domains across the WWW.
An illustration of this situation can be depicted in figure 1.1 where a user (i.e. John)
can define access restrictions for a particular resource (i.e. protectedResource.png)
and share it with other users (e.g. Mathew) that are registered in the same domain
(i.e. Web Domain2). Nevertheless, users that are not registered in the same domain
(e.g. Amelia) and resources that reside elsewhere (e.g. WebPage.html) cannot access
or render the protected resource.
Searching and retrieving meaningful information from the “open web” is a knowledge
and time-intensive task that is aggravated by the cross-systems’ access restrictions
applied to shared resources. Further, while users become more dependent on big
platforms, they also lose some of the control over their resources’ hosting location and
become limited to the rules imposed by each of those domains.
Most of the concept of an Internet made of inter-domain hyperlinks is becoming more
and more disused as more immense companies thrive to keep all the resources within
1.3 Challenges
7
their domain premises (e.g. Facebook9 , Google, Apple). Hyperlinks are still used a lot,
but most of them reference publicly obtainable content (inside or outside a particular
domain). When resource access policies need to be defined, they are only possible for
the resources/users hosted/registered in the same domain.
On existing popular web applications, there are typically two different methods for
users to share resources among them. They can opt for:
Resource Domain Hosting Resources are created and hosted inside a particular
web application and sharing only takes place inside that domain.
Resource Dereferencing This method allows a resource to be used by other resources and web domains by simply referring to it via a hyperlink. Typically,
the web application automatically generates a preview/snapshot (i.e. creates a
copy) of the resource that is hosted in the other web domain;
Long, Pseudo-Undecipherable URIs This method uses URIs that are automatically generated by the domain where the resource is hosted. For security reasons,
these should not be publicly shared, i.e. in a publicly accessible web page, otherwise the protected content would be publicly exposed. Further, given enough
time, any sequential code generator would be able to imitate the automatically
generated URI for that particular resource thus allowing access to any user.
Hence, anyone who obtains the resource sharing URI can actually gain access to
it, meaning that the resource actually becomes public, since no authorisation access policies are defined over it and its reachability is only limited and hardened
by the difficulty of reproducing its URI.
1.3
Challenges
By analysing previous observations about cross-domain resource sharing, several limitations were identified, corresponding to challenges in the scope of this work.
1.3.1
Weak Cross-Domain Security
Different authentication and authorisation processes adopted by each domain realm
prevent cross-domain sharing of resources based on the identity and role of the user
accessing the resource. Instead, a weak security mechanism based on a pseudoindecipherable resource’s URI is used.
Users should be able to share resources according to the identity of the user accessing
the resource, independently of the resource’s location and the origin of the request.
9
http://www.engadget.com/2015/03/24/facebook-hosted-news-content
8
Introduction
1.3.2
Poor Cross-Domain Directory Service
Because it is difficult and insecure to share resources in a cross-domain manner, the
user tends to duplicate each resource in several domains. Every time a resource is
copied, another URI is generated to identify that resource. In the end, the user has
not just one, but two identical resources that are identifiable by two different URIs.
One of the problems in distributed and decentralised systems is the consistency of the
information. In fact, resources’ content can evolve over time and if they are duplicated,
their consistency is hard to maintain. As a consequence, not only vulnerability risks
increase, as well the coherence and consistency between copied resources decrease.
Nowadays a user can only list web resources submitted for a given web domain through
mechanisms provided by each web domain, making it impossible to list all their resources regardless of the domain where they are hosted.
Avoiding resources duplication and multiple identification is one of the existing problems in the WWW, especially when dealing with shared resources. Users should be
able to upload a certain piece of information once and refer to it in other places in
multiple occasions.
1.3.3
Poor Sharing Mechanisms
Current resource sharing mechanisms provide limited expressivity for the specification
of access policies because they are typically based upon static user identity and roles,
which limits the establishment of more complex rules, for example:
• “share this photo with all my friends”;
• “share this document with my boss”;
• “share this photo with everyone at yesterday’s event”;
• “stop sharing project documents with someone that left the project’s team”.
Users should be able to define resource access restrictions based on expressive yet
abstract rules that support dynamic evaluation of roles played by the requesting user,
respecting the currently available information.
1.4
Goal
Managing user resources and their (sharing) access policies in the Web is hard, tedious,
time consuming and error-prone.
1.5 Hypothesis
9
This thesis aims at providing a solution that will help each user in the process of managing (keep track) and sharing resources in the Web in an efficient (i.e. with minimal
user and resource intervention) and effective (i.e. being able to achieve expressivity
and finer granularity sharing) way. The overarching goal of this thesis is therefore to
investigate:
A seamless cross-web-domain infrastructure that provides secure, rich and
supportive resource managing and sharing processes.
1.5
Hypothesis
This thesis argues that such mechanism would:
1. adopt the principles of the Web, namely distribution, decentralisation and the
use of (World Wide Web Consortium (W3C)) standards,
2. adopt a declarative and expressive policy language, that will
3. exploit users’ (i) profiles, (ii) social networks, (iii) generated content and (iv) actions upon their resources,
4. for defining, evaluating and managing users’ access policies to resources.
1.6
Research Questions and Proposals
This hypothesis raises several research questions and establishes multiple requirements:
1. How can a distributed, decentralised and standard-based mechanism
perform authentication and authorisation?
• The fundamental principles of distribution and decentralisation are commonly disregarded nowadays concerning identification, authentication and
authorisation;
• Web principles and W3C standards are be considered in order to formulate
a supporting web-based infrastructure.
2. What and how can user-generated actions upon resources be captured?
• User’s actions upon resources happen on a local Personal Computer (PC)
environment as well as in a web domain;
10
Introduction
• Some of those actions are e.g. creation, reading, classification, tagging,
downloading of resources;
• User’s actions are helpful in order to characterise (i) users individually;
(ii) their social relationships and (iii) their resources;
• Currently, user actions upon resources are not registered by the user or its
system, but are often registered by the resource’s hosting application for
its own operation, but not between different domains;
• The to-be proposed web-based infrastructure must support and promote
the registration of the user’s actions in a distributed, decentralised and
cross-domain way.
3. How can a user share a resource with others, based on rules instead
of statically-defined discretionary access control?
• Despite the validity of (statically defined) Discretionary Access Control
(DAC) mechanisms, its use is time-consuming and restricted;
• The user’s profile, social network and actions upon resources are considered
the base information for determining user’s access policies to resources;
• The envisaged infrastructure must support and promote the adoption of
information-based rule-defined access policies, which will grant or deny access to resources based on the available information.
4. How is it possible to automate or ease the process of managing access policies to resources from a resource’s owner perspective and his
relationships?
• It should ease the process of sharing resources for the resource owner as
well as it should significantly improve access to those resources to users
that could potentially have some interest in a specific resource;
• In its utter essence, this process is a recommender process that matches
a resource to the interests of other users, respecting the owner’s concerns,
such that the resource can be either previously known or unknown to the
potentially interested user;
• The infrastructure exploits the available information in order to match the
users’ and resources’ characteristics, respecting the owner’s concerns (captured through the established access policy rules).
1.7 Research Method
1.7
11
Research Method
This work adopts the design-science paradigm [74] as the research methodology. Design research “addresses important unsolved problems in a unique or innovative way
or solved problems in more effective or efficient ways” [74] through an iterative and
incremental process comprehending two complementary phases:
• The construction/build phase, whose output is a set of design artefacts, such as
constructs (vocabulary and symbols), models (abstractions and representations),
methods (algorithms and practices) and instantiations (prototype system);
• The evaluation phase, which provides essential feedback to the construction
phase as to the quality of the design artefacts.
Further, this paradigm suggests seven research guidelines to assist both researchers
and readers (e.g. reviewers, editors). These guidelines were followed as described next.
Design as an Artefact The research efforts resulted in several artefacts:
• An architecture model capable of enabling cross-domain sharing of resources, regardless of where those are hosted (cf. chapter 4);
• A recommender system model, in complement to the previous system architecture model, in order to ease the access policies management;
• A system prototype that implements the architecture model and recommender system.
Problem Relevance The relevance of the problem is emphasised by identifying the
state-of-the-art limitations by describing a use case scenario (cf. chapter 3),
composed by different use cases and the systematisation of existing problems in
those scenarios.
Design Evaluation To demonstrate the utility, quality and effectiveness of the produced artefacts, a prototype is developed (cf. chapter 6) and controlled experiments are performed (cf. chapter 7). The obtained results were evaluated qualitatively through standard analytical metrics. The experiments are conducted
throughout the project hence providing useful feedback that was adopted in the
following iterations of the design process, improving the outcomes.
Research Contributions Based on novelty, generality and significance of the designed artefacts, a clear and systematised identification of the contributions is
firstly introduced in this chapter (cf. section 1.8) and further explained in the
proposal (cf. part II).
12
Introduction
Research Rigour The proposed artefacts are formally described in part II. The evaluation process makes use of functional testing, some of the community’s evaluation datasets and compares with the state-of-the-art results.
Design as Search Process Design is seen as a process searching for effective artefacts, which requires knowledge of both the application domain and the solution
domain. In that respect, a thorough analysis of the current web paradigms
respecting resource sharing and of application/operational technology is performed.
Communication of Research. As a proof of the pertinence and validity of the contributions, most were presented and published in international conferences and
workshops and, therefore, verified by the respective research community.
1.8
Research Contributions
This work aims at proposing solutions for the challenges presented above, by giving
users the means for selecting where their resources should remain (independently of
where being used or hyperlinked) whilst managing their access control policies in a
single realm. A contribution summary is presented next:
• Distributed and decentralised system architecture for cross-domain hosting and
sharing of resources [19];
• Capture provenance and traceability information from user actions over resources
[22];
• Cross-domain sharing based on Friend Of A Friend + Secure Socket Layer
(FOAF+SSL) authorisation [22, 23];
• Definition of access policies based on semantic rules, social networks and resources information [19, 21];
• Recommendation of access control policies to users [19];
• Decentralised and distributed resource hosting and dereferencing [23];
• Development of a prototype and its deployment on legacy applications [22, 23].
1.9 Overview
1.9
13
Overview
This thesis is split in four parts: Contextualisation, Proposal, Evaluation and Bibliography and Appendix.
The first Part (Contextualisation) contains the following chapters:
Chapter 1 (Introduction) Describes the context of the problem, motivations, thesis statement and contributions by depicting where cross-domain resource sharing would be of an increase value.
Chapter 2 (Background Knowledge) Depicts background knowledge required for
understanding some of the concepts that underlie this work namely: user’s
unique identity; access control policies; social networks; resource state; deep
knowledge; resource hosting; resource dereferencing and recommendation processes.
The second Part (Proposal) contains the following chapters:
Chapter 3 (Use Case Scenario) Presents a use case scenario that depicts an every
day common Internet resource-sharing action, providing several use cases and a
systematisation of the current existing problems.
Chapter 4 (Architecture) Presents a web architecture capable of handling all the
parts necessary to address the systematised problems. It also correlates research
questions and architecture features.
Chapter 5 (Design) Provides a conceptual design for the proposed features and
contributions and correlates these with the research questions, use case scenario
and the systematised problems.
The third Part (Evaluation) contains the following chapters:
Chapter 6 (Prototype) Describes how the prototype was envisaged and implemented, describing the integration with existing legacy web applications.
Chapter 7 (Experiments) Describes how an existing dataset has been interpreted
and mapped to the prototyped infrastructure in order to simulate user interaction
with the system, evaluate the proposed architecture and recommended access
policies.
Chapter 8 (Conclusions & Future Work) Concludes and discusses achieved results, presents limitations and proposes future work.
14
Introduction
The fourth Part (Bibliography and Appendix) contains the following chapters:
Bibliography Contains a list of books, scientific papers and web hyperlinks referred
in the elaboration of this work.
Appendix A (Dataset Preparation) Contains some details about the dataset preparation and domain ontology.
Appendix B (Recommendation Evaluation Results) Contains a set of tables with
the results of the recommendation system’s evaluation.
Chapter 2
Background Knowledge
This chapter provides contextual and foundational knowledge for the remaining chapters. Based on the research questions presented in the previous chapter, two research
fields are addressed:
Authentication, Authorisation, Accountability (AAA), particularly:
Authentication, related to the research question number one in the sense that
distributed authentication is a main pillar for the proposed work and unique
identification is still an issue on current days;
Authorisation, related to the research question number one and research question number three in the sense that a different approach for specifying access
policies is needed to deal with distributed authentication and the way users
should specify access policies;
Accountability, related to the research question number two in the sense that
it addresses the need to trace users actions and state of the information;
Recommender Systems, related to the research question number four in the sense
that recommendation of access policies to users would aid them in the resourcesharing process.
These research field are described and analysed in the next sections.
2.1
Authentication, Authorisation, Accountability
AAA deals with authentication of subjects, authorisation control and accounting of
resources based on access policies set by administrators and users of a system [41, 95,
159].
16
Background Knowledge
Subject is the entity (user or system) seeking or requesting authorisation
to access or use a resource [153]. When referring to human interaction,
subject is also known as user.
Resource “(...) can be anything that has identity”, according to [14, 15].
On the Internet scope, resources are identified through a URI [17] and
therefore can be referred by other resources or entities.
AAA provides three main features to computer-based information, known as the C-I-A
Triad1 [36, 123, 124, 164]:
Confidentiality Resources are protected from unauthorised disclosure and access is
only allowed to permitted users.
Integrity Concerns trustworthiness and origin of resources, protects them from unauthorised modification and ensures that changes that may damage them can be
undone.
Availability Requires that resources are always available to users that have been
granted access to them.
The following subsections provide background comprehension related to each one of
the different topics mentioned in AAA.
2.1.1
Identification
Identification, for each and every single resource on the WWW is obtained
by using the URI concept [17].
Each user is represented and described by a resource and therefore identified through
an URI. In fact, users are a specialisation of resources. Furthermore, URIs allow
establishing associations between pairs of resources and pairs of resources and users.
User identity plays a key role in virtual communities. To understand and evaluate a
user’s interaction over a resource it is necessary to know the identity of who is involved.
In a real physical world, identity is an inherent unity to oneself. A body provides
a compelling and convenient definition of identity where the rule is: one body, one
identity [46]. During each person’s life, his body provides a stabilising anchor even
though it may be complex and in continual change. Sartre wrote “I am my body to
the extent that I am” [135].
1
http://www.techrepublic.com/blog/it-security/the-cia-triad/
2.1 Authentication, Authorisation, Accountability
17
While this prevails in the physical world, in the WWW it is different, as people can
spread and publish information without having a law for identity conservation. In
virtual communities, identity can be ambiguous because users can have many, which
is not applicable in a physical world. According to [28, 60], an efficient virtual identity
formation defines an internal and social identity.
User Internal Identity According to [28], internal identity in a virtual world, is
solely defined by the user, like in the real world. A person is aware of whom
he is from society, culture and experiences [60]. This information is provided
and maintained by the individual and it is essential in order to be compared
and distinguished from others. In the WWW, users are typically represented or
described by user profiles containing their name and an associated e-mail. Further, user profiles reflect the user’s preferences in relation to various subjects in a
particular time. Each term in a user profile is a characteristic of that particular
user [125]. This includes reference to the information directly requested to the
user and implicitly acknowledged during their interaction on the web [34].
User Social Identity According to [28], social identity is built on how users interact
and relate to each other, obtained externally by other members of the virtual
world through feedback or established relationships between users. The social
web firstly appears as a way to connect friends online and secondly as a way of
sharing information and publishing content for those in each community.
2.1.1.1
Social Network
A social network service is an application provided to create and manage communities of people. These services are responsible for the registration, authentication and
support for each person’s management of their network of contacts. Social network
processes manage and provide social relationship information about and between users
and communities. A social network is a system based on the notion of community.
A community is comprised of users and relates to other communities. A user may
belong to multiple communities and have an association with users in the same or
other communities.
Users are typically registered in many social networks services with different purposes,
e.g. professional, friendship, hobbies, but minimal or no interaction is allowed between
different social network services. This is due to privacy restrictions imposed by those
services, as they do not allow exporting the social networks between different social
environments and cannot be used by other computers outside those domains either.
Communities of interest acquired a lot of success in answering questions to technological problems, and gained enthusiasts in many different areas (e.g. Wikipedia,
Del.Icio.us, StackOverflow). These communities allow the exchange of experiences
18
Background Knowledge
and problems between similar-interested individuals, which are considered knowledge
providers with different expertise. When not dealing with technical issues, users still
tend to use these kind of communities (e.g. Facebook, Google+) just for leisure or
entertainment purposes [35].
Typically, on the Web 2.0, each online community has a team of managers that define its members, knowledge repositories and access policies. Every user can create,
update, remove and use available resources that are kept in local repositories as well
as grant access policies to other users. In doing so, each user is explicitly creating
social relationships with other users in that specific social network. Accordingly, in
this silo, each user has (almost) complete control over the knowledge he creates. Nevertheless the user is also responsible for the specified knowledge and resources that
have managed (i.e. non-repudiation).
Social relationships contain one of the missing links concerning the subjectivity and
social dimension in information management and recommender systems. As such,
user profiles can be used across different web applications to insure users identify
independently of which service the user is using.
2.1.1.2
Friend Of A Friend
The appearance of Web 2.0 started a quest for information models capable of describing user’s individual properties, social network, preferences and any other properties
related to their profile.
Figure 2.1: WebID Description [134]
With the introduction of the semantic web in Web 3.0, where Resource Description
Framework (RDF) [115] information exploits the URI concept to form facts in triples
2.1 Authentication, Authorisation, Accountability
19
(i.e. <subject, predicate, object>), a vocabulary for describing user properties in an
RDF format is created.
Friend Of A Friend (FOAF) [29] is a project devoted to linking and describing people,
agents, and their relationships, groups and other information in a non-proprietary
format using the WWW in a machine-readable format e.g. RDF. A FOAF profile
provides information about the user’s profile information and inter-users relationships
using a well-established vocabulary instead of proprietary formats.
Web Identity and Discovery (WebID) URI is an identifier that refers to a
person or agent, e.g.:
http://www.w3.org/People/Berners-Lee/Card#i
WebID Profile is a document, typically an RDF graph that describes the
person.
WebID Profile URI is the identifier to the RDF document describing the
person identified by the WebID URI, e.g.:
http://www.w3.org/People/Berners-Lee/Card
As depicted in figure 2.1, the URI that acts as the user’s WebID denotes the URI that
provides the internal and social identity of the user. This description is typically given
by a FOAF profile with the same WebID URI but without the “#i”. The WebID Profile
allows establishing the user’s internal and social identity in virtual communities. This
enables cross-domain user identification (WebID) where each user’s web identification
is a URI that denotes the user’s WebID Profile.
FOAF does not compete with socially oriented web sites but provides information
about users in order to be used by those websites, thus ensuring that the users can
retain control over their profile information and relationships.
FOAF profiles can be used across different web applications to ensure the user’s identity in a domain-independent way. Nevertheless, while FOAF does provide a single
user identity for each user across different web domains, in the beginning, it did not
support or was supported, by any kind of authentication method that provided user
accounting. FOAF profiles comply with the FOAF vocabulary2 in describing users
profiles.
Some of the existing communities are already complemented with the concept of FOAF
as a way of promoting social relationships between members and exporting them from
community silos.
Multiple FOAF profiles give rise to a social network. In [91–93], the authors introduce
the concept for a multi-domain social network named FOAF-Realm, where social
2
http://xmlns.com/foaf/spec/
20
Background Knowledge
Figure 2.2: FOAF Multi Domain Social Networks [37]
networks are stratified in several layers. Each layer is composed by different levels of
social relationship as shown in figure 2.2.
2.1.2
Authentication
Authentication3 is the process of identifying a user, ensuring that the user
is who he claims to be (identification) by means of presenting authentic
credentials (authentication).
There are different types of authentication techniques that rely on: what a user knows
(e.g. password, pin, passphrase, lock combination, etc.) what a user has (e.g. usb token
device, smart card, citizenship card, passport, etc.) or what a user is (e.g. fingerprint,
retina pattern, signature, etc.).
There are authentication processes that work at the OSI Application Layer (HyperText
Transfer Protocol (HTTP)) like OpenID-Connect4 , Single Sign-On (SSO)5 and proprietary web forms with a basic access authentication6 i.e. username/password combination. Others, work at the OSI Session And Presentation Layers e.g. FOAF+SSL7 .
Secure Sockets Layer (SSL) [57] is a widely used protocol on the WWW to provide
secure communications between clients and servers based on public key exchange.
3
http://www.webopedia.com/term/a/authentication.html
http://openid.net/connect/
5
http://en.wikipedia.org/wiki/Single_sign-on
6
http://en.wikipedia.org/wiki/Basic_access_authentication
7
https://blogs.oracle.com/bblfish/entry/foaf_ssl_creating_a_global
4
2.1 Authentication, Authorisation, Accountability
21
Nowadays, this protocol is mostly known as Transport Layer Security (TLS) in its
latest version [45].
SSL authentication consists in authenticating users by checking the contents of their
client certificates. A typical X.509 client certificate contains detailed identification
information about a user, a public key and the organisation that issued the certificate.
Providing single sign-on mechanisms and distributed authentication over multiple websites became an everyday issue, as different websites demand users authentication and
many use different authentication methods.
Efforts on using FOAF profile information combined with self-signed certificates are
being applied in projects such as FOAF+SSL authentication, presented and described
by the authors in [148]. Its aim is to reduce multiple-user accounting across different
web domains.
FOAF+SSL [148] is an authentication method that does not rely on a user and password combination, but instead uses a self-signed certificate based on a RESTFul architecture. It is an authentication method created to provide a cross-domain protocol
that intended to start a global decentralised authentication mechanism built upon the
usage of FOAF profiles that could easily be adopted by web applications that support
the HTTP Secure (HTTPS) protocol [129]. Each associated certificate contains a reference to the FOAF profile and the FOAF profile contains the associated certificate’s
public key.
Together with SSL client authentication, by using certificates that are validated without a trusted Certificate Authority, it allows users to self-issue certificates and decentralise authentication. The use of a URI as the primary identifier for the user, avoids
the per-domain boundaries of the username token of most centralised authentication
models.
This authentication method can be used through multiple web sites and social networks, acting in a similar way to OpenID-Connect. Nevertheless, OpenID-Connect
requires an external OpenID-Connect authentication provider, which FOAF+SSL does
not require because authentication is solely achieved between client and web domain
using Public Key Infrastructure (PKI).
Each user identification is a URI representing a user on the Internet, which is in fact
an identifier for any entity of type:Person from the FOAF ontology. As a result, it
is possible to link users and their profiles in either a public or protected way. In the
work proposed by [150], by allowing delegation of access authorisation from a user to
a third party, it is shown how a web server or agent can act on behalf of its users.
FOAF+SSL evolved to the WebID W3C editor’s draft8 . This specification outlines
a distributed and openly extensible universal identification mechanism by combining
8
http://www.w3.org/2005/Incubator/webid/spec/identity/
22
Background Knowledge
Figure 2.3: FOAF+SSL Authentication Process [147]
asymmetric cryptography and LD that makes use of the FOAF vocabulary in the
universal identification mechanism.
FOAF+SSL provides a cross-domain protocol that intends to create a global decentralised authentication mechanism. It is an open and secure authentication protocol
that allows the construction of a distributed social network, built upon the usage of
FOAF profiles, based in the Web-Of-Trust (WoT) [63] that can easily be adopted by
web applications. It not only reduces multiple user accounting on web applications,
but also provides a single user identity across different web domains.
Figure 2.3 depicts a typical FOAF+SSL authentication process. The authentication
process is composed by Romeo accessing Juliet’s protected resource, which is kept
under a secure connection (step 2).
When Juliet’s server receives a secure connection it performs the standard HTTPS
protocol by validating Romeo’s identity (step 3). To this end, Romeo’s client browser
requests Romeo’s identity, which is given by selecting an appropriate certificate.
The corresponding certificate is sent back to Juliet’s server, which validates that
Romeo’s browser is using the private key associated with the certificate’s public key.
Since the issued certificate also contains Romeo’s WebID URI embedded in the Subject Alternative Name section of the certificate, Juliet’s server retrieves Romeo’s FOAF
profile by using the corresponding WebID profile URI (step 4).
Juliet’s server examines Romeo’s FOAF profile by matching the public key specified in
his profile against the one on the obtained certificate thus ensuring Romeo is whom he
2.1 Authentication, Authorisation, Accountability
23
claims to be (step 5). A user is authenticated in a single connection, which is further
reused on other requests.
Step 6 of the process represents a simple form of resource authorisation that only
show’s Juliet’s protected information to her friends or to friends of her friends. In this
example, step 6 is performed to check if Julia or any of her friends knowns Romeo. If
he his, he is given access to the resource, otherwise only authentication is achieved.
FOAF+SSL provides a decentralised authentication protocol, but it is not enough to
provide complex authorisation rules over resources.
2.1.3
Authorisation
Authorisation [79, 88] consists in evaluating whether an authenticated subject/user (through an authentication system) should or not have access to
a resource.
Authorisation control is a set of methods and components that allow only authorised
subjects to access controlled resources. Authorisation (or access) control can be accomplished by two different perspectives [143]:
Permissive (least secure) Policies This occurs when access to a resource is given
by default to any user unless otherwise stated.
Prohibitive (least privilege) Policies This occurs when users must be granted
permissions to access a resource.
The following types of attributes can be considered in the definition of access policies,
when describing authorisation over resources [166]:
Subject Attributes Attributes associated with a subject that defines the identity
and characteristics of the subject. A subject’s role can also be viewed as an
attribute, e.g. identifier, name, job title, role, etc.
Resource Attributes Attributes associated with a resource, e.g. resource attributes
that can be extracted from the resource’s meta information, title, subject or
date, etc.
Environment Attributes Attributes that describe the operational, technical, or situational environment or context in which the information access occurs, e.g.
date, time, space and location, etc.
24
Background Knowledge
2.1.3.1
Access Control Design
Access control design is the set of methods, principles and criteria that
drives the specification process of access policies.
Access policies [110, 159] are a set of rules to administer, manage and
control access to resources, which can be evaluated to determine if a subject
has access to a resource.
According to [99], access control designs generally fall under one (or sometimes a
combination) of two primary designs: Mandatory Access Control (MAC) and DAC.
Yet, according to [143] a third design (Non-Discretionary Access Control (nDAC)) is
also possible.
In MAC design, a system mechanism controls access to a resource and no individual
user can alter that access. This design has the following characteristics:
• it is a unified and mandatory manner of assigning security labels to each user
and resource in the system;
• access to resources is permitted whenever there is a match between user and
resource labels;
• users do not have much freedom to determine who can access their resources
because a system mechanism controls access to resources and individuals cannot
alter that access;
• it is mostly used in:
– common military data classifications (e.g. unclassified, sensitive but unclassified, confidential, secret, top secret);
– common commercial data classifications (e.g. public, sensitive, private, confidential);
• it is occasionally called a rule-based access control;
• examples:
– security clearance of users and classification of data (as confidential, secret
or top secret) are used as security labels to define the level of trust;
– access to a room on certain times of a day.
In DAC design, resource owners determine who can access their resources. This design
has the following characteristics:
2.1 Authentication, Authorisation, Accountability
25
• access to resources is always defined by the resource owner, which is the sole
individual who can specify who can access a resource;
• access to resources is given based on the requesting user’s identity;
• it is the most common design in commercial systems;
• it is generally less secure than a MAC design, but easier to implement and more
flexible;
• examples:
– in the MicrosoftWindows operating system, the owner of a file or directory
can grant or deny access to other users or groups of users;
– in the UNIX file system permissions, users determine access groups through
the usage of Access Control List (ACL).
In nDAC design, a subject’s role or a task is assigned to the subject. This design has
the following characteristics:
• roles or tasks are defined and resources are assigned to these roles or tasks;
• access to resources is permitted when a user is part of a role or task that is
allowed access to a resource;
• it is mostly used in environments where there is a high turnover of users;
• access control is decoupled from the user’s identity because access is not tied
directly to users;
• examples:
– user is given access to resources based on the job title;
– user is given access to labs depending on their clearance level in the organisation.
Systems can implement both MAC, DAC and nDAC simultaneously, where DAC or
nDAC refers to one category of access control that subjects can transfer among each
other, and MAC refers to a second category of access controls that imposes constraints
upon the first.
Based on this, many access control models were created. The most common models
are:
26
Background Knowledge
Identity-Based Access Control (IBAC) [166] Primarily follows a DAC design
where the permissions to access a resource are directly associated with a subject’s
identifier, i.e. a user name or identity.
Role Based Access Control (RBAC) [53, 54] Primarily follows an nDAC design,
which restricts access to resources based on roles. According to [48] it is meant
for the task of coordinating users, resources and permissions. It has been rendered obsolete with the dawn of modern web technologies because it does not
allow complex authorisation tasks. For example, when used with a MAC design,
unique roles have to be created for all combinations of security labels, which is
undesirable.
Attribute Based Access Control (ABAC) [48, 138, 155] Grants access based
on the requesting user’s attributes and is primarily a DAC mechanism, which
allows access decisions to have into account user’s attributes. According to [48],
this model was created to be applied to web services and to fight the problems
of RBAC. In ABAC access control decisions are made based on attributes associated with relevant entities, a natural evolution from the RBAC approach.
Nevertheless, it cannot work with web applications that rely on the RBAC approach.
Attribute and Role-Based Access Control (ARBAC) [100] This is an attribute
and role-based access control model that combines RBAC and ABAC. While
not under the ARBAC name, there are other access control meta-models, which
enhance existing models by combining RBAC and ABAC as demonstrated by
the authors in [49].
N-leveled RBAC (Ln RBAC) [38] Is a multileveled variation of the RBAC model
on which different level RBAC provides different control granularity.
While not adequate for modern web technologies [48], RBAC is probably the most
used access control design on WWW applications at the moment. A domain diagram
of its entities is depicted in figure 2.4.
An RBAC system has basically five main entities: Users, Roles, Resources, Privileges
and Actions. Privileges are assigned to Actions that can be performed by a User, over
Resources that are managed by the access control system. Privileges can be assigned
directly to Users but can also be assigned to Roles. In this situation, Users that are
assigned to those roles inherit the Privileges.
According to [40] it is possible to combine RBAC with ABAC, assign roles to users
based on user attributes and therefore the assignment of privileges related with these
roles. In this case the assignment of roles to users is automatic, which in conventional
RBAC system, according to the specification presented in [53], is done manually by
an administrator.
27
2.1 Authentication, Authorisation, Accountability
0..*
Resource
URI
Attributes
1..*
over
owns
1..*
1..*
User
URI
Attributes
1..*
has
0..*
Role
0..*
0..*
has
0..*
Privilege
1..*
has
1..*
Action
0..*
has
Figure 2.4: Access Control Domain Diagram
There are a few other models in the area of cross-domain authorisation e.g. TAAC9 ,
Transparent Accountable Datamining Initiative (TAMI)10 [158] and Priv.ly [56].
The development of TAAC seems to have ceased. Information related to it can now
only be found through Internet web archives.
TAMI uses an approach based on OpenID-Connect authentication [163]. It is browserdependent, forcing users to install tabulator [18] or its plug-in in the Firefox browser.
Priv.ly is a product developed by the Privly Foundation that emphasises on enforcing user’s content privacy on the web. Its implementation is based on a client-side
approach but only accessible for users that install a browser extension (available for
Chrome, Firefox and Opera client browsers). It is similar to what is proposed by the
previous approach, but limited to textual information and client-browser-dependent.
2.1.3.2
Languages & Frameworks
Depending on the adopted access control design, there are several policy languages
that can be applied to the specification of access policies:
KAoS Services Framework (KAoS) [152] Is a language that has been adopted
in specifying access control policies that conform to ontologies. It is one of the
first efforts for representing access policies using semantic web languages (i.e.
Web Ontology Language (OWL)). It allows positive and negative authorisations
expressed by policies that permit users to perform actions in certain defined
contexts. Access policies are represented using context restrictions instead of
conditional rules. Its policy and platform-independent services help in the policy
specification and enforcement for semantic web services in traditional distributed
systems.
9
10
http://www.pipian.com/blog/2008/12/12/taac-in-action/
http://dig.csail.mit.edu/TAMI/
28
Background Knowledge
REI [81] Is a language that provides constructs based on deontic concepts. It has
been adopted in specifying access control policies that conform to first order logic
and accepts RDF schemas. It includes notions of prohibitions and obligations.
REIN [82] Is a decentralised framework for representing and reasoning over distributed policies in the Semantic Web that accepts different policy ontologies
but requires the use of Notation3 (N3)11 rules. It provides an extensible and
distributed framework for representing and reasoning over policies.
Rule Markup Language (RuleML) [26] It is a markup language that explores
existing rule systems (e.g. Horn logics) suitable for the Semantic Web in order
to provide a standard for web knowledge representation. It provides a canonical
language family for representing web rules through Extensible Markup Language
(XML) serialisation and formal semantics.
Security Assertion Markup Language (SAML) [33] It is an XML-based, openstandard data format language for exchanging authentication and authorisation
data between different entities. It addresses issues related to SSO and it mostly
used for web services.
Semantic Web Rule Language (SWRL) [75] Is a semantic web rule language
that uses a combination of OWL DL and OWL Lite sublanguages of OWL with
the Unary/Binary Datalog RuleML sublanguages of RuleML. It extends the
OWL axioms to include Horn-like rules in the OWL DL and OWL Lite sublanguages. Each rule is formed by an implication between an antecedent (i.e. body)
and consequent (i.e. head). Whenever the conditions specified in the antecedent
hold, the conditions specified in the consequent must also hold.
eXtensible Access Control Markup Language (XACML) [122] It is a standard
that defines a declarative access control policy language in XML and a processing model that describes how to evaluate access requests according to defined
policy rules. Its access policies are based on the attributes of the subject, the
resource and the environment, allows an authorisation model based on an ABAC
approach.
Ontology is a formal, explicit specification of a shared conceptualisation
[149]. It is an artefact used to acquire and represent knowledge domain
descriptions i.e. conceptualisation of the domain according to a shared
vocabulary.
Ontologies are the common domain representation paradigm in a system. Accordingly,
unless otherwise explicitly stated, domain descriptions are embodied in ontologies and
ontologies represent domain descriptions.
11
http://www.w3.org/DesignIssues/N3Logic
2.1 Authentication, Authorisation, Accountability
29
Some of the presented languages do not work with OWL ontologies (but provide
different kinds of functionalities like logic programming). Others, are more suitable to
application on web services access control.
2.1.3.3
Access Control Administration
Access control administration is related to how authorisation logic is specified and used in the authorisation method.
Access control is achieved using a centralised, decentralised or hybrid approach. A
centralised access control administration is characterised by:
• forwarding all requests through a central authority hub;
• providing simple control administration;
• suffering from a single point of failure.
A decentralised access control administration is characterised by:
• allowing resource access to be controlled in the domain where it is hosted rather
than centrally;
• having a more difficult control administration;
• avoiding the single point of failure;
• using security domains, which act like a sphere of trust, made by a collection of
users and resources with defined access rules, where a user must be included in
the domain in order to be trusted.
An hybrid access control administration approach combines properties of both models.
2.1.3.4
Resource State
In typical sharing scenarios, each resource can be classified into three groups, according
to their accessibility level states:
Private When the resource is only accessible by the resource author;
Protected When the resource can be accessed by the author and by another user;
Public When the resource is publicly accessible by any user.
30
Background Knowledge
Share With Others
Private
Protected
Make Public
Make Public
Public
Figure 2.5: Resources Accessibility [State Diagram]
According to the state diagram shown in figure 2.5, a private resource automatically
changes from the private to protected state when any other user besides the author
has been granted access to it. Any private or protected resource can become publicly
accessible.
Once a protected resource is shared with other users, there is no point in revoking
those users access to the protected resource, unless when the resource suffers changes
and the author does not wish for those changes to be shared. Furthermore, once a
resource is shared with any user, there is a real treat that the resource can become
available to the public, because once it is shared, the resource owner is no longer the
sole user with access to it.
2.1.3.5
Building Trusted Social Network
When dealing with social networks and resource sharing, it is necessary to have access
policies based on the following parameters:
Who Represents the user (e.g. real user or agent) to be given access to a resource;
When Relates to the date/time period on which the user should have access to the
resource;
Social Network Represents the network layer in which who is situated (i.e. work,
friend, romantic) used to calculate the trust between who and the resource author.
In [94] the authors present a Distributed Identity Management System (D-FOAF)
that shows how social networks’ information can be utilised to provide access control.
Nevertheless, to maintain a social network, it is necessary for users to maintain and
31
2.1 Authentication, Authorisation, Accountability
grow their relationships. Consequently it is also important to mention how those
relationships grow in the system.
Three important ways for users to connect to others on a social network were identified
and described in [21]:
Self-Introduction This represents a direct invitation to other users as shown in
figure 2.6. By using self introduction, a user (e.g. A) can invite another user
(e.g. C) to be part of his social network. The inviting user (i.e. A) can only
invite another user (i.e. C) if one of two assumptions are achieved: the invited
user (i.e. C) allows the invitation by the inviting user (i.e. A), if the friendship
path between them has a distance less than a defined threshold (three degrees in
this case), used for example in the LinkedIn professional social network; or if the
invited user (i.e. C) publicly allows invitations, even if no path or relation exists
between the inviting and invited users (e.g. the Facebook social network uses
this type of approach). Furthermore, on this example, if another user (e.g. D)
specifies that he would only allow to be reached by other users with a friendship
path distance less than two, other users (e.g. A) is not able to see or even send
him an invitation. Users that have no relationship with any user on the social
network (e.g. E) might be invited by any other user if he accepts public invitation.
Notice that the distance evaluation has been simplified by only measuring the
distance path with node counting. In a real application environment usually the
distance between users is given by a higher complexity function. FOAF allows
this type of behaviour.
reacheableBy
A
E
C
friendOf
friendOf
B
friendOf
D
reacheableBy
Figure 2.6: Social Relationships using Self-Introduction
Group-Association This happens when a user becomes part of a group of interest,
which subsequently allows direct invitation to other users on the same group
of interest. When users are part or associated to a group of mutual interest as
shown in figure 2.7, typically all users (e.g. A, Z) in the group are reachable
by each other. In this scenario, there is no user node between them that might
allow the invitation, but a group one instead. Because of these associations of
32
Background Knowledge
users to a group, any user (e.g. A) is able to make a direct invitation to another
user (e.g. Z). The function that allows users to send invitations and be invited
is changed from the previous example such that it now must have into account
the groups to which users belong.
reacheableBy
Ontology
Engineering
A
memberOf
Z
memberOf
Figure 2.7: Social Relationships by Group-Association
System-Proposed This type of connection is a different form of introduction that
relies on the system where users are registered, as shown in figure 2.8. By gathering information about the user, the system keeps records of user actions, runs
information-retrieval processes over entered information, uploaded documents,
bookmarks and any other information that might enrich the user profile. Consequently, users will be suggested to other users in the network whenever there is
a match between both users preferences, topics of interest, ways of thinking or
associations to groups of interest. System-proposed relationships is performed
automatically but typically only occurs between users of a particular domain or
application (e.g. Research Gate or Mendeley) where relationships to other users
are based on the user’s implicit topics of interest or performed system actions.
While it is not common practice, system-proposed relationships might impose
restrictions – like in the previous examples – where users can define the path
length threshold.
2.1.4
Accountability
Accountability is the activity of monitoring who is using a system, what
resources are being used and when a resource is used by a user.
Accountability provides tools for system administrators to monitor a system, in order to identify suspicious activities and deter improper actions. Logging operations
allows tracing events back to user origin, but this process of auditing typically has
a negative effect on system performance and, for that, data collected in logs is reduced. Furthermore, such process is typically implemented in applications that only
act locally.
33
2.1 Authentication, Authorisation, Accountability
reachableBy
A
Interest Topics:
-Semantic Web
-Knowledge
Management
-Ontology
Engineering
Y
Interest Topics:
-Knowledge
Management
match
Figure 2.8: System-Proposed Social Relationships
2.1.4.1
User Actions & Generated Content
User action is the act of performing an operation over a resource.
Common user actions over resources are, for example, accessing, scrolling, downloading, writing, annotating, copying, updating, deleting, sharing resources, etc.
User actions may or may not generate content. For example, uploading a resource
to the WWW generates a new resource that is hosted, while scrolling through a web
page does not typically generate content.
Content creation on the web can take the form of text, documents, photographs,
videos, audio tracks, or even structured data documents like hypertext webpages,
XML, RDF statements, etc. This is related to User Generated Content (UGC). Although the term UGC is commonly used, its definition is still contested. It is also
known as Consumer-Generated Media (CGM) [30]. According to [108],
UGC is defined as “... any form of content such as blogs, wikis, (...) digital
images, video, audio files, and other forms of media that (...) [is] (...)
created by users of an online system or service (...)”.
With the Web 2.0, UGC increased as users started contributing with large amounts of
information [30, 118]. In [30], the authors characterise several aspects related to UGC
and specify that some web domains only deal with specific kinds of content (e.g. Flickr,
youtube, etc.) as others allow several types of content (e.g. Facebook, MySpace, etc.).
Several types of UGC are described in [67].
There are studies that describe the growth of UGC [161], its value and the emerging
business models that make use of it. Other research described in [102], shows how UGC
influences a social network, demonstrating that users’ consumption habits influence
their social activity in a way that users “(...) tend to befriend people similar to
themselves”.
34
Background Knowledge
Symptomatically, each existing web domain or application is responsible for hosting
and keeping track of the resources it holds i.e. independently of their type. The same
applies to user actions. Nevertheless, the format used for keeping those is commonly
proprietary to the web application.
2.1.4.2
Resources Directory
Although it is difficult to testify, it is possible that in several cases, after a few years of
online interactive usage between different web applications and domains, some users
are not capable of coping with the task of remembering where they first uploaded their
family photographs, or when they used a different website to post their kids’ videos.
Similar reasoning can be generalised to all UGC.
Until one to two decades ago, as the Internet was rather small compared to its current
size (over three billions users12 and more then one billion websites13 ), people relied
on search engines (e.g. Netscape, Yahoo) to retrieve their UGC because those search
engines had the “responsibility” of indexing all the WWW content, specially because
most of that content was publicly available.
Nowadays, such search engines cannot be trusted to provide fine-grained results anymore. This is because indexing is performed upon public resources only, i.e. it does not
index personal resources or actions hidden behind web applications and proprietary
data silos that require user’s authentication.
2.1.4.3
Provenance
According to the W3C organisation14 ,
“(...) provenance of a resource is a record that describes entities and processes involved in producing and delivering or otherwise influencing that
resource. Provenance provides a critical foundation for assessing authenticity, enabling trust, and allowing reproducibility. Provenance assertions
are a form of contextual metadata and can themselves become important
records with their own provenance.”.
Therefore, provenance refers to the history of information, including its origin and
details of its creation, as well as major events that occurred throughout its life cycle.
It is the information that relates what happened to a resource to the user action that
forced the change to happen.
12
http://www.internetlivestats.com/internet-users/
http://www.internetlivestats.com/total-number-of-websites/
14
http://www.w3.org/2005/Incubator/prov/wiki/What_Is_Provenance
13
2.1 Authentication, Authorisation, Accountability
35
For each user action upon a resource, that action can be captured and kept as provenance information acting like a breadcrumb for every user action on a resource that
altered the resource’s content or meta-information.
Data provenance has been a much-discussed issue in several different areas. For example, the authors in [141] identified several problems related to the creation and storage
of provenance information and propose a taxonomy of data provenance characteristics.
Nevertheless, only recently the topic has begun to be addressed in the Internet, for
common user actions on the Internet (e.g. blogging, reading, uploading, downloading
resources, etc.).
In [128] the authors defined a conceptual model of provenance data model called
the “7W”. This model is the combination of seven different elements that are: what,
when, where, how, who, which and why. This model describes the provenance of a
given resource with sufficient detail.
The authors in [71], address the creation of content, discuss provenance of content in
the web and propose a suitable provenance model. There are other provenance models
like the one presented in [70] where the authors envisage a quality-aware web of data
by proposing a new vocabulary to describe provenance of Web data as metadata. In
[119] a provenance ontology was developed based on the “7W” model, specifically for
DBpedia15 .
In the work presented by [160], the authors propose an approach to automatically
acquire provenance information for manual data processing. The basic idea behind
this approach is that all documents that have been read by a user can also have
contributed to a document saved later on. Generated provenance information is based
on a Web Distributed Authoring and Versioning (WebDAV) infrastructure mixed with
Apache Subversion proprietary repository for versioning.
The authors in [1] propose a scalable and yet efficient storage model by exploiting
structures of provenance logs and separating metadata from the generating process.
The authors in [97] address some issues by increasing the value of provenance information (by enhancing the W3C PROV pingback technique) and decreasing the cost
of publishing provenance information by using minimal coupling to the Prizms Linked
Data. Prizms is a platform that was introduced by the authors, which creates datasets
about the structural provenance of a host system since it is still too difficult to publish provenance according to LD principles and discover provenance information in the
LOD.
By reusing semantic web standards, provenance annotations can be stored according
to existing provenance ontologies. These allow provenance information to be represented and published as meta information in the form of semantic annotations [20] and
15
http://wiki.dbpedia.org
36
Background Knowledge
stored as semantic annotations on any available triple/quad store or even on LOD [25]
repositories.
If provenance for a resource cannot be determined, it is more difficult to determine or
provide any trust on such a resource as it is presented on the Internet.
Current web applications are capable of relating a user action to a resource at the
moment it happens, thus internally generating provenance information. Yet, most of
them do not publicise it, keeping that information restricted to the web application.
As a result, user actions over resources tend to stay permanently within the Internet
in proprietary formats and non-indexed way, leaving no trace of why and how they
happened, restraining users from an open and full access to his actions and generated
content.
2.1.4.4
Traceability
Traceability [9, 85, 146] has been used as the ability to trace the influence of one software engineering artefact on another and providing better
verification and validation of costumer requirements.
Traceability has been greatly debated when applied to the field of software engineering
as the existing research demonstrates.
In [64], the authors discuss the nature of the requirements traceability problem, provide an empirical study and introduce the distinction between the pre and postrequirements specification.
A toolkit composed of an analyser tool, a dictionary builder and an exporter is presented in [7]. This toolkit provides a rapid interactive development of scenarios, which
supports traceability to provide a better-documented set of specifications.
In [27], the authors propose a model management framework in which traceability is
used as the mechanism to follow the transformations carried out over a model through
the various refinement steps.
The authors in [11] introduce a general proposal to traceability issued in the context of
Global Model Management, by providing a clear separation of concerns between traceability in the small (model weaving) and traceability in the large (mega-modelling).
The authors in [44] consider requirements management as one of the activities responsible for system failures and for that they present an automated generation of the
requirements traceability matrix.
Despite all these efforts, the usage of traceability applied to Internet resources has
yet to be studied. In this proposal, the concept of traceability is applied to user
actions and resources (e.g. downloading a resource) that do not promote a resource’s
content or meta-information change. User actions that in fact alter resources’ content
or meta-information can already make use of provenance information for that matter.
2.1 Authentication, Authorisation, Accountability
2.1.4.5
37
Annotations
An annotation16 is a note or meta-information (e.g. a comment, explanation, presentational markup) added by way of explanation or commentary.
It can be embedded or attached to the original resource. When attached,
annotations typically refer to a specific part of the original data.
Some tools can help annotating e-documents (e.g. e-books, Portable Document Format
(PDF) documents) and web pages (e.g. Annozilla [113]), easing the specification and
maintenance of annotations along with the documents.
More recently, with the advent of the Semantic Web, machine processable information and semantic-based services are delivering, but also claiming, increasingly more
semantically rich information. Web services and agent-based systems are two of these
paradigms. Annotations represented according to formal notations and conceptualisations of knowledge domains (e.g. ontologies) are seen as one of the possibilities to
relate semantics with online databases, services and online rendering web documents.
Annotation refers to both the process and the object resulting from the process. The
process and the object are so closely interrelated that in most cases it makes no sense
to separate one definition from the other.
According to [50], the annotation object is the content represented in a formal language
and attached to the document. This definition agrees with that of [83, 84] when the
authors state that “annotations are viewed as statements made [...] about a Web
document”. Thus, annotation is understood as an independent document, yet existing
only in respect to the content of the document(s) it refers to.
Annotation is often related to the concept of meta-information and the processes
of indexing and information retrieval. Though metadata is traditionally associated
with the categorisation and indexing of documents, it is no longer or never has been,
fundamental for these tasks. In fact, indexing and information retrieval engines exploit
much more complex elements of the document, including the content of the document
itself. Information extraction mechanisms (e.g. [39, 61, 131, 151]) and inference of
inter-document relations based on the analysis of the user’s browsing activities (e.g.
[121]) are some of the currently more used approaches.
When the annotation semantically enriches the content in a formal, machine-readable
way, it is referred to as semantic annotation. Semantic annotations together with
ontologies are envisaged as being capable of providing more information elements. This
new perspective on semantic annotation is referred as “deep annotation” as referred
in [68].
The intent of using either provenance and traceability information is to be able to
provide resources with annotations that enrich those resources.
16
http://www.merriam-webster.com/dictionary/annotation
38
Background Knowledge
2.1.5
Reference Architecture
In [153] the authors present a conceptual architecture for AAA in the Internet. While
this Internet Engineering Task Force (IETF) memo does not specify a standard of
any kind, it lays down the nomenclature and the architecture commonly found in
modern AAA systems. Furthermore, when included as part of multi-domain decentralised AAA system, the conceptual architecture sets the stage for defining protocol
requirements between the engaged systems.
Identity Provider
Policy
Administration
Point
uses
Retrieve/Store
Access Policies
Policy
Enforcement
Point
Policy
Retrieval
Point
Request
Authorisation
Policy
Decision
Point
Retrieve
Access Policies
Require
Additional
Information
Policy
Information
Point
Figure 2.9: Reference Architecture [Components Diagram]
Commonly accepted names for the various entities involved in the architecture are:
Policy Enforcement Point (PEP) [122, 153, 159, 162], Policy Decision Point (PDP)
[122, 153, 159, 162], Policy Information Point (PIP) [122, 153], Policy Retrieval Point
(PRP) [114, 153], Policy Administration Point (PAP) [41, 122, 145]. Other existing
architectures use the concept of an IDentity Provider (IdP) that offers features as
creating and maintaining users identity. A components diagram that depicts the
entities involved in the reference architectural model is presented in figure 2.9.
In [116] the author suggests a typical operation pattern for providing resource authorisation, which complies with the sequence diagram presented in figure 2.10.
In this operation pattern, the PEP is responsible for intercepting access requests sent
from the user to perform some type of action upon a resource. The PEP, on behalf of
the user, requests authorisation for accessing the resource. This request is forward to
the PDP, which is the entity that has the engine for evaluating access policies. It uses
the information provided by the PEP and the specified access policies to determine if
the user should be allowed or denied access to the resource.
The PDP uses the PRP and PIP to retrieve policies and attributes referenced in the
policies. The PAP is the system entity used for managing the access policies. For that
it uses the features of PRP to retrieve existing policies and store changes to those.
2.1 Authentication, Authorisation, Accountability
Figure 2.10: PEP Pattern [Sequence Diagram] (adapted from [116])
39
40
Background Knowledge
When the PDP finishes the evaluation of access policies, it returns an answer to the
PEP stating whether access has been granted or denied to the user. If access is granted,
the resource is retrieved from the hosting server.
Next sections provide fundamental and more detailed description about each architecture’s entity, by addressing their overall operation in providing authentication and
authorisation.
2.1.5.1
IDentity Provider
According to the authors of SAML17 :
IdP is “a kind of service provider that creates, maintains, and manages
identity information for principals[18 ] and provides principal authentication
to other service providers within a federation, such as with web browser
profiles”.
Within the AAA architecture, the IdP has the responsibility of generating identity
for new users, by allowing new users to create a virtual identity for themselves, thus
generating the corresponding user’s credentials.
2.1.5.2
Policy Enforcement Point
According to the joint effort made by the IETF Policy Framework Working Group
and the Distributed Management Task Force (DMTF)/Common Information Model
(CIM) [153, 159, 162],
PEP is the entity of the AAA reference architecture where policy decisions
are enforced.
It performs access control and guaranties authorisation access to resources, by making
decision requests to the PDP and enforcing authorisation decision. It is also mentioned
in other access control architectures e.g. XACML [122].
Typical PEP responsibilities consist of:
• providing a single, yet decentralised point of access where user access to a resource is enforced [116];
• transferring the request details to a PDP for evaluation and authorisation decision [116];
• enforcing the decisions of the PDP.
17
https://www.oasis-open.org/committees/download.php/21111/saml-glossary-2.0-os.html
While SAML provides a distinction between end user and principal, in the scope of this work,
principal is considered as a synonym of user.
18
2.1 Authentication, Authorisation, Accountability
2.1.5.3
41
Policy Decision Point
According to the joint effort made by the IETF Policy Framework Working Group
and the DMTF/CIM [153, 159, 162],
PDP is the entity of the AAA reference architecture that evaluates the
applicable policies and renders an authorisation decision.
This entity makes policy decisions for itself or for other network entities
that request such decisions. In particular, according to [122], the PDP is
used by the PEP for evaluating policies in order to determine if a particular
user is granted access or not to a particular resource, application or service when an access request is performed. Each evaluation engine depends
on the specified policy notation and may differ according to different access policy languages and frameworks e.g. REI, XACML, etc. (cf. section
2.1.3.2). The authors in [126] suggest extending XACML by introducing
a semantic inference engine and using attributes that specific services may
require.
2.1.5.4
Policy Administration Point
According to [122, 145],
PAP is the entity of the AAA reference architecture that enables users to
build and manage access policies over resources.
A PAP has the responsibility of administering or defining access policies over existing
resources. The level of granularity, expressivity or inference of access policies differs
according to the used access policy language. PAP must adapt to the access policy
language in order to allow their usage to a full extent.
2.1.5.5
Policy Information Point
According to [122, 153],
PIP is the entity of the AAA reference architecture that stores the information about users and resources against which access policies are evaluated.
Typical PIP responsibilities consist of:
• storing users and resources;
• respond to additional information requests.
While these are the typical responsibilities of a PIP, it can also store resources that
can be hosted or dereferenced in a WWW scope.
42
Background Knowledge
2.1.5.6
Policy Retrieval Point
According to [153],
PRP is the entity of a AAA reference architecture where policies are retrieved from a policy repository.
Typical PRP responsibility consists of providing access to the policy repository. While
there are some evidences about PRP in [114, 153], over time, the responsibility of the
PRP in the AAA framework has moved to the PIP. In fact, for the time being, it
seems that PRP usage has become deprecated. An example of this is XACML [122]
that never refers to the PRP and its responsibilities are performed by the PIP.
2.2
Recommendation
In section 1.6 and particularly in research question number four, it was perceived that
in order to ease and automate the access policy management process, a recommender
process would be advantageous.
To reduce uncertainty and help coping with information overload when trying to choose
among various alternatives, people usually rely on suggestions given by others, which
can be given directly by recommendation texts, opinions of reviewers, books, newspapers, among others [137].
Recommendation is something that has become part of everyone’s daily lives. Users
are willing to follow others’ recommendations and to give back recommendations to
the community. When deciding between which product to buy, users want to be able
to read opinions from other buyers [104] and tend to follow them as they are considered
experienced users [157].
Currently, recommendation is widely used in electronic commerce and other applications [2, 98, 112, 136]. In e-commerce web applications, trust is based on the feedback
of previous online interactions between members as shown by the authors in [130, 132].
In the Internet perspective, there are other areas in which recommendation is also relevant, such as resource recommendations on websites like (e.g. Pinterest), documents
(e.g. Slideshare, Pocket) and users (e.g. LinkedIn, Facebook, Google+).
With the Internet’s continual evolution, recommender systems have also evolved.
While initially recommendation was only used in e-commerce websites for recommending similar or most bought items to users, nowadays the process of recommendation
has improved such that the recommendation of friendship and/or relationship between users of a social network has become a quite common task on typical social web
applications.
Every recommender system is typically based on two elements:
2.2 Recommendation
43
User/Item Actions Represents user actions upon items and may include a possible
rating;
Item Similarities Represents the associations between users or between items. Some
recommender systems provide algorithms to calculate item similarities during
the recommendation process, while others even allow the usage of external precomputed item similarities during the process.
The output of a recommender system is a scored list of recommended items that are
recommended to a list of users. The maximum number of retrieved recommendations
is specified by the value of AT. For example, a top AT value of five means that only
the most scored five recommendations are proposed.
This section provides an insight about recommender systems, recommendation techniques, dataset training models, similarity algorithms and measures applied to recommendation evaluation. Before that, a systematisation of the user’s consciousness
about resources is presented, which will be helpful to perceive the importance of the
recommendation process in the scope of this work.
2.2.1
Location Awareness vs. Knowledge Awareness
A user’s consciousness about something can be characterised according to two dimensions19 : perception of reality and reality of perception.
Applying such rationale to resources’ location and users’ knowledge awareness of those
(cf. figure 2.11), a particular resource can be classified as:
Known-Knowns These are resources whose existence and location are known by
the user e.g. a photograph is taken of a person, and the person knows about its
existence and its location.
Known-Unknowns These are resources a user recognises he knows nothing about
until he finds them e.g. a person finds a photo by chance on which he appears,
knows its location but was not aware of its existence.
Unknown-Knowns These are resources that the user does not know how to find,
but knows about their existence, e.g. a photo is taken of a person, the person
knows about its existence but does not know about its location. With time and
searching investment the person might get to its location.
Unknown-Unknowns These are resources whose existence the user is not even
aware of e.g. a photo is taken of a person but the person does not know its
existence or where to locate it. These type of resources would only come up on
searches related to the user if contextual information is used.
44
Location Awarenes
Low
High
Background Knowledge
Known
Unknown
Known
Known
Unknown
Unknown
Unknown
Known
Low
High
Knowledge Awareness
Figure 2.11: Resource Location vs. User Awareness
This classification is helpful in the sense it emphasises the fact that the same existing
information is perceived differently by users. There are different reasons for these
different perceptions, including (i) access policy restrictions and (ii) information overload. Recommender systems are conceptually fit to help users perceive resources as
(useful) known-knowns:
Access Policies Restrictions Access policies prevent users to access resources that
would be of their interest. The recommender system mediates between the owner
(that has the resource and can grant access to it) and the beneficiary (that is
interested in the resource). Recommender system will:
Recommend the owner with access policies to grant access permissions to
the user upon the resource;
Recommend the beneficiary to request access permissions for a certain resource that is not accessible and that is known-unknown, unknown-known
or unknown-unknown to the reader20 ;
Information Overload “Information overload occurs when the amount of input to
a system exceeds its processing capacity” [144]. In this context, information
overload occurs because the owner is not able to match – the large number of
19
http://en.wikipedia.org/wiki/There_are_known_knowns
Notice that the resource is protected/private and is not accessible by the reader but the recommender system may be allowed to advertise its existence (not its content) to the reader.
20
2.2 Recommendation
45
– his protected resources with the potentially – large number of – interested
readers. In that sense, the recommender system:
Recommends the owner with suggestions of potentially interested users that
are not able to access the resources;
Recommends the beneficiary, which is overloaded by the quantity of users
that he would have to contact to request access to known-unknown, unknownknown or unknown-unknown resources.
2.2.2
Recommendation Techniques
Recommender systems typically exploit products information, user’s profiles and recently started using social network analysis. According to each desired application,
different recommendation methods or techniques for can be applied.
Some of these techniques have evolved from existing techniques in information-retrieval
systems with the main objective to retrieve useful or relevant information from the
user [2, 3].
According to [32], different recommendation approaches may apply different recommendation techniques.
Content-Based Content-based filtering techniques [73] automatically generate descriptions of the resource’s content and compare these descriptions with users
interests in order to verify an item’s relevancy to the search. When adapted to
the semantic web, content-based filtering can use ontologies for the representation of users and resources in order to represent the classification according to
their relevance in the field. The filtering method proposed in [139] considers
the hierarchical distance or proximity between the concepts of user profile and
concepts in the resource profile by using a hierarchical ontology.
Collaborative-Based Collaborative recommendation exploits the information resulting from the exchange of experiences among people who share common interests. Hence it is a technique that does not require understanding or recognition
of the content of resources [73]. An application to the Semantic Web of such
technique is demonstrated in [140]. The experimental recommendation results of
ontology-based collaborative recommendation [140], show a significant improvement in recommendation accuracy compared to standard collaborative filtering.
Context-Based Context-based recommendation is based in contextual information
and is useful in information retrieval [80], although the decisions taken in most
information retrieval systems are based only in consultation and collection of
documents, and information about the context of research is often ignored [6].
46
Background Knowledge
When related to web search, the context is considered as a set of topics potentially related to the search term [103]. Some semantic context-dependent
recommendation approaches use ontologies [101, 165]. In [101] the authors propose an ontology-based system that contains contextual information about the
recommendation process and items. The contextual information is processed
through heuristic rules applied to vector spaces, thus allowing the system to
dynamically place a given recommendation.
Trust-Based According to [142], people rely more on recommendations from people
they trust (e.g. friends) than online recommendations generated from anonymous
people with similar characteristics. In [12] the authors propose a trust-based recommender system for the semantic web, based on ontologies and using the WoT
to generate recommendations. The authors in [109] propose a two-dimensional
trust model that dynamically gets updated based on user’s feedbacks.
Social Recommendation Social networks can be integrated in recommender systems in order to improve their behaviour. The accuracy of the recommendation
can be improved using the information obtained from the user social networks,
thereby improving the understanding of user behaviour [72]. In [51], as proof
of concept, the authors developed a prototype recommender grounded in social
science. In the systems presented in [51, 117], semantic web and ontologies help
with the representation of context and interpretation of social data. As perceived, several types and sources of information can be exploited as input for
the recommendation process, including the content and semantic of the resource,
the users’ profiles and relationships, and the users’ actions upon the resources.
Hybrid Filtering A hybrid filtering approach combines the strengths of the various
involved filtering techniques with the aim of creating a system that can better
meet the needs of users. In [31] the author presents seven types of different
hybrid approaches to hybrid recommendation. In [5] there are two different
approaches for building hybrid content and collaborative recommender systems,
whose purpose is to produce relevant recommendations, while overcoming the
cold-start issue for new items.
Cross-Domain Recommendation Cross-domain recommender systems must have
generic user and item models in order to mediate user data through different systems and application domains. Using such information, a recommender system
can produce recommendations from user preferences collected in one domain to
generate recommendations in a different one. Based in the information available
in cross-domain linked data repositories, the authors in [52] present an on-going
research work about a generic knowledge-based description framework upon semantic networks to provide cross-domain recommendations.
47
2.2 Recommendation
Table 2.1: Recommender Engines Comparison
Name
Type
Accepts Semantic Data
Learning Curve
Documentation
C-IKNOW
Application
7
medium
good
Easyrec
Application
7
easy
good
Gusto
Framework
3
hard
medium
MyMediaLite
Framework
7
hard
good
Mahout
Framework
7
hard
good
OpenRecommender
Application
7
medium
poor
2.2.3
Recommender Engines
Some of the existing and available recommendation engines are described. A comparison matrix that relates their implementation type, accepted data, learning curve and
documentation is presented in Table 2.1.
Learning curve and documentation is obtained by manipulating the recommender
engine and observing existing data. Learning curve is rated from easy to hard, while
documentation is rated from poor to good.
C-IKNOW [76] Is a semantic recommender system that integrates social network
analysis and automated reasoning. This recommender process is based in geodesic
distances, positive matches and profile similarity. It firstly returns the same
scores for all users based on the search keyword and recommended resources,
followed by a selection stage that incorporates information about the relationship between users and a potential recommendation in order to achieve a final
score.
Easyrec [10] This open-source recommender engine provides recommendations based
on user actions. Default user actions are buy, rate and view, but more actions
can be added. In order to identify patterns to generate recommendations, recommender analysers will periodically analyse this information. Generated recommendations can be requested and accessed trough Application Program Interface
(API) calls.
Gusto [13] Is a set of APIs that uses the semantic similarity Semsim [24] that measures the similarity between objects i.e. the similarity between two users based
on several properties. Those properties are obtained through collaborative or
content filtering algorithms and each user data can be stored in semantic models
or on a Jena RDF Store.
MymediaLite [58] Is a lightweight open-source, multi-purpose library of recommender system algorithms. It provides collaborative and content recommendation, rating prediction and item prediction from positive-only feedback. Data
48
Background Knowledge
can be loaded from databases and from several graph repositories as ThinkerGraph21 , Neo4j22 , Orientdb23 , Sparksee24 , Rexster25 and Sail26 .
Apache Mahout [8, 120] Is a Java library that has four main usages: recommendation, clustering, classification and frequent item set mining. Mahout scalability
to process large datasets is essentially provided by clusters of computers using
Hadoop27 . Mahout provides several applications that are useful for a recommender system development and recommendations can be obtained by using
collaborative filtering techniques over data loaded from a database or formatted files. Mahout collaborative filtering recommender supports predictions from
either user or item-based approaches that use various similarity and neighbourhood functions to calculate similarities between items or between users.
OpenRecommender [42] Is an open-source recommender engine that is capable
of intelligently retrieve, sort, rank, filter, aggregate and display data choices
to users. This recommender engine is based in the Apache Mahout machinelearning library and can integrate data from multiple sources. Recommendations are obtained by collaborative or content filtering techniques used with
data loaded from a database or formatted files.
2.2.4
Similarity Algorithms
Recommender systems based on collaborative filtering techniques use similarity measures between items or between users, in order to recommend items for users.
Some similarity algorithms can be used with either item or user-based recommendation
engines but some can only be applied to one of the types. According to [77, 78], different algorithms are suited for different recommendation aims. Nevertheless, there are
some general rules: (i) more input data makes all algorithms perform better than with
reduced datasets and (ii) more complex algorithms are not always the best solution.
In a recommendation process, two types of similarity can be obtained:
User Similarities These find connections between users that share the same tastes
and calculate the similarity between users.
21
http://tinkerpop.incubator.apache.org/docs/3.0.0.M9-incubating/
http://neo4j.com
23
http://orientdb.com
24
http://sparsity-technologies.com
25
https://github.com/tinkerpop/rexster/wiki
26
https://github.com/tinkerpop/blueprints/wiki/Sail-Implementation
27
http://hadoop.apache.org/
22
2.2 Recommendation
49
Item Similarities These act by firstly observing what type of items a user prefers
and then finds similar items, suggesting and calculating the similarity between
items. Item similarities are less likely to change in recommender systems and
have a tendency for convergence.
The author in [120] depicts several algorithms that can be used for generating similarity
sets between users and between items:
Pearson Correlation This metric is based on the Pearson Correlation of two series
of values. It is mostly used for calculating user similarity and measures the
tendency of two users’ actions to move together. When the tendency is high,
the correlation is close to one. When little relationship exists between two users’
set of actions, those have a correlation near zero. When one user’s set of action
values are high and the other one is low, there is an opposing relationship and
the correlation value is near minus one.
Spearman Correlation This is a variant of the Pearson Correlation where users
actions’ weight are firstly ordered in an ascendant list (from lowest weight to
highest) and each weight is substituted by a sequential value that starts in one
for the lowest weight and is incremented by one until the last item for that user.
Afterwards, the Pearson Correlation is used. It preserves the users actions’
weights and ordering, but looses some information.
Euclidean Distance This measure is based on the euclidean distance between two
users that are considered points in a space of many dimensions i.e. as many as
the number of items. Identical preferences mean that two users are quite near
in that dimension and the similarity is a value of one. This value decreases,
showing a tendency towards zero, as the distance between users increase.
Tanimoto Coefficient This similarity measure discards weights associated to users
actions and uses the Tanimoto coefficient. It represents the ratio between (i) the
size of the intersection in two user’s items and (ii) the union of both users actions.
When two users’ sets of actions completely overlap, similarity has a value of one.
Otherwise, when there is no item in common, the value is zero.
Log-Likelihood This similarity measure is similar to the Tanimoto Coefficient, although using a more complex formula. It disregards users actions’ weights and
attempts to determine how strongly unlikely it is that two users have no resemblance in their actions. The higher the unlikeliness, the more similar they
are.
50
Background Knowledge
2.2.5
Relevant Actions and Training Model
Recommender systems accuracy or effectiveness is evaluated based on comparing recommendations to a previously collected dataset (training set) or by using human
interaction that classifies the resulting recommendations.
Relevant Actions dataset is a set of actions between users and items that
are considered relevant for comparing against recommended predictions.
These actions are not used when running an evaluation recommendation
engine, because if they were included, the recommender system would never
predict them.
Training Model dataset is the set of all the other actions between users
and items that are not considered relevant actions. These are the actions
used as input in a recommender system evaluation and optimally, based
on these, the recommender system would predict for each user the items
that are considered in the Relevant Actions dataset.
The union of actions of both the Training Model and Relevant Actions Datasets constitute the initial Dataset (cf. equation 2.1).
Dataset = T raining M odel fi Relevant Items
(2.1)
Both datasets are obtained with the following parameters :
• user actions weight relevance threshold that defines which actions are included in
the relevant items, where only those with a weight equal or above the threshold
are considered relevant;
• a top K value for user actions that sets how many user actions (per user) are
considered relevant by filtering the results obtained by the application of the
threshold limit and considering only the top K.
2.2.6
Evaluation
According to the authors in [65], in order to choose the best algorithm and configuration for a recommender system it is important to use different metrics when evaluating
the results. This is important either from a research perspective, as well as from a
practical view to be able to decide which configuration best matches the domain and
recommendation. Several metrics can be used to help in decision. Most metrics are
based on the confusion matrix or table, depicted in Table 2.2.
51
2.2 Recommendation
Table 2.2: Confusion Matrix
Positive Cases (Item is correct)
Negative Cases (Item is not correct)
Recommended Items
True-Positive (tp)
False-Positive (fp)
Non-Recommended Items
False-Negative (fn)
True-Negative (tn)
2.2.6.1
Accuracy
Typical accuracy measure is given by equation 2.2. It ratios the number of true
positives and true negatives against the sum of correctly or not, recommended or not
items.
Accuracy =
tp + tn
tp + f p + f n + tn
(2.2)
This measure should not be applied in evaluations, because it gives the same weight
either to true positives as true negatives. In a worst case scenario, if the number of
true negatives is very high and not even one recommend item is returned, the system
can show an accuracy near one hundred per cent without even recommending an item.
2.2.6.2
Recall
Recall is a measure that shows the percentage of correct items that are recommend
by the system. Based on the confusion matrix, recall is given by equation 2.3.
Recall =
tp
tp + f n
(2.3)
Another way of representing recall is based on the Training Model and Relevant Items
dataset as shown in equation 2.4.
Recall =
|{Relevant Items} fl {Recommended Items}|
|{Relevant Items}|
(2.4)
In cases where only collaborative filtering techniques exist, recall is mostly the sole
measure adopted. Yet, it is used for any recommendation technique.
2.2.6.3
Precision
Precision represents the percentage of recommended items that are considered correct.
Based on the confusion matrix, precision is given by equation 2.5.
52
Background Knowledge
P recision =
tp
tp + f p
(2.5)
Another way of representing precision is based on the Training Model and Relevant
Items dataset as shown in equation 2.6.
P recision =
|{Relevant Items} fl {Recommended Items}|
|{Recommended Documents}|
(2.6)
According to the authors in [154] measuring precision on collaborative filtering techniques is quite relative, because the user’s preference absence can have two different
meanings. Let us assume U users, R resources, T topics. The preference of a user to
a topic is given by put œ {0, 1}, which denotes that a user u has included topic t in
their preferences list as shown in equations 2.7-2.10.
U = {u1 , . . . , uf }
(2.7)
R = {r1 , . . . , rg }
(2.8)
T = {t1 , . . . , th }
(2.9)
When a user u does not have a preference topic t on their preferences list, i.e. put = 0,
this can mean two things as shown in equation 2.10:
• the user U has not included the topic T in their preferences list because the user
U in not interested on the topic T ;
• user U does not even know about topic T .
put =
2.2.6.4
Y
]1 :
[0 :
if topic t is in users’ u preferences
otherwise
(2.10)
F Measure
F measure is a weighted harmonic mean that uses a conservative average. It demonstrates the trade-off point between precision and recall as shown in equation 2.11.
F =
1
(— 2 + 1) P R
≈∆
— 2P + R
– P1 + (1 ≠ –) R1
(2.11)
53
2.3 Summary
Conveniently, F1 measure is a balanced measure when the same weight is given to
either precision and recall, shown in equation 2.12.
F1 =
2.3
2P R Ó
—=1
P +R
(2.12)
Summary
This section describes some of the most important concepts used in this work.
Firstly, it details existing AAA frameworks, their components and interactions. Typical decentralised and distributed AAA architectures provide access policies’ decoupling
from resources and applications as well as independent management of access policies.
Secondly, this background knowledge provides insights about recommender systems.
It describes their different properties, inputs, algorithms and recommendation techniques. Some of the measures applied to recommender systems’ evaluation are also
described.
These two different subjects are combined in this proposal in order to ease the resourcesharing process thus recommending access policies for unattainable resources (i.e.
blocked by access policies) and help easing the users task of sharing resources (i.e.
because of information overload and Internet’s continual growth). This is described
in the next part of this work.
Part II
Proposal
Chapter 3
Use Case Scenario
In order to better comprehend the proposal’s effectiveness for an everyday Internet
experience, this chapter details a use case scenario depicting a typical resource-sharing
action. The common and basic tasks presented in this chapter correspond to most of
today’s volume of generated content on the Internet1 (cf. figure 3.1), thus reinforcing
its meaningfulness.
A characterisation of the several dimensions related to the use case scenario is presented in the next section. Afterwards, the Use Cases section presents several use
cases and relates the dimensions with the existing problems.
3.1
Characterisation
This use case scenario demonstrates the lack of standards and people’s difficulties while
using traditional resource-sharing mechanisms. This scenario relates the experience
of several different people that get together at a social event. People at the event are
somehow related, and most of them use the Internet and have virtual identities.
During the event, photographs, videos or audio recordings are produced. During or
after the social event, people tend to share those captured moments with each other
through some kind of cloud storage. This results in an upload action of resources to
the Internet.
Current web domains do not provide users with enough expressive sharing mechanisms
that allow finer granularity access to resources and therefore it becomes difficult for
people to easily achieve the following intended use cases, specially in a cross-domain
perspective while preserving their privacy.
1
http://aci.info/2014/07/12/the-data-explosion-in-2014-minute-by-minute-infographic/, http://
www.statista.com/statistics/195140/
58
Use Case Scenario
Figure 3.1: Uploaded Photographs and Sharing Statistics [55]
3.1.1
Social Event
The use case scenario illustrates a traditional family event, specifically a birthday
party. Events like these are quite a routine in everyone’s lives and they can take place
at various locations, for example in a house, restaurant, park or playground.
This particular event occurs in an open-air public park where multiple activities
(i.e. jogging, running, cycling, soccer games, etc.) can be performed by those on the
premises.
3.1.2
People Social Relationships
As on typical family events, the guests are mainly comprised by family relatives and
some close friends. Since the birthday person is a child, there are also some of the
child’s school friends and parents attending the event. In order to provide a better
understanding of the people involved figure 3.2 illustrates a random example of people
in the park.
Because the event takes place in a public park, there are other people, unrelated to this
event (i.e. bystanders or passers-by) that are taking part on other events or practicing
any other kind of activities in the same park.
59
3.1 Characterisation
Family
John
Mary
Family Friends
Mathew
Joseph
Passers-By
Amelia
Unknown 1
Unknown 2
Figure 3.2: Use Case Scenario: Users
Figure 3.3 shows different relationships layers for some of the social relationships
between people in the event. Several relationship layers are identified:
Family Layer This layer is comprised by family members, where several levels of
relationships can be defined e.g. mother, father, daughter, son, aunt, uncle, etc.
Friendship Layer This layer is comprised by friendship relationship levels e.g. acquaintance, friend, best friend, best friend forever, etc.
Work Layer This layer is comprised by work relationship levels e.g. co-worker, boss,
employee, etc.
3.1.3
Content Generation
When performing certain actions, users generate content. For example, social event
organisers and guests generate content by taking photographs or making videos of
each other and occasionally from the place and surroundings where the event is being
held. Such content is captured either by dedicated photography or video equipment,
as well as smartphones, tablets or other devices that users ordinarily carry with them.
When these events are held in public places, other people (i.e. that might not be related
to this particular event) can also generate content of their own, either photographs,
videos or even records of their physical performance while practicing sports. As a
result, passers-by might even appear in some of the content generated by the event
guests and the opposite is also possible.
60
Use Case Scenario
Family Layer
No Existing Relationship
Married to
John
Son of
Son of
Unknown 1
Mary
Unknown 2
Joseph
Friendship Layer
Mary
Friend of
Friend of
Friend of
Amelia
John
Mathew
Figure 3.3: Use Case Scenario: Users’ Social Relationships
This use case scenario focuses mainly on these types of generated content i.e. photographs, videos and records of physical performance. An example of different content
produced during this event at the public park is represented in figure 3.4, along with
the person that generated it.
3.1.4
Contextual Information
Devices used for taking photographs or making videos are capable of providing additional information regarding the resource itself and sometimes the user. For example,
smartphones, smart-watches and others, have other types of ambient sensors embedded in them (e.g. geospatial location, environment temperature, humidity, body temperature, blood pressure, heart rate), capable of providing information that serve as
contextual information for the generated resources. While the aperture and shutter
speed are properties only related to the photographs being generated, geospatial location where a photography is captured is not only related to that photograph, but also
denotes the user’s location while taking it.
Each different type of resource keeps different types of meta-information. While some
of this information is kept embedded on some generated resources (e.g. location and
creation date on photographs and videos), other related environments surrounding the
capture moment are lost forever on most resource types (e.g. environment tempera-
61
3.1 Characterisation
Photograph A
Author of
Photograph B
Author of
Mary
Author of
John
Author of Author of
Photograph C
Author of
Photograph D
Video H
Photograph G
Unknown 2
Author of
Author of
Race Log I
Author of
Photograph E
Unknown 1
Photograph F
Figure 3.4: Use Case Scenario: Resources & Authorship
ture, atmospheric pressure, humidity, blood pressure, heart rate) and that person’s
generated content loses part of their context forever. This is related to the problem of
Contextual Information Loss (cf. section 3.3.5). Nevertheless, part of the contextual
resource that is lost on some resources can sometimes be kept and retrieved from other
types of resources created at the same time, e.g. the records of physical activities of
passers-by.
Figure 3.5 shows a subset of information that is typically embedded in resources.
Let us assume that some of the used photographic equipment is capable of recognising,
identifying and tagging each person showing up on a photograph, thus annotating that
resource with the person’s name as shown in figure 3.6. Each photograph also denotes
when there are recognised but not identified faces in the photo.
For the characterisation of the resources presented in this use case scenario, each
photograph is produced with the following embedded meta-information: technical
details about the photo itself; the date and time it was taken; the location where it
was taken; identification and tagging information.
62
Use Case Scenario
Number of
Recognised People
GPS Location
Frames Per Second
Tags of
Identified People
Photographs
Resolution
Creation Date
Videos
Ambient Temperature
Atmospheric Pressure
Performance
Logs
Heart Rate
GPS Waypoints
Figure 3.5: Use Case Scenario: Resources Information
inPicture
inPicture
John
Photograph A
inPicture
Photograph B
inPicture
inPicture
Joseph
inPicture
Photograph C
inPicture
Mathew
inPicture
Photograph D
Photograph E
Amelia
Photograph F
inPicture
inPicture
Mary
inPicture
Unknown 2
Photograph G
Unknown 1
Figure 3.6: Use Case Scenario: Annotated Photographs
63
3.1 Characterisation
Is registered on
Web Domain 1
Is registered on
Amelia
Joseph
Is registered on
John
Is registered on
Is registered on
Mary
Is registered on
Web Domain 2
Is registered on
Mathew
Web Domain 3
Web Domain 4
Web Domain 5
Unknown 1
Is registered on
Unknown 2
Is registered on
Mary
Figure 3.7: Use Case Scenario: Users Web Domain Registration
3.1.5
Virtual Identity
Different people use different web applications, even when those different web applications serve the same purpose. This is because each person has different characteristics,
tastes and personal preferences about the plethora of existing solutions on the Internet. As depicted in figure 3.7, not all people in the social event act the same way and
they are registered on different applications/domains.
Sharing the photographs in multiple domains forces the author to know each people’s
identity on the different domain. This is related to the problem of User’s Multiple
Identity (cf. section 3.3.1). When sharing content with other users, this poses other
problems that are related to Cross-Domain Authentication and Authorisation (cf. sections 3.3.3, 3.3.4).
3.1.6
Content Hosting Domains
According to the specificity of each content being generated (i.e. photographs, videos,
physical performance), different web domains are more suitable to host some resources
types than others, e.g. Flickr is more suitable for photographs hosting that Facebook.
Even for the same resource type, a multitude of web domains exist that are capable
of hosting that resource type.
According to personal preferences, cultural or social influence, people tend to host
64
Use Case Scenario
their content on the domain they prefer and feel is most suitable for that particular
type of content. This intensifies the problems related to Cross-Domain Authentication
and Authorisation (cf. sections 3.3.3, 3.3.4).
3.1.7
Access Policies
Depending on the resource-hosting web domain, different types of sharing options and
processes are provided by those web domains, not interchangeable with other domains.
This means that when people use two web domains, one for hosting photographs and
another for videos, they will most likely have to deal with two different types of sharing
mechanisms resulting in different sharing options. This problem is referred as CrossDomain Compliance of the Access Policy Language (cf. section 3.3.9).
Despite everything, most probably none of those domains allow a secure method for
sharing hosted resources with users that are not registered on that domain. This is
referred as the problem of Insecure Sharing (cf. section 3.3.14).
3.2
Use Cases
Commonly, the resource-sharing process does not come with a “one size fits all” option. While there are some web domains that allow grouping users and sharing those
resources with those groups, most of the times sharing is achieved by individually discriminating which specific user should have access to a specific resource. In both cases
though, there are difficulties when applying those sharing rules to multiple domains.
The following use cases describe several situations that occur when people have to
share resources with others.
3.2.1
Content Sharing on a Single Domain
Let us assume that a person uploads the event photos to a particular web domain that
only allows sharing the resources with users registered on the same domain.
Participants of the social event can be registered on multiple domains and can have
different identities according to each domain. This problem is referred as User Multiple
Identity (cf. section 3.3.1).
Consequently, this poses the problem of how people are able to access resources if
they are not registered on the particular domain. This is referred as Cross-Domain
Authentication (cf. section 3.3.3).
For unregistered users, sharing is sometimes possible through a method, which is
achieved by assigning the content a very hard to decipher URI and passing it along
3.2 Use Cases
65
to unregistered users. This is related to the problem of Insecure Sharing (cf. section
3.3.14).
3.2.2
Content Sharing on Multiple Domains
Let us assume that a person has taken several photos and created multiple videos of
the event. When a person takes many photographs and wishes to upload them to a
particular web domain, he has to manually upload those resources on that particular
and centralised domain.
When the type of the resource changes (i.e. to video), and another web domain is
more suitable to host that type, the person has to upload those resources to the
other domain, again. This is a problem referred as Cross-Domain Resource Hosting
(cf. section 3.3.10).
Furthermore, because a person has part of his resources (i.e. photographs) hosted on
one domain and the other part (i.e. videos) on another domain, it is even harder to
maintain the same sharing policy definition. The sharing process is probably different
in both domains and this falls in the problem referred as Cross-Domain Compliance
of the Access Policy Language (cf. section 3.3.9). Besides, users can have different
identities for each different domain, which would result in a task prone to errors and
incoherent.
3.2.3
Content Sharing by Using Social Networks
Let us assume that people wish to share their family event content with all their family
members. Defining access policies for those resources in a resource single basis for each
person on the family would be troublesome for the authors. If an author had twenty
relatives and fifty resources to share, in a worst-case scenario, this could end up in an
enormous possible combination of different sharing assignments, which would make
manual attribution a laborious task.
Some web domains allow the definition of social networks. Uncommonly, some even
allow the definition of access policies by using the members of the social network. Nevertheless, social network relationship levels and layers are completely disregarded when
defining access policies due to the lack of expansion and expressivity of the adopted
language or notation i.e. the idea of sharing resources with first-degree relatives is not
an option as the notion of first-degree relative is not intuitive to those domains.
This happens because access policies are typically made possible by creating static
groups of users and assigning those to specific resources. This is related to the problems
of Expressivity and Expansion of the Access Policy Language (cf. sections 3.3.6, 3.3.7).
66
Use Case Scenario
Webpage
Photograph A
Photograph B
Photograph C
Figure 3.8: Compound Resource
Furthermore, social network representation is typically domain-specified, cannot be
used in other domains and its definition process must be manually cloned to other
domains. This is referred as Cross-Domain Profile Semantic Heterogeneity (cf. section
3.3.2).
3.2.4
Content Sharing by Using Resource Attributes
As previously noted, several photographs have been annotated (either automatically
or manually) with the name of people that have been identified in the photograph.
Additionally, each photograph has also been annotated with the number of people on
the picture that have been recognised but not identified.
Let us assume that authors wish to share their family-event resources only with people
that have been identified on those resources and belong to the author’s social network.
To achieve this purpose, resource authors must check which person is showing up in
each resource and manually give access to the identified people in the photo.
In order to support this task, the system should make use of the annotations provided with the resource, compare those to the names of users in the social network of
the author, in order to grant them automatic access. This is related to the Expansion Problem of the Access Policy Language (cf. section 3.3.7), which does not allow
a typical access control system to be expanded and to include resources’ or users’
information in the access policy definition mechanism.
3.2.5
Compound Resources by Hyperlinking Resources
Let us assume that a resource’s author wishes to use a web application that supports
creating a compound resource capable of displaying other resources (e.g. a photo album, a photo gallery, etc.). Figure 3.8 depicts a possible compound resource layout
diagram (e.g. Webpage) that contains some of the event resources (e.g. Resource A,
Resource B, Resource C).
The author wants to be able to use that application to share the compound resource
with other people at the party, while still preserving access policies over the resources
being shown and creating a new access policy over the compound resource.
67
3.2 Use Cases
Webpage (Rendered Views)
John’s View
Joseph’s View
Mathew’s View
Photograph A
Photograph A
Photograph C
Photograph B
Photograph B
Photograph C
Figure 3.9: Compound Resource Rendering
According to each person’s access policies regarding the resources, visualising the compound resource might have a different visualisation of the compound resource. This
is related to the problem of Multiple Renderings of Compound Resources (cf. section
3.3.11).
In figure 3.9 it is possible to depict different renderings of the same compound resource
according to different users’ access policies i.e. in this particular example, only users
that have been annotated as being in the photograph are given access to the resources.
3.2.6
Content Sharing by Dynamically Grouping Users and
Resources
Let us assume that a person is uploading all the photographs that were taken for the
last two months on several different events, attended by different and heterogeneous
groups of people. Imagine that the author wishes to share each event’s photographs
with the corresponding group of people attending that particular event.
Currently, on most popular web domains, users are able to achieve this by statically
creating groups of users and resources.
This being, a static group of resources must be created for each event and photographs
must be manually associated to that event for each individual event. Furthermore,
people attending each event must be manually selected for a particular static group
that must be created for that event. Afterwards, each group must be assigned to
each resource group. If a person attended both events that person must be statically
specified in both groups.
68
Use Case Scenario
This is related to the Expressivity Problem of the Access Policy Language (cf. section
3.3.6). If resources are of different types and hosted on different web domains, the
author must replicate that action to other web domains. This is referred as the CrossDomain Compliance of the Access Policy Language (cf. section 3.3.9).
3.2.7
Known-Unknowns Discovery
Let us suppose that an invited friend could not attend the event. Due to past events
of the kind, this person knows, quite for sure that photos and videos were probably
created during the event, but does not know how to start looking for them. The only
way for that person to access those resources was if he looked for those resources in
every possible web domain, which could also be unavailable to the person because they
were restricted by access policies.
Taking into consideration some of the use cases mentioned before (cf. sections 3.2.3,
3.2.4), that person would not be given access to any resource because it was not a
family member nor had been recognised or identified in any resource because of his
absence from the event.
This is referred to a Known-Unknown Situation (cf. section 3.3.12) where the user
knows that some kind of resources exist but the user cannot find or access them,
neither can the system recommend them to the user.
3.2.8
Unknown-Unknowns Discovery
Let us assume that passers-by or bystanders have been captured in some of the resources generated by people attending the social event or the opposite, where people
passing by have captured some of the people on the social event on their resources.
Without knowing who those passers-by are, the social event guest would never be
able to gain access to the resources that had been captured by the passers-by and the
opposite also applies.
This is referred to an Unknown-Unknown Situation, in which a person: (i) does not
know the existence of a resource; (ii) nor as any relation to the resource or the resource
author and (iii) cannot access it nor request access to it.
3.3
Systematisation of Problems
This section provides an overview between the hypothesis mentioned on section 1.5
and the problems encountered in the use cases presented in section 3.2.
3.3 Systematisation of Problems
3.3.1
69
User Multiple Identity
User Multiple Identity is a continual problem that assaults the Internet at the time.
Every time a user registers on a web domain that person is given a new virtual identity.
This raises conflicts when defining access policies in a single place for resources hosted
on multiple web domains. As a result, it becomes more difficult to uniquely identify
each person when they use different virtual identity/credentials for every different web
domain.
When specifying an access policy, the resource’s author becomes doubtful about which
identity they should choose for the policy. To avoid such kind of problems, this work
assumes that every person is uniquely identified by a URI.
3.3.2
Cross-Domain Profile Semantic Heterogeneity
In a distributed and decentralised Internet, the definition of social relationships must
occur in a standard format common to all web domains.
Practically all web domains allow users to register their identification but do not allow
them to share it with other domains in a standard and automatic way. For example, when people change their address that change is not automatically perpetuated
throughout all the domains where the person has been registered, because they are
not linked together. For that change to take effect, people must change their identity
on each and single web domain where they have been registered.
Each person’s social identity should remain the same independently of where that
person is registered. Moreover, a change in the social network should have immediate
impact on all web domains and not solely on the one where the change is registered.
This proposal adopts W3C standards for representing people’s identity, profiles and
social networks that comply with the distributed and decentralised principles of the
Internet.
3.3.3
Cross-Domain Authentication
In order to achieve a globally and distributed Internet, users must be able to authenticate in any domain without relying on specific third-party domains. To this end, a
distributed Internet requires users to have control over their identity. Such identity
makes it possible for them to achieve a global and distributed authentication.
For the moment, the most common way for allowing cross-domain authentication is by
relying on federated single-sign-on providers e.g. OpenID-Connect, googleID, OAuth,
Facebook Connect. Contrary to what happens with federated single-sign-on processes,
70
Use Case Scenario
the authentication process should occur only between the web domain and the user,
independently of other authentication mediators.
This proposal is based on the adoption of a non-federated single-sign-on authentication
solution that provides distributed cross-domain authentication.
3.3.4
Cross-Domain Authorisation
To enable decentralised control over resources, users and resources must be uniquely
identified by URIs. While users have that unique identity, through the adoption of
a unique identity, web domains must be able to provide and expose unique URIs for
hosted resources.
Granting access to a resource and enforcing that decision to occur on different domains
allows building a decentralised cross-domain authorisation process.
The decision’s enforcement is achieved on the domain where the resource is hosted, but
granting or denying access can be provided by any other entity based on the requesting
user’s credentials and the resource being accessed. Following that information, a
decision entity is responsible for retrieving access policies for the specified resource,
evaluate them and grant or deny access.
In this proposal, access to resources is given when the requesting user provides proper
credentials and the resource’s author has specified an access policy that grants resource
access to the requesting user. Unless stated as public, all resources must undergo an
authorisation process.
3.3.5
Contextual Information Loss
People can explicitly or implicitly provide information to a system. When people
create and maintain their profile, build up their identification or set up their social
network, they are explicitly stating information. Nevertheless, based on people’s actions and behaviours, people are implicitly allowing systems to capture additional and
meaningful information that can be reused later on.
Based on the use cases Content Sharing on a Single Domain and Content Sharing
on Multiple Domains (cf. sections 3.2.1, 3.2.2), when people host resources on a web
domain they are in fact performing an upload action of a resource to a specific web
domain. Until then, those resources only contained embedded information and were
not related to the person performing the upload action.
After the upload action, an author relationship between resource and person is established and implicit knowledge is created. By creating this knowledge, it is possible
to associate resources with people, something that is typically lost or not exposed in
common web domains.
3.3 Systematisation of Problems
71
Capturing people’s actions upon resources and their environment information/conditions, allows establishing relationships between user and resource that could not be
achieved before, which are eventually used by the system to address the problems
described on the use case Unknown-Unknowns Discovery (cf. section 3.2.8).
Furthermore, contextual and meta-information about resources can enhance other resources’ meta-information. For example, relating to the use case characterisation of
Contextual Information (cf. section 3.1.4), by crossing information (i.e. geospatial location and time) from the passers-by records of sports activities with the photographs
and videos, it is possible to enhance other resources’ contextual information, by adding
information about the environment temperature, pressure and humidity. Such information might even disclose if the event took place in a sunny or cloudy day.
In order to capture this knowledge, action sensors are introduced in this proposal for
recording user actions and contextual information. These action sensors capture and
exploit implicit user actions and behaviours in order to provide a better and smoother
web experience.
3.3.6
Expressivity Problem of the Access Policy Language
Most web domains offer access policies management over their hosted resources based
and constrained by an RBAC approach where a static definition of groups of users
and resources must be accomplished apriori before these can be used to define access
policies. Referring to use case Content Sharing by Dynamically Grouping Users and
Resources (cf. section 3.2.6), event photographs must be manually associated to the
static group of users specifically created for that.
The lack of contextual information and knowledge expressivity in these access control
policies does not allow dynamic grouping of resources and/or users to be created
based on the resource attributes or user profiles. Consequently, this kind of dynamic
grouping is disabled on such environments.
In this proposal, a simple yet more expressive and semantic policy notation is used.
Resource and user groups become dynamic sets, specified by semantic rules that are
only determined when enforcing access policies. For example, resources from different
events can be grouped together by making use of their creation date and location and
people groups can be created based on the event’s guest list.
3.3.7
Expansion Problem of the Access Policy Language
Resource attributes, annotations and any other kind of non-proprietary information
are disregarded by the web domain if it does not recognise the information.
72
Use Case Scenario
In the use case Content Sharing by Using Resource Attributes (cf. section 3.2.4),
automatic usage of the annotations provided with the photographs should be made by
the web domain, and a process should occur to compare those to the names of existing
users in the author’s social network. For supporting resource sharing as described the
web domain must consider the inclusion of resources’ meta-information.
For the time being, there are no web domains capable of hosting resources and allow
sharing this way, nor provide means or extensions to the default web-domain-specific
sharing procedures. Similarly to what happens in Object Oriented Programming, the
access policy notation used by these web domains should be open for extension and
closed for modifications as enunciated by the Open/Closed Principle [105].
This work proposes a single access policy management system where access policies
are defined in a single place and people are not obliged to comply with any native or
proprietary access control systems.
3.3.8
Semantics of the Access Policy Language
Web domains organise information according to internal and proprietary formats instead of using e.g. ontologies to capture the formal semantics and representation of
information.
While some web domains provide simple inference mechanisms i.e. being able to dynamically populate a “Family” group by having into consideration manually established relationship levels between people, those web domains are not capable of inferring first or second degree relatives.
Not being able to address the semantics implied in family ties, automatically limits the
sharing possibilities with those in the “Family” group, because it does not automatically allow subsets of that group to be used (e.g. first and second degree relatives).
This does not allow the definition of access policies derived from existing knowledge.
To achieve these fine semantic inference capabilities, this proposal adopts semantic
rules for the specification process on access policies.
3.3.9
Cross-Domain Compliance of the Access Policy Language
When all the resources from an event are hosted on a single web domain, i.e. according
to use case Content Sharing on a Single Domain (cf. section 3.2.1), granting access
to those resources is less difficult than when the resources are hosted on different web
domains, because compliance of access policies between all the involved domains must
be accomplished by the author.
3.3 Systematisation of Problems
73
Referring to the use case Content Sharing on Multiple Domains (cf. section 3.2.2),
when resources from the same event are hosted on multiple different domains, access
policies must be replicated on multiple different domains by a process of copying access
policies from one domain to another and performing manual comparison and revision.
Even when sharing options are similar in multiple web domains, their specification is
not interchangeable in a way that can be exported from one domain to be used in
the other. This generates a redundancy problem for access policies in different web
domains. Not only are users faced with different access policy management systems
with different users’ identities, but also the fact that most of these web domains work
with proprietary access control policies. Unless access is given in a one-to-one userresource rule, permissions will most probably end up being different in those domains.
This proposal focus on adopting a central access policy management that complies
with semantic W3C standards. It eliminates the need for duplication or redundancy
of access policies for the same resource in different domains.
3.3.10
Cross-Domain Resource Hosting
Given the existing surplus of web applications, people should embrace these and use
them according to their specific needs. At the moment, when people upload resource
bundles, those resources cannot automatically be hosted on different web domains
according to each specific resource type and user’s preferences.
Users-generated content is characteristically hosted on the web domain where it is
generated e.g. when uploading a resource to a web domain that resource resides on
that web server and a URI for that host domain will be given to the resource.
Symptomatically, each existing web application is responsible for hosting and keeping
track of the resources it holds and the rules for displaying those resources are intrinsic to each of those domains, usually not being possible for other users to reuse or
hyperlink them on other domains.
Web domains are not prepared for fully decentralised resource-hosting based on the
resource type, nor do users have a way to specify in a cross-domain perspective where
should a particular type of resource be hosted when creating them.
Specifying the hosting domain for each resource type solves the problem of automatically hosting resources in the domains that are most suitable for those resources as
identified in the use case presented in section 3.2.2.
The same also applies to information – filled in web forms – that is kept in a restricted
format in that domain and for which a URI is not generated for each piece of filled-in
information, thus not being addressable.
This work proposes a mechanism capable of distributing content on appropriate web
domains, according to the specified preferences in user profiles.
74
Use Case Scenario
3.3.11
Multiple Renderings of Compound Resources
A compound resource is a first-class resource composed by other first-class resources.
A compound resource is addressable by a URI and access policies can be defined for
it independently of its inner resources.
Access policies for each of the inner resources must always be preserved and for that,
different visualisations of a compound resource should occur depending on viewers’ access to those resources. This means that different visualisations of the same compound
resource, must be automatically displayed according to the grants that the visualising
user has for the resources being displayed.
Not all users are able to visualise all resources equally, hence different views of the
compound resource must be rendered according to each specific user’s access policies.
With traditional authentication, authorisation and access policy methods as the ones
presented on the Internet today, when compound resources are composed by other
resources that do not actually reside on the same web domain (cf. section 3.2.5), the
resource author faces the dilemma of:
1. having to upload those resource again:
• thus creating a copy of those existing resources;
• having to replicate the sharing permissions.
2. reusing the existing and already uploaded resources (by hyperlinking to them):
• which avoids duplicated resources;
• but the source web domain does not allow the other web domain to access
the resources because of the existing access policies.
The presented proposal is based on the fact that each resource is uniquely addressable
and for that, each URI can have access policies that must always be preserved. This
promotes the usage of the approach suggested in the second dilemma, where resources
should be reusable and hyperlinked, even when protected by access policies.
Access policies are always enforced as close as possible to the resource. This means
that for accessing a compound resource, the compound resource’s hosting web domain
is only responsible for enforcing access policies for the compound resource. Inner
resource’s hosting web domains are responsible for enforcing access policies for those
resources.
3.3 Systematisation of Problems
3.3.12
75
Known-Unknown Situation
Referring back to the use case Known-Unknowns Discovery (cf. section 3.2.7), the
invited person not attending the event has the problem of finding resources he knows
that exist but are not searchable or accessible on common web domains. Such knownunknown resources can eventually be located on any domain, which would make the
user have to search for them in every particular domain. Furthermore, as mentioned
in the same use case, if the person is not given explicit access grant to the resources
that person will never be allowed access to those resources.
Cross-domain indexing and search engines are only capable of retrieving publicly accessible resources. Today’s web domains and recommendation engines are still not
capable of allowing this kind of indexing and awareness about private or protected resources. Firstly due to existing access policies and secondly due to privacy constraints,
thus not being capable of solving this problem.
This proposal uses a recommendation engine that, given a domain knowledge and
sufficient rules, is capable of analysing contextual information and recommends user’s
access policies to those resources. For example, in the mentioned use case, if the
event’s guest list was known by the recommendation system, it could propose resource
authors to allow access to those resources for people in the guest list.
3.3.13
Unknown-Unknown Situation
Referring to use case Unknown-Unknowns Discovery (cf. section 3.2.8), people at the
event and passers-by are not related. They do not share common interests, have
similar tastes or belong together in any social network. Nevertheless, they are related
by a thin line that puts them in the same place and time.
By comparing photographs’ information and annotations providing location and time
they were taken with information from the physical performance record of a runner
passing by (that disclosures that person’s location at a precise time), it is possible to
frame both resource and passer-by in the same window of time and space. In this case,
the unknown-unknown situation occurs because passers-by are unaware that a given
resource exists on which they might appear.
This proposal uses knowledge and rules that can be expanded and enhanced by users
in order to facilitate this kind of reasoning and discovery of unknown-unknown resources. As a result, if the meta-information is correctly represented according to
formal specifications that enable reasoning, the recommendation system can propose
the recommendation of access policies to resources on the same window of time and
space for passers-by.
Furthermore, it could eventually even prune the number of possible and suggested
recommendations, by analysing the information related to recognised but not identified
76
Use Case Scenario
people in the photo and comparing those unidentified faces with the ones available on
passers-by online profile.
3.3.14
Insecure Sharing
When people host resources on a web domain and wish to share them with others,
typical web domains demand that the people who were given access to resources must
be registered on that domain so that they have a domain user identity in order to
specify access policies. Nevertheless, not every person wishes to be registered in every
existing domain from which resources can be shared.
To overcome this matter, most existing web domains offer a mechanism for allowing
unregistered users to gain access to those resources. The sharing process is achieved
through an insecure method, which assigns the resource a completely random and
very hard to decipher URI that is sent to unregistered users. Some of these sharing
methods allow a password to be defined that should be sent to the user along with
the URI for the shared resource. This has the following downsides:
• given the insecure forms of existing communication (e.g. e-mail), any person that
has the generated URI is given access to the resource, because no authentication
is performed on the web domain. This is more difficult when a password is defined
for accessing that URI but, once more, that password is typically shipped to the
user using the same method;
• it is difficult but not impossible to generate those URIs;
• the URI can be passed on to other users and the author loses track of whom has
access to what.
This proposal assumes a position where access to resources must always be achieved
by means of access policy definition that does not require such kind of insecure sharing
in order to overcome the problem of unregistered users.
Given that access policies can be defined using a single user identity (cf. section 3.3.3),
it is always possible to specify which user should access which resource.
3.4
Summary
A use case scenario that demonstrates several typical use cases of a day-to-day web
experience was presented in this chapter. A clear systematisation of problems that
still exist in the current web is provided for each use case.
This next chapter introduces a conceptual architecture that deals and solves the problems systematised in this chapter.
Chapter 4
Architecture
This chapter describes the architecture for a system capable of accomplishing the
envisaged goal of the work described in section 1.4 by solving the problems enunciated
and systematised in section 3.3.
4.1
Overview
The decentralised structure is capable of providing authentication, authorisation, access control management and recommendation based on resources, users, provenance
and traceability information in a distributed and decentralised system, by promoting
the usage of action sensors, metadata generators and semantic rules.
This architecture is based on the access control architecture described in section 2.1.5
but it is novel in respect to the following aspects:
• despite most of the components maintaining the same names as in typical architectures, their responsibilities and features are enhanced to address the defined
requirements;
• it adds a Policy Recommendation Point (PRP) that is responsible for the recommendation of access policies;
• it boosts these components by replacing legacy and traditional non-standard
formats and procedures with new data representation by using semantic web
standards, capable of a better and explicit knowledge and information description.
In particular, managed and exploited information is a cornerstone of this work, namely:
78
Architecture
Resources Represent anything in the world that can be referred (either physical
or virtual). The proposed architecture considers and exploits the resources’
unequivocal identity, information content, meta-information, provenance and
traceability information.
Users Are a special kind of resources representing a human or artificial agent in the
system. The proposed architecture considers and exploits users’ related resources
such as identity, profiles, preferences, social relationships, generated actions and
contents.
As described previously in section Reference Architecture, an application that provides AAA is typically composed by five components (i.e. Identification Provider Point
(IDP), PEP, PIP, PDP, PAP and PRP). With the inclusion of the PRP, the architecture components (cf. figure 4.1) have the following responsibilities and features.
IDP Provides new users with a new identity and appropriate credentials. This component replaces the IdP mentioned in the reference architecture, because more
features have been added and the name did not comply with the other components. The following features are enhanced or added:
• allows identity generation and credentials to new user;
• allows managing each user’s internal and social identity in the virtual world;
• provides an authentication relying party service that allows legacy domains that
do not provide FOAF+SSL authentication to validate the users credentials.
PEP Enforces the user’s authentication and guarantees controlled and authorised
access to resources. The following features are enhanced or added:
• typical basic authentication methods are replaced by FOAF+SSL distributed
cross-domain authentication (cf. section 2.1.2);
• enforcement is no longer achieved by using local access policies, but instead it is
replaced by a distributed and decentralised method which does not rely on local
access policies;
• action sensors provide capturing user generated actions and content.
PDP Evaluates access policies in order to decide if a user should be or not given
access to a resource. The following features are added or changed:
79
4.1 Overview
HTTP Client
User
Access or
Publish
Resources
Web Domain
Policy
Information
Point
Request:
Authentication;
Authorisation
Get: User; Resource
Store: Provenance;
Traceability;
Content
Policy
Enforcement
Point
Identification
Provider
Point
Get/Set: User Profile
Get Extra Information
Get
Resource
Get
Extra
Information
Evaluate
Authorisation
Request
Manage
Access
Policies
Get
Identity/Credentials
Maintain
User Profile
Request
Authentication
(Relying Party)
Resource Author
Get
Access
Policies
Policy
Decision
Point
Policy
Administration
Point
Retrieve/Store:
Access Policies
Get
Recommendations
Get:
Users; Resources;
Traceability;
Provenance
Policy
Recommendation
Point
Figure 4.1: Proposed Architecture [Components Diagram]
80
Architecture
• replaces traditional role or attribute based authorisation mechanisms by an authorisation mechanisms capable of handling semantic, declarative and expressive
access policy languages;
• provides decentralised access policies evaluation that are used in a cross-domain
perspective;
• obtains, if necessary, semantic information from the PIP for evaluating a particular access policy;
• offers reasoning capabilities over more expressive access policy rules that exploit
the system’s semantics.
PIP Manages the information needed for the authentication, authorisation and recommendation processes (cf. section 2.1.5.5). Responsibilities formerly provided
by the PRP are now assigned to this component. The following features are
added or enhanced:
• management of information related to:
– resources’ content, including their type, attributes/properties and preferred
hosting domain;
– provenance and traceability information over user generated actions and
content;
• generating and publishing information according to an explicit and public semantic specification (i.e. ontology).
PAP Enables users to manage access policies over existing resources. The following
features are added or enhanced:
• access policies are specified by rules instead of directly assigning user to resources
or placing users in particular roles;
• proprietary access policies over resources imposed by closed domains are replaced
by far more flexible and expressive rules that capture the rationale behind a
particular access policy beyond current approaches;
• provides and promotes the means to create access policies not only based on user
attributes and relationships, but also on resource attributes;
• provides and promotes the means to define more complex access policies through
semantic reasoning over contextual information and meta-information.
4.2 Research Questions vs. Features
81
PRP This component is a novelty in AAA systems, enabling the recommendation
of access policies to be applied to users and resources. These are some of the
envisaged responsibilities and features:
• recommend access policies by combining collaborative, social content and semantic filtering methods, allowing the recommendation of known-unknown and
unknown-unknown resources to users;
• allow customising the recommendation process, namely the weights for each
filtering method.
4.2
Research Questions vs. Features
These components deliver functionalities that answer the identified research questions
(cf. section 1.6) as follows:
RQ1: How can a distributed, decentralised and standard-based
mechanism perform authentication and authorisation?
The authentication process adopts a FOAF+SSL-based approach that enables user
authentication in different domains (i.e. distributed), in a way such that the information and responsibilities are not centralised in a single entity, but instead it is
the responsibility of multiple entities to perform the authentication between the web
domain and the user, without the interference of any third-party authentication or
certifying domain.
The access policy definition language is a W3C recommendation – a declarative and
expressive rule language – whose rules are managed and exploited by components in
a distributed and decentralised manner.
RQ2: What and how can user-generated actions upon resources
be captured?
Maintaining an association between resources and authors is required. Research on this
subject has been carried out in the context of provenance information and the concept
of resource traceability information is a novelty proposed in this thesis (cf. section
5.2.1.3).
It consists on generating meta-information for users’ generated content and actions, by
capturing, creating and representing that meta-information in the form of traceability
or provenance semantic annotations (cf. sections 5.4.1.2, 5.4.1.3).
82
Architecture
Action sensors detect and capture different user interactions with the system (cf. section 5.2.1.4). Common actions like reading, uploading, scrolling, downloading, etc. are
captured, providing contextual information about a user’s performed actions (cf. section 5.2.1.4).
Contrary to typical approaches where user actions are only kept in local and proprietary repositories, this approach enables user-actions information to be captured on
multiple domains and accessible in a standard format in a cross-domain perspective
by other systems.
Provenance and traceability annotations adhere to the semantics specified by specifically developed and adopted ontologies, providing more expressivity to such annotations and means to be used in access policies’ definition and evaluation.
RQ3: How can a user share a resource with others, based on
rules instead of statically-defined discretionary access control?
Contrary to different options, presented by different web domains that only allow
restrictive and proprietary access policies (e.g. Google, Facebook), this approach uses
a declarative language for describing access policies that allows users to define access
policies that comply with more complex sharing scenarios.
The expressivity of the adopted access policy language allows the definition of access
policies by using rules that consider the user’s actions, preferences and context, and
the resource’s context and meta-information, providing dynamic grouping of resources
and users, instead of having statically-defined or role-based access policies for each
resource.
Access policies defined through rules apply to different resources even if hosted in
different domains, reducing policy re-definition efforts on multiple domains.
By using a declarative and expressive semantic rule language for describing access
policies, it is possible for users to capture the rationale for sharing a resource.
RQ4: How is it possible to automate or ease the process of managing access policies to resources from a resource’s owner perspective and his relationships?
The social relationships between individuals contain the missing link concerning the
subjectivity and social dimension in information management and recommendation
systems. Such relationships can be exploited by classifying and entrusting the information returned by a service. In fact, recommended access is accomplished by the
combination of ontologies, social network analysis and user profiles. Therefore, acting
together, knowledge engineering and social network analysis facilitate access to:
4.3 Summary
83
• semantically contextualised knowledge, because it considers the semantics of
both the user request and the requested resource;
• socially contextualised knowledge, because the system considers the information
resulting from the actions of other users, contributing to the creation of context
and recommendation of appropriate resources.
Only with such information can a system reason upon all published resources spread
across different domains all over the Internet, and then provide recommendations.
4.3
Summary
When users adopt a content consumer role, the system is capable of automatically and
implicitly capture users’ actions. When users adopt a content producer role, besides
capturing the user’s actions, the system allows users to manage their resources’ access
policies in a cross-domain perspective.
While published work [163] built with TAMI uses an approach based on OpenIDConnect authentication, the presented architecture uses a similar approach to TAAC,
using FOAF+SSL for authentication.
Whereas [163] is browser-dependent, forcing users to install tabulator [18] or its plug-in
on the Firefox browser, the proposed architecture realises authorisation over resources
with just a technological enhancement on each HTTP web server. This is achieved by
implementing special modules on web servers, which intercept requests.
Whilst the system provided in [163] has a nucleus that is responsible for acting as a
proxy and maintaining all the data and URIs to resources, the proposed architecture
approach makes use of distributed repositories to create associations between resources
and authors, not requiring a central proxy hub for accessing files. Also, it does not
change the final user browsing experience.
TAAC uses an approach of logging access to resources, which becomes quite useful
when analysing and providing context for resources. The architectural model proposed
in this work widens that vision by also providing means for creating traceability and
provenance information for each user action, maintaining an association between every
available resource and its author.
This chapter proposed an architecture outline, independent of each component’s internal details, enabling a seamless Internet browsing experience to the end user. For
a matter of cohesion and high level perspective of the architecture and separation of
duties between the systematised problems identified in the previous chapter and the
architecture components’ internal design proposed on the next chapter, this chapter
was intentionally set apart from those chapters.
Chapter 5
Design
This chapter describes the details of the architecture proposed in the previous chapter.
The proposed architecture is built upon a distributed user identification and authentication methodology called WebID, where each individual user is identified by a profile
that may be located anywhere, provided that it is addressable via a URI. This assumption allows users to perform a distributed authentication on any web domain
without relying on third-party web domains or federations. Because each user is individually identified by its URI, authors can share their resources by simply defining
which user’s URI should be given permission to access a resource.
For the process of sharing resources with users not registered in the same domain, the
resource’s hosting web domain can not rely on centralised authorisation methodologies otherwise users not registered on that domain would be prevented from having
access to those resources. Based on this assumption, this work provides a decentralised authorisation process based on FOAF profiles, replacing typical access control
approaches (e.g. RBAC, ABAC, etc.) by a semantic rule-based approach for defining
access policies.
This chapter delivers technological solutions for establishing a trust network in a crossdomain perspective. Access to resources is voided if the user being given access is not
part of the resource’s author social network. When no relationship exists between
the user granting access to a resource and the user receiving them, access is denied
for that user. A social relationship, independently of its level and layer (cf. section
2.1.1), must exist between the resource owner and the other user whose access to
the resource is given. While this may seem redundant, it allows resource authors to
maintain an updated list of all their relationships and builds a cross-domain trusted
resource-sharing network, avoiding problems like insecure sharing issues (cf. section
3.3.14).
A resource recommendation process has been implemented to aid users when sharing
resources. This process allows known-unknown and unknown-unknown resources to
86
Design
be recommended to users. Even though access to resources is only achieved when a
social relationship exists between the resource’s author and the proposed users, this
recommendation process is not limited to the users reachable through users social
relationships. In fact, when a resource’s author and the proposed user are not directly
related, which means the proposed user is not part of the resource author’s social
network, the inclusion of the proposed user in the resource author’s social network is
recommended by the system (cf. section 2.1.3.5). The proposed features are therefore
capable of performing the recommendation process for users that might be interested
in a particular resource, even if at the time being that resource is unreachable to
that user, because no social connection exists between the resource’s author and that
particular user.
The following sections depict a conceptual design for the architecture capable of achieving the proposed goal.
5.1
Identification Provider Point
The concept of an Identification Provider Point (IDP) entity is introduced in this work
and extends the existing generic term IDentity Provider (IdP) and its functionalities.
This IDP, besides providing identification and credentials for new users, allows the
user profile management and can act as an authentication relying party.
5.1.1
Contributions
The contributions of this component address the following responsibilities:
• to generate identities for new users, by allowing new users to create a virtual
identity, thus generating the corresponding user credentials;
• to allow the users’ internal and social identities management, allowing users to:
– maintain their profiles;
– maintain their social networks;
– manage their topic preferences list;
– manage the list of preferred hosting domains for each distinctive resource
type.
• work as a relying party by validating the user’s credentials that attest the user’s
identity.
The next sub-sections describe the processes addressing each of these responsibilities
in the scope of the IDP.
5.1 Identification Provider Point
87
Figure 5.1: IDP: Identity Creation Process [System Sequence Diagram]
5.1.1.1
User Identity Creation
Authentication can only take place for users that can provide proper credentials. In
order for users to have proper credentials, this entity assists users in generating their
internal and social identities, as well as tying it up with credentials that attest that
the user is who he claims to be.
The identity creation of a new user is divided in a twofold process as depicted in
figure 5.1. In the first phase, the user provides personal information in order to fill his
internal identity (cf. section 2.1.1) and create his user FOAF profile. In the second
phase an SSL certificate is generated at the client browser, which will work as the
user’s credentials. Figure 5.2 depicts a sequence diagram of the whole process.
FOAF+SSL protocol requires that at least first name, last name, a unique nickname
and e-mail address are included in the FOAF profile.
The nickname must be unique for the domain where the FOAF profile is hosted because
it is part of the WebID that is given to the user on the domain. Further, every
generated WebID for users follows the same rule inside the domain. Such information
must be entered (steps 4).
The creation of the FOAF profile is initiated by the user (step 5). This submits the
request to the IDP (step 6) which validates the nickname’s uniqueness in the system
(step 7), generates the corresponding FOAF profile in RDF, according to the FOAF
ontology (step 8) and sends the FOAF profile to be stored in a PIP (steps 9-10). After
storing the FOAF profile, it responds with the generated URI that partly identifies the
88
Design
Figure 5.2: IDP: Identity Creation Process [Sequence Diagram]
5.1 Identification Provider Point
89
user in the WWW and points the location of the corresponding FOAF profile (step
12). This URI is embedded in the response that is sent to the client (step 13-14).
In the second phase of the process, an SSL certificate is associated to the user’s FOAF
profile. The SSL certificate generation takes place locally i.e. in the user’s client
browser, when demanded by the user (step 15). The SSL certificate provides an
association between the FOAF profile and the SSL certificate. For this, the SSL
certificate is generated with the FOAF profile URI in the Subject Alternative Name
section (step 16). The SSL certificate is automatically installed in the user’s client
browser, which means that the user’s credentials are always kept on the user’s side.
To finalise the bi-directional association between the FOAF profile and the SSL certificate, the corresponding user’s FOAF profile is updated with the SSL certificate’s
public key (steps 18-26).
5.1.1.2
User Profile Management
Information about each user’s preferences, profile and social network plays an important role in this proposal. In every social network it is necessary to give users
the possibility to manage their user profile and social relationships. In a cross-domain
perspective, the social network should not reside nor be represented in a specific proprietary format but rather stored, distributed and decentralised using an open standard
while bounded to the user’s URI.
User preferences and profile information are used in the recommendation processes
and the continual building of the user’s social network is used to entrust the network.
All this information is exploited when defining sharing/access policies.
On a cross-domain perspective, it is important that this kind of information remains
the same independently of the domain where the user might use it. The adoption
of FOAF profiles allows the definition of that information in a single profile that is
uniquely tied to the user’s identification because its content resides inside the resource
identified by the user’s URI. Such information can be used transversely by all the
domains on which the user interacts. Given the proposed distributed and decentralised
architecture, this component allows users to manage their unique FOAF profile in a
cross-domain perspective.
The management of social relationships with other users has a direct impact on the
user’s profile. It allows the definition of several levels (i.e. acquaintance, friend, best
friend, best friend forever) and different social layers (i.e. family, love, work, religious)
of relationships.
The information presented on FOAF profiles has been extended on this proposal so
that it can also mention the user’s interest topics and preferences as well as the domains
where generated content should be kept according to the resource’s type (cf. section
5.2.1.5).
90
Design
5.1.1.3
Authentication Relying Party
Since FOAF+SSL authentication is achieved in a distributed manner, it is possible to
ensure FOAF+SSL authentication for several web domains without implementing an
authentication point on each domain, by simply redirecting the authentication request
to other entities that can perform such validation.
This is especially useful for legacy applications that wish to offer FOAF+SSL authentication but are not hosted under a secure connection or a PEP can not be deployed
on the same domain.
For an IDP to act as an authentication relying party, it uses some of the authentication
features provided by the PEP, namely the authentication module. The this special
authentication process is depicted in figure 5.3.
When users want to perform FOAF+SSL authentication on a domain (steps 1-2) that
does not natively provide that type of authentication, the domain must redirect the
request to an authentication relying party that supports it (steps 3-4). In this request,
the domain must specify and anchor for where the authentication response should be
redirected, depicted in the diagram by the “LoginURL” parameter.
The IDP authentication service is exposed through a secure connection that must be
established with the user. When the IDP receives the authentication request, a PEP
(that resides in the same domain as the IDP) is responsible for validating the user
credentials1 (steps 5-8).
If the user is properly authenticated by the PEP’s authentication module, the process
continues (step 9). The IDP creates a signature containing the authenticated user’s
WebID, a timestamp and encrypts it using its private key (step 10). The response
is redirected back to the requesting service (step 11-12), together with the provided
signature attesting successful or unsuccessful user authentication.
The domain application is then responsible for validating the received message and
resumes its login process.
5.1.2
Summary
This component allows users to have a virtual identity and attest that identity credentials.
Extending the IDP to be capable of verifying or attesting user credentials is perfectly
in line with the traditional IDP responsibilities. If the existing IDP already has the responsibility of generating user credentials, these entities are certainly also responsible2
for performing validation upon the resources they create.
1
2
More information related to this authentication process is described in section 5.2.1.1.
i.e. applying the Information Expert GRASP Pattern[96]
5.1 Identification Provider Point
Figure 5.3: IDP: Authentication Relying Party Process [Sequence Diagram]
91
92
Design
Adding services to update the user’s personal information, preferences and social network in his FOAF profile, addresses and solves the problems inherent to the crossdomain profile semantic heterogeneity systematised in section 3.3.2.
While the added functionalities seem minimal, they are in fact very powerful – as
long as relying parties exist, the more distributed the authentication becomes, thus
minimising fault points in a WWW environment.
When an authentication request is received on the IDP, it relies on the PEP to perform
the authentication challenge. Succeeding, the IDP generates a signature, composed
by different pieces of information that proves that the authenticating user is whom it
claims to be.
By allowing the user’s information to reside on his unique profile, the proposed features
solve some problems related to the user’s multiple identity systematised in section
3.3.1.
Extending the user’s FOAF profile to include information related to his interest topics,
preferences and domains where generated content should be kept, solves the resourcehosting location problems mentioned in section 3.3.10.
5.2
Policy Enforcement Point
In a typical web architecture, the PEP is the first component that interacts with any
user or agent request. This component enforces local authentication and guarantees
controlled and authorised access to resources.
A PEP is responsible for acting as a broker between user requests and web server
responses. This component acts by intercepting each received HTTP request. When a
request is intercepted, PEP uses the authentication module to authenticate the user.
If the user is not able to provide FOAF+SSL credentials, the PEP bypasses the request
directly to the web application without interfering in the request/response process.
For each request (e.g. when a user is uploading or downloading a resource), the PEP
is capable of capturing and generating provenance and traceability information about
the user action, which is sent for semantic enrichment, storage and publishing in PIPs.
By using the proposed features of this component it is possible to capture relationships
between users and resources, as well as gathering information about those actions.
By combining distributed authentication, decentralised authorisation, action sensors
and decentralised resource hosting, users can keep an internet repository directory of
their actions and content. This allows the system to retrieve a list of the author’s
resources in the web, and a replica of the user’s actions on the Internet. This feature
acts like a breadcrumb of all the actions each user performed, which addresses one of
the challenges enunciated in section 1.3.2.
5.2 Policy Enforcement Point
5.2.1
93
Contributions
The contributions of this entity address the following responsibilities:
• providing distributed user authentication;
• providing decentralised user authorisation;
• generating provenance and traceability information;
• capturing user-generated actions;
• enabling resources’ decentralised hosting;
• rendering compound resources.
The next sub-sections describe the modules addressing each of these responsibilities:
Authentication, Authorisation, Capturing User Actions, Provenance & Traceability
Information, Decentralised Resource Hosting and Rendering Compound Resources.
5.2.1.1
Authentication
The authentication module is responsible for confirming users identity, by validating
their FOAF+SSL credentials, thus assuring that a user/agent is who they claim to be.
In order to strengthen security levels on web applications, the process of authenticating
a user is typically performed as soon as possible. In this architecture, the PEP is the
entity responsible for providing end-user authentication because it is the component
that is closer to end-users.
While FOAF profiles are used for identifying and describing each user and social relationships, a global and decentralised Web requires a FOAF-based authentication
method, capable of providing single cross-domain user identity. Although not neglecting any proprietary existing authentication models, FOAF+SSL authentication
model has been chosen for this proposal. Whilst not acting as a federated single
sign-on model, this approach can coexist with other existing authentication models
on legacy web domains while providing authentication across multiple web sites and
social networks, without relying on external domains to provide authentication, as it
has been denoted in section 2.1.2.
The proposed implementation of the PEP provides means for validating FOAF+SSL
authentication for every user (human or agent). In a distributed FOAF+SSL authentication model the user does not need to be registered on the domain where the
authentication is being enforced and no third-party authentication domains are necessary.
94
Design
Figure 5.4 depicts the used authentication process. When a request is made to the
domain (steps 1-2) the domain analyses it to check if it is using the HTTPS protocol.
When it is, the request is intercepted (step 3) and the module uses a FOAF+SSL-based
authentication module in order to authenticate and validate the user.
To ensure proper authentication, this module must have access to the user’s FOAF
profile and to the corresponding client certificate. This is needed as the identity of
the requesting user is verified by using the FOAF+SSL protocol and its associated
client certificate. When accessing secure resources, the user is requested to select an
SSL certificate that provides his credentials (step 4). After choosing the appropriate certificate associated with the user’s FOAF profile, the login process is resumed
automatically (steps 5-6).
When the SSL handshake succeeds , the user’s WebID is retrieved from the client’s
certificate by analysing the Subject Alternative Name section (steps 8-9). Also, the
certificate’s public key is retrieved (steps 10-11).
From this point onwards, the user’s FOAF profile can be obtained (steps 12-13) and
by using semantic querying (e.g. SPARQL [69]) it is possible to obtain the public key
from the FOAF’s profile (step 14-15).
A comparison between the certificate and FOAF profile key is performed to check if
the user is whom he claims to be (step 16). If validation succeeds, a secure session is
established (step 18). Having in mind legacy web domains, if the requesting user is
not able to provide proper credentials for FOAF+SSL authentication, this module still
forwards the request to the web domain (step 19), but the secure session is not established. In such cases, the authentication process is fully ensured by the web domain
application server and the features proposed by this architecture are not available to
the user.
This authentication module is independent of any specific web application implementation and may be shared by more than one web application.
5.2.1.2
Authorisation
Authorisation is the process of controlling the access of a certain user to a resource,
considering the established access policies for that resource when a user is successfully
authenticated. In this proposal, all resources are categorised according to typical
accessibility states (cf. section 2.1.3.4).
When a resource is private or protected, which means that it is protected by access
policies, the authorisation module is responsible for allowing or denying the requesting
user from having access to the resource. This module does not have the power to rule
if a user should or should not have access to a particular resource, but it does enforce
that ruling.
5.2 Policy Enforcement Point
Figure 5.4: PEP: Authentication Process [Sequence Diagram]
95
96
Design
This module allows any web domain to provide decentralised authorisation for hosted
resources. Users do not need to be registered in the same domain, access policies can
be decoupled from the resource and defined elsewhere. The authorisation process can
be depicted in figure 5.5.
This authorisation module intercepts the request resource (step 2) in order to check
whether the user has clearance to access the resource, or not. To this end, it obtains
the user WebID from the user’s certificate (step 4) and the URI of the resource being
requested (step 6).
Afterwards, it requests a decision from a PDP (step 8), which can be located outside
the domain. For this purpose, the authorisation module sends the ResourceID URI
and the user WebID.
After receiving the response, if access is denied, the process aborts returning a response
of unauthorised access to the user (step 10). If the user has been granted permission,
the request continues in the web application.
This module only enforces the access permission or access denial received from the
PDP request. Acting as a failsafe mechanism, when communication with the PDP is
not possible, default behaviour is that the resource is never available to the requesting
user.
5.2.1.3
Provenance & Traceability Information
Most work related to data provenance explores existing information and how that
information is stored and reused. At the time, semantic provenance can be represented
in RDF triples according to an extension of the Provenance vocabulary3 .
Another aspect related to data provenance is that its adoption on private or protected
shared resources is reduced or inexistent. However, when provenance information
is applied to resources published in LOD repositories, this facilitates the process of
relating resources to their author, without depending on the domain where the resource
is located, thus improving the user identity and the relation between resources and
users.
Despite presented approaches for retrieving and storing provenance information [1, 70,
97, 119, 160], there are still no standards on how to capture and store such provenance
information in a cross-domain environment by using a user’s single identity.
Several questions are raised about capturing UGC on the WWW:
• where should the capturing process be executed: on the client or on the server?
3
http://purl.org/net/provenance/
5.2 Policy Enforcement Point
Figure 5.5: PEP: Authorisation Process [Sequence Diagram]
97
98
Design
from
User
with
performed by
Action
Sensor
triggers
Action
performed on
generates
interest in
(inferred)
Web
Application
hosted by
Provenance or
Traceability Information
over
Resource
Figure 5.6: PEP: Provenance and Traceability Entities
• how to track UGC made by automated processes on which an agent is working
on behalf of a user?
• where to conveniently store provenance annotations?
• how to store provenance information of UGC that does not actually change the
resource or its meta information?
Most of the proposed features take into account provenance and traceability information. The traceability concept is introduced in the development of this proposal.
Given the core implications of this concept in the proposed features, it is explained in
the following paragraphs.
Traceability is any information that can be traced back to an action, performed by a user or agent, whose result did not influence or alter the
content or meta-information of a resource.
In this proposal, the concept of traceability is applied to user actions upon resources
and not over software engineering processes (cf. section 2.1.4.4). Furthermore, contrary
to provenance information, traceability information only applies to user actions that
do not translate into resource meta-information modification.
Figure 5.6 depicts the relations between the main concepts and entities involved in
provenance and traceability information. While provenance information and traceability annotations are quite similar in representation and meaning, their impact on
resources is different in both approaches.
When user actions have a direct or indirect impact on resources i.e. by modifying any
of their content or meta-information, the resulting information is called provenance
information and should be recorded according to the provenance ontology. On the
contrary, when those actions do not have an impact on the resource, the resulting
5.2 Policy Enforcement Point
99
information should be called traceability information instead, and recorded according
to the traceability ontology, which inherits part of its definition from the provenance
ontology.
Traceability information relates users, actions and resources but does not contribute
to the alteration of resources content or meta-information. Traceability information is
captured by action sensors (cf. section 5.2.1.4) and represented according to semantic
annotations (cf. section 5.4.1.2).
When properly recorded and managed, traceability annotations will:
• promote and help building a better user profile and contextual information;
• act like a breadcrumb for every resource and corresponding action, which can be
exploited by other users or agents on their behalf.
There are some actions that provide both traceability and provenance information.
For example, when backing up a resource from one location to another, traceability
information is generated regarding the copy action performed by the user on the
original resource and relating it to its copy. Furthermore, the act of creating the backup
copy of the resource generates provenance information solely for the new resource,
because this action actually generates new content.
5.2.1.4
Capturing User Actions
This section proposes a conceptual approach for acquiring and publishing information
from user generated actions and content.
The goal is to propose a system that is able to capture the association between authenticated users and their actions (i.e. reading, uploading, downloading, etc.) upon
resources, hence generating information that should afterwards be treated and stored
in appropriate repositories as provenance or traceability information. The process of
associating a user action to a resource implies the necessity of uniquely identifying the
user, the resource and the action.
By applying FOAF+SSL authentication to a web domain, as described in section
5.2.1.1, it is possible to provide automatic traceability or provenance acquisition over
user actions and UGC, since it becomes possible to identify the user behind each and
every single action and therefore associate the user to the resource in a unique way,
even across different web domains.
Once the user and resource are identified, action sensors are responsible for capturing
relevant information concerning the action taking place or content being generated.
In order to intersect each user action and retain additional information, this component provides and deploys action sensors, being responsible for intersecting user
100
Design
actions on resources. Different sensors can be provided according to each desired user
action monitoring e.g. reading, writing, downloading a resource, etc. Furthermore,
information generators are used for dissecting each resource according to their type.
The process of acquiring information about UGC starts in the action sensors, as shown
in figure 5.7, but there is interaction with other components involved in the whole
process e.g. the PIP, which is used to publish that information in convenient semantic
representation (cf. section 5.4.1.2).
As depicted in figure 5.7, when a request or response reaches the web domain server,
the request/response is intercepted by the PEP (step 3). According to each sensor
specificity, it transforms user actions into traceability or provenance information.
The three main pieces of information required for providing traceability or provenance
information are: action, user WebID and ResourceID URI . Each action sensor,
independently of its implementation, must obtain the user WebID (step 4) and the
ResourceID URI (step 5).
According to the type of action, a provenance or traceability message is created (step
6), containing the above acquired information, plus a timestamp and any other additional specific information related to the action being captured. The extent of captured
information depends on the specific implementation of each action sensor and the implementation of each action sensor depends on the web system where the sensor is
being deployed. Such message is further sent to the PIP for publishing (step 7). After
that, the request proceeds normally to the web domain server.
Each sensor is deployed to the same web server as the web application on the PEP
component alongside the authentication module, and any other modules as presented
in figure 5.8. This way, requests are easily intercepted.
This type of approach is characteristically used for malicious behaviour in order to
eavesdrop4 on requests, allowing network intrusion or man-in-the-middle attacks5 .
Contrary to these malicious types of attacks, the intention of this module is to exploit
value from the user actions.
This process works best in a RESTFul architecture [148] where actions are denoted in
the request. When intercepting proprietary web applications that do not implement
such types of architecture, tweaking may be necessary to this module (cf. section
6.2.3).
5.2.1.5
Decentralised Resource Hosting
While the issue of handling distributed authentication and decentralised authorisation
over each of those resources has been dealt and solved with the authentication and
4
5
https://www.owasp.org/index.php/Network_Eavesdropping
https://www.owasp.org/index.php/Man-in-the-middle_attack
5.2 Policy Enforcement Point
Figure 5.7: PEP: Intercepting User Actions [Sequence Diagram]
101
102
Design
Webserver
PEP
Upload Sensor
Web Domain 1
<uses>
Authentication
Web Domain 2
Authorisation
Decentralised
Resource Hosting
Figure 5.8: PEP Components Diagram
proposed authorisation modules, the purpose of this module is to provide decentralised
resource hosting.
As evidenced in the use case described in section 3.2.2 and systematised in section
3.3.10, current web domains are still not prepared for a fully decentralised resource
hosting based on the resource type, neither can users specify where a particular type
of resource should be hosted, in a cross-domain perspective.
Contrary to traditional approaches where every resource created by a user is physically
hosted in that web domain, this proposal specifies that resources can be kept according
to the author’s wishes and preferences and therefore be hosted on the domains that
the user prefers, instead of the domain where the user’s action takes place.
This module’s responsibilities consist of (depicted in figure 5.9):
• obtaining the appropriate domain for hosting the (uploaded) resource content,
according to the resource’s type and the user’s preferences specified in the user’s
profile;
• filtering the request response by seamlessly and automatically embedding resources located in a different web domain.
The upload process is initiated by the user (step 1) and reaches the web domain (step
2). Because the web domain has an upload actions sensor capable of monitoring upload
actions, it intercepts the web request (step 3) and uses the decentralised resource
hosting module to override the default location where the resource should be hosted.
By retrieving the user’s WebID from the request (step 4), and the resource type (step
5), the module can contact a PIP in order to obtain the user’s preferred hosting domains according to its type (step 6-7). By accessing this preferences list, the module
5.2 Policy Enforcement Point
Figure 5.9: PEP: Resource Upload Action Process [Sequence Diagram]
103
104
Design
is responsible for hosting each resource according to the resource type-domain preferences specified in the user’s FOAF profile.
When the resource hosting takes place (step 8), the hosting domain is responsible for
providing the resource URI for the created content (step 9). The returned URI is
injected in the request and the upload request is changed so that the content of the
resource is now replaced by a custom placeholder that specifies the hosting server (step
10). In the end, the upload request proceeds to the web application (step 11). When
the domains finishes its processing, the response is sent back the user (steps 12-13).
The inverse action also happens when users retrieve resources that have been relocated
to other domains. The process for obtaining such resources is shown in figure 5.10.
For each download request a web domain application receives (step 1-2), the download
action sensor is responsible for intercepting the response and redirecting it through
the decentralised resource hosting module.
When the response is intercepted (step 5), the module checks its content for redirection
placeholders (step 6).
For each existing placeholder, the referred external resource is retrieved from the
external hosting server (steps 7-8) and its content is injected in the response (step 9).
The response is then sent to the user (steps 10-12). This proposal does not specify the
criteria used for specifying redirection placeholders or how the resource is injected in
the response.
Decentralised resource hosting has several side effects and offers major advantages on
behalf of users:
• users are able to host resources under their preferred locations, therefore reducing
security jeopardies over physical access to resources;
• increases user’s privacy over resources, as users control the location where resources are hosted;
• eliminates the duplication or cloning of the same resource in different domains,
yet allows sharing the resource in more than one domain;
• eliminates the duplication or cloning of redundant or even contradictory access
policies over several copies of the same resource.
5.2.1.6
Rendering Compound Resources
Compound resources are resources that are accomplished by the process of
combining several other resources into one. Compound resources make use
of other resources that are addressable by an URI. There is no limitation
on whether inner resources can also be compound resources.
5.2 Policy Enforcement Point
Figure 5.10: PEP: Resource Download Action [Sequence Diagram]
105
106
Design
Figure 5.11: PEP: Rendering Compound Resources Example [Screenshot]
Given the fine granularity of the whole system – where each single part has an URI,
which can be associated with different access policies – depending on each user access
permission regarding each individual resource constituting the compound resource, it
can be presented to that user with different views/content according to the defined
access policy of the parts.
The authorisation process no longer takes place only over the compound resource URI,
but for every inner resource that the compound resource has. Consequently, each
individual user can eventually have a very different perception of the same compound
resource depending on whether that user has access to all the inner resources, or not.
When a user requests a compound resource with resources for which the user has
not been granted access, those resources are automatically filtered out by the client.
How those resources appear to the client when the user has no access permissions
is totally dependent of the used client application and it is out of the scope of this
work. For example, in a web browsing environment, different web browsers render
the unauthorised access response in different manners. In figure 5.11, it is possible
to depict that when access to a resources is blocked by access policies, the image is
simply omitted.
Depending on the resource type and how it is rendered on requests, the web domain
5.3 Policy Decision Point
107
where the compound resource is hosted must delegate to the inner resources the user
credentials of the requesting user and act on behalf of that user.
5.2.2
Summary
The raison d’être of FOAF+SSL authentication is to provide a user’s authentication
process solely achieved between user and web domain without the need to rely on thirdparty authentication parties that works in a cross-domain perspective. The adoption
of such type of authentication eliminates the systematised problems of User Multiple
Identity (cf. section 3.3.1) and Cross-Domain Authentication (cf. section 3.3.3).
The Cross-Domain Authorisation problem (cf. section 3.3.4) is solved as the PDP and
access policies are decentralised, meaning that they can be decoupled from the web
domain where authorisation is enforced and the resource is hosted.
Generating and storing contextual information about who performed what action over
which resource and when it happened is the responsibility of action sensors. This
solves the problems related to Contextual Information Loss systematised in section
3.3.5. Activating traceability and provenance modules for tracing user actions over
resources on any existing web application is quite straightforward if the corresponding
action sensors are deployed on the web server where the actions occur.
Providing decentralised resource hosting is the solution for the Cross-Domain Resource
Hosting problem (cf. section 3.3.10), because it enables users to specify their hosting
preferences on their personal profile. Likewise, resources do not have to reside in
proprietary domain servers where user actions take place, thus becoming accessible
either inside or outside those domains.
The problems associated with the Multiple Renderings of Compound Resources (cf. section 3.3.11) are solved as an outcome of the usage of fine granularity access policies
over compound resources and their inner resources, as well as their local enforcement
policies. This provides multiple renderings of the same compound resource depending
on the access policies of each inner resource.
5.3
Policy Decision Point
The PDP is the component responsible for producing the decision of granting or
denying access to a resource. While being one of the simplest components in the
architecture, it plays a major role because it is responsible for evaluating access policies
when users request access to resources.
To provide authorisation over resources, the PDP requires the following fundamental
pieces of information:
108
Design
• the requested resource URI, to unequivocally and respectively identify the resource;
• the requesting user WebID, to unequivocally and respectively identify the user
requesting access to the resource;
• the resource’s access policies, defined by the resource author that provides the
rules to evaluate which users have access to a certain resource.
5.3.1
Contributions
The responsibility of this component is to provide a rule’s evaluation engine, capable
of evaluating whether a requesting user should have access to the requested resource
according to the resource author’s specified access policies.
The evaluation engine proposed for this component is capable of assessing semantic
rules and provides reasoning capabilities. It acts upon declarative and expressive
access policy rules that are written according to the ontologies used in the system.
The evaluation of access policies rules follows the process depicted in figure 5.12.
The evaluation is not responsible for holding or storing resource access policies. When
a request for deciding access permission is received by the PDP (step 1), the access
policies established by the resources’ author are retrieved from the PIP (steps 2-3).
The evaluation engine is started (step 4) and access policies are loaded (step 5).
When access polices are comprised of rules that need extra information in order to
be evaluated, the PDP may contact the PIP to obtain those and load them in the
evaluation engine (step 6-8). For example, when an access policy states that only
family members should have access to a resource, the PIP is requested for the list of
users that are family members of the resource’s author.
The access evaluation query is made to the evaluation engine (step 9), which returns
the result (step 10).
A response stating whether the access has been granted or denied is created (step 11)
and sent back to the requesting entity, i.e. the PEP (step 12).
5.3.2
Summary
While this entity relies on a PIP to retrieve access policies, its evaluation takes place
independently of where resources and access policies are located. Furthermore, this
entity does not need to be located in the same domain where resources are hosted,
thus addressing the problem of the Cross-Domain Authorisation mentioned in section
3.3.4.
5.3 Policy Decision Point
Figure 5.12: PDP: Access Policy Evaluation Process [Sequence Diagram]
109
110
Design
The evaluation engine used in the PDP must be capable of reasoning upon declarative
and expressive access policy rules, therefore capable of dealing with the systematised
issues of Expressivity and Semantics of the policy language (cf. sections 3.3.6, 3.3.8).
Because access policies are located in the PIP and not in the PDP itself, it enhances
the proposed solutions for the Cross-Domain Compliance of the access policy language
(cf. section 3.3.9).
5.4
Policy Information Point
In typical AAA systems, the PIP is the responsible component for storing and retrieving access policies.
However, in the proposed architecture, the PIP component is responsible for performing several tasks related to information storage, publication and retrieval. It acts as
a data broker for repositories that have information about users, resources, access
policies, provenance and traceability annotations. It also allows the generation of
contextual and meta-information about resources in user actions.
5.4.1
Contributions
In the proposed architecture, the PIP has the following responsibilities:
• storage and retrieval of information related to users, resources or their metainformation, as requested by other entities in the architecture/system. Access
policies are a special case of resources that are stored and retrieved by the PIP;
• publication of resource information, namely:
– making information available on registered repositories;
– allowing the PEP component to publish provenance and traceability information;
• generating semantic contextual and meta-information from resources manipulated in user actions.
The next sub-sections describe the modules addressing each of these responsibilities:
Information Storage and Retrieval, Publishing Semantic Information and Creating
Resource Contextual Information.
5.4 Policy Information Point
5.4.1.1
111
Information Storage and Retrieval
In order to fulfil the mentioned responsibilities, PIPs gathers and groups information
stored in more than one PIP instance in a seamless way, providing the requested
information to the requester (i.e. IDP, PIP, PEP, PDP, PAP or PRP).
In particular, it is able to answer the following queries:
• a list of all resources for a given user WebID;
• a resource’s author;
• a user’s social network;
• a user’s actions;
• a resource’s provenance and traceability information;
• retrieve users’ access policies;
• persist access policies on behalf of the PAP (cf. section 5.5).
Access policies are a special case of resources stored and retrieved by this component, managed and maintained by the PAP (cf. section 5.5.1.1) and used by the PDP
(cf. section 5.3.1). Because PAPs allow users to add, change or delete existing policies,
such changes can be stored again by using this module.
The process by which this component is able to formulate queries and provide query
expansion for information located in remote repositories is not dealt within the scope
of this proposal. This part of the proposal runs as a black box that can adopt thirdparty solutions. Yet, this entity should be capable of obtaining information from local
or remote/distributed repositories (e.g. relational databases, websites or triple or quad
stores) or any other kind of information repository or endpoint that may be registered
in the component, provided that a common communication interface is established
between different architectures.
5.4.1.2
Publishing Semantic Information
The process for acquiring, storing and publishing meta-information should be kept the
same across all possible applications. Creating and publishing semantically structured
information are the main responsibilities of this module.
This module receives raw provenance and traceability information requests, generated
by action sensors, and transforms them into semantic annotations according to defined
ontologies.
112
Design
Resources’ contextual and meta-information as well as semantic annotations from
provenance and traceability processes are published in existing and predefined repositories.
5.4.1.3
Creating Resource Contextual Information
This module is responsible for analysing and parsing resource content referred in provenance or traceability annotations. It is triggered when raw provenance or traceability
requests are made to the publishing module (cf. section 5.4.1.2).
Several types of resources e.g. PDF, written documents, photographs and videos already possess embedded meta-information. Yet, the whole process expects explicitly
stated semantic and integrated information and for this reason, such meta-information
must be kept and mapped according to an ontology that captures the necessary semantics and can be shared across the internet.
Some of those resources have properties for author reference but few enable the association between the resource’s identifiable URI and the user’s FOAF profile, thus the
need for new meta-information creation. For example, an HTML resource for itself
does not state who the resource’s author is, unless the author had embedded that
information in the resource’s content.
According to the actions performed upon resources, this component is capable of
triggering a meta-information generation process in order to enhance the resource’s
meta-information. Because different types of resources have different types of attached
or attainable meta-information, each meta-information generator deals with a specific
resource type (e.g. photograph, video, audio, HTML webpage). For example, video
files have length and frames per second properties, while music tracks have length
and beats per minute instead of frames per second, requiring different processing.
Information is semantically represented according to the most suitable ontology for
each resource type.
This module is capable of maintaining a list of the most common resource types and
therefore generates semantic annotations according to previously established templates
for each of the resources. The process on how these annotations are generated is out
of scope in this work.
5.4.2
Summary
This component allows typical retrieval and storage of information, namely related to
users, resources, access policies and provenance and traceability information.
Having provenance and traceability information stored according to specific ontologies
aids in the information retrieval process for several queries, because independently of
5.5 Policy Administration Point
113
where the user action occur, these are always kept in a formal, explicit and shared
specification i.e. an ontology.
Retrieving a user’s resource list becomes a simple task, independently of the domain
where the resource is hosted, because all resources have provided provenance or traceability about it. No specific wrappers for every web domain need to be developed
to obtain that information. As a result, queries over the resource’s traceability and
provenance information are automatically supported and the information is returned
independently of where information is hosted.
Each PIP is able to generate semantic provenance and traceability annotations according to the publishing request and store those annotations in appropriate repositories.
For any publishing request, the PIP is responsible for analysing the involved resources
in order to trigger the contextual and meta-information generation module.
Regarding the presented use case scenario, adding contextual and meta-information
to user actions not only increases the amount of contextual information, but can
also boost the discovery of known-unknown and unknown-unknown resources, thus
enhancing the resolution of problems systematised in sections 3.3.5, 3.3.12 and 3.3.13.
Due to privacy concerns, provenance and traceability annotations can be kept private
to the user, shared or made publicly available as any other resource.
5.5
Policy Administration Point
To give users a better, wider and more comprehensive control over their shared resources in a cross-domain perspective, access control shifts to a paradigm on which access policies are defined by rules. In fact, for the described use case scenario (cf. chapter
3) and to simplify users’ sharing of resources, it is important to understand why a user
should have certain rights upon a resource and use that rationale, rather than just
giving explicit access to a user over a resource.
The definition of access control policies follows a prohibitive or least privilege approach
(cf. section 2.1.3), where users must be granted permissions to access a resource.
In order to better handle resource sharing among users, there is a need to manage
each resource’s access control policy in a single (centralised) location, independently
of where the resource is physically hosted. Access policies are stored and retrieved
from PIPs.
In this proposal, access policies over resources are expressive, declarative and specified, having in mind the characteristics of a particular user, resource and relationship
between that user and the resource’s author. Depicting the presented use case scenario, the sharing of resources may occur based on any combination of the following
dimensions:
114
Design
User Attributes User characteristics are used to grant or deny access e.g. age, gender, interest topics, etc.;
Resource Attributes Resource attributes are used to group resources, provide topics of interest or infer contexts to grant or deny access;
Social Relationships Users’ social networks and relationships can be used to specify
conditions for granting or denying access to resource.
Based on resource attributes, user attributes, and users’ social relationships, this component allows users to define access policies over resources, by describing the implicit
rationale behind the sharing in a way that access restrictions can be semantically
expressed by the means of rules.
These access policies over resources are created by the author (or on the author’s
behalf), by means of rules that are capable of stipulating a very dynamic environment.
To comply with such dynamics, the resource’s access policies must be enforced and
evaluated when a resource is requested, in order to check if a particular user has access
to a particular resource, instead of being discretionarily/statically defined. This is the
role of PEP (cf. section 5.2.1.2).
Resources’ groups to which rules apply, and users’ groups that have access to resources
are dynamically evaluated on each resource request. This translates into access policies that support the continual evolution and growth of information (i.e. users and
resources attributes, users topic interests, social network and contextual information).
Such dynamics allow the system to automatically grant access to newly added users
on the social network or newly added resources that comply to existing access policies.
To enhance a trusted sharing network, users’ social relationships are deeply used in the
definition of access policies. Resources can only be shared with users that belong to the
resource author’s social network. As a result, the stronger a relationship is between
the author and another user, the higher is the probability that the author grants
access to resources for that user – and the opposite can also be inferred. Therefore,
exposing access policies publicly can be harmful for the resource’s author, as it would
deeply expose his social network, preferences and other sensible information that may
be expressed in such access policies. To this end, every rule definition is kept private,
except to the system and the author, in a way that no exceptions are allowed.
5.5.1
Contributions
This component is responsible for:
• determining a user’s rights to access a protected resource;
115
5.5 Policy Administration Point
hasAccess
Res1
isAuthor
John
hasAccess
hasAccess(Res1, Mary)
hasAccess(Res1, Jane)
Mary
Jane
Figure 5.13: PAP: Static Access Approach
• generating an explanation of why a user has access rights to a resource;
• preventing inconsistencies between access policies that affect the same resource
in opposite ways. For example, by analysing the list of access policies, it exposes
users that have simultaneously been granted and denied access to the same
resource by different access policies.
The next sub-sections describe the modules addressing each of these responsibilities:
Access Policy Management, Dynamic User Grouping and Dynamic Resource Grouping.
5.5.1.1
Access Policy Management
The semantic representation of access policies allows greater expressivity when defining access policies, provide means for standard and compliant representation across
different domains and allows a finer granularity access when defining access policies.
Access policies that connect resources and users are kept in a semantically-explicit
and machine-readable format. Despite the existence of several existing policy languages (cf. section 2.1.3.2) none is specialised in cross-domain information privacy
and sharing.
For a matter of simplicity and demonstration purposes, the SWRL language was chosen for describing and representing access policies, as its design assumes the explicit
specification of semantics through ontologies. The semantic is then complemented
through the rules that relate and constrain the concepts and properties defined in the
ontology.
In most systems, resource access control is an explicit grant/deny relationship established between a resource and a particular user (as depicted in figure 5.13).
For years, sharing as been achieved by authors specifying static access policies that
granted access to their resources. While this approach is still commonly used (e.g.
Facebook, google groups), the RBAC approach mentioned also became one of the
most used approaches as it actually solves some part of authorisation issues (allows
users grouping).
116
Design
hasAccess
Res1
inRole
Role1
isAuthor
John
Mary
inRole
hasAccess(Res1, Role1)
Jane
Figure 5.14: PAP: RBAC Approach
hasAccess
John
isAuthor
Res1
hasAccess(Res1, age>10) Jane
Age=8
Mary
Age=25
Figure 5.15: PAP: ABAC Approach
When access to a resource is given to a group of users that is generically identified by
a role in RBAC approaches, and a user becomes part of a role that has been granted
access to a resource that user is automatically given access to that resource (cf. figure
5.14).
Recently, but not commonly used, ABAC approaches aid the specification of access
policies based on user attributes instead of roles, thus allowing dynamic user groupings. ABAC approaches increased the expressivity level of RBAC, by allowing access
policies’ definition based on the user’s attributes (cf. figure 5.15).
Sharing resources intelligently in a cross-domain perspective requires finer granularity
levels that typical authorisation models like RBAC or ABAC lack. Such systems can
not be expanded to infer what is a family member or a second-degree relative, because
such knowledge is expressed and obtained by the usage of semantically structured
information. Contrary to these approaches, when using an access control mechanism
that allows semantic rules to be specified, access policies go beyond role or attributebased approaches.
As previously mentioned, it is more important to understand the rationale behind the
creation of a group (either of users or resources) than to explicitly provide individual
users or roles with access to resources.
In this work, access policies are proposed as simple rules, similar to Horn-like rules,
being able to capture the richness of the access policy in terms of rationale, instead of
discretionary access to resources.
Such rules, when described according to the semantics conveyed by an ontology, specify
117
5.5 Policy Administration Point
isFriend
isAuthor
isAuthor
Me
Res1
hasAccess
hasAccess
Res2
isFriend
Mary
hasAccess
hasAccess
Jane
Person(?auth) ^ isAuthor(?auth, ?res) ^
isFriend(?auth, ?friend) -> hasAccess(?res, ?friend)
Figure 5.16: PAP: FOAF-based Approach
Algorithm 5.1 PAP: Static User Grouping Definition
1
2
3
R e s o u r c e ( PhotographA ) , User ( John ) ≠> hasAccessTo ( John , PhotographA )
R e s o u r c e ( PhotographA ) , User ( Mary ) ≠> hasAccessTo ( Mary , PhotographA )
R e s o u r c e ( PhotographA ) , User ( Joseph ) ≠> hasAccessTo ( Joseph , PhotographA )
the meaning and rationale knowledge for the formation of (i) groups of resources to
be shared and (ii) groups of users to share with. With this, resource authors can
stop specifying access policies over roles and assigning users to roles, because agents
understand the rationale implied in the group creation and hence automatically reason
upon it.
In Figure 5.16 the rationale of sharing resources only with friends is captured by using
a simple SWRL rule that makes use of the user’s social network, described in the
user’s FOAF profile.
Further, this modification to the typical access policy management paradigm simplifies
the user’s task of specifying policies, as the rationale for those rules is automatically
expanded and applied to other users that have the same characteristics or relationships
with the resource’s author – allowing a continual growth of the system and access policy
enforcement.
5.5.1.2
Dynamic User Grouping
Referring back to the presented use cases(cf. section 3.2.6), sharing resources with
dynamic groups of users is made possible by enriching existing systems with the user’s
social network and by adopting a more expressive access policy specification, on which
the access control method is aware of dynamic user grouping and semantic.
This gives users the power to describe knowledge concepts that were not present before
(e.g. “family member”), which is semantically defined in the/an ontology and whose
member can be easily derived from the user’s social network.
This way, instead of explicitly giving direct access for each of the resource’s author
118
Design
Algorithm 5.2 PAP: Dynamic User Grouping Definition
R e s o u r c e ( R e s o u r c e 1 ) , User ( ? u s e r ) , User ( ? a u t h o r ) , hasAuthor ( Resource1 , ? a u t h o r ) ,
isFamilyMemberOf ( ? author , ? u s e r ) ≠> hasAccessTo ( ? u s e r , R e s o u r c e 1 )
Algorithm 5.3 PAP: Static Resource Grouping Definition
1
2
3
R e s o u r c e ( R e s o u r c e 1 ) , User ( FamilyMember1 ) ≠> hasAccessTo ( FamilyMember1 , R e s o u r c e 1 )
R e s o u r c e ( R e s o u r c e 2 ) , User ( FamilyMember1 ) ≠> hasAccessTo ( FamilyMember1 , R e s o u r c e 2 )
R e s o u r c e ( R e s o u r c e 1 ) , User ( FamilyMember1 ) ≠> hasAccessTo ( FamilyMember1 , R e s o u r c e 2 )
family member to each resource (as demonstrated on algorithm 5.1), users can expand
the access control mechanism to include their FOAF profile information and existing
social relationships. In fact, the proposed features allow the usage of a predicate (e.g.
“isFamilyMemberOf”6 ) that considers any level of family relationship between two
users, automatically giving access to all family members to resources, as demonstrated
on listing 5.2.
With only one rule that captures the rationale behind a specific sharing, as specified
on algorithm 5.2, the author is able to create an access policy whose result evolves
according to the author’s social network.
For example, given the use case mentioned in section 3.2.3, when the author adds
another family member to the social network graph, the new member automatically
has access to the resources accessible by the author’s family members.
5.5.1.3
Dynamic Resource Grouping
The same type of dynamic grouping is applied to resources when specifying access
policies. Resource sharing rules include semantic predicates that enable the usage of
resources’ contextual and meta-information enabling dynamic resource grouping.
When authors wish to share a particular set of resources with family members, it is
essential to capture the rationale leading to grouping those resources i.e. by identifying
similarities between those resources instead of having to write rules for each one of the
shared resources (cf. algorithm 5.3).
Instead, authors can make use of resource properties to characterise all the resources in
a group, e.g. those created on a specific date, between two dates or on a specific location. This dynamic form of resource characterisation/classing/grouping allows sharing
to be achieved with a simpler and more elucidating rule specification (cf. algorithm
5.4).
6
This predicate can be created by the user or it can be used on the system by importing an
ontology that is capable of representing this domain.
5.6 Policy Recommendation Point
119
Algorithm 5.4 PAP: Dynamic Resource Grouping Definition
R e s o u r c e ( ? r e s o u r c e ) , User ( ? a u t h o r ) , User ( FamilyMember1 ) ,
hasAuthor ( ? r e s o u r c e , ? a u t h o r ) , c r e a t i o n D a t e ( ? r e s o u r c e , "20 ≠05 ≠2013") ≠> hasAccessTo (
FamilyMember1 , ? r e s o u r c e )
5.5.2
Summary
The introduction of semantic rules-based access policies allows more than just giving
explicit permissions over a particular set of resources.
The presented approach can still deliver the same levels of sharing achieved by ABAC
and RBAC approaches, but enhances those by allowing the explicit semantic classification/grouping of users and resources based on their properties and relations.
This is possible because the PDP is capable of dynamically inferring which group of
resources belong to a certain class of resources, and which group of users belong to a
certain class of users, hence dynamically establishing access permissions between users
and resources.
Summarising, the introduction of ontologies for describing knowledge and the adoption
of ontology-based semantic rules for describing access policies provides information expressivity, expansion, semantics and cross-domain compliance solving the systematised
problems described in sections 3.3.6, 3.3.7, 3.3.8 and 3.3.9.
5.6
Policy Recommendation Point
The recommendation point recommends known-unknown and unknown-unknown resources to users in a cross-domain perspective, by exploring the information gathered
by the system (namely user profiles, social network relationships, provenance and
traceability information).
Having an access policy definition like the one proposed in section 5.5, based on similarities between resources, users or domain knowledge, is being half-way to enabling
an automatic recommendation system based on information such as FOAF profiles,
interest topics and contexts to provide the sharing of resources as depicted in figure
5.17.
Traditionally, the responsibility of sharing resources always comes down to the resource’s author, based on his restricted perception/knowledge of the whole network of
users and resources. Resource access policy recommendation is a process that is introduced to widen that vision by which a system notifies the resource author when other
users would probably benefit or rejoice from having access to a particular resource.
120
Design
isFriend
Res1
isAuthorOf
hasAccess
Res2
John
isAuthorOf
hasAccess
isCoworker
similarContext
isAuthorOf
Crishtoph
Jane
Res3
Person(?author), Person(?coworker), Resource(?author_res),
Resource(?coworker_red), isCoworker(?author, ?coworker),
isAuthorOf(?author, ?author_res), isAuthor(?coworker, ?coworker_res),
similarContext(?author_res, ?coworker_res) -> hasAccess(?author_res, ?coworker)
Person(?author), Person(?friend), Resource(Res1), isFriend(?author, ?friend), ->
hasAccess (Res1, ?friend)
Figure 5.17: PRP: FOAF and Context-based Rules
The access policy recommendation process aids resource authors in granting or denying
access to existent resources by making use of similarity factors between resources and
social relationships, suggesting which users should be given access to each resource.
It also eases resource authors’ task of sharing resources by finding similar access policies
that could be reapplied to similar resources. It is envisaged that recommendation can
aid users in the access policy management process regarding their resources, and give
other users access to resources that would not have previously been accessible to them.
This is achieved by enriching and enhancing the access policy recommendation process
with existing users’ and resources’ meta-information, and creating a hybrid recommendation method capable of understanding not only the concepts of users and resources
but also provenance and traceability annotations gathered from user actions.
A resource context is produced by the analysis of each resource’s content and metainformation, while a relationship context is created based on the existing relationship
depth [156] between users, each user’s profile, linked resources and consequent relationships.
One of the outcomes of this proposal is the creation of semantic rules that match
similarities between contexts [59]. Therefore, for every resource or relationship, a
context is generated and multiple contexts may exist for the same resource.
5.6.1
Contributions
This component is responsible for:
• the implementation of a hybrid recommendation engine;
5.6 Policy Recommendation Point
121
• guiding users through the resource-sharing process by suggesting access policies
for their resources:
– by evaluating feedback actions regarding the acceptance or rejection of recommended resource sharing;
– avoiding rejected recommendations from being recommended again;
• recommending known-unknown and unknown-unknown resources.
5.6.1.1
Hybrid Recommendation Engine
When an application responsible for ensuring access control is aware of all users’ resources and social relationships, such application is capable of recommending resources
to new users that have recently became part of any of the resource author’s social networks.
This already happens on typically closed applications (e.g. Slideshare, Research Gate,
etc.) but is still not being used in a cross-domain perspective for all user resources.
Contrary to such closed environments, this proposal consists on performing such task
in a cross-domain perspective.
The recommendation is enhanced with semantic information for cross-domain web
applications relying on an open and distributed social network based on FOAF profiles,
provenance and traceability information.
Users’ public resources are used in the recommendation process to enable associations
between users, between resources or between users and resources. Despite already
being publicly accessible, recommendation of publicly accessible resources is performed
because other users that do not know of their existence can eventually have interest
in them.
The proposed recommendation consists of a hybrid approach accomplished by the
combination of users’ profiles, resources’ meta-information, traceability, provenance
annotations, social network analysis and domain knowledge as depicted in figure 5.18.
The semantic filtering relates to the described and systematised problems of recommending known-unknown (cf. section 3.3.12) and unknown-unknown (cf. section
3.3.13) resources that users had little or even no knowledge about.
The recommendation service is built on top of these three filtering methods that are
capable of dealing with different sets of information as displayed in table 5.1.
The following methods are therefore suggested in the PRP:
Content-based Filtering Method Recommends existing resources by comparing
resource attributes, content and meta-information to the user’s profile attributes
122
Design
Information
Social Information
Provenance/
Traceability
Semantic Information
Social
Recommender
Provenance/
Traceability
Recommender
Semantic
Recommender
Social-based
Recommendations
Provenance/
Traceability-based
Recommendations
Semantic-based
Recommendations
Aggregator
Recommendations
Figure 5.18: PRP: Hybrid Recommendation Information
Filtering Method
Input Information
URI
Resources
Semantic
Collaborative
–
–
Content
–
Meta-Information
–
–
Topic Preferences
Context
–
Attributes
URI
Users
Content
–
–
Social Network
–
–
Actions
–
Provenance
–
–
Traceability
–
–
Other Ontologies
–
–
Table 5.1: Filtering Methods Input Information
5.6 Policy Recommendation Point
123
and topic preferences in order to verify the resource’s relevancy to the user. This
relevancy is given by the similarity between resource attributes and the user’s
topic preferences. The content-based filtering method is enriched mainly by
exploiting resources’ content, resources’ generated meta-information and users’
interest topic preferences.
Collaborative Filtering Method It recommends resources based on the following
pairs of connections: (users, users), (resources, resources) and (users, resources).
This process is content-agnostic, meaning that it only recommends resources
based on these collaboration patterns, where similarities between users linked to
resources are used to infer other new possible connections between users and resources. The collaborative filtering method mainly uses information that relates
users’ actions to resources.
Context-based Filtering Method Recommends resources that match the proposed
user’s topic preferences or semantically related topics. This filtering method
expands the capabilities of the content-based filtering method by introducing
reasoning over knowledge concepts. When the user context and resource context
matches, the recommender system recommends that resource to the user. Interest topics are semantically described, providing not only hierarchical relations
between topics but also a graph of other connections between semantic information. Contexts are obtained through the usage of ontologies and semantic
rules that provide grounding to this filtering method. The filtering method is
enriched by semantic information derived from multiple domains, that include
users’ FOAF profiles, topic interests, social network graphs, resources’ metainformation, provenance and traceability annotations.
While the recommendation process runs continually, it is triggered by several changes
in the system, namely:
User Generated Content, when users create, edit or change an existing resource,
resource content is analysed by specific meta-information generators that generate semantic information. The recommendation process is triggered because
changes in content might affect the result of the content-filtering (i.e. new content can be added or removed), collaborative-filtering (i.e. changing or adding a
resource increases the number of times the resource has been accessed) and
context-filtering (i.e. changing content may derive new context information)
methods.
User Generated Actions, when users perform actions over resources, they are implicitly building their profile. When their profile changes, it is necessary to trigger a recommendation process because a change in a user profile might suggest
access to other resources as it influences the collaborative and context-filtering
124
Design
method. Notice that revoking access permission might also be suggested if the
resource is evolved through time and its applicability is over. In case the resource does not change and if it has been shared before, it makes no sense in
revoking access rights because the resource might have been duplicated by others
elsewhere.
Access Policy Modification, when users create, change or remove access policies,
the recommendation process is triggered because users may now have access to
resources that they did not have before, which also influences the collaborativefiltering method.
Social Network Changes, whenever a user becomes part of or is removed from
another user’s social network. In fact, this process is quite similar to the addition
of new resources because a new user is actually a special case of a new resource
that is identified by a corresponding URI. As a result, the user’s context might
change – which would trigger the recommendation process. The inclusion of
a new relationship might change a user’s context, which has an impact on the
resources the user may have access to.
5.6.1.2
Notifications & Feedback
When the recommendation process succeeds in recommending access to resources, the
resource author is notified with a message containing:
• the resource to be shared;
• the user to whom the resource is being shared;
• an explanation of why the resource is being recommended.
When resource sharing is recommended, the system checks with the resource’s author
if he wishes to assign the access privilege to the proposed user. If the author wants
to assign the privilege to the proposed user, the PRP takes the necessary actions to
notify the proposed user as demonstrated in figure 5.19.
When authors accept resource sharing recommendations, these are translated into
access policies over resources. For this reason, when these are accepted, they are
translated into access policies.
The author may receive recommendation notifications of access policies granting access
to users that may not be part of the his social network. When sharing is recommended
to users outside the author’s social network, the inclusion of that user in the author’s
social network must be achieved prior to the sharing act, otherwise sharing is not
125
5.6 Policy Recommendation Point
Filtering result
Start
Find resources to
recommend
Found resources?
Enlarge
Content Filtering
Spectrum
No
Yes
Compare Resources context against
proposed user context interest
Matching
Context
Resources
Found?
No
Do not perform
Recommendation
Yes
Notify Resource
Owner
Yes
Resource owner
wishes to assign
the privilege?
No
Figure 5.19: PRP: Recommendation Process
permitted. To this end, the inclusion of a new relationship is proposed (cf. section
2.1.3.5). If accepted, the author’s FOAF profile is changed accordingly.
The proposed user who should be given access to the resource also receives a notification message stating:
• that a resource exists that might be suitable for the user;
• an explanation of why the resource is being recommended.
Each user receives a list of resources that were shared with him, and a request to
express whether or not that resource is relevant to him, thus providing feedback to
the recommender system as shown in figure 5.20. This feedback is captured in the
form of traceability information (cf. section 5.2.1.4) and will be used as supporting
information.
5.6.1.3
Known-Unknown and Unknown-Unknown Resources
In order for a system to be able to recommend the sharing of known-unknown and
unknown-unknown resources, it must be possible to establish associations between
resources, between users and between users and resources that are not possible to
126
Design
Notify Resource
Owner
Privilege
Recommendation
Notification
response
Owner
Response
Register
Denny
recommendation
request for
Proposed User
Accept proposed
privilege ?
no
yes
Register
response
PIP
Response
PAP
Assign
Privilege to
Proposed
User
Assign Privilege
Priviledge
Database
Privilege
Information
Creates
Priviledge
Store Privilege
Store
PIP
Store Provenance
Information
Notifies
Captures
Provenance information
Provide
Feedback
Provenance
Database
Action Sensor
Proposed
User
Figure 5.20: PRP: Recommendation Notifications & Feedback
5.6 Policy Recommendation Point
127
establish by means of content or collaborative analysis. The semantic-filtering method
uses ontologies to map existing information and allow the inference of new knowledge
by providing associations between resources that would not have been related before.
In order to better understand what an unknown-unknown resource is, let us refer
to the unknown-unknown discovery use case described in section 3.2.8. On this use
case, an unknown-unknown resource is any resource generated at the event on which
a passer-by could appear but would never have access to, because no connection exists
between those passers-by and the social event except that both co-existed in the same
place for a certain period of time.
By enriching the system with ontologies capable of performing deductive reasoning
about events, space and time (from multiple and different sources of information), the
architecture infers that the passer-by was located in the same place and time the social
event took place and therefore presumes a possible association between passers-by and
the social event resources.
In use case 3.2.8, resources have been annotated for having recognised a person that
has not been identified from the resource author’s social network. That unidentified
person has no relationship with the resource’s author or any other guest at the social
event because it was a passer-by in a public place.
This unidentified person represents any possible user that may be interested in some
of the photographs of the place, not because of any relationships with the users but
because that person was at the same time and place where those photos were taken and
could eventually appear in one or more. It is possible to narrow down the possibilities
of people that could be passers-by at that location and time if the unidentified person
is in the same context on which the photos were taken, and as a result recommend
the resource sharing to that unidentified user, by using the following information:
• user profile;
• user contextual information:
– users’ geo-referenced position;
– users’ geo-referenced position’s time;
• resource creation time and location;
• provenance and traceability information from user actions:
– event records of their physical performance while practicing sports (cf. section 3.1.3).
128
Design
When the system discovers which unidentified users were at the same time and place,
by comparing their location at a given time with the resource time and location the
resource’s author is notified in order to share those photos with those particular users.
This type of recommendation can only be derived if different resources’ contexts are
matched. In this situation, time and location create the context for the presented
resources. Nevertheless, this is just an example of a possible context. The conditions
for specifying contexts can be fully captured by ontologies and semantic rules, thus
being easily extended and reused by multiple recommendation system.
5.6.2
Summary
As any other recommendation process, the one proposed is based upon three main
parts: users, resources and associations between users and resources. Yet, provenance
and traceability annotations, users’ social awareness, list of interest topics, resources’
and users’ context are used in the recommendation process to infer users’ interest in
resources.
The access policy recommendation process aids resource authors in granting or denying
access to existing resources by making use of similarity factors between resources and
social relationships and suggesting which users should be given access to each resource.
Access policies are recommended for easing the process of sharing resources with other
users, when the recommendation process determines that another user might be interested in a particular resource.
The addition made to the recommendation engine – of a semantic-filtering method that
is capable of using contextual information mentioned in section 3.1.4 – allows recommendation of known-unknown (cf. section 3.3.12) and unknown-unknown (cf. section
3.3.13) resources.
The usage of semantic contexts enhances the recommendation process in a way that
users can now be proposed with unknown-unknown resources that they did not even
know they existed.
On a closed and proprietary domain, with less provenance and traceability annotations,
it would not have been possible to recommend unknown-unknown resources without
proper inference and context matching.
Referring back to the use case scenario, proposing the sharing of unknown-unknown
resources is also useful to other users that might have some interest in the photos
taken at that place (independently of the time they were taken), even when those
users were not in the same place at the same time. For this reason, the resource’s
author might even be open to the idea of sharing resources that he does not explicitly
consider private because they are, for example, resources just showing landscape and
do not harm any of the event guests’ privacy.
5.7 Summary
129
Typical existing recommendation engines used for these types of systems have a “limitation” which only allows them to deal with users, resources and associations between
the following pairs (users, users), (resources, resources) and (users, resources). Because
of this, for ontologies to be handed to the recommendation engine, domain experts and
preprocessing might be required, depending on the particular domain knowledge being
used and the chosen recommender system. This preprocessing of the semantic might
have to occur in order to translate ontologies’ expressivity into associations between
those mentioned pairs of information.
5.7
Summary
The lack of confidence in single resources, when globally examined, goes against the
goals of a WoT [63] and diminishes the trust layer of the semantic web stack [86]. On
the other hand, an architectural model with the features presented in this work would
foster a WoT to take place and converge in a faster and error-free mode.
In summary, the proposed system architecture provides the following features and
functionalities:
• user-generated actions are captured and kept in order to enable context creation;
• user-generated content provides additional contextual semantic annotations;
• each resource can be hosted on a different domain according to each resource’s
type (e.g. photograph, document, video, html, text) and user preference;
• for compound or composed resources, different views are created depending on
the user’s access policies regarding that composed resource content;
• the resource author is recommended with new access policies that would facilitate
sharing resources with other users;
• it allows discovering resources that users did not even known existed;
• users can be given a list of resources which match their interests or contexts,
even though specific names and content are not shown unless the author gives
them permission to access it i.e. the resource author will know which user is
requesting access but the requesting user does not know who is the author;
• semantically enhanced recommendations, allowing the creation of contexts (e.g.
time and space) for resources and users;
• a distributed and decentralised administration of access policy rules where access
policies are decoupled from resources, enforcement point or evaluation engine;
130
Design
• access policy rules are used to capture the sharing rationale instead of a specific
assignation of permissions between users and resources;
• one access policy simultaneously give one or more users access to one or more
resources, based on user or resource attributes or semantically inferred context;
• the permissions to access a resource are not statically defined but instead dynamically evaluated when the access enforcement takes place, based on the information available in that moment.
The next chapter shows how the proposed design features are deployed in a web
environment and applied to legacy applications.
Part III
Evaluation
Chapter 6
Prototype
This chapter is divided into eight sections. The first six sections describe how it is
possible to achieve an architecture’s generic implementation in the Web. Section seven
specifies how it is possible to implement the proposal in a way that coexists with legacy
Web applications. Section eight provides a summary of what was achieved with the
prototype.
The proposed architecture can be fully deployed in different environments (even a
non-web one) as most of the components can be deployed in either a server or clientside approach. Yet, in order to provide a working prototype as proof of concept for
the architecture, a web environment was used based on W3C recommendations or
standards for the WWW. For this reason, some of the illustrations and texts in this
chapter have a connotation that is mostly used in a web environment.
This prototype is comprised by several entities depicted in figure 6.1, respecting the
architecture and the proposed design. The entities are implemented as web applications or services that demonstrate how the proposed features can be deployed in a web
environment and specifically for existing web applications.
The developed prototype allows users to acquire a new identity and credentials, manage their user profile, social networks and relationships. It also allows managing
resource access policies and recommendation of resources.
In order to provide some notation on how the architecture can be used with legacy web
application, several instances of the Wordpress (v3.0.4) web application were improved
with the features proposed in this work, providing a full-fledged FOAF+SSL platform,
capable of gathering provenance and traceability over UGC.
6.1
Identification Provider Point
The IDP developed for this prototype has the following responsibilities:
134
Prototype
HTTP
Wordpress
Authentication
Upload Module
uses
is intercepted by
HTTPS
IDP
Relying
Authentication Party
Profile Management
uses
PEP
Wordpress
Upload Sensor
Authentication
uses
Decentralised
Resource Hosting
PIP
Authorisation
publish
secured by
host resource
Resource
Hosting
Wordpress
Publish
Information
PDP
Resource
Download
PRP
Evaluation Engine
PAP
Access
Recommendation
Access Policies
Management
Figure 6.1: Prototype [Components Diagram]
6.1 Identification Provider Point
135
Figure 6.2: IDP: FOAF+SSL Profile Creation [Screenshot]
• to provide identities and generate authentication credentials for new users;
• to manage the user’s profile, interest topics, social networks and relationships;
• to provide distributed authentication to other applications by issuing a message
stating whether users have been or not successfully authenticated through their
credentials.
6.1.1
User Identity Creation
In order to take advantage of what FOAF+SSL authentication has to offer, each user
must be identifiable by a FOAF profile to which an SSL certificate is associated.
With the aims of the IDP, a service was developed to assist users when creating a
FOAF profile and associating an SSL certificate to it. The process for generating new
identities for a user is implemented as described in section 5.1.1.1.
This process follows a two-stage approach. In the first phase, users provide personal
information in order to fill their internal identity and generate their user FOAF profile.
Such information must be entered into a web page provided by the IDP. A screenshot
representing an example layout for the FOAF profile creation is shown in figure 6.2.
The generation of the FOAF profile is initiated by the user when the “FOAF me!”
button is pressed. This submits the request to the IDP, which validates and generates
the corresponding FOAF profile in RDF format. The profile is stored in an information
point.
136
Prototype
Figure 6.3: PIP: RDF FOAF Profile Example [Screenshot]
6.1 Identification Provider Point
137
Figure 6.4: SSL Certificate Associated to FOAF Profile [Screenshot]
In the second phase of the process, an SSL certificate is generated on the client side,
which works as the user’s credentials. This step has been separated from the first
stage because more than one SSL certificate can be associated with a FOAF profile.
The SSL certificate generation takes place locally i.e. in the user’s client browser,
when demanded by the user by pressing the “create cert” button presented in figure
6.2. The SSL certificate is generated with the FOAF profile URI in the Subject
Alternative Name. The SSL certificate is automatically installed in the user’s client
browser. A screenshot of an SSL certificate showing that association is depicted in
figure 6.4.
To attach the SSL certificate to the FOAF profile, the user’s FOAF profile is also
updated with the SSL certificate’s public key. Figure 6.3 shows an example of a
FOAF profile in RDF format, where the SSL certificate’s public key is depicted on the
Subject Alternative Name section.
138
Prototype
Figure 6.5: IDP: Social Network Management [Screenshot]
Figure 6.6: IDP: Topic Preferences Management [Screenshot]
6.1.2
Profile Management
Just like other user profiles on the internet, FOAF profiles can be updated. To maintain user FOAF profiles associated, users must undergo a FOAF+SSL authentication
process, as described in section 6.2.1.
After successful authentication, FOAF profiles can be updated. To this end, a webpage was developed allowing users to manage their FOAF profile, i.e. their personal
information (internal identity) and relationships (social identity) (cf. figure 6.5).
Figure 6.6 shows an example of the topic preferences that added to a user’s profile.
These topics are provided by semantic concepts and properties and are preloaded into
the system by a domain knowledge expert. These examples are related to the animal
kingdom ontology representation.
6.1.3
Relying Authentication Party
The relying authentication party is a service provided by the IDP. This service is
implemented as described in the proposed design (cf. section 5.1.1.3). It provides
requesting web applications to receive a signature attesting that a user can be successfully authenticated by the FOAF+SSL authentication method.
6.2 Policy Enforcement Point
139
Figure 6.7: SSL Certificate Selection [Screenshot]
This type of service is useful especially when dealing with legacy applications that
are not prepared to attest the user’s FOAF+SSL credentials used in the authentication process. Such legacy applications can delegate the authentication process to this
relying authentication party that becomes responsible for attesting the user’s authentication credentials. The relying authentication party issues a response to the legacy
authentication stating, whether or not the user was successfully authenticated. This
process is detailed in section 6.7.1.
6.2
Policy Enforcement Point
The policy enforcement point developed as a prototype is an entity composed by:
Authentication Module Performs user authentication for any request regarding resources, hosted in the same domain, under a secure connection;
Authorisation Module Enforces access policy control over resources hosted in the
same domain under a secure connection;
Upload Action Module Provides an upload action sensor prototyped specifically
for Wordpress.
6.2.1
Authentication Module
To be given access to any private or restricted information on the IDP, PAP, PIP
or PRP, users must provide valid FOAF+SSL credentials to successfully log in the
system.
Surveys show that companies are moving their web applications to be accessed under
secure HTTPS instead of insecure HTTP [62]. Following this trend, in this prototype,
private and protected resources are only available under a secure HTTPS connection
that is subjected to an authentication challenge and authorisation evaluation.
140
Prototype
Thus, when accessing any of these resources, the user is required by the browser (as
shown in figure 6.7) to select an SSL certificate that provides the user credentials.
After choosing the appropriate certificate associated with the user FOAF profile, the
login process is performed without any further user interaction.
6.2.2
Authorisation Module
This module communicates with the PDP in order to evaluate access rights of a user to
a resource. This only happens if FOAF+SSL authentication takes places successfully,
otherwise the request is forwarded to the application. It is implemented as stated in
section 5.2.1.2.
6.2.3
Upload Action Sensor Module
UGC is automatically captured as requests are intercepted by the action sensors deployed in the PEP. In this prototype, the focus is given to intercepting resource upload
actions.
To intercept resource upload actions, an upload action sensor was developed for
Apache, capable of detecting an upload request coming from a web client, based on
typical upload request characteristics of a RESTFul web environment. To generate
provenance information about each action, the action sensor must know the identity
of the user performing the action as well as the resource’s identity.
Action sensors deployed on the PEP are only capable of obtaining the user’s WebID if
they intercept requests submitted under a secure connection, which is not the case of
Wordpress. Therefore, Wordpress was modified to accept FOAF+SSL authentication
by relying on an external party (cf. section 6.7.1). For this reason, over the Transport
Layer, the secure authenticated session is established only once to the IDP, per each
Wordpress session. Thereafter, Wordpress uses session cookies on the Application
Layer so the user continues logged in.
Wordpress upload page is hosted under a non-secure connection. As a result, the
action sensor is incapable of obtaining user identification from the upload request,
because it does not exist in the request scope. Only Wordpress webpages that are
available under a secure connection (as the new uploaded resources) are subjected to
authentication and authorisation on the Transport Layer (cf. section 6.7.2). Besides,
the resource-identifying URI is only acquired after the resource is hosted.
For the above-mentioned reasons, the upload action sensor is not capable of retrieving
neither the user’s WebID nor ResourceID URI from the request. To circumvent
both issues, both user’s WebID and ResourceID URI are being captured from the
response generated in the Wordpress upload action. This process is slightly different
6.3 Policy Decision Point
141
from the one proposed because the upload action sensor is being used on a legacy web
application.
The resource upload process is shown in figure 6.8.
The resource upload process starts when the user chooses to upload a resource and
submits it (steps 1-6). In the developed extension for Wordpress, the uploaded resource
is hosted under another domain (steps 7-8). The user’s WebID and resource URI are
added to the resource upload response.
The specially developed Wordpress upload action sensor intercepts the upload actions
response (step 11). After intercepting the upload response, the upload action sensor
obtains the user WebID (step 12), the new uploaded resource’s URI (step 13) and
generates provenance information about the action (step 15). It initiates a request to
a pre-configured PIP1 in order to publish that provenance information (step 15) and
then delegates the HTTP response back to the web application that forwards it to the
client (steps 16-18).
6.3
Policy Decision Point
This entity was prototyped according to what was proposed in section 5.3.1. The main
responsibility of this entity is to evaluate access policies in order to issue a statement
on whether or not a requesting user should have access to a requested resources.
The process starts when an access decision is requested by the PEP. Access policies
are defined using a semantic rule language. For this reason, the prototyped evaluation
engine uses the Jena Semantic Web framework because it is capable of understanding
structured documents and ontologies thus providing reasoning capabilities.
In order to evaluate if a user has access to a resource, the resource author’s access
policies are requested from the PIP and loaded into the Jena framework. A SPARQL
query, similar to the one presented in algorithm 6.1 is created and executed in the
Jena framework to determine the user’s access to the resource. If there are results to
the executed SPARQL query, then the user has access to the resource, otherwise the
access is rejected. The decision is reported back to the PEP.
6.4
Policy Information Point
The prototyped PIP has the following responsibilities, as described in the following
subsections:
1
The PIP that is contacted by the upload action sensor is previously pre-configured. In order to
increase system responsiveness and avoid downtimes, an URI can also be configured for an online
pool of PIPs.
142
Prototype
Figure 6.8: PEP: Resource Upload Process [Sequence Diagram]
6.4 Policy Information Point
143
Algorithm 6.1 SPARQL Access Query
1
2
3
4
PREFIX auth : <h t t p : / / h t t p : / / f o a f s e r v e r . d e i . i s e p . i p p . pt / auth . owl#>
PREFIX u s e r : h t t p : / / f o a f s e r v e r . d e i . i s e p . i p p . pt / p r o f i l e s / John . r d f#
SELECT ? r e s o u r c e
WHERE { u s e r : me auth : h a s A c c e s s ? r e s o u r c e }
• hosting and retrieving resources;
• publishing provenance and traceability information;
• allowing user’s manual addition of contextual information to resources.
6.4.1
Hosting and Resources Retrieval
It allows other entities to host or retrieve resources namely: FOAF profiles, access
policies and other resources originated from UGC;
This component allows UGC (i.e. namely uploaded resources) to be hosted and changed
under a secure connection, and to be securely retrieved. Access policies, in particular,
are requested by the PDP and PAP.
When a new user is created (cf. section 6.1.1), his generated FOAF profile contains
the user’s internal and social identity and is hosted by a PIP in order to be available
to other users and user authentication challenging components (cf. section 6.2).
6.4.2
Publishing Provenance and Traceability Information
This process, which takes place when the PIP receives a request to publish traceability
or provenance information, is responsible for:
• creating the provenance and traceability semantic annotations according to an
extension of the Provenance Vocabulary Core Ontology Specification;
• generating contextual meta-information according to the resource type (PNG
and PDF meta-information generators are added to the PIP);
• publishing provenance, traceability and contextual meta-information;
• trigger the recommendation process (cf. section 6.6), e.g. when a new user is
added to a social network or new resources are generated.
144
Prototype
Figure 6.9: PAP: Resources Listing [Screenshot]
6.4.3
User’s Manual Addition of Contextual Information to
Resources
This module allows users to manually and explicitly state that a resource is related to
some concept. Users can enhance resources’ meta-information by manually associating
one or more semantic interest topics to each available resource by pressing the “Manage
Interests” button that is available on the same webpage where resources are managed,
as depicted in figure 6.9.
This option displays a webpage that allows semantic topic interests to be associated
to resources (as shown in figure 6.10), which is similar to the page for associating topic
interests to the user. Meta-Information that is associated to the resource is kept in a
repository accessible by the PIP.
6.5
Policy Administration Point
This entity allows users to manage their resources’ access policies. It provides a simple
interface as depicted in figure 6.9. The visual interface allows users to perform the
maintenance of resources’ access policies spread across multiple web domains in a
cross-domain perspective.
In order to list all the resources belonging to a user and its access policies, the prototype
must contact the existing PIP to retrieve such information. User resources are listed
as shown in figure 6.9, independently of the web domain where they are hosted.
From this point, users can perform the following actions upon each resource:
• show and manage access policies;
6.5 Policy Administration Point
145
Figure 6.10: PIP: Resource’s Related Semantic Topic Management [Screenshot]
146
Prototype
Figure 6.11: PAP: Resource Access Policy Definition Example [Screenshot]
• explicitly state that a resource is public;
• manage interest topics related to the resource.
To make a resource public, by disregarding any existing access policies related to it,
users can press the “Set Public” option that can be depicted in figure 6.9, on the right
of the resource.
The access policy management complies with the proposed features (cf. section 5.5.1)
and relies on semantic rules that provide access to a resource. An example of the
editing of one of those rules is shown in figure 6.11.
6.6
Policy Recommendation Point
This entity provides a prototype to some of the proposed features identified in section
5.6.1, namely:
• collaboratively recommending access policies to resources by users;
• guiding users through the resource-sharing process.
6.7 Wordpress Integration
147
Figure 6.12: PRP: List of Recommended Access Policies [Screenshot]
Figure 6.13: PRP: List of Recommended Resources Notification [Screenshot]
The recommendation service runs continually behind the system and each user can
list the results of the recommendation process. The listing of resources’ access policies
that can be recommended to other users is shown in figure 6.12.
The proposed user is the user that will have the privilege of executing the action
over the resource. In the first example, user “SafariFoafServer” is recommended with
“Read” access over resource https://foafserver.dei.isep.ipp.pt/resources/foafserver.dei.
isep.ipp.pt/profiles/rafppopera/resource_2f2d4ec939520fd6f83154c079dced29.jpg.
From this list, users have the option to grant or deny access (cf. figure 6.12). When
resource authors grant privileges over their resources to the proposed users, proposed
users are notified about their new access right to the protected resources. In the same
entity, users can access a list of notifications as shown in figure 6.13.
6.7
Wordpress Integration
In order to ensure that it is possible to reuse the proposed architecture in existing
web domains, this section presents a prototype that enables an existing web content
148
Prototype
management system, namely Wordpress, to comply with the proposed features.
Wordpress supports several types of authentication through available plugins. Still,
it basically relies on a proprietary domain-centralised mechanism, which requires registration or federated authentication from local users (e.g. OpenID-Connect). Yet,
FOAF+SSL is not yet a natively supported authentication method for Wordpress.
In order to restrict to a minimum the number of changes to the Wordpress web application – for it to be able to handle FOAF+SSL authentication – a Wordpress extension
was specifically developed to handle FOAF+SSL authentication.
The developed extension has the following responsibilities:
• provide support for FOAF+SSL authentication in the login webpage by
– adding a new authentication option;
– delegating FOAF+SSL authentication to an external relying party (i.e.
IDP);
• automatic registration of first-time FOAF users in Wordpress’s repository when
authentication is successful;
• generate an appropriate session cookie to maintain the user logged in Wordpress
after FOAF+SSL authentication takes place;
• allow resources to be relocated to other hosting domains and upload actions to
generate provenance information and.
6.7.1
User Authentication
The Wordpress default authentication relies on a username/password combination
that is stored in the web application proprietary repository.
Adding a FOAF+SSL authentication method to the Wordpress web application does
not replace Wordpress’s centralised authentication as depicted in figure 6.14, such
that regular users registered on the local web application can still perform the native
username/password authentication.
The addition of FOAF+SSL authentication to the Wordpress application is grounded
on the usage of:
• an external IDP that provides distributed FOAF+SSL authentication for Wordpress;
• a PEP that is responsible for intercepting HTTPS requests;
149
6.7 Wordpress Integration
Wordpress
Authentication Extension
Client
Login
Request
[yes]
Use FOAF+SSL
Authentication?
Perform FOAF+SSL
Authentication
[no]
Perform Wordpress
Authentication
Response
Figure 6.14: Wordpress: Authentication Options [Activity Diagram]
• a Wordpress extension (as mentioned above).
FOAF+SSL authentication is automatically provided to any existing web application,
if that web application is hosted under a secure connection, by automatically using
the PEP deployed on the same domain.
If the legacy web application is not accessible through a secure connection, then it
must redirect FOAF+SSL authentication requests to an external relying party. After
challenging the user for credentials, it redirects back to the legacy web application
with a response of whether or not it was able to perform the authentication.
In the Wordpress case, the web application must use an external IDP to attest the
user credentials. The authentication process is delegated to the IDP relying party as
demonstrated in figure 6.15.
6.7.1.1
Request Redirection
As depicted in figure 6.15, when a user starts the login process (steps 1-3) on the
Wordpress home page, a new option for FOAF+SSL authentication appears, as shown
in figure 6.16. By pressing this option (step 4), the authentication request is redirected
to an external pre-configured IDP (step 5-6), whose URI location can be stored in one
of Wordpress’s configuration files.
The external IDP handles the authentication process (step 7) as described in the
proposed design for and authentication relying party (cf. section 5.1.1.3). When
150
Prototype
Figure 6.15: Wordpress: FOAF+SSL Authentication Process [Sequence Diagram]
6.7 Wordpress Integration
151
Figure 6.16: Wordpress: Authentication Extension Login Page [Screenshot]
FOAF+SSL authentication is successful, a signature is generated, containing the user
WebID encrypted with the IDP’s private key and the request is redirected back to
Wordpress (step 8) together with the generated signature.
6.7.1.2
Validating Authentication Response
When the request is received from the relying authentication party (step 9), Wordpress
validates the received signature from the IDP (step 10), by obtaining the IDP’s public
key (steps 10-11) and decrypting/validating the signature (step 12). If the signature
is successfully validated, it is possible to obtain the user’s WebID from the exchanged
message (step 13).
6.7.1.3
Wordpress User Registration
Web applications that support FOAF+SSL authentication do not require existing
users to register before the first usage. Nevertheless, because the Wordpress web
domain relies on a proprietary and centralised user management system, it requires
all users to be registered in the Wordpress domain.
This specifically developed Wordpress extension allows new FOAF+SSL users to be
automatically added as a local Wordpress user when they first authenticate on the
web application using FOAF+SSL.
Wordpress makes use of the user’s WebID to check if the user whose authentication has
been challenged and successfully validated is already registered in the web application.
This is achieved by comparing the user WebID to the “user_url” field in the Wordpress
users table.
If no user record is found for the provided WebID, it means that this is the first time
the user is performing authentication in this web application/domain. This extension
152
Prototype
allows Wordpress’s proprietary user registration process to take place automatically
with no user intervention as the minimum required information to register a Wordpress
user (i.e. WebID, nickname and e-mail address) is obtained from the authenticated
user’s FOAF profile (step 15). Thereafter, this information is used to register the user
in the Wordpress web application (steps 17-18).
6.7.1.4
Session Generation
For Wordpress to run normally after a user is properly authenticated via FOAF+SSL,
it is necessary to create a session for that user. The developed extension has the
purpose of identifying the user and retrieving the user’s information from Wordpress’s
repository in order to create a user cookie session. This session’s variable ensures a
successful log in to the web application.
If a user registers and the provided WebID is retrieved, the extension obtains more information from the local internal Wordpress repository using the user’s WebID (steps
20-21). Then, a user session is created for the authenticated user, through the generation of the user’s session cookie (step 22). The last step redirects the user to the
Wordpress home page, together with the session cookie (step 23).
6.7.2
Resource Upload
Wordpress allows users to upload resources. In this prototype, the content-hosting
server for uploading requests was modified. The location where the resources reside is
overridden and the response from the upload webpage was modified.
This developed extension for Wordpress makes slight modifications to the upload page
of Wordpress so that extra parameters are added to the HTTP response in order to
carry both the performing action user WebID (step 9) and the newly created ResourceID URI (step 10) as previously shown in figure 6.8 on page 142.
This is a necessary step on legacy application so that provenance information can be
captured by the corresponding action sensor (cf. section 6.2.3). Firstly, this allows
the generation of provenance information regarding the action. Secondly, uploaded
resources can now be managed according to the architecture’s access policies. Finally,
resources uploaded to Wordpress can now be dereferenced not only in the Wordpress
domain, but also on other domains.
6.7.3
Rendering Compound Resources
In Wordpress, users can maintain several blogs and publish different kinds of information. This prototype was developed to allow the act of publishing photography
6.8 Summary
153
resources on the user’s main blog. This main blog acts as a compound resource,
because it can contain several different other resources (e.g. posts, text, photographs).
While the blog is publicly available with no restrictions to any user, using the proposed features of this work makes it possible for Wordpress users to provide different
visualisations of the same blog to different users – according to the access policies that
the user configures for each individual resource showing up in the blog.
Access policies over resources are defined in the PAP component (cf. section 6.5) and
are not Wordpress proprietary. This is possible because the uploaded photographs’
content is being hosted in a decentralised resource repository, decoupled from Wordpress’s hosting component (cf. section 6.7.2), which automatically hosts resources under secure connections, on different and independent hosting domains.
6.8
Summary
Testing a web server component on a big and fully operational website like Facebook,
Google+ or any other of the kind was an unrealistic option. As a consequence, several
of the previously proposed features were developed and implemented on a web domain,
used in combination with the Wordpress web framework that was engineered to accept
FOAF+SSL authentication. For testing purposes, different FOAF profiles were created
to simulate different user identities. For each profile, several social relationships were
added in order to recreate a social network. Several different Wordpress domains were
created to simulate several different hosting services for each of the different resource
types.
The system’s evaluation was comprised of each user maintaining a blog in one of the
several Wordpress application servers where resources could be uploaded and shared.
Each resource, according to its type (i.e. music, photo or video) was automatically
uploaded by the framework into a specific repository instance only responsible for
hosting those types of resources. The PIP resides in the same domain as the Wordpress
hosting server, but it may reside elsewhere. In either situation, when resources are
uploaded, the PIP must be able to access the resource to generate the contextual
information if that is the case.
After setting access policies over the resources and relationships, users would only
access other users’ blog renderings according to the access policies that had been
defined for each individual resource.
Multiple web page renderings, in the same webpage were achieved due to the usage
of different access policies for each user. Uploaded resources are physically hosted
on the specified repository that had been set up for that purpose according to users’
preferences and resources’ type.
154
Prototype
Each Wordpress blog webpage only holds blog entries and did not have any copy
of the uploaded resources providing full hyperlinking of resources. While no major
performance impact was detected, one might suspect that in a true WWW experiment,
such impact might occur depending on the network performance provided by each of
the individual hosting services being used.
This prototype allows web applications to have: decentralised resource hosting that is
obtained by a combination of a PEP with an upload action sensor and a decentralised
resource hosting module and multiple renderings of the same compound resource by
a combination of an IDP that provides a relying authentication party and PEP for
enforcing authorisation.
Action sensors can be reused by other web applications that run on the same domain
and several action sensors can coexist in the same domain. However, just like in the
presented prototype, each action sensor may have to be enhanced for each specific web
application.
Despite tests proving successful, hence complying with all the system integration proposed goals, experiments and evaluation were carried out in order to access the PRP
contribution to the envisaged goals, as described in the next chapter.
Chapter 7
Experiments
The aim of the experiments was to prove that even with a large dataset of information,
semantic information would improve existing algorithms. For that, a larger set of
information is required.
Mahout was selected as the recommender engine. It is a framework that provides
advanced expansion features and makes use of collaborative filtering. As proposed in
section 5.6, the recommender engine should feature a hybrid mechanism that makes
use of collaborative, content and semantic filtering techniques. Natively, Mahout
does not provide content or semantic filtering techniques as these must use domainspecific approaches and can not be meaningfully represented in Mahout [120]. In
order to provide this support with content and semantic filtering techniques, Mahout’s
recommendation process was modified to enable the aggregation of similarities between
items and between users, together with Mahout’s similarities generation (cf. section
7.2.4).
Conducting the evaluation in a real world would be time-consuming and would hence
face cold-start problems typically associated with collaborative filtering techniques.
For these reasons, it was decided that the system should be evaluated according to an
existing dataset (cf. section 7.1.1).
The information provided in such dataset is interpreted according to the concepts
and properties of the system model (cf. section 7.1.3), allowing the simulation of real
human interaction with the system.
The rest of this chapter is organised in four main sections as follows. The first section
describes the interpretation of the existing dataset to be exploited in the access policies
recommendation. The second section demonstrates how the recommender system was
set up and how its typical behaviour is changed in order to make use of semantic
information. The third section describes the recommendation experiments, including
the setup configurations and achieved results. The last section summarises the chapter.
156
Experiments
Table 7.1: Analysed Datasets
Recommendation
LastFM Dataset
MovieLens Dataset
Delicious Dataset
User
3
3
3
Item
Artist; Tag
Movie; Genre;
Tag; Bookmark
Concepts
Director; Actor; Tag;
Location;
User Actions
Listening; Tagging
Tagging; Rating
Bookmarking; Tagging
Items
Tag
Genre; Director; Actor;
Tag
Meta-Information
User-Item Associations
Tag; Location;
(User; Artist);
(User, Movie);
(User, Tag);
(User; Tag)
(User, Tag)
(User, Bookmark)
User-User Associations
(User, User);
—
(User, User)
Item-Item Associations
(Artist, Artist);
(Movie, Genre)
(Tag, Bookmark)
(Artist, Tag)
(Movie, Director)
(Movie, Actor)
(Movie, Tag)
(Movie, Location)
User Preferences
7.1
Possibly interpreted
from:
(User, Tag);
Possibly interpreted
from:
(User, Movie);
Possibly interpreted
from:
(User, Tag)
(User, Artist)
(User, Tag)
(User, Bookmark)
Dataset Preparation
Evaluating this system requires its operation by several users, during a run-in period
of time, for it to be able to collect enough information to be processed. Due to
time constraints, technological and business context, it was not possible to carry the
experiments in a real-world context. In order to simulate the operation of the proposed
system, it was necessary to adopt data from other real-world experiments that could
be interpreted for this goal.
7.1.1
Source Datasets
Several datasets used on the Second International Workshop on Information Heterogeneity and Fusion in Recommender Systems (Hetrec’2011)1 , shown in Table 7.1, were
analysed in order to prove their appropriateness to the desired evaluation.
The MovieLens dataset is excluded because it does not provide sufficient information
to relate users, thus preventing collaborative filtering.
1
http://grouplens.org/datasets/hetrec-2011/
157
7.1 Dataset Preparation
isPerformedBy
lastfm:User
performs
knows
performs
over
isPerformedBy
lastfm:ListenAction
lastfm:TagAction
over
lastfm:Tag
sameAs
musicbrainz:MusicalArtist
over
lastfm:MusicalArtist
sameAs
subSubGenre
freebase:MusicalGenre
hasGenre
freebase:MusicalArtist
Figure 7.1: Source Datasets
After a careful inspection of the content of the LastFM and Delicious datasets, it
was clear that both would support collaborative filtering, but content in the LastFM
dataset would provide more useful information than the one in the Delicious Dataset,
thus promoting the content and semantic filtering. For this reason, LastFM was the
chosen dataset for the experiments as it suits the evaluation needs, considering a
carefully planned interpretation and mapping to the ontology used in the system. The
LastFM dataset is further enhanced with data from the Freebase and Music Brainz
datasets.
Figure 7.1 depicts the entities and associations from LastFM online music dataset2 ,
Freebase dataset3 and Music Brainz4 dataset. It is worth noticing that Music Brainz’s
Musical Artist is used for the single purpose of data integration between LastFM and
Freebase datasets.
A description for each concept is presented:
User [LastFM] Contains information about Users and their bidirectional friend relationships.
Musical Artist [LastFM] Contains information about LastFM’s Musical Artists5 .
2
The namespace “lastfm” or “lfm” may be used to denote the LastFM dataset.
The namespace “freebase” or “fb” may be used to denote the Freebase dataset.
4
The namespace “musicbrainz” or “mb” may be used to denote the MusicBrainz dataset.
5
This concept may be presented as “lastfm:MA” or “lfm:MA”.
3
158
Experiments
It provides Musical Artist identification and a Uniform Resource Locator (URL)
that dereferences the homepage in the LastFM online system.
Tag [LastFM] Contains Tags6 that are created by Users and associated to Musical
Artists. It provides identification and value for each Tag.
Listen Action [LastFM] Represents Listening Actions7 performed by Users over
Musical Artists. It is the association between each User and the listened Musical
Artist.
Tag Action [LastFM] Contains Tag Actions8 , performed by Users over Musical
Artists. It provides information that relates User, Musical Artist and Tag.
Musical Artist [Freebase] Contains Freebase’s Musical Artists9 and related Musical Genres.
Musical Genre [Freebase] Contains Freebase’s Musical Genres10 , together with their
sub-genres.
Musical Artist [MusicBrainz] Contains MusicBrainz’s Musical Artists11 .
The mapping between the recommendation conceptual entities, the music domain
entities and the source datasets is represented in Table 7.2, i.e. how the musical entities
and their associations are interpreted.
7.1.2
Domain Ontology Mapping & Integration
Due to the lack of integration and explicit semantics of the source datasets, it is
necessary to derive and integrate the implicit semantics from the existing datasets
into a domain ontology12 capturing the necessary semantics, shown in figure 7.2.
The mapping stage is responsible for converting the source datasets into a domain
ontology. This mapping process is depicted in figure 7.3. The dotted lines represent
mappings from the source datasets to the domain ontology.
6
This concept may be presented as “lastfm:T” or “lfm:T”.
This concept may be presented as “lastfm:LA” or “lfm:LA”.
8
This concept may be presented as “lastfm:TA” or “lfm:TA”.
9
This concept may be presented as “freebase:MA” or “fb:MA”.
10
This concept may be presented as “freebase:MG” or “fb:MG”.
11
This concept may be presented as “musicbrainz:MA” or “mb:MA”.
12
The namespace “domain” may be used to denote the domain ontology, which represents a Music
Domain.
7
159
7.1 Dataset Preparation
Table 7.2: Entities Mapping
Concepts
Domain entities
LastFM
Freebase
MusicBrainz
Domain
Music
Music
Everything
Music
User
Person
User
—
—
Musical Artist
Musical Artist
Musical Artist
Musical Artist
URI
URI
URI
URI
Tags are reconciled
—
—
—
—
Musical Artist
—
Resource
Resource
Identification
Resource Content
against Musical Genres
Source of
Musical Artists
Resource
association to Musical
association to
Semantics
Genre
Musical Genre
Musical Genre Hierarchy
—
Musical Genre
—
Hierarchy
Person association to
—
—
—
Interpretation: First
—
—
—
—
(user, artist)
—
—
(User, Musical
—
—
Musical Genre
Provenance
Musical Artist Creation
tagging action upon
a Musical Artist
Musical Artist Tagging
Interpretation: All
tagging actions upon
a Musical Artist
(except the first)
Traceability
Action: to listen (Musical
Artist)
(User, Musical Artist)
Artist)
Associations
(User, Musical Genre)
(User, Tag)
—
—
(Musical Artist, Musical
—
(Musical Artist,
—
(Musical Artist, Musical
(Musical Artist,
(Musical Artist,
Artist)
MusicBrainz Musical
MusicBrainz
Artist)
Musical Artist)
(Musical Genre, Musical
(LastFM Musical
(Musical Genre,
Genre)
Artist, MusicBrainz
Musical Genre)
Genre)
Musical Genre)
Musical Artist)
—
—
160
Experiments
knows
domain:Action
performs
domain:User
over
likes
domain:MusicalArtist
hasMusicalGenre
domain:MusicalGenre
hasSubGenre
Figure 7.2: Domain Ontology
Each lastfm:User individual/instance gives origin to a domain:User individual. Listen
and Tag actions are combined into the general domain’s Action because Mahout recommender system does not distinguish different types of user actions. Each LastFM
Musical Artist individually originates a domain’s Musical Artist.
These data transformations are declaratively specified and executed by one of the
many existing tools for data transformation and integration (e.g. MAFRA Toolkit13
or NeOn Toolkit14 ).
The original lastfm:Tag individuals are interpreted as domain Musical Genres’ individuals. This is the result of users’ manual tagging of each Musical Artist. Yet, while these
users’ actions complement the Musical Artists with associations to Musical Genres,
the original LastFM dataset does not provide information about each Musical Artist
and their related Musical Genres. In order to simulate the generation of semantic
information when UGC is captured, an enrichment process is performed for providing
an association between domain:MusicalArtist and domain:Musical Genre.
In order to enrich the domain dataset with users’ preferences for musical genres it is
necessary to transform the information in the source dataset (i.e. Tags) into semantic
content by performing the mapping represented as mapping “a)” in figure 7.3.
Domain’s Musical Genre individuals are obtained by the union of any Freebase Musical
Genre:
• whose description matches LastFM’s Tag’s value15 . In the end of this process,
4698 of the initial 11946 tags were correctly reconciled to their semantic equivalent domain Musical Genre;
• that are tagged against the Musical Artist. Freebase’s and LastFM’s Musical
Artist are not directly associated. Nevertheless, when a Music Brainz Musical
13
http://mafra-toolkit.sourceforge.net
http://neon-toolkit.org
15
More detailed information about the matching process is provided in Appendix A.1.
14
161
7.1 Dataset Preparation
knows
knows
lastfm:User
performs
domain:User
likes
performs
performs
lastfm:ListenAction
over
domain:Action
lastfm:TagAction
over
over
over
lastfm:Tag
freebase:MusicalArtist
has
sameAs
freebase:MusicalGenre
a)
domain:MusicalGenre
hasSubGenre hasSubGenre
musicbrainz:MusicalArtist
hasMusicalGenre
sameAs
lastfm:MusicalArtist
domain:MusicalArtist
Figure 7.3: Sources Dataset Mapping to Domain Ontology
162
Experiments
Artist is the same for both Freebase and LastFM, one may conclude they are
the same.
This mapping is achieved by equations 7.1-7.3.
M usicalGenre = ReconciledT ag fi F BM Genre
(7.1)
ReconciledT ag = {f b:M G|f b:M G.description ¥ lf m:T.value}
(7.2)
I
F BM Genre = f b:M G|
A
f b:M A.has.f b:M G·
f b:M A.sameAs.mb:M A = lf m:T A.over.lf m:M A
BJ
(7.3)
A transitive property “hasSubGenre” is added to the domain ontology to relate subgenres. This “hasSubGenre” relation provides the necessary information for semantic
filtering recommendation.
In the end of this mapping and integration process Musical Artists are associated with
Musical Genres. Appendix A.2 provides more facts and numbers about the individuals
in the domain ontology.
7.1.3
System Ontology Mapping
This part of the work explains how the concepts and properties from the domain
ontology are interpreted in the system’s ontology in order to simulate human behaviour
usage of the system.
A subset of the system ontology, shown in figure 7.4, captures the semantics of the information as dealt by the proposed PRP. This ontology uses concepts from the Provenance Vocabulary Core Ontology (prv) and PROV Ontology (provo) to semantically
represent provenance and traceability actions. It represents resources as prv:DataItem
and uses the concept provo:Activity to represent activities performed by foaf:Person
over prv:DataItems.
One of the recommendation key elements is Person, described by the foaf:Person entity,
which in turn can have a relationship with other users, described by the foaf:knows
property.
The domain ontology is solely used for system evaluation and demonstration purposes.
The proposed system’s model is capable of handling any generic concepts and not
163
7.1 Dataset Preparation
foaf:knows
provo:Activity
performedBy
foaf:Person
over
prv:DataItem
Figure 7.4: System Ontology
just the ones presented on this domain ontology. Above all, the PRP is capable of
recommending any data that respects the system’s ontology. Therefore, it is necessary
to semantically map the domain ontology to the system’s ontology (i.e. interpret the
domain data as recommendation data) as depicted in figure 7.5.
In order to adhere to the specific Music Domain, Musical Artists and Musical Genres
provide an extension to the system’s ontology, which are a specialisation of prv:DataItem.
To simulate user preferences, a property named “likes”, connects each foaf:Person to
a domain:MusicalGenre. The values for this property are obtained by using the rule
described in algorithm 7.1.
Algorithm 7.1 User Preferences for Musical Genre (SWRL)
User (? user ) , Activity (? activity ) , MusicalArtist (? artist ) , MusicalGenre (? genre ) , over
(? activity , ? artist ) , ha s Mu si ca l Ge nr e (? artist , ? genre ) , performedBy (? activity , ?
user ) -> likes (? user , ? genre )
7.1.4
Summary
The transformations applied to the original dataset provided enough information that
filled all the blanks needed to simulate human behaviour usage of the architecture.
Tagging and Listening actions over resources are both combined in a single Action
concept because the Mahout recommender system cannot distinguish between different
user actions and their differentiation for this matter would be pointless. The tag
reconcile process associates each Tag to an equivalent Musical Genre from the Freebase
database. The process of reconciling tags to their semantic equivalent Musical Genre
allows the system to not only categorise musical artists to their corresponding musical
genre, but also demonstrated that semantic information can be added to the resources
when UGC is performed.
164
Experiments
knows
foaf:knows
domain:User
likes
foaf:Person
performs
likes
performedBy
domain:Action
provo:Activity
over
over
prv:DataItem
domain:MusicalArtist
hasMusicalGenre
domain:MusicalGenre
domain:MusicalArtist
hasMusicalGenre
domain:MusicalGenre
hasSubGenre
hasSubGenre
Figure 7.5: Domain Ontology to System Ontology Mapping
7.2 Recommender System
165
The rules added to the system ontology allowed the association of users to their preferred Musical Genres i.e. building their preferences list.
The semantic information generation process complemented the musical artists associations to musical genres associations, provided by the original LastFM dataset with
new knowledge provided by the Freebase and Music Brainz databases.
The resulting ontology from this information integration process is stored in an Apache
Jena Fuseki triple store configured with a Pellet reasoner16 . Some facts and figures
about the derived domain ontology can be found in Appendix A.2.
7.2
Recommender System
This section describes the evaluation of the PRP component of the proposed system
architecture. Because no human-based experiences were conducted on this phase, part
of the initial dataset is used to calculate precision and recall.
The experimental process is depicted in figure 7.6. In the following diagrams, dashed
lines represent components, messages and properties only used for evaluation purposes.
The following processes take place during an evaluation process:
Recommendation Dataset Mapping This process maps the system ontology to
the recommendation dataset;
Evaluation Dataset Generator Generates two different datasets from the recommendation dataset, one for testing the recommender system and the other to
evaluate it;
Similarities Generator Generates similarities between users or between items from
the recommendation dataset;
Similarities Aggregator Aggregates similarities obtained from different similarity
generators ;
Predictions Generator Predicts the user’s interests in available items;
Predictions Aggregator Aggregates one or more possible prediction results, filters
them from users’ feedback and recommends items to resource authors;
Measurement Calculator Creates recall, precision and f1 measures for the recommended items.
16
http://clarkparsia.com/pellet/
166
Experiments
Recommendation Process
Evaluation Process
Rec Dataset
Users;
Items;
User Actions
Similarities
Generator
Users;
Items
Evaluation Dataset
Generator
Users;
Items;
User Actions
Training Model
Collaborative
Similarities
Semantic
Similarities
Similarities
Aggregator
Users; Items
Similarities
Predictions
Generator
Relevant Items
Predictions
Predictions
Aggregator
Recommendations
Measurements
Calculator
Precision
Recall
F1
Figure 7.6: Recommender System Overall Evaluation Process
167
7.2 Recommender System
rec:Action
performedBy
weight
overItem
rec:User
weight
firstUser
rec:Item
firstItem
secondUser
rec:UserSimilarity
weight
secondItem
weight
rec:ItemSimilarity
weight
weight
Figure 7.7: Recommendation Dataset
7.2.1
Recommendation Dataset
The process of generating the recommendation’s dataset17 consists in obtaining the
following sets of information from the system’s ontology, to comply with the recommendation model presented in figure 7.7. Mahout’s recommendation process recognises
users, items, and similarities between users or between item, user actions and their
weights.
Because Mahout’s recommender system does not recognise or handle ontologies, a
mapping between the system’s ontology and Mahout’s recommender model is necessary. It converts the system’s ontology data into a format that the recommendation
engine can use (cf. figure 7.8).
According to figure 7.8, it is possible to derive the following concepts.
User Derived from the foaf:Person concept. Each foaf:Person from the system’s ontology is mapped to rec:User in the recommendation dataset.
Item Derived from the prv:DataItem concept. Each prv:DataItem is mapped from
the system’s ontology to the rec:Item concept in the recommendation dataset.
Actions Derived from the provo:Activity concept. Each provo:Activity from the system’s ontology is mapped into a rec:Action in Mahout’s. For each mapped activity, respective relationships with the user (i.e. “performedBy” property) and
items (i.e. “over” property) are created.
17
The namespace “recommendation” or “rec” may be used to denote the recommendation dataset.
168
Experiments
foaf:Person
rec:User
foaf:knows
firstUser
secondUser
rec:UserSimilarity
weight
performedBy
weight
performedBy
provo:Activity
rec:Action
weight
over
weight
over
rec:ItemSimilarity
weight
isSimilarTo
prv:DataItem
weight
secondItem
firstItem
rec:Item
Figure 7.8: System Ontology to Recommendation Dataset Mapping
169
7.2 Recommender System
Evaluation Process
Users;
Items;
Rec Dataset User Actions
Relevant
Items
Evaluation Dataset
Generator
Training
Model
Configuration Values:
- Top K Preferences
- Threshold
Figure 7.9: Evaluation Dataset Generator Process
User/User Similarities Derived from the foaf:knows property. Each foaf:knows
property originates a rec:UserSimilarity individual.
Item/Item Similarities Derived from the system:isSimilarTo property. This similarity set is the outcome of the semantic filtering approach further explained in
section 7.2.3.
The weight of each user action, item similarity and user similarity is obtained by the
number of repetitions that occur during the mapping process. The resulting dataset
represents the input data for the recommender system.
7.2.2
Evaluation Dataset Generator
For evaluation purposes, from the recommendation dataset, this process generates two
different complementary datasets (cf. figure 7.9):
Relevant Items Dataset Represents all the items that are considered relevant to
each user.;
Training Model Dataset Includes all the remaining items from the dataset that
will be used as training model for the recommender system. The training model
dataset is only necessary when the system is running in evaluation mode.
The evaluation engine allows top K user actions and a relevancy threshold to be
configured when evaluating the system. Due to disparities in the number of actions per
user, a fixed top K value and relevancy threshold would result in many users not having
any relevant items or other users having too many relevant items, thus reducing the
training model actions. The trade-off between both is achieved by allowing Mahout’s
automatic threshold calculation for each user. As a result, threshold is calculated for
each user of the dataset by the average of all user preferences’ value plus the deviation
of all user action’s weights [120].
170
Experiments
Recommendation Process
Rec Dataset
Evaluation Process
Users; Items;User Actions
Collaborative
Similarities Generator
Semantic
Similarities
Users;
Items;
User Actions
Training
Model
Parameters:
Tanimoto;
LogLikelihood
Collaborative
Similarities
Semantic
Similarities Generator
Parameters:
-Association Property
Semantic
Similarities
Figure 7.10: Similarities Generator Process
7.2.3
Similarity Generator
When using collaborative filtering techniques, most recommender engines evaluate the
similarities between users and between items i.e. in this domain, Musical Artists. This
phase occurs before applying the recommendation algorithms.
Several different similarity processing algorithms exist and have been described in
section 2.2. The usage of different similarity algorithms influences the prediction’s
outcome. As depicted in figure 7.10, the similarity generator process allows the creation of similarity sets by either using Mahout’s existing similarity algorithms or obtaining them from a semantic model, by using a property that relates a concept to
itself. Therefore, semantic information is used in Mahout’s through the usage of item
similarities, obtained from the system’s ontology.
The similarity generation process is divided into two processes:
Collaborative Similarities Generator This process generates similarities between
users and between items by using Mahout’s user and item similarity algorithms.
These are solely obtained and derived from provenance and traceability information given by rec:Action individuals. Actions are obtained from users that
listened to and tagged Musical Artists, where the weight of each action is given
171
7.2 Recommender System
Table 7.3: Mahout-based Similarities
[Weightless]
Table 7.4:
[Weightless]
Semantic
Item
Item
Item
Item
1
1
1
2
1
2
1
2
2
1
1
3
1
3
1
5
2
1
...
...
3
...
Similarities
1
...
by the number of Actions between a User and Item. An example is shown in
Table 7.3.
Mahout supports several similarity algorithms from which Tanimoto Coefficient Similarity and Log-Likelihood Similarity were the native similarity algorithms generators
used in this set of experiments.
Semantic Similarities Generator This process is responsible for mapping user preferences and musical genres for the recommendation process. In order for the semantically inferred similarities between Musical Artists to be used in the recommender system, the values of rec:ItemSimilarity are used to derive the semantic
associations between different Musical Artists. An example is shown in Table
7.4.
The addition of the property “isSimilarTo” to the prv:DataItem class in the system’s
ontology, as shown in figure 7.11, allows the extraction of similarities between Musical
Artists.
Another property “isSimilarToMA”, sub-property of “isSimilarTo,” is added to the
domain:MusicalArtist concept. This property is filled by each Musical Artist that is
similar to another Musical Artist if the second has a Musical Genre that is a sub-genre
of any of the first Musical Genres, using the rule shown on algorithm 7.2. An example
of this inference is shown in figure 7.12, where Musical Artist 1 becomes similar to
Musical Artist 2, as Musical Artist 2 is related to Musical Genre B that is a sub-genre
of the Musical Genre A associated to Musical Artist 1.
These properties are only created for Mahout’s integration in the proposed architecture
in order to be used by this process to generate semantic similarities. The modified
system’s ontology is depicted in figure 7.11.
172
Experiments
foaf:knows
foaf:knows
foaf:Person
foaf:Person
likes
likes
performedBy
performedBy
provo:Activity
provo:Activity
overItem
overItem
isSimilarTo
prv:DataItem
prv:DataItem
domain:MusicalArtist
hasMusicalGenre
domain:MusicalArtist
isSimilarToMA
domain:MusicalGenre
hasMusicalGenre
domain:MusicalGenre
hasSubGenre
hasSubGenre
Figure 7.11: System Ontology Semantic Similarity Enrichment
Algorithm 7.2 Similar Musical Artists Rule (SWRL)
MusicalArtist (? artist1 ) , MusicalArtist (? artist2 ) , MusicalGenre (? genre1 ) , MusicalGenre
(? genre2 ) , h as M us ic al G en re (? artist1 , ? genre1 ) , hasSubGenre (? genre1 , ? genre2 ) ,
h as Mu si c al Ge nr e (? artist2 , ? genre2 ) -> isSimilarToMA (? artist1 , ? artist2 )
Musical Artist 1
hasMusicalGenre
hasSubGenre
isSimilarToMA
Musical Artist 2
Musical Genre A
hasMusicalGenre
Musical Genre B
Figure 7.12: Semantic Enrichment Example
173
7.2 Recommender System
Mahout
Similarities
Semantic
Similarities
Similarity
Generator
Merge
Similarities
Load External
Similarities
Recommender
Dataset
Aggregated
Similarities
System
Ontology
Generic(User/Item)BasedRecommender
Figure 7.13: Similarity Aggregation Process Example
7.2.4
Similarity Aggregator
The recommenders used in Mahout receive a dataset as input with existing user preferences and similarities between users or between items. Similarities can either be
automatically obtained by: the execution of a similarity generator (e.g. Tanimoto
Coefficient Similarity) or obtained from an external source.
Mahout provides these two approaches separately, but does not provide a way to
integrate both in order to join similarities from an external source with the similarities
generated by Mahout’s internal similarity generators in the same recommendation
process.
More than one algorithm for generating item similarities is used in this experiments.
For example, semantic similarities are obtained from the system’s ontology through
the “isSimilarTo” property. Because Mahout uses collaborative filtering, the usage of
semantic information is limited to the injection of similarities before the predictions.
In order to overcome this limitation, a similarity aggregator was developed that merges
the similarities provided by Mahout’s similarity generator and the ones provided from
an external source is shown in figure 7.13.
The similarity aggregator process is designed for Mahout to aggregate different similarity sets originated from different similarity generating processes, as depicted in
figure 7.14.
This process enriches the typical similarity output, used by the recommender system,
with semantic similarities.
The similarities aggregator process aggregates the similarities resulting from the above
distinct processes. It is possible, in the aggregation process, to give different weights
for each of the different generated similarities. Semantic associations per-se do not
have a specific scoring, they either exist or not, i.e. have a value of one or are not
174
Experiments
Recommendation Process
Evaluation Process
Collaborative
Similarities
Semantic
Similarities
Similarities
Aggregator
Similarities
Parameters:
Intersection Average
Union Average;
Union;
Figure 7.14: Similarities Aggregator Process
Table 7.5: Mahout-Based Similarities
[Boolean]
Table 7.6:
[Boolean]
Semantic
Similarities
Item
Item
Weight
Item
Item
Weight
1
1
1
1
2
1
1
2
1
1
3
1
1
3
1
1
5
1
...
...
...
Table 7.7: Mahout-based Similarities
[Weighted]
...
Table 7.8:
[Weighted]
...
...
Semantic
Similarities
Item
Item
Weight
Item
Item
Weight
1
1
12.00
1
2
3
1
2
10.00
1
3
2
3
18.00
1
...
...
...
1
...
5
...
1
...
175
7.2 Recommender System
Table 7.9: Mahout Similarities [Normalised Weight]
Table 7.10: Semantic Similarities [Normalised Weight]
Item
Item
Weight
Weight [0-1]
Item
Item
Weight
Weight [0-1]
1
1
12.00
0.25
1
2
3.00
1.00
1
2
10.00
0.00
1
3
2.00
0.50
3
18.00
1.00
5
1.00
0.00
1
...
...
...
...
1
...
...
...
...
present. Yet, multiple records for the same association can appear (cf. table 7.4) and
in different orders18 . The similarity aggregator is composed by two inner processes:
• value normalisation for different similarity sets. In this case, different similarity
sets (cf. tables 7.3, 7.4) can be aggregated using two different approaches:
– using a boolean similarity approach by establishing a value of one to any
similarity and deleting duplicate similarities. The resulting sets are shown
in tables 7.5 and 7.6;
– using a weighted similarity approach by counting the number of equally
repeating similarities. The resulting sets are shown in tables 7.7 and 7.8.
• average calculation for the different values of a same similarity is performed. This
happens after values have been normalised (cf. tables 7.9, 7.10). The aggregator
calculates the average between the recommender-based (cf. table 7.7) and the
semantic-based (cf. table 7.8) similarities by :
– using a union average approach, by calculating the average of common
similarities i.e. even if a similarity only appears on one of the generators.
The resulting set is shown in table 7.11;
– using an intersection average approach, by calculating the average of all
common similarities i.e. by discarding similarities that are only present on
one of the similarity generators. The resulting set is shown in table 7.12.
In the experiments, recommendations processes can be configured to use one or more
similarity generators. Union average or intersection average is only applied when
similarity values have been normalised.
18
While the order of similarity items can change, associations are symmetric i.e. similarity pair
(1,2) is the same as (2,1).
176
Experiments
Table 7.11: Similarities Aggregation
[Union Average]
Item
Item
Weight
Item
Item
Weight
1
1
0.13
1
2
0.50
1
2
0.50
1
3
0.75
1
3
0.75
1
5
0.00
...
7.2.5
Table 7.12: Similarities Aggregation
[Intersection Average]
...
...
...
...
...
Predictions Generator
Mahout presents two categories of prediction algorithms: user-based or item-based.
In the experiments, both prediction types have been used in different configurations.
For both approaches, Mahout predictions’ consider the existing similarities and their
weights.
The process presented in figure 7.15 predicts users’ interest in items and filters them
according to a value specified as top N.
The Top N value represents the maximum number of predictions to be recommended
for each user. Because the number of generated predictions can be very large, the
number of considered predictions is configurable. For evaluation purposes, top N
is configured with a value of one thousand in order to make sure that all possible
predicted items from the dataset are covered. The result of this process is a list of
tuples combining user, item and a score. The higher the score, the more relevant it is
to the user.
7.2.6
Predictions Aggregator
The predictions aggregator process is responsible for producing a recommendation list
consisting in tuples (user, item, score), based on the previously calculated predictions.
This process is divided in two sub-parts as depicted in figure 7.16.
This evaluation system allows the combination of multiple configurations, executed in
parallel, into one set of recommendations. As there are different predictions provided
from different configurations (i.e. originated from item or user-based recommender)
with different input similarities (i.e. recommendation-based similarities, collaborative
similarities and semantic similarities), the recommender engine must aggregate the
predictions generated from the system.
The recommendation process is comprised of a predictions aggregator that is an optional part of the system. It allows the recommendation process to aggregate two or
177
7.2 Recommender System
Recommendation Process
Rec Dataset
Similarities
Evaluation Process
Users; Items
Predictions
Generator
Users;Items
Recommender Parameters:
Generic Boolean Item/User;
Generic Item/User;
Unfiltered
Predictions
Filter Top N
Parameters:
- Top N
Predictions
Figure 7.15: Predictions Generator Process
Training
Model
178
Experiments
Recommendation Process
Rec Dataset
Evaluation Process
Users; Items
Predictions
Set 1
Predictions
Set …
Predictions
Aggregator
Users; Items
Aggregator Parameters:
- Union Average;
- Intersection Average
Predictions
Filter Top AT
Recommendations
Aggregator Parameters:
- Top AT
Figure 7.16: Predictions Aggregator Process
Training
Model
179
7.2 Recommender System
more prediction results from different configurations into one new prediction result,
where prediction scores can be normalised to a given range.
Because Mahout’s prediction values do not follow a normalisation with a maximum
value for each single system execution, a function is used to normalise each prediction’s
relevance in the overall recommending process.
The proposed predictions aggregator, by default, normalises the aggregation to values
on a scale from zero to one by using a simple linear conversion expressed in equations
7.4-7.8.
OldRange = OldM ax ≠ OldM in
(7.4)
N ewRange = N ewM ax ≠ N ewM in
(7.5)
Y
]N ewM ax
N ewV alue =
A
=1
N ewRange = 1, [
N ewM in = 0
(7.6)
B
(OldV alue ≠ OldM in) ú N ewRange
+ N ewM in
OldRange
N ewV alue =
A
B Y
OldV alue ≠ OldM in ]N ewRange = 1
,[
OldRange
N ewM in = 0
(7.7)
(7.8)
The aggregation function can be configured and may adopt several functions. An example of an average score between different prediction generators is shown in equation
7.9, where each prediction can be favoured over another by giving a different weight
to each prediction set.
score(item) =
score(p1 ) ú w1 + score(p2 ) ú w2
w1 + w2
(7.9)
Tables 7.13 and 7.14 demonstrate how the aggregation function might affect recommendation results. In the first table, without normalising the prediction values, item
number one is the least recommended to user number one, while it becomes the second
most recommended item when using normalised values.
Results are scored and the final list is filtered by the top AT value i.e. the number
of maximum recommendations per user, defined in the system configuration. This is
necessary because the user considers only a few of the first recommendations [47, 87].
180
Experiments
Table 7.13: Predictions Without Normalised Scoring
user
item
P1 Score
P2 Score
Average Score
Weighted Score (P1=0.8, P2=0.2)
1
1
2.00
25.00
13.50
6.60
1
2
1.00
35.00
18.00
7.80
1
3
2.00
32.00
17.00
8.00
Table 7.14: Predictions With Normalised Scoring
user
item
P1 Normalised Score
P2 Normalised Score
Average Score
Weighted Score (P1=0.8, P2=0.2)
1
1
1.00
0.00
0.50
0.80
1
2
0.00
1.00
0.50
0.20
1
3
1.00
0.70
0.85
0.94
7.2.7
Measurement Calculator
Measurement calculation, depicted in figure 7.17, allows the calculation of average
recall (cf. equation 2.4), precision (cf. equation 2.6) and f1 measure (cf. equation
2.12), for each evaluation run.
Several authors state that collaborative filtering techniques should mainly focus on
using recall measures [5, 43, 127, 154]. According to [154], a high recall with lower
top AT is better, when using recall as the measure for a recommender system. Nevertheless, for the presented experiments, precision and recall are both calculated. In
order to provide a weighted harmonic mean between recall and precision, F1 is also
calculated.
7.2.8
Summary
In order for Mahout’s recommendation engine to run it is necessary to provide information mapped from the system’s ontology. The process on how this is achieved was
described in this section.
The processes for generating and aggregating semantic similarities, for usage with a
collaborative-filtering technique are detailed. It is also described how it is possible
to aggregate predictions from different recommendation executions, even when their
predicted scores do not follow the same range of values.
These processes are used for building different evaluation configurations that are presented and discussed in the next section.
181
7.3 Evaluation & Results
Recommendation Process
Evaluation Process
Relevant
Items
Recall Calculus
Recall
Precision Calculus
Precision
F1 Calculus
F1
Recommendations
Figure 7.17: Measurement Calculator Process
7.3
Evaluation & Results
This evaluation suite gathers measurements of the recommendation evaluation execution under different runtime configurations. Some of the most relevant configurations
are shown in this section.
This section describes each configuration’s experiment and respective results.
Experiments are characterised according to the following dimensions:
• the recommendation dataset (cf. section 7.2.1);
• the process of generating the training model and relevant items (cf. section 7.2.2);
• the process of generating and aggregating similarities (cf. sections 7.2.3, 7.2.4);
• the process of generating and aggregating predictions (cf. sections 7.2.5, 7.2.6);
• the recommendation engine configurations (e.g. AT).
Each experiment has its own configuration of these dimensions. The experiments
were conducted for a top AT of 25, 50 and 150. In this section, comparisons between
different configurations are only presented for an AT value of 25. The remaining
results, for AT values of 50 and 150 are presented in appendix B.
182
Experiments
Table 7.15: Baseline Configurations Results [AT=25]
Tanimoto
User/Item-Based
Boolean/Weighted
AT
Precision
Recall
F1
Measures
Log-Likelihood
Prediction
Configuration ID
Similarity
C1
C2
C4
C7
C8
C10
L
L
L
T
T
T
-
I
I
U
U
U
U
B
B
B
W
W
B
25
25
25
25
25
25
0,0814
0,0774
0,0754
0,0471
0,0473
0,0730
0,4969
0,4726
0,4652
0,3173
0,3180
0,4541
0,1399
0,1331
0,1297
0,0820
0,0824
0,1258
Each configuration evaluation consists in the calculation of average precision, recall
and f1. Experiment’s results are compared to those of an initial baseline experiment
that is obtained by using the dataset with the simplest possible configuration.
7.3.1
Baseline Configurations Analysis
Baseline configurations were created using Mahout’s algorithms without injecting any
extra similarities in the process, as depicted in Table 7.15.
Configurations C1 and C2 are item-based, while C4, C7, C9 and C10 are user-based.
Some configurations use the Tanimoto Coefficient similarity algorithm (i.e. C2, C4
and C8) while others use the Log-Likelihood similarity algorithm.
By performing a first inspection of the provided data, user-based recommendations
(i.e. C7 and C8) do not produce results on pair with item-based recommendations.
7.3.2
C1 Derived Configurations Analysis
All configurations derived from the C1 baseline configuration are configured with an
item-based boolean recommender that uses the Log-Likelihood Similarities algorithm
as shown in Table 7.16.
By using the baseline recommender configuration solely with semantic similarities (i.e.
C105), precision and recall value drop when compared to the baseline (i.e. C1).
Yet, when aggregating both the recommender system similarities and the semantic
similarities, using an approach without averages (i.e. C104), it produces much better
results: precision is about six per cent higher, recall around twenty-nine per cent and
183
7.3 Evaluation & Results
Table 7.16: C1 Derived Configurations Results [AT=25]
Semantic
User/Item-Based
Boolean/Weighted
Normalised/Std.
Union
Union Average
Intersection Average
AT
Precision
Recall
F1
Measures
Tanimoto
Aggregation
Log-Likelihood
Prediction
Configuration ID
Similarity
C1
C104
C105
C109
C110
L
L
L
L
-
S
S
S
S
I
I
I
I
I
B
B
B
B
B
S
N
N
UN
-
UA
-
IA
25
25
25
25
25
0,0814
+0,0605
-0,0018
+0,0456
+0,0455
0,4969
+0,2874
-0,0427
+0,2011
+0,2000
0,1399
+0,1005
-0,0045
+0,0750
+0,0748
f1 about ten per cent higher than the baseline and the normalised average approaches
(i.e. C109 and C110)19 with union or average intersection. .
Using a normalised approach with intersection provides worse results than a nonnormalised union of all results.
It is possible to conclude that an item-based boolean recommendation is better when
enriched with semantic similarities compared to the baseline.
7.3.3
C2 Derived Configurations Analysis
All configurations derived from the C2 baseline configuration are configured with an
item-based boolean recommender that uses the Tanimoto Coefficient similarity algorithm as shown in Table 7.17.
By using the baseline recommender configuration solely with semantic similarities (i.e.
C205), precision only increases for AT=25, and decreased for the other AT values. All
other measures drop their values when compared to the baseline configuration, except
for f1 measure for AT=25, which is still slightly better because precision is higher than
the baseline.
When aggregating the recommender system similarities and the semantic similarities,
using a non-normalised approach (i.e. C204) produces much better results than the
baseline (i.e. C2) and the normalised average approaches (i.e. C209 and C210).
Using a normalised approach with intersection average (i.e. C210) produces worse
results than the baseline. It is possible to conclude that an item-based boolean rec19
An exception exists for top AT=150, which in configurations C109 and C110 has minimal, but
better results than C104. This is depicted in table B.2 on page 227.
184
Experiments
Table 7.17: C2 Derived Configurations Results [AT=25]
Semantic
User/Item-Based
Boolean/Weighted
Normalised/Std.
Union
Union Average
Intersection Average
AT
Precision
Recall
F1
Measures
Tanimoto
Aggregation
Log-Likelihood
Prediction
Configuration ID
Similarity
C2
C204
C205
C209
C210
-
T
T
T
T
S
S
S
S
I
I
I
I
I
B
B
B
B
B
S
N
N
UN
-
UA
-
IA
25
25
25
25
25
0,0774
+0,0196
+0,0022
+0,0070
-0,0296
0,4726
+0,0650
-0,0184
+0,0123
-0,1960
0,1331
+0,0312
+0,0023
+0,0106
-0,0516
Table 7.18: C4 Derived Configuration Results [AT=25]
Semantic
User/Item-Based
Boolean/Weighted
Normalised/Std.
Union
Union Average
Intersection Average
AT
Precision
Recall
F1
Measures
Tanimoto
Aggregation
Log-Likelihood
Prediction
Configuration ID
Similarity
C4
C405
C404
C400
C411
-
T
T
T
T
S
S
S
S
U
U
U
U
U
B
B
B
B
B
N
N
S
UN
UA
-
IA
-
25
25
25
25
25
0,0754
+0,0001
+0,0079
+0,0089
+0,0032
0,4652
+0,0072
+0,0486
+0,0519
+0,0237
0,1297
+0,0005
+0,0137
+0,0152
+0,0057
ommendation is better when enriched with semantic similarities, except when using
the normalised intersection average approach.
7.3.4
C4 Derived Configurations Analysis
All configurations derived from the C4 baseline configuration are configured with a
user-based boolean recommender that uses the Tanimoto Coefficient similarity algorithm as shown in Table 7.18.
By using the baseline recommender configuration solely with semantic similarities (i.e.
C405), precision and recall as well as f1 are only marginally improved, yet providing
better results than the baseline (i.e. C4).
185
7.3 Evaluation & Results
Table 7.19: C7 Derived Configurations Results [AT=25]
Semantic
User/Item-Based
Boolean/Weighted
Normalised/Std.
Union
Union Average
Intersection Average
AT
Precision
Recall
F1
Measures
Tanimoto
Aggregation
Log-Likelihood
Prediction
Configuration ID
Similarity
C7
C705
C704
C700
C711
L
L
L
L
-
S
S
S
S
U
U
U
U
U
W
W
W
W
W
N
N
S
UN
UA
-
IA
-
25
25
25
25
25
0,0471
+0,0230
+0,0010
+0,0008
+0,0049
0,3173
+0,0112
+0,0130
+0,0104
+0,0254
0,0820
+0,0335
+0,0019
+0,0015
+0,0083
When aggregating both the recommender system similarities and the semantic similarities, using a normalised approach (i.e. C400 and C404), precision and recall give
marginal better results favouring the intersection average approach (i.e. C400).
Not normalising the aggregation of similarities (i.e. C411) results in slightly better
values than the baseline, still worse than when similarities are normalised before aggregation average (i.e. C400 and C404).
7.3.5
C7 Derived Configurations Analysis
All configurations derived from the C7 baseline configuration are configured with a
user-based weighted recommender that uses the Log-Likelihood similarity algorithm
as shown in Table 7.19.
By using the baseline recommender configuration solely with semantic similarities (i.e.
C705), precision improves around two per cent while recall is slightly better than the
baseline (i.e. C7).
When aggregating both the recommender system similarities and the semantic similarities, using a normalised approach (i.e. C700 and C704), precision and recall give
very marginal but better results.
Not normalising the similarities aggregation process (i.e. C711) produces better results
than the baseline, but is still worse than using only semantic similarities. It is evident
that the weighted user-based recommendation combined with the Log-Likelihood similarity algorithm does not produce good results, even when semantic similarities are
added.
186
Experiments
Table 7.20: C8 Derived Configuration Results [AT=25]
Semantic
User/Item-Based
Boolean/Weighted
Normalised/Std.
Union
Union Average
Intersection Average
AT
Precision
Recall
F1
Measures
Tanimoto
Aggregation
Log-Likelihood
Prediction
Configuration ID
Similarity
C8
C805
C804
C800
C811
-
T
T
T
T
S
S
S
S
U
U
U
U
U
W
W
W
W
W
N
N
S
UN
UA
-
IA
-
25
25
25
25
25
0,0473
+0,0228
+0,0011
+0,0014
+0,0047
0,3180
+0,0105
+0,0154
+0,0179
+0,0247
0,0824
+0,0331
+0,0022
+0,0027
+0,0079
7.3.6
C8 Derived Configurations Analysis
All configurations derived from the C8 baseline configuration are configured with a
user-based weighted recommender that uses the Tanimoto Coefficient similarity algorithm as shown in Table 7.20.
By using the baseline recommender configuration solely with semantic similarities (i.e.
C805), precision improves around two per cent while recall is slightly better than the
baseline (i.e. C8).
When aggregating both similarity sets using a normalised averaged approach (i.e.
C800 or C804), precision and recall haver marginal but better results. Surprisingly,
not normalising the similarities (i.e. C811) shows better results than the baseline, but
is still worse than only using semantic similarities.
It is evident that the weighted user-based recommendation combined with the Tanimoto similarities algorithm does not produce good results, even when added with
semantic similarities.
7.3.7
C10 Derived Configurations Analysis
All configurations derived from the C10 baseline configuration are configured with a
user-based boolean recommender that uses the Log-Likelihood similarity algorithm as
shown in Table 7.21.
By using the baseline recommender configuration solely with semantic similarities (i.e.
C305), measures are barely the same as the baseline (i.e. C10).
Nevertheless, when mixing both the recommender system similarities and the semantic
similarities, using a normalised averaged approach (i.e. C300 or C304), precision is
187
7.3 Evaluation & Results
Table 7.21: C10 Derived Configurations Results [AT=25]
Semantic
User/Item-Based
Boolean/Weighted
Normalised/Std.
Union
Union Average
Intersection Average
AT
Precision
Recall
F1
Measures
Tanimoto
Aggregation
Log-Likelihood
Prediction
Configuration ID
Similarity
C10
C305
C304
C300
C311
L
L
L
L
-
S
S
S
S
U
U
U
U
U
B
B
B
B
B
N
N
S
UN
UA
-
IA
-
25
25
25
25
25
0,0730
+0,0025
+0,0143
+0,0145
+0,0056
0,4541
+0,0183
+0,0767
+0,0765
+0,0348
0,1258
+0,0044
+0,0241
+0,0244
+0,0096
improved around one per cent and recall around eight per cent. Not normalising the
results (i.e. C311) drops the previously observed values but still produces better results
than the baseline.
7.3.8
Aggregated Predictions Analysis
The normalised aggregations of predictions by combining the different baseline configurations are displayed in Table 7.22. This aggregation is performed using the union
average method and aims at combining item-based and user-based recommenders.
Configurations C11 and C15 are the only that outperform any of the individual approaches when run in parallel.
Combining configuration C1 with other configurations produces better results than any
of the other configuration when run individually, but worst than using C1 individually,
except for configuration C11 where C1 and C4 predictions are aggregated.
Combining C2 and C4 produces better results than using any of the configurations
individually. All the other combinations of C2 with other configurations produce worst
results than using C2 individually.
Table 7.23 shows the results of aggregating predictions from baselines with their derived approach that only considers semantic similarities.
The aggregation of predictions obtained solely from semantic similarities with their
baseline approach, produced worst results than using any of the configurations individually for configurations C106, C206 and C406.
Configurations C306 and C806 perform better than using the baseline configuration,
but worst than using the same configuration only with semantic similarities (i.e. C205
and C805).
188
Experiments
Table 7.22: Baseline Configurations Aggregated Predictions Results [AT=25]
Configuration 2
AT
Precision
Recall
F1
Configuration 1
Configuration 2
F1 Comparison
Configuration 1
Measures
Configuration ID
Configurations
C11
C12
C13
C14
C15
C16
C17
C18
C1
C1
C1
C1
C2
C2
C2
C2
C4
C7
C8
C10
C4
C7
C8
C10
25
25
25
25
25
25
25
25
0,0821
0,0475
0,0479
0,0790
0,0787
0,0475
0,0475
0,0737
0,5021
0,3194
0,3218
0,4851
0,4808
0,3188
0,3188
0,4566
0,1412
0,0826
0,0834
0,1359
0,1353
0,0826
0,0826
0,1269
+0,0013
-0,0573
-0,0565
-0,0040
+0,0022
-0,0505
-0,0505
-0,0062
+0,0115
+0,0006
+0,0010
+0,0101
+0,0056
+0,0006
+0,0002
+0,0011
Table 7.23: Derived Configurations Aggregated Predictions Results [AT=25]
Configuration 2
AT
Precision
Recall
F1
Configuration 1
Configuration 2
F1 Comparison
Configuration 1
Measures
Configuration ID
Configurations
C106
C206
C306
C406
C706
C806
C1
C2
C10
C4
C7
C8
C105
C205
C305
C405
C705
C805
25
25
25
25
25
25
0,0596
0,0600
0,0745
0,0600
0,0471
0,0562
0,3596
0,3628
0,4651
0,3628
0,3173
0,7540
0,1022
0,1030
0,1284
0,1030
0,0820
0,1046
-0,0377
-0,0301
+0,0026
-0,0267
0,0000
+0,0222
-0,0332
-0,0324
-0,0018
-0,0272
-0,0335
-0,0109
7.4 Summary
189
The configuration C706 produces the same results as the baseline (i.e. C7), but worst
than using the same configuration only with semantic similarities (i.e. C705).
7.3.9
Summary
From these evaluation results, it was possible to conclude that an item-based recommender system provides better results than a user-based recommender for the existing
dataset.
Furthermore, weighted analysis of actions provided worse results than when using a
boolean approach. This may be due to existing disparities in the values of each action
in the original dataset.
Overall, the injection of similarities derived from semantic information proved to enhance the results in all the configuration scenarios. This was achieved by using only
a minimal subset of information that a semantic system can have.
Furthermore, given the high configurability of the evaluation system, it is possible
to aggregate the predictions of different recommendation configurations that can use
different sets of semantic similarities. This would foster item similarities to have
different weights, used in different configurations, instead of having just one semantic
similarity set.
7.4
Summary
This evaluation aims to simulate resource’s access recommendation in a cross-domain,
based on provenance and traceability information captured from the seamless use of
the Internet. The original dataset, generated from a completely different scenario, is
interpreted to represent the semantic information respecting resources, users, users’
actions upon resources and provenance and traceability information.
The results demonstrate that introducing similarities calculated from content and
semantic information into a collaborative filtering technique – either focusing on social
networking, user profiles or resource content – it is possible to improve recommendation
results.
In [133] the authors used the MovieLens dataset for measuring the system’s recommendation performance, using a mean average precision measure. Their precision values
for an AT of 50 vary from 0.0272 to 0.0687, which are on par with those obtained
by the experiments conducted on this experiments with the baseline configuration for
the same AT (0.0255 to 0.0345). For a top AT value of 25, precision barely drops below 0.0500 hitting a maximum of around 0.0800, which is better than the best values
(0.0699@5) shown in [133]. This proves that precision measures produce quite small
190
Experiments
results but yet good enough for providing comparison between different systems in an
evaluation phase.
The usage of similarities produced from semantic content injected in collaborativefiltering techniques, shows that average precision values higher than ten per cent are
easily achievable. Provenance and traceability information, together with enriched
semantic information, can indeed make the resource recommendation better.
Aggregating similarities from different sources in the same recommendation process
produces better results than aggregating predictions for the same configurations considering only one of the similarity sets, when run in parallel.
Performing a thorough evaluation of the system, by running each configuration for
each individual is time-consuming and produces much information. Only a subset of
that information is presented in this chapter. Analysing the rest of the generated information would allow a finer-tuning of the system, but recommendation optimisation
was out of the scope of this work.
Chapter 8
Conclusions & Future Work
8.1
Overview
The proposed architectural model totally disrupts with current practices and adoptions
of web applications. Currently, web applications seek a “(. . . ) ‘personal data gold
rush’ driven by the dominance of advertising as the primary source of revenue for
most online companies” [66]. Such behaviour only entraps consumers and publishers
in single massive web domains/silos that dangerously have “more” control over user
content than they do.
When users create content in a web environment, they do not necessarily or entirely
“own” it. When using proprietary web applications, users are only allowed to make
use of those resources to the extent of what actual web applications allow them to do.
Managing resource-sharing and their access policies is a time-consuming, tedious and
error-prone operation especially in a cross-domain perspective.
Sharing the same resource in a cross-domain perspective, without duplicating it on
multiple different domains, crosses the boundaries imposed by typical web applications’
mandatory access policies that only allow users to share their resources in a way that
the web application desires and agrees with.
The work described in this thesis proposes a seamless cross-web-domain infrastructure
that can provide secure, rich and supportive resource-managing and sharing processes.
One of the axioms for this work is that it had to be achievable by using Internet
standards or recommendations, namely Semantic Web technologies.
8.2
Research Questions / Contributions
A divide-and-conquer approach was adopted, by structuring down the problem into
several smaller questions that focus on several different and smaller issues related to
192
Conclusions & Future Work
the architecture model.
RQ1: How can a distributed, decentralised and standard-based
mechanism perform authentication and authorisation?
Successful distributed user authentication is only attainable if users can provide the
correct credentials and the authentication takes place in several different domains
without relying on a centralised or federated mechanism.
Typical basic access combinations (e.g. user name and password) are not suitable for
this kind of authentication because validation credentials reside only in one domain not
(easily) attainable by others. Relying on a federated mechanism implies that a third
party should have to be involved in the authentication process (e.g. OpenID-Connect).
It was observed that the existing FOAF vocabulary was capable of not only providing
the user identity but also representing users’ internal characteristic (identity) and
social identity, in a readable and understandable format, either by people or machines
(i.e. RDF). Unique identity, referred as WebID, is provided through a URI that
denotes an association to the user’s FOAF profile.
The FOAF profile when associated to an SSL certificate (i.e. FOAF+SSL) provides
the user with enough credentials for a distributed user authentication.
Yet, decentralised authorisation over resources based on FOAF+SSL, is a novel contribution of this thesis. Decentralised authorisation is achieved when access policies
are decoupled from resources and the policy decision point is decoupled from the
enforcement point, as proposed in the system architecture.
Solving research question number one includes the combination of the following features:
• Access policies and resources are hosted by different entities;
• Association between resources and their authors is specified through semantic
annotations;
• Access policies are associated to each resource’s author, and only resource authors can set access policies over their resources;
• SWRL is the access policy definition selected language;
• The access policy evaluation engine is capable of performing explicit semantic
reasoning, thus exploiting the adopted and promoted rule-based access policies.
8.2 Research Questions / Contributions
193
RQ2: What and how can user-generated actions upon resources
be captured?
The information needed for relating users and resources already exists in closed environments or data silos but is typically inaccessible from other domains.
In a web environment, stipulating a new standard that could be implemented by
each web application’s internal structure was a possibility. Nevertheless, this option
undoubtedly faded away as it would only promote another standard on how to produce
and publish information, which would be prone to specific implementation errors and
non-compliance of standards.
Research on this field steered the vision of a module that should be decoupled from
specific web applications. For this matter, a system of action sensors is proposed in
this work. These sensors obey to predetermined message exchange that is hidden from
the web application’s developers and completely agnostic from their implementation.
Action sensors consist of singular and small entities that are placed between the user
and the action being performed, capturing information about the user’s actions and
generated content, thus providing associations of who did what upon which resource.
These sensors are deployed on web servers.
While this approach provides a high coupling of these modules to the webserver infrastructure, they are decoupled from existing web applications, which effortlessly
outnumber web servers in terms of existing implementations, reducing the number of
different modules that would need to be maintained and comply with the predetermined message exchange.
In a cross-domain environment, the usage of the proposed action sensors enables the
establishment of associations between users and resources. Such information is represented as provenance and/or traceability semantic annotations.
Therefore, research question number two has been answered by a combination of the
following contributions:
• The definition of action sensors for capturing user actions and generated content;
• The definition of traceability applied to Internet resources;
• The creation of provenance and traceability semantic annotations;
• The definition of a simple ontology that is used to model and later store provenance and traceability annotations;
• The extraction of meta and contextual information about users and resources
involved in the action.
194
Conclusions & Future Work
RQ3: How can a user share a resource with others, based on
rules instead of statically-defined discretionary access control?
Contrarily to what happens in closed environments where a finite list of all the users
registered in the domain is possible to obtain, in an infrastructure that provides userdistributed authentication it is rather difficult to obtain such list of all possible users
because their identities are scattered all over the Internet. Furthermore, assigning
access policies over resources should not be achieved on a user identity basis, as it
does not really translate the rationale behind the sharing action.
This research question led to some thinking about why resource-sharing is still used in
such a traditional, time-consuming, restricted and restrictive way. Worse, the traditional static assignment of users, or static definition of roles, or groups of users that can
access resources, does not scale in a web environment or comply with the constantly
changing user social networks.
This work proposes that access policies should always be expressed by means of rules
that capture the sharing rationale. In fact, it is more important, either for maintenance, readability or scalability that the rationale is specified instead of producing
such statically semantically poor defined access policies, which should be seen as a
last resort.
The adoption of a semantic web language that could enable the definition of rules was
envisaged. Such rules should be human-readable and easily processed by machines
without losing their explicit semantics.
Research question number three has been answered by the combination of the following
contributions:
• Access policies over resources are defined according to a semantic rule language
(i.e. SWRL);
• Access policies allow rules to be defined based on users’ information, resources’
information or any other semantic information that the author may wish and is
attainable by the system;
• The definition of an ontology that is responsible for describing access policies
over resources;
• Access policies do not depend on the domains where they are hosted;
• Access policies can be used in a cross-domain environment;
• Access policies are decoupled from resources;
• Access policies allow the dynamic creation of user and resource groups, either
based on the user or group characteristics.
8.3 Outcomes
195
RQ4: How is it possible to automate or ease the process of managing access policies to resources from a resource’s owner perspective and his relationships?
As the amount of content and the number of users in social relationships are continually
growing in the Internet, resource sharing and access policies management is difficult,
time-consuming and error-prone.
In order to aid users in the resource-sharing process, the adoption of an entity that
would recommend users with access policies for their resources is proposed, by analysing
(i) resource content, (ii) user preferences, (iii) users’ social networks, (iv) semantic
information, (v) user feedback about recommendation actions and (vi) provenance/traceability information gathered from actions sensors.
A hybrid recommendation engine capable of performing collaborative-filtering was
adopted and enhanced to use semantic information.
Research question number four has been answered by the combination of the following
contributions:
• Recommendation of access policies over resources, thus promoting the discovery
of known-unknown and unknown-unknown resources to other users that could
not even know about the existence of such resources;
• Adoption of a hybrid recommendation engine that translates user and resources’
semantic information and aggregates those with other content and collaborative
filtering techniques.
8.3
Outcomes
This work demonstrates that a user can share a one-time created resource with other
users, either on the same or other domains, without the need to duplicate the item
and access policies.
Every resource can be hosted in a different web domain than the one where the resource
is being referred.
The proposed action sensors not only relate user actions with resources, but also allow
users to keep a log of their actions and every piece of content created, published and
shared on the web, in the form of provenance or traceability annotations.
Users are no longer forced to manage authorisation over their resources the way each
domain wants and how it wants. Other users can even refer to it on different web
domains while still preserving access control policies over it.
196
Conclusions & Future Work
Privacy over user resources is enhanced because resources can be hosted exactly where
the user feels most comfortable and relies on. Confidentiality is achieved because
resources are inherently private when first created, being shared only with whom the
user requires.
Trustworthiness over resources is given by the user’s association to the resource, supported by provenance or traceability annotations. Resources’ integrity is therefore a
feature that is partially addressed in this work.
Due to the decentralised and distributed nature of the architectural model, availability
and scalability of the system is addressed. Distributed entities can play the same role
allowing redundancy of services and information to occur, thus minimising downtimes
or irresponsive behaviour.
8.4
Limitations & Future Work
Despite of, or because of all the novel proposals, several limitations and new requirements need to be addressed in the future.
8.4.1
Access Policy Revocation
Access policy revocation is regularly pointed out when discussing privacy and sharing,
specially restraining access to content that has been shared with someone else from
being copied or re-shared. This relates to copyright infringement, which is much
discussed in the audio-visual area. This issue has not been dealt with in the scope of
this work, but research on this may be pursued.
Pragmatically, nothing ensures that the user did not duplicate the resource before
his access to the resource has been revoked. Therefore, it is my belief that access to
resources can be revoked at any time, but it should happen only when the resource
evolves or changes, hence denying access to the resource only from there on. In fact,
the changed resource is different, but because it shares the identity (i.e. URI) with
the previous resource, it is considered/treated as the same resource.
8.4.2
Indivisible Resources
Despite composed resources being considered in the proposal, the sharing process
focused in specifying access policies to indivisible resources e.g. photographs, PDFs,
etc. Yet, there are other formats that can be divided into subparts thus allowing finegrained access (i.e. HTML, RDF, etc.). In fact, all resources are dividable, depending
only on the granularity level that is to be achieved with such division e.g. specifying the
8.4 Limitations & Future Work
197
right-corner of a photograph, sentences of a document, individual fields of an HTML
form, frames of a video file, etc.
Sharing just part of resources poses several difficulties such as: having a common
query/navigation language; embedding results in other resources and resolving semantic ambiguities resulting from dereferencing parts of other resources. Such issues
promote further research on this field.
8.4.3
Embeddable Resources
As mentioned, the proposed work focuses mainly on indivisible resources that are
obtainable by a URI, but not directly embeddable on HTML rendering (e.g. images,
videos, PDFs, etc.).
Let us assume a typical use case registration in a web domain where name, email and
password are some of the commonly requested attributes to be filled. Users should
not need to write their email when they should be given the chance to reuse the
“foaf:mbox_sha1sum” property that resides in the identifying FOAF profile, simply
by hyperlinking to it. By doing so, changes to that information would automatically
propagate to other web domains where a user had performed the registration and
the act of changing the email address on the FOAF profile would be automatically
reflected everywhere else.
Dereferencing structured text or comments on forums or any other information introduced via a web form is more difficult to handle than indivisible resources (e.g.
photographs), as those are intrinsic to the domain where the text is introduced or the
form is filled.
Nevertheless, once the data/text is identified (by a URI) and localised (by an URL,
a form of URI), the resource’s authorisation and controlled access can be managed,
thus behaving like any other addressable resource.
Future work in this field of research would foster the adoption of finer granularity
and the change or addition of new vocabularies that could allow such dereferencing of
content.
8.4.4
Blacklisting Users & Resources
Trustworthiness plays a very important role when creating a hyperlinked web with
the proposed components. Just because a user has access to a resource, it does not
mean that he wishes to have it rendered. If a malicious script or document exists on
a referred external resource, and the user’s credentials allow access to that resource,
this could pose a security threat to the user.
198
Conclusions & Future Work
The proposed architecture model could be enhanced with a user/resource blacklist, in
order to filter which resources could be loaded based on the referring resource or even
resource author.
When the user does not trust another user or a specific resource, the user or specific
resource would be blacklisted and resources’ dereferencing would be voided.
8.4.5
Performance impact
A fully dereferenced web like the one envisaged automatically increases the number of
HTTP requests made between clients and webservers and among webservers. Future
work in this area should prove, in terms of performance, scalability and reliability,
whether or not it is better to have all resources hosted in a single domain (without
the need for dereferencing) or hosted on different multiple domains (and dereferenced
when necessary).
Despite the fact that the number of resource requests would certainly increase, the
overall consumed bandwidth for each different web domain would probably be reduced.
Research on this field is suggested in order to evaluate such performance impacts.
8.5
Final Remarks
Recently, in a recent talk entitled “Why We Need the IndieWeb”1 held at the 2014
edition of the Personal Democracy Forum, Tantek Çelik addressed the issue of an
independent web and how back in 2003 people owned their domain in the Internet.
From there on, silos started to appear, and continued to emerge until the present day.
Meanwhile, other platforms decayed, ceased to exist2 or were bought and dismantled
by bigger silo companies. Current web paradigm of the web architecture is set to
imprison users (consumers and publishers) in big web domains. In fact, a recent study
and market comparison of existing cloud storage platforms, performed by the authors
in [111], clearly shows that no single cloud storage solution is capable of appropriately
and automatically hosting resources according to their different types in the most
suitable solutions.
Instead, this novel approach is set to disrupt this state of affairs by empowering users
with a higher control degree over their resources.
However, most of the advantages provided by this proposal come with a price that
may not be achievable at the moment. There are certain technological and specially
socio/cultural/economical issues that still constitute an obstacle to a full adoption
1
2
http://tantek.com/2014/175/t1/pdf14-why-need-indieweb-video-slides
https://indiewebcamp.com/site-deaths
8.5 Final Remarks
199
of such a hyperlinked web of resources, which is nothing less than the original envisioned web but transposed to the more actual Web 3.0. Mass social and cultural
behaviour/thinking are more difficult to overcome especially when those have an impact on economy, as web companies thrive for advertising as their primary source of
revenue.
Necessity is the mother of invention and this has been demonstrated by the continuous
evolution achieved with engineering that has set greater technological difficulties apart
over the last century. Setting technical difficulties aside, using an implementation of
the proposed architectural model, users can publish resources privately. It is my belief
that, in the future, every author will be the sole owner and manager of their resources.
Very recent work related to this matter, “Personal Data” [66] and “Synereo” [90],
denote web solutions similar to the features proposed in this thesis, which demonstrates
that there is evidently several pending issues related to user’s privacy on the Internet.
Furthermore, some of the ideas proposed in this thesis are currently taking a step
forward and building solid ground for effective and progressive adoption through a
collaboration in the “FreeTrust: Voluntary Trusted Identities, Privacy and Safety in
Cyberspace”3 project.
3
http://freetrust.org
Part IV
Bibliography and Appendix
Bibliography
[1] Jemal H. Abawajy, Syed I. Jami, Zubair A. Shaikh, and Syed A. Hammad.
A framework for scalable distributed provenance storage system. Computer
Standards and Interfaces, 35:179–186, 2013. ISSN 09205489.
[2] Gediminas Adomavicius and Tuzhilin Alexander. Context-Aware Recommender Systems. In Media, editors Francesco Ricci, Lior Rokach, Bracha
Shapira, and Paul B Kantor, chapter 7, pages 217–253. Springer US, 2011.
ISBN 9780387858197. doi: 10.1007/978-0-387-85820-3. URL: http://www.
springerlink.com/index/10.1007/978-0-387-85820-3.
[3] Gediminas Adomavicius, Nikos Manouselis, and Youngok Kwon. Multi-Criteria
Recommender Systems. In Recommender Systems Handbook, editors Francesco
Ricci, Lior Rokach, Bracha Shapira, and Paul B Kantor, chapter 24, pages 769–
803. Springer, 2011. ISBN 9780387858197. doi: 10.1007/978-0-387-85820-3.
URL: http://www.springerlink.com/index/K8GH04X241722610.pdf.
[4] Sareh Aghaei, Mohammad Ali Nematbakhsh, and Hadi Khosravi Farsani. Evolution of the World Wide Web : From Web 1.0 to Web 4.0. International Journal
of Web & Semantic Technology, 3(1):1–10, January 2012. ISSN 09762280.
[5] Fabio Airoldi, Paolo Cremonesi, and Roberto Turrin. Hybrid Algorithms for
Recommending New Items In Personal TV. In Proceedings of the Second International Workshop on Future of Television held in conjuntion with the Ninth European Conference on Interactive TV and Video (EuroITV), editors Lora Aroyo,
Stefan Dietze, and Lyndon Nixon, FutureTV’11, Lisbon, Portugal, June 2011.
[6] George Akrivas, Manolis Wallace, Giorgos Andreou, Giorgos Stamou, and Stefanos Kollias. Context-Sensitive Semantic Query Expansion. In Proceedings of
the IEEE International Conference on Artificial Intelligence Systems, ICAIS’02,
pages 109–114, Divnomorskoe, Russia, September 2002. IEEE Computer Society. ISBN 0-7695-1733-1. doi: 10.1109/ICAIS.2002.1048064.
[7] Ian Alexander. Towards Automatic Traceability in Industrial Practice. In Proceedings of the First International Workshop on Traceability, pages 26–31, 2002.
204
Bibliography
[8] Apache. Apache Mahout: Scalable machine learning and data mining, 2012.
URL: http://mahout.apache.org/.
[9] Hazeline Asuncion and Richard N. Taylor. Establishing the Connection Between Software Traceability and Data Provenance. Technical report, Institute for Software Research - University of California, Irvine, 2007. URL:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.75.1322.
[10] Research Studios Austria. easyrec - Open Source Recommendation Engine, 2012.
URL: http://easyrec.org.
[11] M Barbero, M D Del Fabro, and J Bézivin.
Traceability and provenance issues in global model management. 3rd ECMDATraceability Workshop, 2007. URL: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.
98.2284&rep=rep1&type=pdf.
[12] Punam Bedi, Harmeet Kaur, and Sudeep Marwaha. Trust Based Recommender System for the Semantic Web. In Proceedings of the International Joint Conference on Artificial Intelligence, IJCAI’07, pages 2677–2682,
2007. URL: http://scholar.google.com/scholar?hl=en&btnG=Search&q=intitle:
Trust+based+Recommender+System+for+the+Semantic+Web#0.
[13] Amokrane Belloui. Gusto!, 2012. URL: https://code.google.com/p/gusto/.
[14] T Berners-Lee, R Fielding, and L Masinter. Uniform Resource Identifiers (URI):
Generic Syntax [RFC2396]. RFC 2396 (Draft Standard), 1998. URL: http:
//www.ietf.org/rfc/rfc2396.txt.
[15] T Berners-Lee, R Fielding, and L Masinter. Uniform Resource Identifier (URI):
Generic Syntax [RFC3986], 2005. URL: https://tools.ietf.org/html/rfc3986.
[16] Tim Berners-Lee and Mark Fischetti. Weaving the Web - The Original Design
and Ultimate Destiny of the World Wide Web. HarperBusiness, 1999. ISBN
978-0062515872.
[17] Tim Berners-Lee, Roy T Fielding, and Larry Masinter. Uniform Resource Identifier (URI): Generic Syntax [RFC 3986], 2005. URL: https://tools.ietf.org/html/
rfc3986.
[18] Tim Berners-lee, Yuhsin Chen, Lydia Chilton, Dan Connolly, Ruth Dhanaraj,
James Hollenbach, Adam Lerer, and David Sheets. Tabulator: Exploring and
Analyzing linked data on the Semantic Web. Methodology, 2006(i):6, 2006. doi:
10.1.1.97.950.
Bibliography
205
[19] Nuno Bettencourt and Nuno Silva. Recommending Access to Web Resources
based on User’s Profile and Traceability. In the Tenth IEEE International Conference on Computer and Information Technology, CIT’10, pages 1108–1113,
Bradford, UK, June 2010. IEEE. ISBN 9781424475476. doi: 10.1109/CIT.2010.
202. URL: http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=5578559.
[20] Nuno Bettencourt, Paulo Maio, András Pongó, Nuno Silva, and João Rocha.
A Systematization and Clarification of Semantic Web Annotation Terminology.
In Proceedings of the International Conference on Knowledge Engineering and
Decision Support, ICKEDS’06, pages 27–34, Lisbon, Portugal, May 2006.
[21] Nuno Bettencourt, Paulo Maio, Ricardo Almeida, Nuno Silva, and João Rocha.
Semantically Collaborative Knowledge Management System. In Proceedings of
the AAAI Spring Symposia in Symbiotic Relationships between Semantic Web
and Knowledge Engineering, AAAI’08 Spring Symposia, pages 13–20, Palo Alto,
CA, USA, March 2008. AAAI. URL: https://www.aaai.org/Papers/Symposia/
Spring/2008/SS-08-07/SS08-07-002.pdf.
[22] Nuno Bettencourt, Rafael Peixoto, and Nuno Silva. Automatic Traceability
Acquisition Framework. In Proceedings of the Second International Conference
on Web Intelligence, Mining and Semantics, WIMS’12, Craiova, Romania, June
2012. ACM Press. ISBN 9781450309158. doi: 10.1145/2254129.2254169. URL:
http://dl.acm.org/citation.cfm?id=2254129.2254169.
[23] Nuno Bettencourt, Nuno Silva, and João Barroso. How to Publish Privately. In
Proceedings of the Second Workshop on Society, Privacy and the Semantic Web Policy and Technology at the Thirteenth International Semantic Web Conference,
PrivOn’14, ISWC’14, Riva Del Garda, Italy, September 2014. CEUR-WS.
[24] Yolanda Bianco-Fernández, José J. Pazos-Arias, Alberto Gil-Solla, Manuel
Ramos-Cabrer, Martín López-Nores, Jorge García-Duque, Ana Fernández-Vilas,
Rebeca P. Díaz-Redondo, and Jesús Bermejo-Muñoz. An MHP framework to
provide intelligent personalized recommendations about digital TV contents.
Software - Practice & Experience, 38(9):925–960, 2008. ISSN 00380644. doi:
10.1002/spe.v38:9.
[25] Christian Bizer, Tom Heath, and Tim Berners-Lee. Linked Data - The Story
So Far. International Journal on Semantic Web and Information Systems, 5
(3):1–22, 2009. ISSN 15526283. doi: 10.4018/jswis.2009081901. URL: http:
//www.citeulike.org/user/omunoz/article/5008761.
[26] H Boley, S Tabet, and G Wagner. Design Rationale of RuleML: A Markup
Language for Semantic Web Rules. In The First International Semantic Web
Working Symposium, volume 1 of SWWS’01, pages 381–401, Stanford (CA),
206
Bibliography
USA, July 2001. URL: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.
1.1.21.5035&amp;rep=rep1&amp;type=pdf.
[27] Artur Boronat, José Á. Carsí, and Isidro Ramos. Automatic Support for Traceability in a Generic Model Management Framework. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and
Lecture Notes in Bioinformatics), volume 3748 LNCS, pages 316–330, 2005.
ISBN 3540300260. doi: 10.1007/11581741\_23.
[28] Danah Boyd. Faceted ID/Entity; Managing representation in a digital world.
PhD thesis, Brown University, 2002. URL: http://smg.media.mit.edu/people/
danah/thesis/thesis/.
[29] Dan Brickley and Libby Miller. FOAF Vocabulary Specification, 2010. URL:
http://xmlns.com/foaf/spec/.
[30] Interactive Advertising Bureau. User Generated Content, Social Media, and
Advertising - An Overview. Technical Report April, Interactive Advertising
Bureau, 2008.
[31] Robin Burke. Hybrid recommender systems: Survey and experiments. User
Modeling and UserAdapted Interaction, 12(4):331–370, 2002. ISSN 09241868.
doi: 10.1023/A:1021240730564. URL: http://www.springerlink.com/index/
N881136032U8K111.pdf.
[32] Robin Burke. Hybrid Web Recommender Systems. In The Adaptive Web,
editors Peter Brusilovsky, Alfred Kobsa, and Wolfgang Nejdl, pages 377–
408. Springer-Verlag, 2007. URL: http://link.springer.com/chapter/10.1007%
2F978-3-540-72079-9_12.
[33] Scott Cantor and Ij Kemp. Assertions and protocols for the oasis security assertion markup language. OASIS Standard (March . . . , (March):1–86,
2005. URL: https://svn.softwareborsen.dk/sosi-gw/tags/release-1.1.4/vendor/
doc/saml-core-2.0-os.pdf.
[34] Ricardo Carreira, Jaime M Crato, Daniel Gonçalves, and Joaquim A Jorge.
Evaluating Adaptive User Profiles for News Classification. In the Proceedings of
the Ninth International Conference on Intelligent User Interfaces, IUI’04, pages
206–212, Funchal, Madeira, Portugal, 2004. ACM. ISBN 1-58113-815-6. doi:
10.1145/964442.964481. URL: http://doi.acm.org/10.1145/964442.964481.
[35] Charalampos Chelmis and Viktor K. Prasanna. Social Networking Analysis: A
State of the Art and the Effect of Semantics. In The Proceedings of the 2011
IEEE International Conference on Privacy, Security, Risk and Trust and the
2011 the third IEEE International Conference on Social Computing, PASSAT’11
Bibliography
207
/ SocialCom’11, pages 531–536, Boston, MA, USA, October 2011. Institute of
Electrical and Electronics Engineers (IEEE).
[36] Yulia Cherdantseva and Jeremy Hilton. A Reference Model of Information Assurance & Security. In Proceedings of the SecOnt Workshop in conjunction with the
Eight International Conference on Availability, Reliability and Security (ARES),
Regensburg, Germany, September 2013. IEEE. doi: 10.1109/ARES.2013.72.
URL: http://users.cs.cf.ac.uk/Y.V.Cherdantseva/RMIAS.pdf.
[37] Hee Chul Choi, Sebastian Ryszard Kruk, Slawomir Grzonkowski, Katarzyna
Stankiewicz, Brian Davis, and John Breslin. Trust Models for CommunityAware Identity Management. In the Proceedings of the Identity, Reference and
Web Wordshop held at the Fifteenth International World Wide Web Conference,
WWW’06, Edinburgh, Scotland, UK, 2006.
[38] S C Chou. L n RBAC: A multiple-levelled Role-Based Access Control model for
protecting privacy in object-oriented systems. Technology, 3(3):93–120, 2004.
ISSN 16601769.
[39] Fabio Ciravegna. (LP)2, an Adaptive Algorithm for Information Extraction from
Web-related Texts. In Proceedings of the Workshop on Adaptive Text Extraction
and Mining, in conjunction with the International Joint Conference on Artificial
Intelligence, volume 20 of IJCAI’01, pages 1–10. University of Sheffiel, 2001. doi:
10.1.1.23.7653. URL: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.
1.23.7653.
[40] Lorenzo Cirio, I F Cruz, and Roberto Tamassia. A Role and Attribute Based
Access Control System Using Semantic Web Technologies. Symposium A Quarterly Journal In Modern Foreign Literatures, 4806:1256–1266, 2007. URL:
http://portal.acm.org/citation.cfm?id=1780518.
[41] Sean Convery. Network Authentication, Authorization, and Accounting: Part
One: Concepts, Elements and Approaches. The Internet Protocol Journal, 10
(1):2–11, 2007.
[42] Bryan Copeland. OpenRecommender, 2012.
bcmoney/OpenRecommender/.
URL: https://github.com/
[43] Paolo Cremonesi, Roberto Turrin, and Fabio Airoldi. Hybrid Algorithms for
Recommending New Items. In Proceedings of the Second International Workshop
on Information Heterogeneity and Fusion in Recommender Systems, HetRec’11,
pages 33–40, Chicago, IL, USA, October 2011. ACM. ISBN 978-1-4503-10277. doi: 10.1145/2039320.2039325. URL: http://doi.acm.org/10.1145/2039320.
2039325.
208
Bibliography
[44] André Di Thommazo, Gabriel Malimpensa, Thiago Ribeiro De Oliveira, Guilherme Olivatto, and Sandra C P F Fabbri. Requirements traceability matrix:
Automatic generation and visualization. In Proceedings - 2012 Brazilian Symposium on Software Engineering, SBES 2012, pages 101–110, 2012.
[45] T Dierks and E Rescorla. The Transport Layer Security (TLS) Protocol Version
1.2 [RFC5246], 2008. URL: https://tools.ietf.org/html/rfc5246.
[46] JS Donath. Identity and Deception in the Virtual Community. In Communities in Cyberspace, editors Marc A. Smith and Peter Kollock, pages 29–
59. RouteLedge, London & New York, 1999. ISBN 0415191394. URL: http:
//citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.53.6901.
[47] Tim Duhamel, Gianni Cooreman, and Pieter De Vuyst. MC DC 2009 - UNITE
Report. Technical report, IAB Europe, 2009. URL: http://www.iabeurope.eu/
files/7513/6852/2734/mc-dc-2009-iab-unite-report.pdf.
[48] Hany F. EL Yamany, Miriam A M Capretz, and David S. Allison. Intelligent
security and access control framework for service-oriented architecture. Information and Software Technology, 52(2):220–236, 2010. ISSN 09505849.
[49] Christian Emig, Frank Brandt, Sebastian Abeck, Jürgen Biermann, and Heiko
Klarl. An Access Control Metamodel for Web Service-Oriented Architecture.
In Proceedings of the Second International Conference on Software Engineering
Advances, ICSEA’07, pages 57–64, 2007.
[50] Jérôme Euzenat. Eight Questions about Semantic Web Annotations. IEEE
Intelligent Systems and Their Applications, 17(2):55–62, 2002. doi: 10.1109/
MIS.2002.999221.
[51] Maryam Fazel-Zarandi, Hugh J Devlin, Yun Huang, and Noshir Contractor.
Expert Recommendation based on Social Drivers, Social Network Analysis,
and Semantic Data Representation. In Proceedings of the Second International Workshop on Information Heterogeneity and Fusion in Recommender
Systems, HetRec’11, pages 41–48. ACM, 2011. ISBN 9781450310277. doi:
10.1145/2039320.2039326. URL: http://doi.acm.org/10.1145/2039320.2039326.
[52] Ignacio Fernández-Tobías, Iván Cantador, Marius Kaminskas, and Francesco
Ricci. A generic semantic-based framework for cross-domain recommendation.
In Proceedings of the Second International Workshop on Information Heterogeneity and Fusion in Recommender Systems as the Fifth ACM Conference
on Recommender Systems, HetRec’11, pages 25–32, Chicago, IL, USA, October 2011. ACM. ISBN 9781450310277. doi: 10.1145/2039320.2039324. URL:
http://doi.acm.org/10.1145/2039320.2039324.
Bibliography
209
[53] D Ferraiolo, J Cugini, and D Richard Kuhn. Role-based access control (RBAC):
Features and motivations. In Proceedings of 11th Annual Computer Security
Application Conference, volume pages, pages 241–248. IEEE Computer Society
Press, 1995. URL: http://brutus.ncsl.nist.gov/groups/SNS/rbac/documents/
ferraiolo-cugini-kuhn-95.pdf.
[54] D F Ferraiolo and R Kuhn. Role-Based Access Control (RBAC). Proc 15th
NISTNSA National Computer Security Conference, pages 1–5, 1992.
[55] Seth Fiegerman. More Than 500 Million Photos Are Shared Every Day, 2013.
URL: http://mashable.com/2013/05/29/mary-meeker-internet-trends-2013/.
[56] The Privly Foundation. Priv.ly, 2015. URL: https://priv.ly/.
[57] A Freier, P Karlton, and P Kocher. The Secure Sockets Layer (SSL) Protocol
Version 3.0 [RFC 6101], 2011. URL: https://tools.ietf.org/html/rfc6101.
[58] Zeno Gantner, Lars Schmidt-thieme, Steffen Rendle, and Christoph Freudenthaler. MyMediaLite : A Free Recommender System Library. October, pages
305–308, 2011. URL: http://dl.acm.org/citation.cfm?id=2043989.
[59] Stefania Ghita, Wolfgang Nejdl, and Raluca Paiu. Semantically Rich Recommendations in Social Networks for Sharing, Exchanging and Ranking Semantic Context. Social Networks, 3729:293–307, 2005. ISSN 03029743. doi:
10.1007/11574620\_23. URL: http://citeseerx.ist.psu.edu/viewdoc/summary?
doi=10.1.1.59.3295.
[60] Anthony Giddens. Modernity and Self-Identity: Self and Society in the Late
Modern Age. Stanford University Press, Stanford (CA), USA, 1991. ISBN
9780804719445. URL: http://www.sup.org/books/title/?id=2660.
[61] Luca Gilardoni, Chistian Biasuzzi, Massimo Ferraro, Roberto Fonti, and Piercarlo Slavazza. LKMS - A Legal Knowledge Management System Exploiting
Semantic Web Technologies. In Proceedings of the Fourth International Semantic Web Conference, volume 3729 of ISWC’2005, pages 872–886, Galway, Ireland,
2005. Springer. doi: 10.1007/11574620\_62. URL: http://dx.doi.org/10.1007/
11574620_62.
[62] Scott Gilbertson.
HTTPS is more secure, so why isn’t the Web
using it?,
2011.
URL: http://arstechnica.com/business/2011/03/
https-is-more-secure-so-why-isnt-the-web-using-it/.
[63] Jennifer Golbeck, Bijan Parsia, and James Hendler. Trust Networks on the
Semantic Web. Cooperative Information Agents VII, 1(1):238–249, 2003. ISSN
00992399. doi: 10.1504/IJMSO.2006.008770. URL: http://www.springerlink.
com/index/766XQW7F29277DR8.pdf.
210
Bibliography
[64] Orlena C. Z. Gotel and Anthony C. W. Finkelstein. An Analysis of the Requirements Traceability Problem. In Proceedings of the First IEEE International
Conference on Requirements Engineering, volume Imperial C, pages 94–101. Department of Computing Imperial College, IEEE Computer Society Press, 1994.
URL: http://discovery.ucl.ac.uk/153850/.
[65] Asela Gunawardana and Guy Shani. A Survey of Accuracy Evaluation Metrics of Recommendation Tasks. The Journal of Machine Learning Research,
10:2935–2962, December 2009. URL: http://www.jmlr.org/papers/volume10/
gunawardana09a/gunawardana09a.pdf.
[66] Hamed Haddadi, Heidi Howard, Amir Chaudhry, Jon Crowcroft, Anil Madhavapeddy, and Richard Mortier. Personal Data: Thinking Inside the Box.
January 2015. URL: http://arxiv.org/abs/1501.04737.
[67] Stephan Hagemann and Gottfried Vossen. Categorizing User-Generated Content (extended abstract). In In the Proceedings of the Web Science Conference:
Society On-Line, WebSci’09, Athens, Greece, March 2009. Web Sciente Trust.
URL: http://journal.webscience.org/155/.
[68] Siegfried Handschuh, Raphael Volz, and Steffen Staab. Annotation for the Deep
Web. IEEE Intelligent Systems, 18(5):42–48, September 2003. ISSN 15411672. doi: 10.1109/MIS.2003.1234768. URL: http://ieeexplore.ieee.org/lpdocs/
epic03/wrapper.htm?arnumber=1234768.
[69] Steve Harris and Andy Seaborne. SPARQL 1.1 Query Language, 2010. URL:
http://www.w3.org/TR/sparql11-query/.
[70] O Hartig and J Zhao. Publishing and Consuming Provenance Metadata
on the Web of Linked Data.
Proceedings of the Thirsd International
Provenance and Annotation Workshop, 6378:78–90–90, 2010. doi: 10.1007/
978-3-642-17819-1. URL: http://olafhartig.de/files/HartigZhao_Provenance_
IPAW2010_Preprint.pdf.
[71] Olaf Hartig. Provenance Information in the Web of Data. In Proceedings of
the Linked Data on the Web Workshop in conjunction with the WWW, volume 39 of LDOW’09, pages 1–9. Ceur-Ws, 2009. doi: 10.1016/S0040-4039(98)
00959-9. URL: http://www.dbis.informatik.hu-berlin.de/fileadmin/research/
papers/conferences/2009-ldow-hartig.pdf.
[72] Jianming He and Wesley W Chu. A Social Network-Based Recommender System
(SNRS). In Data Mining for Social Network Data, editors Nasrullah Memon,
Jennifer Jie Xu, David L Hicks, and Hsinchun Chen, volume 12 of Annals of
Information Systems, pages 47–74. Springer US, 2010. ISBN 9781441962867.
Bibliography
211
doi: 10.1007/978-1-4419-6287-4. URL: http://citeseerx.ist.psu.edu/viewdoc/
summary?doi=10.1.1.156.2547.
[73] Jonathan L Herlocker, Joseph A Konstan, and John Riedl. Explaining Collaborative Filtering Recommendations. In Proceedings of the 2000 ACM conference
on Computer supported cooperative work CSCW 00, volume pages of CSCW
’00, pages 241–250. ACM, ACM Press, 2000. ISBN 1581132220. doi: 10.1145/
358916.358995. URL: http://portal.acm.org/citation.cfm?doid=358916.358995.
[74] Alan R Hevner, Salvatore T March, Jinsoo Park, and Sudha Ram. Design
Science in Information Systems Research. MIS Quarterly, 28(1):75–105, March
2004. ISSN 02767783. doi: 10.2307/25148625. URL: http://dblp.uni-trier.de/
rec/bibtex/journals/misq/HevnerMPR04.
[75] Ian Horrocks, Peter F Patel-Schneider, Harold Boley, Said Tabet, Benjamin
Grosof, and Mike Dean. SWRL: A Semantic Web Rule Language Combining OWL and RuleML. Syntax, 21(May):79, 2004. URL: http://www.w3.org/
Submission/SWRL/.
[76] Yun Huang, Noshir Contractor, and York Yao. CI-KNOW: Recommendation
based on Social Networks. In Proceedings of the Ninth International Digital
Government Research Conference, editors Soon Ae Chun, Marijn Janssen, and
J Ramon Gil-Garcia, pages 27–33. Digital Government Society of North America, 2008. ISBN 9781605580999. URL: http://portal.acm.org/citation.cfm?id=
1367832.1367840.
[77] Z Huang, Daniel Zeng, and Hsinchun Chen. A Comparative study of recommendation algorithms in e-commerce applications. IEEE Intelligent Systems, 22
(5):1–23, 2007. URL: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.
1.79.8432&amp;rep=rep1&amp;type=pdf.
[78] Zan Huang, Daniel Zeng, and Hsinchun Chen. A Comparison of CollaborativeFiltering Algorithms for E-commerce. IEEE Intelligent Systems, 22(5):68–78,
2007. doi: http://dx.doi.org/10.1109/MIS.2007.80.
[79] Ian Jacobs and Norman Walsh. Architecture of the World Wide Web, Volume
One, 2004. URL: http://www.w3.org/TR/webarch/.
[80] Gareth J F Jones. Challenges and Opportunities of Context-Aware Information Access. In International Workshop on Ubiquitous Data Management, UDM
’05, pages 53–60, Dublin, Ireland, April 2005. IEEE. ISBN 0769524117. doi:
10.1109/UDM.2005.5. URL: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.
htm?arnumber=1521237.
212
Bibliography
[81] Lalana Kagal. Rei : A Policy Language for the Me-Centric Project. Technical Report HPL-2002-270, HP Labs, 2002. URL: http://www.hpl.hp.com/
techreports/2002/HPL-2002-270.html.
[82] Lalana Kagal and Tim Berners-lee. Rein: Where Policies Meet Rules in the
Semantic Web. Technical report, Laboratory, Massachusetts Institute of Technology, 2005.
[83] José Kahan, Marja-Riitta Koivunen, Eric Prud’Hommeaux, and Ralph R Swick.
Annotea: An Open RDF Infrastructure for Shared Web Annotations. In Proceedings of the Tenth Internation Conference on World Wide Web, WWW’01, pages
623–632. ACM, 2001. URL: http://citeseerx.ist.psu.edu/viewdoc/summary?
doi=10.1.1.29.2418.
[84] José Kahan, Marja-Riitta Koivunen, Eric Prud’Hommeaux, and Ralph R Swick.
Annotea: An Open RDF Infrastructure for Shared Web Annotations. Computer
Networks, 39(5):589–608, 2002. ISSN 13891286. doi: 10.1016/S1389-1286(02)
00220-7. URL: http://linkinghub.elsevier.com/retrieve/pii/S1389128602002207.
[85] Andrew Kannenberg and Dr. Hossein Saiedian. Why Software Requirements
Traceability Remains a Challenge. The Journal of Defense Software Engineering,
(July/August):14–19, 2009.
[86] Michael Kifer, Jos De Bruijn, Harold Boley, and Dieter Fensel. A Realistic Architecture for the Semantic Web. Lecture Notes in Computer, 3791(November):
17–29, 2005. URL: http://www.springerlink.com/index/f511460n0v3hl61n.pdf.
[87] Sara Kimberley.
European Web Users Stop Searching After First
10 Results, 2009.
URL: http://www.mediaweek.co.uk/article/974179/
european-web-users-stop-searching-first-10-results-report-reveals.
[88] Andreas Klenk, Tobias Heide, Benoit Radier, Mikaël Salaün, and Georg
Carle. Pluggable Authorization and Distributed Enforcement with pam_xacml.
In Kommunikation in verteilten Systemen KiVS’09, pages 253–264, 2009.
URL: http://www.net.in.tum.de/fileadmin/bibtex/publications/papers/klenk_
kivs2009.pdf.
[89] J Klensin. Role of the Domain Name System (DNS) [RFC 3467]. RFC 3467
(Informational), 2003. URL: http://www.ietf.org/rfc/rfc3467.txt.
[90] Dor Konforty, Yuval Adam, Daniel Estrada, and Lucius Gregory Meredith.
Synereo: The Decentralized and Distributed Social Network. Technical report,
Synereo, 2015. URL: http://www.synereo.com/whitepapers/synereo.pdf.
Bibliography
213
[91] Sebastian Ryszard Kruk. FOAF-Realm - control your friends’ access to the
resource. In Proceedings of the First Workshop on Friend of a Friend (FOAF),
Social Networking and the Semantic Web, Galway, Ireland, September 2004.
DERI.
[92] Sebastian Ryszard Kruk and Stefan Decker. Semantic Social Collaborative Filtering with FOAFRealm. In Proceedings of the Semantic Desktop Workshop at
the International Semantic Web Conference, editors Stefan Decker, Jack PArk,
Dennis Quan, and Leo Sauermann, Galway, Ireland, November 2005.
[93] Sebastian Ryszard Kruk and Stefan Decker. JeromeDL and FOAFRealm-Taking
Advantage of Semantic Social Collaborative Filtering in Digital Libraries. In Proceedings of the Demo and Poster session at ECDL, volume 2005 of ECDL’2005,
pages 3–4, 2005. URL: http://vmserver14.nuigalway.ie/xmlui/handle/10379/
580.
[94] Sebastian Ryszard Kruk, Slawomir Grzonkowski, Adam Gzella, Tomasz
Woroniecki, and Hee-Chul Choi. D-FOAF: Distributed Identity Management
with Access Rights Delegation. In Proceedings of the First Asian Semantic
Web Conference, ASWC’2006, pages 140–154, Beijing, PRC, September 2006.
Springer Berlin Heidelberg. ISBN 978-3-540-38329-1. doi: 10.1007/11836025\
_15.
[95] A. Langsford, K. Naemura, and R. Speth. OSI management and job transfer
services. In Proceedings of the IEEE, volume 71, pages 1420–1424. IEEE, December 1983. doi: 10.1109/PROC.1983.12790. URL: http://ieeexplore.ieee.org/
articleDetails.jsp?arnumber=1457058.
[96] Craig Larman. Applying UML and Patterns: An Introduction to Object-Oriented
Analysis and Design and Iterative Development. Prentice Hall, 3rd edition, 2004.
ISBN 007-6092037224.
[97] Timothy Lebo, Patrick West, and Deborah L. McGuinness. Walking into the Future with PROV Pingback: An Application to OPeNDAP using Prizms. In Proceedings of the Fifth International Provenance Annotation Workshop, IPAW’14,
2014. URL: https://github.com/timrdf/prizms/wiki.
[98] G Linden, B Smith, and J York. Amazon.com recommendations: item-toitem collaborative filtering. IEEE Internet Computing, 7(1):76–80, 2003. ISSN
10897801. doi: 10.1109/MIC.2003.1167344. URL: http://ieeexplore.ieee.org/
lpdocs/epic03/wrapper.htm?arnumber=1167344.
[99] Häkan Lindqvist. Mandatory Access Control. PhD thesis, Umeä University,
2006.
214
Bibliography
[100] Miao Liu, He-Qing Guo, and Jin-Dian Su. An Attribute and Role Based Access
Control Model for Web Services. In Proceedigns of the Fourth International
Conference on Machine Learning and Cybernetics, volume 2, pages 1302–1306,
Guangzhou, PRC, August 2005. ISBN 0-7803-9091-1. doi: 10.1109/ICMLC.
2005.1527144.
[101] Antonis Loizou and Srinandan Dasmahapatra. Recommender Systems for the
Semantic Web. In Proceedings of the Recommender Systems Workshop of the
European Conference on Artificial Intelligence, volume 2 of ECAI’06, pages
269–271, Trento, Italy, August 2006. Springer-Verlag. ISBN 1852335769. doi:
10.4304/jetwi.2.4.269-271. URL: http://ojs.academypublisher.com/index.php/
jetwi/article/view/3630.
[102] Jake T. Lussier, Troy Raeder, and Nitesh V. Chawla. User Generated Content Consumption and Social Networking in Knowledge-Sharing OSNs. In Proceedings of the Third International Conference on Social Computing, Behavioral Modeling, and PRediction, volume 6007 LNCS of SBP’10, pages 228–237,
Bethesda, MD, USA, March 2010. Springer Berlin Heidelberg. ISBN 978-3-64212078-7. doi: 10.1007/978-3-642-12079-4\_29.
[103] Zakaria Maamar, Djamal Benslimane, and Nanjangud C Narendra. What Can
Context Do For Web Services? Communications of the ACM, 49(12):98–103,
December 2006. ISSN 00010782. doi: 10.1145/1183236.1183238. URL: http:
//portal.acm.org/citation.cfm?doid=1183236.1183238.
[104] Katherine A. MacKinnon. User Generated Content vs. Advertising: Do
Consumers Trust the Word of Others Over Advertisers?
The Elon
Journal of Undergraduate Research in Communications, 3(1):14–22, 2012.
URL: https://www.elon.edu/docs/e-web/academics/communications/research/
vol3no1/02MacKinnonEJSpring12.pdf.
[105] Robert C Martin. The Open-Closed Principle. In More C++ Gems, editor Robert C Martin, pages 97–112. Cambridge University Press, NY, USA,
2000. ISBN 0-521-78618-5.
[106] P V Mockapetris. Domain names - Concepts And Facilities [RFC1034]. RFC
1034 (INTERNET STANDARD), 1987. URL: http://www.ietf.org/rfc/rfc1034.
txt.
[107] P V Mockapetris.
Domain Names - Implementation and Specification
[RFC1035]. RFC 1035 (INTERNET STANDARD), 1987. URL: http://www.
ietf.org/rfc/rfc1035.txt.
Bibliography
215
[108] Marie-Francine Moens, Juanzi Li, and Tat-Seng Chua. Mining User Generated
Content. Chapman & Hall/CRC, 2014. ISBN 978-1466557406. URL: http:
//dl.acm.org/citation.cfm?id=2584531.
[109] Samaneh Moghaddam, Mohsen Jamali, Martin Ester, and Jafar Habibi. FeedbackTrust: Using Feedback Effects in Trust-based Recommendation Systems. In
Proceedings of the Third ACM Conference on Recommender Systems, RecSys’09,
pages 269–272, New York, NY, USA, 2009. ACM. ISBN 9781605584355. doi: 10.
1145/1639714.1639765. URL: http://portal.acm.org/citation.cfm?id=1639765.
[110] B Moore, E Ellesson, J Strassner, and A Westerinen. Policy Core Information
Model – Version 1 Specification [RFC3060]. RFC 3060 (Proposed Standard),
2001. URL: http://www.ietf.org/rfc/rfc3060.txt.
[111] Pedro Moreira. Nuvens de Informação. Deco Pro Teste, pages 58–21, January
2015. ISSN 0873-8785.
[112] L Moreira-Matias, R Fernandes, J Gama, M Ferreira, J Mendes-Moreira, and
L Damas. An Online Recommendation System for the Taxi Stand choice Problem (Poster). In IEEE Vehicular Networking Conference, volume 1 of VNC’12,
pages 173–180, Seoul, South Korea, 2012. IEEE.
[113] Mozdev. Annozilla, 2005. URL: http://annozilla.mozdev.org.
[114] Srijith Nair. XACML Reference Architecture - Developer Blog, 2013. URL: http:
//developers.axiomatics.com/blog/index/entry/xacml-reference-architecture.
html.
[115] Mark H. Needleman. RDF: The resource description framework. Serials Review,
27(1):58–61, 2001. ISSN 00987913. doi: 10.1016/S0098-7913(00)00131-3. URL:
http://linkinghub.elsevier.com/retrieve/pii/S0098791300001313.
[116] Steve Nimmons. Policy Enforcement Point Pattern, 2012. URL: http://
stevenimmons.org/2012/02/policy-enforcement-point-pattern/.
[117] Salma Noor and Kirk Martinez. Using Social Data as Context for Making Recommendations: An Ontology based Approach. In Proceedings of the First Workshop on Context Information and Ontologies, CIAO’09, page #7, Herakleion,
Greece, June 2009. ACM. URL: http://eprints.ecs.soton.ac.uk/17685/.
[118] Xavier Ochoa and Erik Duval. Quantitative Analysis of User-Generated Content
on the Web. In Proceedings of the First International Workshop on Understanding Web Evolution, volume 34 of WebEvolve’08, pages 19–26, Beijing, PRC, April
2008. Web Sciente Trust. ISBN 978 0854328857.
216
Bibliography
[119] Fabrizio Orlandi and Alexandre Passant. Modelling provenance of DBpedia
resources using Wikipedia contributions. Web Semantics Science Services and
Agents on the World Wide Web, 9(2):149–164, 2011. ISSN 15708268. doi: 10.
1016/j.websem.2011.03.002. URL: http://linkinghub.elsevier.com/retrieve/pii/
S1570826811000175.
[120] Sean Owen, Robin Anil, Ted Dunning, and Ellen Friedman. Mahout in Action.
Manning, 2011. URL: http://www.manning.com/owen/.
[121] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank Citation Ranking: Bringing Order to the Web. World Wide Web Internet
And Web Information Systems, 54(2):1–17, 1998. URL: http://ilpubs.stanford.
edu:8090/422.
[122] Bill Parducci and Hal Lockhart.
eXtensible Access Control Markup
Language (XACML) Version 3.0.
Technical Report January, OASIS,
2013.
URL: https://www.oasis-open.org/committees/download.php/4412/
oasis-xacml-2_0-core-spec-wd-01.pdf.
[123] Donn B Parker. Our Excessively Simplistic Information Security Model and
How to Fix It. ISSA Journal, (July):12–21, July 2010.
[124] Chad Perrin. The CIA Triad, 2008. URL: http://www.techrepublic.com/blog/
it-security/the-cia-triad/.
[125] Danny Poo, Brian Chng, and Jie-Mein Goh. A Hybrid Approach for User Profiling. In Proceedings of the 36th Annual Hawaii International Conference on System Sciences (HICSS’03), volume 4 of HICSS’03, pages 103–111, Washington,
DC, USA, 2003. IEEE Computer Society. ISBN 0-7695-1874-5. doi: 10.1109/
HICSS.2003.1174242. URL: http://dl.acm.org/citation.cfm?id=820751.821533.
[126] Torsten Priebe, Wolfgang Dobmeier, Christian Schläger, and Nora Kamprath.
Supporting Attribute-based Access Control in Authorization and Authentication
Infrastructures with Ontologies. Security, 2(1):27–38, February 2007.
[127] Sanjay Purushotham, Yan Liu, and C.-C. Jay Kuo. Collaborative Topic Regression with Social Matrix Factorization for Recommendation Systems. Proceedings
of the Twenty-Ninth International Conference on Machine Learning, July 2012.
[128] Sudha Ram and Jun Liu. Understanding the Semantics of Data Provenance to
Support Active Conceptual Modeling. Active conceptual modeling of learning,
4512(May 2006):17–29, 2007. doi: 10.1007/978-3-540-77503-4\_3. URL: http:
//www.springerlink.com/index/07x8046426578944.pdf.
[129] E Rescorla. HTTP Over TLS [RFC 2818], 2000. URL: http://tools.ietf.org/
html/rfc2818.
Bibliography
217
[130] Paul Resnick, Ko Kuwabara, Richard Zeckhauser, and Eric Friedman. Reputation Systems. Communications of the ACM, 43(12):45–48, 2000. ISSN 00010782.
doi: 10.1145/355112.355122.
[131] L Rodrigo, Richard Benjamins, J Contreras, D Patón, D Navarro, R Salla,
M Blázquez, P Tena, and I Martos. A Semantic Search Engine for the International Relation Sector. In Proceedings of the Fourth International Semantic Web
Conference, editors Yolanda Gil, Enrico Motta, Richard Benjamins, and Mark A
Musen, volume 3729 of Lecture Notes in Computer Science, pages 1002–1015,
Galway, Ireland, 2005. Galway, Ireland, Springer Berlin / Heidelberg. ISBN
978-3-540-29754-3. doi: http://dx.doi.org/10.1007/11574620\_71.
[132] Sini Ruohomaa, Lea Kutvonen, and Eleni Koutrouli. Reputation Management
Survey. In Second International Conference on Availability, Reliability and Security, ARES’07, Vienna, Austria, April 2007.
[133] Alan Said, Benjamin Kille, Ernesto W De Luca, and Sahin Albayrak. Personalizing Tags: A Folksonomy-like Approach for Recommending Movies. In Proceedings of the Second International Workshop on Information Heterogeneity and
Fusion in Recommender Systems, HetRec’11, pages 53–56, Chicago, IL, USA,
October 2011. ACM. ISBN 978-1-4503-1027-7. doi: 10.1145/2039320.2039328.
[134] Andrei Sambra, Henry Story, and Tim Berners-Lee. Web Identity and Discovery,
2014. URL: http://www.w3.org/2005/Incubator/webid/spec/identity.
[135] Jean-Paul Sartre. Being and Nothingness. Washington Square Press, 1943. ISBN
0671867806. URL: https://books.google.pt/books?id=L6igUcpDEO8C.
[136] J Ben Schafer, Joseph A Konstan, and John Riedl. E-Commerce Recommendation Applications. Data Mining and Knowledge Discovery, 5(1):115–153, 2001.
ISSN 13845810. doi: 10.1023/A:1009804230409. URL: http://www.springerlink.
com/index/r24285574675qu7v.pdf.
[137] Upendra Shardanand and Pattie Maes. Social Information Filtering: Algorithms for Automating "Word of Mouth". In Proceedings of the ACM Conference on Human Factors in Computing Systems, editors I R Katz, R Mack,
L Marks, M B Rosson, and J Nielsen, volume 1 of CHI ’95, pages 210–217.
ACM Press/Addison-Wesley Publishing Co., ACM Press/Addison-Wesley Publishing Co., 1995. ISBN 0201847051. doi: 10.1145/223904.223931. URL:
http://portal.acm.org/citation.cfm?id=223904.223931.
[138] Hai-bo Shen and Fan Hong. An Attribute-Based Access Control Model for
Web Services. In Proceedings of the Seventh IEEE International Conference on
Parallel and Distributed Computing, Applications and Technologies, PDCAT’06,
pages 74–79, 2006. doi: 10.1109/PDCAT.2006.28.
218
Bibliography
[139] P Shoval, V Maidel, and B Shapira. An Ontology-Content-Based Filtering
Method. International Journal on Information Theories and Applications,
15:303 – 318, 2008. URL: http://www.citeulike.org/user/icantador/article/
4080611.
[140] Ahu Sieg, Bamshad Mobasher, and Robin Burke. Ontology-Based Collaborative Recommendation. In Proceedings of the Eight Workshop on Intelligent
Techniques for Web Personalization & Recommender Systems, editors Bamshad
Mobasher, Dietmar Jannach, and Sarabjot Singh Anand, ITWP’10, Big Island
of Hawaii, USA, June 2010.
[141] Yogesh L. Simmhan, Beth Plale, and Dennis Gannon. A survey of data provenance in e-science, 2005.
[142] Rashmi R. Sinha and Kirsten Swearingen. Comparing Recommendations Made
by Online Systems and Friends. In Proceedings of the Second DELOS Network of
Excellence Workshop on Personalisation and Recommender Systems in Digital
Libraries, Dublin, Ireland, June 2001.
[143] Michael G. Solomon and Mike Chapple. Information Security Illuminated. Jones
& Bartlett Learning, 2005. ISBN 978-0763726775.
[144] Cheri Speier, Joseph S. Valacich, and Iris Vessey. The Influence of Task Interruption on Individual Decision Making: An Information Overload Perspective.
Decision Sciences, 30(2):337–360, 1999. URL: http://doi.wiley.com/10.1111/j.
1540-5915.1999.tb01613.x.
[145] Lee Stephen, William Dettelback, and Nishant Kaushik. Modernizing Access
Control with Authorization Service. Oracle - Developers and Identity Services,
(November), 2008.
[146] Perdita Stevens. Traceability in ( bidirectional ) model transformations Traceability, 2009. URL: http://wiki.esi.ac.uk/ProvenanceInSoftwareSystems.
[147] Henry Story. FOAF&SSL: Creating a Global Decentralised Authentication Protocol (The Sun BabelFish Blog), 2008. URL: https://blogs.oracle.com/bblfish/
entry/foaf_ssl_creating_a_global.
[148] Henry Story, Bruno Harbulot, Ian Jacobi, and Mike Jones. FOAF+SSL:
RESTful Authentication for the Social Web. Current, pages 1–12, 2009. doi:
10.1.1.154.3628. URL: http://ceur-ws.org/Vol-447/paper5.pdf.
[149] Rudi Studer, V Richard Benjamins, and Dieter Fensel. Knowledge engineering:
Principles and methods. Data & Knowledge Engineering, 25(1-2):161–197, 1998.
ISSN 0169023X. doi: 10.1016/S0169-023X(97)00056-6. URL: http://linkinghub.
elsevier.com/retrieve/pii/S0169023X97000566.
Bibliography
219
[150] Sebastian Tramp, Henry Story, Andrei Sambra, Philipp Frischmuth, Michael
Martin, and Sören Auer. Extending the WebID Protocol with Access Delegation.
In Proceedings of the Third International Workshop on Consuming Linked Data
(COLD2012), 2012.
[151] J Uhlir and M Falc. Annotating narratives using ontologies and conceptual
graphs, 2004. ISSN 15294188. URL: http://portal.acm.org/citation.cfm?id=
1019407.
[152] Andrzej Uszok, Jeffrey M. Bradshaw, Matthew Johnson, Renia Jeffers, Austin
Tate, Jeff Dalton, and Stuart Aitken. KAoS Policy Management for Semantic
Web Services. IEEE Intelligent Systems, 19(4):32–41, 2004. URL: http://hdl.
handle.net/1842/2217.
[153] John R. Vollbrecht, Pat R. Calhoun, Stephen Farrell, Leon Gommans, George M.
Gross, Betty de Bruijn, Cees T.A.M. de Laat, Matt Holdrege, and David W.
Spence. AAA Authorization Framework [RFC 2904], 2000.
[154] Chong Wang and David M Blei. Collaborative Topic Modeling for Recommending Scientific Articles. In Proceedings of the Seventeenth ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, volume 7
of KDD’11, pages 448–456, San Diego, CA, USA, August 2011. ACM, ACM
Press. ISBN 9781450308137. doi: 10.1145/2020408.2020480.
[155] Lingyu Wang, Duminda Wijesekera, and Sushil Jajodia. A Logic-based Framework for Attribute based Access Control. In Proceedings of the 2004 ACM
workshop on Formal methods in security engineering FMSE 04, FMSE ’04,
pages 45–55. ACM New York, NY, USA, ACM Press, 2004. ISBN 1581139713.
doi: 10.1145/1029133.1029140. URL: http://portal.acm.org/citation.cfm?doid=
1029133.1029140.
[156] S Wasserman and K Faust. Social Network Analysis: Methods and Applications.
Structural analysis in the social sciences. Cambridge University Press, 1 edition, 1994. URL: http://www.bibsonomy.org/bibtex/
29b337c5a631805f247bd96e0f5ac7ed9/rincedd.
[157] Steve Wasserman. The Amazon Effect, May 2012. URL: http://www.thenation.
com/print/article/168125/amazon-effect.
[158] Daniel J. Weitzner, Harold Abelson, Tim Berners-Lee, Chris Hanson, James
Hendler, Lalana Kagal, Deborah L. McGuinness, Gerald Jay Sussman, and
K. Krasnow Waterman. Transparent Accountable Data Mining: New Strategies for Privacy Protection. Technical report, Computer Science and Artificial
Intelligence Laboratory, 2006. URL: http://dig.csail.mit.edu/TAMI/.
220
Bibliography
[159] A. Westerinen and J. Schnizlein. Terminology for Policy-Based Management
[RFC 3198], 2001. URL: http://tools.ietf.org/html/rfc3198.
[160] Andreas Wombacher and M R Huq. Towards Automatic Capturing of Manual
Data Processing Provenance. Technical report, Centre for Telematics and Information Technology, University of Twente, 2011. URL: http://eprints.eemcs.
utwente.nl/20164/01/paper.pdf.
[161] Sacha Wunsch-Vincent and Graham Vickery.
Participative Web and
User-created Content: Web 2.0, Wikis and Social Networking.
Number 2006 in OECD Publications. OECD Publishing, 2007.
ISBN
9789264037465. URL: http://www.oecd-ilibrary.org/science-and-technology/
participative-web-and-user-created-content_9789264037472-en.
[162] R. Yavatkar, D. Pendarakis, and R. Guerin. A Framework for Policy-based
Admission Control [RFC2753]. RFC 2753, pages 1–21, 2000. URL: http://tools.
ietf.org/html/rfc2753.
[163] Au Yeung, Ching Man, Lalana Kagal, Nicholas Gibbins, and Nigel Shadbolt.
Providing Access Control to Online Photo Albums Based on Tags and Linked
Data. AAAI SSW, 2:9–14, 2009. URL: http://eprints.soton.ac.uk/267203/.
[164] Akintunde Michael Yinka. Data and Information Security. In Proceedings of
the First International Technology, Education and Environment Conference,
Omoku, Nigeria, 2011. Human Resource Management Academic Research Society (HRMARS) and African Society for Scientific Research (ASSR). URL:
http://hrmars.com/index.php/pages/detail/Proceeding2.
[165] Zhiwen Yu, Yuichi Nakamura, Seiie Jang, Shoji Kajita, and Kenji Mase.
Ontology-Based Semantic Recommendation for Context-Aware E-Learning. In
Ubiquitous Intelligence and Computing, editors Jadwiga Indulska, Jianhua Ma,
Laurence Yang, Theo Ungerer, and Jiannong Cao, volume 4611 of LNCS,
pages 898–907. Springer Berlin Heidelberg, 2007. ISBN 978-3-540-73548-9. doi:
10.1007/978-3-540-73549-6\_88.
[166] Eric Yuan and Jin Tong. Attributed Based Access Control (ABAC) for Web
Services. In Proceedings of the Third IEEE International Conference on Web
Services, ICWS’05, pages 561–569, Orlando, FL, USA, July 2005. IEEE. ISBN
0769524095. doi: 10.1109/ICWS.2005.25.
Appendix A
Dataset Preparation
A.1
OpenRefine
The process of mapping tags from the original LastFM dataset to their semantic
equivalent meaning, consisted on reconciling tag values to their corresponding Musical
Genre1 in the Freebase database.
This task was achieved by the OpenRefine2 application that is capable of reconciling3
text against chosen types of information. Such reconciliation service provides a semiautomatic process of matching textual words to database identifiers.
The reconciliation process was capable of automatically reconciling 303 tags. The
reconcile process was not able to provide any kind of reconciliation for 5476 Tags. The
remaining 6167 tags had multiple options for reconciliation.
The tags that had multiple possible options, with very similar scores, required a manual conducted post-processing phase. This phase allowed 4395 Tags to be manually
associated to the most appropriate Musical Genre (according to the user’s perception).
There were initially 11946 tags in the LastFM dataset. In the end of the reconcile
process, 5001 Tags were semantically reconciled to their correspondent Musical Genre
in the Freebase dataset. 6945 tags remained unreconciled and are considered waste as
no appropriate correspondency can be established.
A.2
Domain Ontology
Table A.1 depicts the number of concepts and properties that existed in the original
dataset with those mapped to the domain ontology and Table A.2 shows the origin of
1
http://www.freebase.com/music/genre
http://openrefine.org/
3
https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation
2
222
Dataset Preparation
information for each concept.
Table A.1: Domain Ontology Facts & Numbers
Concepts/Properties/Actions
LastFM Dataset
Domain Ontology
Users
1892
1892
User-User
12717
12717
Artists
17632
17632
Tags
11946
11946
Musical Genres From Tags (Reconcile Process)
—
5001
Musical Artist - Musical Genre (Contextual Information)
—
36092
Musical Artist - Musical Genre Inferred from subGenres
—
1959
User - Musical Genre (User Preferences)
—
177483
User-Artist (Traceability Information)
92834
92834
User-Tag Assignment
186479
168847
User-Musical Artist (Provenance Information)
—
17632
223
A.2 Domain Ontology
Table A.2: Domain Concepts’ Source Datasets
User
Tag
User/Listen
User/Tag
Users/Users
LastFM
Freebase
MusicBrainz
Reasoning
Identification
—
3
—
3
3
3
—
—
—
—
Identification
3
—
—
—
3
—
3
3
3
—
Concepts
User
M. Artist
M. Genre
Tag
Name
3
—
—
—
—
—
3
—
—
—
Picture URL
3
—
—
—
—
—
3
—
—
—
LastFM URI
3
—
—
—
—
—
3
—
—
—
Freebase URI
—
—
—
—
—
—
—
3
—
—
MusicBrainz URI
—
—
—
—
—
—
3
3
3
—
Identification
—
—
—
—
—
—
—
3
—
—
Description
—
—
—
—
—
—
—
3
—
—
Identification
—
—
3
—
—
—
—
—
—
—
Value
—
—
3
—
—
—
—
—
—
—
Tagging
Action
Provenance
Traceability
Tag
—
—
3
—
—
—
—
—
—
—
Day
—
—
—
—
3
—
—
—
—
—
Month
—
—
—
—
3
—
—
—
—
—
Year
—
—
—
—
3
—
—
—
—
—
Artist
3
—
—
—
3
—
—
—
—
—
User
—
3
—
—
3
—
—
—
—
—
Tag
—
—
3
—
3
—
—
—
—
—
User
—
3
—
3
—
—
—
—
—
—
Artist
3
—
—
3
—
—
—
—
—
—
Listening Count
—
—
—
3
—
—
—
—
—
—
User/M. Artist
3
3
—
3
—
—
—
—
—
—
User/User
—
3
—
—
—
3
—
—
—
—
M. Artist/M. Artist
—
—
—
—
—
—
—
—
—
3
Tag/M. Artist
3
—
3
—
3
—
—
—
—
—
Tag/User
—
3
3
—
3
—
—
—
—
—
Listening
Association
Online
Artist
LastFM Dataset
Tag/M. Genre
—
—
3
—
—
—
—
3
—
—
M. Artist/M. Genre
3
—
—
—
—
—
—
3
—
3
User/M. Genre
—
3
—
—
—
—
—
—
—
3
M. Genre/M. Genre
—
—
—
—
—
—
—
3
—
3
Musical Artist Creation
3
3
—
—
3
—
—
—
—
—
Meta-Information
3
—
—
—
3
—
—
—
—
—
Listening Actions
3
3
—
3
—
—
—
—
—
—
Appendix B
Recommendation Evaluation
Results
226
B.1
Recommendation Evaluation Results
Baseline Configurations Results
Table B.1 depicts baseline configurations and their evaluation results for AT values of
25, 50 and 150.
Table B.1: Baseline Configurations Results
Tanimoto
User/Item-Based
Boolean/Weighted
AT
Precision
Recall
F1
Measures
Log-Likelihood
Prediction
Configuration ID
Similarity
C1
C1
C1
C2
C2
C2
C4
C4
C4
C7
C7
C7
C8
C8
C8
C10
C10
C10
L
L
L
L
L
L
L
L
L
T
T
T
T
T
T
T
T
T
-
I
I
I
I
I
I
U
U
U
U
U
U
U
U
U
U
U
U
B
B
B
B
B
B
B
B
B
W
W
W
W
W
W
B
B
B
25
50
150
25
50
150
25
50
150
25
50
150
25
50
150
25
50
150
0,0814
0,0345
0,0193
0,0774
0,0343
0,0195
0,0754
0,0327
0,0187
0,0471
0,0254
0,0167
0,0473
0,0255
0,0167
0,0730
0,0321
0,0185
0,4969
0,6086
0,6677
0,4726
0,6060
0,6745
0,4652
0,5839
0,6509
0,3173
0,4751
0,5930
0,3180
0,4772
0,5942
0,4541
0,5749
0,6438
0,1399
0,0654
0,0376
0,1331
0,0650
0,0380
0,1297
0,0620
0,0364
0,0820
0,0482
0,0325
0,0824
0,0484
0,0325
0,1258
0,0608
0,0359
227
B.2 C1 Derived Configurations Results
B.2
C1 Derived Configurations Results
Table B.2 presents C1 derived configurations and their evaluation results for AT values
of 25, 50 and 150.
Table B.2: C1 Derived Configurations Results
Semantic
User/Item-Based
Boolean/Weighted
Normalised/Std.
Union
Union Average
Intersection Average
AT
Precision
Recall
F1
Measures
Tanimoto
Aggregation
Log-Likelihood
Prediction
Configuration ID
Similarity
C1
C1
C1
C104
C104
C104
C105
C105
C105
C109
C109
C109
C110
C110
C110
L
L
L
L
L
L
L
L
L
L
L
L
-
S
S
S
S
S
S
S
S
S
S
S
S
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
S
S
S
N
N
N
N
N
N
UN
UN
UN
-
UA
UA
UA
-
IA
IA
IA
25
50
150
25
50
150
25
50
150
25
50
150
25
50
150
0,0814
0,0345
0,0193
+0,0605
+0,0143
+0,0054
-0,0018
-0,0078
-0,0059
+0,0456
+0,0143
+0,0055
+0,0455
+0,0143
+0,0055
0,4969
0,6086
0,6677
+0,2874
+0,1988
+0,1496
-0,0427
-0,1516
-0,2095
+0,2011
+0,1974
+0,1530
+0,2000
+0,1973
+0,1527
0,1399
0,0654
0,0376
+0,1005
+0,0266
+0,0103
-0,0045
-0,0149
-0,0116
+0,0750
+0,0266
+0,0105
+0,0748
+0,0266
+0,0105
228
B.3
Recommendation Evaluation Results
C2 Derived Configurations Results
Table B.3 presents C2 derived configurations and their evaluation results for AT values
of 25, 50 and 150.
Table B.3: C2 Derived Configurations Results
Semantic
User/Item-Based
Boolean/Weighted
Normalised/Std.
Union
Union Average
Intersection Average
AT
Precision
Recall
F1
Measures
Tanimoto
Aggregation
Log-Likelihood
Prediction
Configuration ID
Similarity
C2
C2
C2
C204
C204
C204
C205
C205
C205
C209
C209
C209
C210
C210
C210
-
T
T
T
T
T
T
T
T
T
T
T
T
S
S
S
S
S
S
S
S
S
S
S
S
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
S
S
S
N
N
N
N
N
N
UN
UN
UN
-
UA
UA
UA
-
IA
IA
IA
25
50
150
25
50
150
25
50
150
25
50
150
25
50
150
0,0774
0,0343
0,0195
+0,0196
+0,0021
-0,0008
+0,0022
-0,0076
-0,0061
+0,0070
+0,0048
+0,0025
-0,0296
-0,0120
-0,0059
0,4726
0,6060
0,6745
+0,0650
-0,0071
-0,0615
-0,0184
-0,1490
-0,2163
+0,0123
+0,0594
+0,0638
-0,1960
-0,2159
-0,2006
0,1331
0,0650
0,0380
+0,0312
+0,0036
-0,0018
+0,0023
-0,0145
-0,0120
+0,0106
+0,0089
+0,0047
-0,0516
-0,0229
-0,0116
229
B.4 C4 Derived Configurations Results
B.4
C4 Derived Configurations Results
Table B.4 presents C4 derived configurations and their evaluation results for AT values
of 25, 50 and 150.
Table B.4: C4 Derived Configurations Results
Semantic
User/Item-Based
Boolean/Weighted
Normalised/Std.
Union
Union Average
Intersection Average
AT
Precision
Recall
F1
Measures
Tanimoto
Aggregation
Log-Likelihood
Prediction
Configuration ID
Similarity
C4
C4
C4
C405
C405
C405
C404
C404
C404
C400
C400
C400
C411
C411
C411
-
T
T
T
T
T
T
T
T
T
T
T
T
S
S
S
S
S
S
S
S
S
S
S
S
U
U
U
U
U
U
U
U
U
U
U
U
U
U
U
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
N
N
N
N
N
N
S
S
S
UN
UN
UN
UA
UA
UA
-
IA
IA
IA
-
25
50
150
25
50
150
25
50
150
25
50
150
25
50
150
0,0754
0,0327
0,0187
+0,0001
-0,0025
-0,0003
+0,0079
+0,0015
+0,0005
+0,0089
+0,0018
+0,0005
+0,0032
+0,0001
+0,0002
0,4652
0,5839
0,6509
+0,0072
-0,0499
-0,0972
+0,0486
+0,0247
+0,0152
+0,0519
+0,0282
+0,0163
+0,0237
+0,0036
+0,0090
0,1297
0,0620
0,0364
+0,0005
-0,0048
-0,0008
+0,0137
+0,0027
+0,0008
+0,0152
+0,0033
+0,0009
+0,0057
+0,0001
+0,0004
230
B.5
Recommendation Evaluation Results
C7 Derived Configurations Results
Table B.5 presents C7 derived configurations and their evaluation results for AT values
of 25, 50 and 150.
Table B.5: C7 Derived Configurations Results
Semantic
User/Item-Based
Boolean/Weighted
Normalised/Std.
Union
Union Average
Intersection Average
AT
Precision
Recall
F1
Measures
Tanimoto
Aggregation
Log-Likelihood
Prediction
Configuration ID
Similarity
C7
C7
C7
C705
C705
C705
C704
C704
C704
C700
C700
C700
C711
C711
C711
L
L
L
L
L
L
L
L
L
L
L
L
-
S
S
S
S
S
S
S
S
S
S
S
S
U
U
U
U
U
U
U
U
U
U
U
U
U
U
U
W
W
W
W
W
W
W
W
W
W
W
W
W
W
W
N
N
N
N
N
N
S
S
S
UN
UN
UN
UA
UA
UA
-
IA
IA
IA
-
25
50
150
25
50
150
25
50
150
25
50
150
25
50
150
0,0471
0,0254
0,0167
+0,0230
+0,0293
+0,0348
+0,0010
+0,0013
+0,0011
+0,0008
+0,0015
+0,0012
+0,0049
+0,0058
+0,0089
0,3173
0,4751
0,5930
+0,0112
-0,0896
-0,1875
+0,0130
+0,0239
+0,0352
+0,0104
+0,0263
+0,0386
+0,0254
-0,0126
-0,0846
0,0820
0,0482
0,0325
+0,0335
+0,0476
+0,0590
+0,0019
+0,0025
+0,0021
+0,0015
+0,0029
+0,0023
+0,0083
+0,0102
+0,0162
231
B.6 C8 Derived Configurations Results
B.6
C8 Derived Configurations Results
Table B.6 presents C8 derived configurations and their evaluation results for AT values
of 25, 50 and 150.
Table B.6: C8 Derived Configurations Results
Semantic
User/Item-Based
Boolean/Weighted
Normalised/Std.
Union
Union Average
Intersection Average
AT
Precision
Recall
F1
Measures
Tanimoto
Aggregation
Log-Likelihood
Prediction
Configuration ID
Similarity
C8
C8
C8
C805
C805
C805
C804
C804
C804
C800
C800
C800
C811
C811
C811
-
T
T
T
T
T
T
T
T
T
T
T
T
S
S
S
S
S
S
S
S
S
S
S
S
U
U
U
U
U
U
U
U
U
U
U
U
U
U
U
W
W
W
W
W
W
W
W
W
W
W
W
W
W
W
N
N
N
N
N
N
S
S
S
UN
UN
UN
UA
UA
UA
-
IA
IA
IA
-
25
50
150
25
50
150
25
50
150
25
50
150
25
50
150
0,0473
0,0255
0,0167
+0,0228
+0,0292
+0,0348
+0,0011
+0,0008
+0,0009
+0,0014
+0,0011
+0,0009
+0,0047
+0,0057
+0,0089
0,3180
0,4772
0,5942
+0,0105
-0,0917
-0,1887
+0,0154
+0,0173
+0,0273
+0,0179
+0,0211
+0,0290
+0,0247
-0,0147
-0,0858
0,0824
0,0484
0,0325
+0,0331
+0,0474
+0,0590
+0,0022
+0,0016
+0,0017
+0,0027
+0,0020
+0,0018
+0,0079
+0,0100
+0,0162
232
B.7
Recommendation Evaluation Results
C10 Derived Configurations Results
Table B.7 presents C10 derived configurations and their evaluation results for AT
values of 25, 50 and 150.
Table B.7: C10 Derived Configurations Results
Semantic
User/Item-Based
Boolean/Weighted
Normalised/Std.
Union
Union Average
Intersection Average
AT
Precision
Recall
F1
Measures
Tanimoto
Aggregation
Log-Likelihood
Prediction
Configuration ID
Similarity
C10
C10
C10
C305
C305
C305
C304
C304
C304
C300
C300
C300
C311
C311
C311
L
L
L
L
L
L
L
L
L
L
L
L
-
S
S
S
S
S
S
S
S
S
S
S
S
U
U
U
U
U
U
U
U
U
U
U
U
U
U
U
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
N
N
N
N
N
N
S
S
S
UN
UN
UN
UA
UA
UA
-
IA
IA
IA
-
25
50
150
25
50
150
25
50
150
25
50
150
25
50
150
0,0730
0,0321
0,0185
+0,0025
-0,0019
-0,0001
+0,0143
+0,0042
+0,0018
+0,0145
+0,0042
+0,0017
+0,0056
+0,0007
+0,0004
0,4541
0,5749
0,6438
+0,0183
-0,0409
-0,0901
+0,0767
+0,0612
+0,0534
+0,0765
+0,0596
+0,0504
+0,0348
+0,0126
+0,0161
0,1258
0,0608
0,0359
+0,0044
-0,0036
-0,0003
+0,0241
+0,0079
+0,0036
+0,0244
+0,0078
+0,0034
+0,0096
+0,0013
+0,0009
233
B.8 Baseline Aggregated Predictions Results
B.8
Baseline Aggregated Predictions Results
Table B.8 presents aggregated predictions from baseline configurations and their evaluation results for AT values of 25, 50 and 150.
Table B.8: Baseline Configurations Aggregated Predictions Results
Configuration 2
AT
Precision
Recall
F1
Configuration 1
Configuration 2
F1 Comparison
Configuration 1
Measures
Configuration ID
Configurations
C1
C2
C4
C7
C8
C10
C11
C12
C13
C14
C15
C16
C17
C18
C1
C1
C1
C1
C2
C2
C2
C2
C4
C7
C8
C10
C4
C7
C8
C10
25
25
25
25
25
25
25
25
25
25
25
25
25
25
0,0814
0,0774
0,0754
0,0471
0,0473
0,0730
0,0821
0,0475
0,0479
0,0790
0,0787
0,0475
0,0475
0,0737
0,4969
0,4726
0,4652
0,3173
0,3180
0,4541
0,5021
0,3194
0,3218
0,4851
0,4808
0,3188
0,3188
0,4566
0,1399
0,1331
0,1297
0,0820
0,0824
0,1258
0,1412
0,0826
0,0834
0,1359
0,1353
0,0826
0,0826
0,1269
+0,0013
-0,0573
-0,0565
-0,0040
+0,0022
-0,0505
-0,0505
-0,0062
+0,0115
+0,0006
+0,0010
+0,0101
+0,0056
+0,0006
+0,0002
+0,0011
234
Recommendation Evaluation Results
B.9
Derived Configurations Aggregated Predictions
Results
Table B.9 presents aggregated predictions from derived configurations and their evaluation results for AT values of 25, 50 and 150.
Table B.9: Derived Configurations Aggregated Predictions Results
Configuration 2
AT
Precision
Recall
F1
Configuration 1
Configuration 2
F1 Comparison
Configuration 1
Measures
Configuration ID
Configurations
C106
C106
C106
C206
C206
C206
C306
C306
C306
C406
C406
C406
C706
C706
C706
C806
C806
C806
C1
C1
C1
C2
C2
C2
C10
C10
C10
C4
C4
C4
C7
C7
C7
C8
C8
C8
C105
C105
C105
C205
C205
C205
C305
C305
C305
C405
C405
C405
C705
C705
C705
C805
C805
C805
25
50
150
25
50
150
25
50
150
25
50
150
25
50
150
25
50
150
0,0596
0,0240
0,0129
0,0600
0,0249
0,0135
0,0745
0,0327
0,0188
0,0600
0,0249
0,0135
0,0471
0,0254
0,0167
0,0562
0,0274
0,0171
0,3596
0,4212
0,4456
0,3628
0,4415
0,4738
0,4651
0,5855
0,6554
0,3628
0,4415
0,4738
0,3173
0,4751
0,5930
0,7540
1,0207
1,2171
0,1022
0,0455
0,0251
0,1030
0,0471
0,0262
0,1284
0,0619
0,0366
0,1030
0,0471
0,0262
0,0820
0,0482
0,0325
0,1046
0,0533
0,0338
-0,0377
-0,0199
-0,0125
-0,0301
-0,0179
-0,0118
+0,0026
+0,0011
+0,0007
-0,0267
-0,0149
-0,0102
0,0000
0,0000
0,0000
+0,0222
+0,0049
+0,0013
-0,0332
-0,0050
-0,0009
-0,0324
-0,0034
+0,0002
-0,0018
+0,0047
+0,0010
-0,0272
-0,0101
-0,0094
-0,0335
-0,0476
-0,0590
-0,0109
-0,0425
-0,0577