EmoPhoto: Identification of Emotions in Photos Information Systems

Transcription

EmoPhoto: Identification of Emotions in Photos Information Systems
EmoPhoto: Identification of Emotions in Photos
Soraia Vanessa Meneses Alarcão Castelo
Thesis to obtain the Master of Science Degree in
Information Systems and Computer Engineering
Supervisor: Prof. Manuel João Caneira Monteiro da Fonseca
Examination Committee
Chairperson: Prof. José Carlos Martins Delgado
Supervisor: Prof. Manuel João Caneira Monteiro da Fonseca
Member of the Committee: Prof. Daniel Jorge Viegas Gonçalves
October 2014
ii
Abstract
Nowadays, with the development in digital photography and the increasing easiness of acquiring cameras, taking pictures is a common task. Thus, the number of images in private collections of each person
or in the Internet is becoming bigger.
Every time we use our collection of images, for example, to search for an image of a specific event,
the images we receive will always be the same. However, our emotional state is not always the same:
sometimes we are happy, and other times sad. Depending of the emotions perceived from the image,
we are more receptive to some images than others. In the worst case, we will feel worse, which, given
the importance of the emotions in our daily life, will lead to a significantly negative performance during
cognitive tasks such as attention or problem-solving.
Although it seems interesting to take advantage of the emotions that an image transmits, currently
there is no way of knowing which emotions are associated with a given image. In order to identify the
emotional content present in an image, as well as the category of those emotions (Negative, Positive
or Neutral), we describe in this document two approaches: one using Valence and Arousal information,
and the other one using the content of the image.
The two developed recognizers achieved recognition rates of 89.20% and 68.68%, for the categories
of emotions, and 80.13% for the emotions. Finally, we also describe a new dataset of images annotated
with emotions, obtained from sessions with users.
Keywords: emotion recognition, emotions in images, fuzzy logic, content-based image retrieval, emotionbased image retrieval
iii
Resumo
Actualmente, com os desenvolvimentos na área da fotografia digital e a crescente facilidade de aquisição
de câmaras fotográficas, tirar fotos tornou-se uma tarefa comum. Consequentemente, o número de imagens nas colecções particulares de cada pessoa, bem como das imagens disponı́veis na Internet,
aumentou.
Sempre que procuramos uma imagem de um determinado evento na nossa colecção particular, as
imagens apresentadas serão sempre as mesmas. No entanto, o nosso estado emocional não permanece igual: por vezes estamos felizes, e outras vezes tristes. Dependendo das emoções percepcionadas a partir de uma imagem, estamos mais receptivos a algumas imagens do que outras. No
pior caso, vamos sentir-nos pior, o que, dada a importância das emoções no nosso quotidiano, poderá
conduzir a uma deterioração no desempenho de tarefas a nı́vel cognitivo, como atenção ou resolução
de problemas.
Embora pareça interessante aproveitar as emoções transmitidas pelas imagens, actualmente não
existe nenhuma forma de saber quais as emoções que estão associadas a uma determinada imagem.
A fim de identificar os conteúdos emocionais presentes numa imagem, assim como a categoria desses
conteúdos (negativa, positiva ou neutra), descrevemos neste documento duas abordagens: uma recorrendo aos nı́veis de Valência e Excitação da imagem, e uma outra utilizando o conteúdo da mesma.
Os dois reconhecedores desenvolvidos alcançaram taxas de reconhecimento de 89.20% e 68.68%,
para as categorias de emoções, e 80.13% para as emoções. Por fim, criámos um novo conjunto de
dados de imagens anotadas com emoções, obtidas a partir de sessões com utilizadores.
Palavras-chave: reconhecimento de emoções, emoções em imagens, lógica difusa, recuperação de
imagens baseada em conteúdo, recuperação de imagens baseada em emoções
iv
Acknowledgments
I would like to thank my supervisor, Prof. Manuel João da Fonseca, not only for being an inspiration
in the fields of Human-Computer Interaction and Multimedia Information Retrieval, but also for being a
supportive and committed supervisor, who has always believed in my work, provided valuable feedback
and motivated me all the way through this journey. To Prof. Rui Santos Cruz, thank you for also encouraging me to go even further in my academic choices and for being always available to sort out any issue
that I came across.
To my family, in general, thank you for forgiving my “absence” these past years due to my academic
life. To my brother Pedro Castelo, my sister Alexandra Castelo, and my sister-in-law Laura Pereira,
thank you for all your support and care, and for making sure that I would withstand these past years. To
my mother Carmo Meneses Alarcão, thank you for always believing in my success, and for being with
grandma when I was not able to. To Jorge Cabrita, thank you for being the “father” that I never had.
To my second “mommy” Luı́sa Bravo da Mata, for encouraging me and always cheering me on each
decision I took, thank you! My deepest, fondest and heartfelt thank you goes to my grandma Alcinda
Meneses Alarcão, for all the sacrifices she made throughout her life in order to get me where I am now.
Without her, I would not be who I am, and would not have gotten this far as I did.
In the last couple of years, I was fortunate to find true and amazing friends. Every time I was
happy they were there to smile and celebrate with me. However, all the times I needed their support,
they were also there: listening, helping, and most of the time, telling “our” silly jokes to cheer me up!
Therefore, each one of you is the family that I chose: Ana Sousa, Andreia Ferrão, Bernardo Santos,
Joana Condeço, João Murtinheira, Inês Bexiga, Inês Castelo, Inês Fernandes, Margarida Alberto, Maria
João Aires, Miguel Coelho, Ricardo Carvalho, Rui Fabiano, and last, but not least, my “sister” Vânia
Mendonça.
A special thank you to my favorite “grammar-police” staff: Bernardo, João, Miguel, and Vânia, for
your patience and availability to proof countless times for each part of this work. Thanks to all of you,
this final version became much more complete and typo-free. I appreciate everything that you have
taught me, my English skills have improved so much! Catarina Moreira, João Vieira, and João Simões
Pedro, thank you for your precious assistance with your knowledge of Machine Learning and Statistics.
Thank you to all the amazing people that I had the pleasure to work with these past years, whether
in class projects or other academic projects (especially NEIIST): David Duarte, David Silva, Fábio Alves,
Luis Carvalho, Mauro Brito, Ricardo Laranjeiro, Rita Tomé, among many others.
I would also like to thank to everyone who accepted to participate in the user sessions I performed in
the context of this thesis.
To each and every one of you - Thank you.
v
vi
To my grandma Alcinda
vii
viii
Contents
Abstract
iii
Resumo
iv
Acknowledgments
v
List of Figures
xii
List of Tables
xiv
List of Acronyms
xvii
1 Introduction
1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.3 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.4 Contributions and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.5 Document Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
2 Context and Related Work
5
2.1 Emotions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
2.2 Emotions in Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.2.1 Content-Based Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.2.2 Facial-Expression Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.2.3 Relationship between features and emotional content of an image . . . . . . . . .
13
2.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
2.3.1 Emotion-Based Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
2.3.2 Recommendation Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
2.4 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
3 Fuzzy Logic Emotion Recognizer
23
3.1 The Recognizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
ix
4 Content-Based Emotion Recognizer
39
4.1 List of features used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
4.2 Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
4.2.1 One feature type combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
4.2.2 Two feature type combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
4.2.3 Three feature type combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
4.2.4 Four feature type combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
4.2.5 Overall best features combinations . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
5 Dataset
51
5.1 Image Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
5.2 Description of the Experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
5.3 Pilot Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
6 Evaluation
59
6.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
6.1.1 Fuzzy Logic Emotion Recognizer . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
6.1.2 Content-Based Emotion Recognizer . . . . . . . . . . . . . . . . . . . . . . . . . .
60
6.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
7 Conclusions and Future Work
63
7.1 Summary of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
7.2 Final Conclusions and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
7.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
Bibliography
67
Appendix A
73
Appendix B
89
x
List of Figures
2.1 Universal basic emotions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
2.2 Circumplex model of affect, which maps the universal emotions in the Valence-Arousal
plane. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.3 Wheel of Emotions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
3.1 Circumplex Model of Affect with basic emotions. Adapted from [75] . . . . . . . . . . . . .
24
3.2 Distribution of the images in terms of Valence and Arousal . . . . . . . . . . . . . . . . . .
25
3.3 Polar Coordinate System for the distribution of the images . . . . . . . . . . . . . . . . . .
25
3.4 Sigmoidal membership function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
3.5 Trapezoidal membership function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
3.6 2-D Membership Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
3.7 Membership Functions for Negative category . . . . . . . . . . . . . . . . . . . . . . . . .
28
3.8 2-D Membership Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
3.9 Membership Functions for Neutral category . . . . . . . . . . . . . . . . . . . . . . . . . .
28
3.10 2-D Membership Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
3.11 Membership Functions for Positive category . . . . . . . . . . . . . . . . . . . . . . . . . .
29
3.12 Membership Functions for Anger, Disgust and Sadness . . . . . . . . . . . . . . . . . . .
30
3.13 Membership Functions for Disgust . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
3.14 Membership Functions for Disgust and Fear . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3.15 Membership Functions for Disgust and Sadness . . . . . . . . . . . . . . . . . . . . . . .
32
3.16 Membership Functions for Fear . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
3.17 Membership Functions for Happiness . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
3.18 Membership Functions for Sadness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
3.19 Membership Functions of Angle for all classes of emotions . . . . . . . . . . . . . . . . .
34
3.20 Membership Functions of Radius for all classes of emotions . . . . . . . . . . . . . . . . .
35
4.1 Average recognition considering all features . . . . . . . . . . . . . . . . . . . . . . . . . .
42
4.2 Results for Color - one feature
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
4.3 Results for Color - two features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
4.4 Results for Color - three features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
4.5 Time to build models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
5.1 EmoPhoto Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
5.2 Emotional state of the users in the beginning of the test . . . . . . . . . . . . . . . . . . .
53
5.3 Classification of the Negative images of our dataset (from users) . . . . . . . . . . . . . .
56
5.4 Classification of the Neutral and Positive images of our dataset (from users) . . . . . . . .
56
6.1 Classification of the Negative and Positive images of our dataset (from users) . . . . . . .
61
xi
B1
EmoPhoto Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
90
B2
1. Age . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
91
B3
2. Gender . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
91
B4
3. Education Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
92
B5
4. Have you ever participated in a study using any Brain-Computer Interface Device? . . .
92
B6
7. How do you feel? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
93
B7
8. Please classify your emotional state regarding the following cases: Anger, Disgust,
Fear, Happiness, Neutral, Sadness and Surprise . . . . . . . . . . . . . . . . . . . . . . .
xii
93
List of Tables
2.1 Comparision between International Affective Picture System (IAPS), Geneva Affective
PicturE Database (GAPED) and Mikels datasets . . . . . . . . . . . . . . . . . . . . . . .
21
3.1 Confusion Matrix for the classes of emotions in the IAPS dataset . . . . . . . . . . . . . .
36
3.2 Confusion Matrix for the categories in the Mikels dataset . . . . . . . . . . . . . . . . . . .
36
3.3 Confusion Matrix for the categories in the GAPED dataset . . . . . . . . . . . . . . . . . .
36
4.1 List of best features for each category type . . . . . . . . . . . . . . . . . . . . . . . . . .
47
4.2 Overall best features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
5.1 Confusion Matrix for the categories between Mikels and our dataset . . . . . . . . . . . .
55
5.2 Confusion Matrix for the categories between GAPED and our dataset . . . . . . . . . . .
56
6.1 Confusion Matrix for the categories using our dataset
. . . . . . . . . . . . . . . . . . . .
59
6.2 Confusion Matrix for the categories using our dataset
. . . . . . . . . . . . . . . . . . . .
60
A1
Simple and Meta classifiers results for each feature . . . . . . . . . . . . . . . . . . . . . .
73
A2
Vote classifiers results for each feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
73
A3
Results for Color using one feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
74
A4
Results for combination of two Color features . . . . . . . . . . . . . . . . . . . . . . . . .
74
A5
Results for combination of three Color features . . . . . . . . . . . . . . . . . . . . . . . .
75
A6
Results for combination of four Color features . . . . . . . . . . . . . . . . . . . . . . . . .
76
A7
Results for combination of five Color features . . . . . . . . . . . . . . . . . . . . . . . . .
76
A8
Results for combination of six Color features
. . . . . . . . . . . . . . . . . . . . . . . . .
76
A9
Results for combination of seven Color features . . . . . . . . . . . . . . . . . . . . . . . .
77
A10 Results for combination of all Color features . . . . . . . . . . . . . . . . . . . . . . . . . .
77
A11 List of candidate features for Color . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
A12 Results for Composition feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
A13 Results for combination of Shape features . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
A14 Results for combination of Texture features . . . . . . . . . . . . . . . . . . . . . . . . . .
78
A15 Results for combination of Joint features . . . . . . . . . . . . . . . . . . . . . . . . . . . .
79
A16 Results for combination of Color and Composition features . . . . . . . . . . . . . . . . .
79
A17 Results for combination of Color and Shape features . . . . . . . . . . . . . . . . . . . . .
79
A18 Results for combination of Color and Texture features . . . . . . . . . . . . . . . . . . . .
79
A19 Results for combination of Color and Joint features . . . . . . . . . . . . . . . . . . . . . .
80
A20 Results for combination of Composition and Shape features . . . . . . . . . . . . . . . . .
80
A21 Results for combination of Composition and Texture features . . . . . . . . . . . . . . . .
80
A22 Results for combination of Composition and Joint features . . . . . . . . . . . . . . . . . .
80
xiii
A23 Results for combination of Shape and Texture features . . . . . . . . . . . . . . . . . . . .
80
A24 Results for combination of Shape and Joint features . . . . . . . . . . . . . . . . . . . . .
81
A25 Results for combination of Texture and Joint features . . . . . . . . . . . . . . . . . . . . .
81
A26 Results for combination of Color, Composition and Shape features . . . . . . . . . . . . .
81
A27 Results for combination of Color, Composition and Texture features . . . . . . . . . . . . .
81
A28 Results for combination of Color, Composition and Joint features . . . . . . . . . . . . . .
82
A29 Results for combination of Color, Shape and Texture features . . . . . . . . . . . . . . . .
82
A30 Results for combination of Color, Shape and Joint features
. . . . . . . . . . . . . . . . .
82
A31 Results for combination of Color, Texture and Joint features . . . . . . . . . . . . . . . . .
83
A32 Results for combination of Color, Composition, Texture and Shape features . . . . . . . .
83
A33 Results for combination of Color, Composition, Texture and Joint features . . . . . . . . .
84
A34 Results for combination of Color, Texture, Joint and Shape features . . . . . . . . . . . . .
84
A35 Results for combination of Color, Texture, Joint and Composition features . . . . . . . . .
84
A36 Confusion Matrices for each combination . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
A37 Confusion Matrices for each combination using GAPED dataset with Negative and Positive categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
86
A38 Confusion Matrices for each combination using GAPED dataset with Negative, Neutral
and Positive categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
87
A39 Confusion Matrices for each combination using Mikels and GAPED dataset . . . . . . . .
88
xiv
List of Acronyms
AAM
Active Appearance Models
ACC
AutoColorCorrelogram
ADF
Anger, Disgust and Fear
ADS
Anger, Disgust and Sadness
AF
Anger and Fear
AM
Affective Metadata
ANN
Artificial Neural Network
AS
Anger and Sadness
AU
Action Unit
Bag
Bagging
BCI
Brain-Computer Interfaces
CBIR
Content-based Image Retrieval
CBER
Content-based Emotion Recognizer
CBR
Content-based Recommender
CCV
Color Coherence Vectors
CEDD
Color and Edge Directivity Descriptor
CF
Collaborative-Filtering
CH
Color Histogram
CM
Color Moments
CMA
Circumplex Model of Affect
D
Disgust
DF
Disgust and Fear
DOM
Degree of Membership
DOF
Depth of Field
DS
Disgust and Sadness
xv
EBIR
Emotion-based Image Retrieval
EEG
Electroencephalography
EH
Edge Histogram
F
Fear
FCP
Facial Characteristic Point
FCTH
Fuzzy Color and Texture Histogram
FE
Feature Extraction
FLER
Fuzzy Logic Emotion Recognizer
FER
Facial Expression Recognition
FCTH
Fuzzy Color and Texture Histogram
FS
Fear and Sadness
G
Gabor
GAPED
Geneva Affective PicturE Database
GLCM
Gray-Level Co-occurence Matrix
GM
Generic Metadata
GMM
Gaussian Mixture Models
GPS
Global Positioning System
Ha
Happiness
H
Haralick
HSV
Hue, Saturation and Value
IAPS
International Affective Picture System
IBk
K-nearest neighbours
IGA
Interactive Genetic Algorithm
J48
C4.5 Decision Tree (algorithm from Weka)
JCD
Joint Composite Descriptor
KDEF
Karolinska Directed Emotional Faces
LB
LogitBoost
Log
Logistic
MHCPH Modified Human Colour Perception Histogram
MIP
Mood-Induction Procedures
MLP
Multi-Layer Percepton
xvi
NB
Naive Bayes
NDC
Number of Different Colors
OH
Opponent Histogram
PCA
Principal Component Analysis
PFCH
Perceptual Fuzzy Color Histogram
PFCHS
Perceptual Fuzzy Color Histogram with 3x3 Segmentation
POFA
Pictures of Facial Affect
PVR
Personal Video Recorders
RCS
Reference Color Similarity
RecSys Recommendation Systems
RF
Random Forest
RSS
RandomSubSpace
RT
Rule of Thirds
S
Sadness
SAM
Self-Assessment Manikin
SM
Similarity Measurement
SMO
John Platt’s sequential minimal optimization algorithm for training a support vector classifier
SPCA
Shift-invariant Principal Component Analysis
Su
Surprise
SVM
Support Vector Machine
T
Tamura
V1
Vote 1
V2
Vote 2
V3
Vote 3
V4
Vote 4
V5
Vote 5
V6
Vote 6
VAD
Valence, Arousal and Dominance
VOD
Video-On-Demand systems
xvii
xviii
1
Introduction
In this chapter we present our motivation, the goals we intend to achieve, as well as the solution developed to identify emotions, in particular the Fuzzy Logic Emotion Recognizer (FLER) and the Contentbased Emotion Recognizer (CBER). We also enumerate the main contributions and results of our work,
as well as the document outline.
1.1 Motivation
Images are an increasingly important class of data, especially as computers become more usable,
with greater memory and communication capacities [42]. Nowadays, with the development in digital
photography and the increasing easiness of acquiring cameras and smartphones, taking pictures (and
storing them) is a common task. Thus, the number of images in private collections of each person is
becoming bigger. In the case of the images available in the internet, it is not big, it is huge.
With this massive growth in the amount of visual information available, the need to store and retrieve
images in an efficient manner arises, leading to an increase of the importance of Content-based Image
Retrieval (CBIR) systems [78]. However, these systems do not take into account high level features like
human emotions associated with the images or the emotional state of the users. To overcome this, a
new technique, Emotion-based Image Retrieval (EBIR) was proposed in order to extend CBIR systems
through the use of human emotions besides common features [81, 87]. Currently, emotion or mood
information are already used as search terms within multimedia databases, retrieval systems or even
multimedia players.
We can interact with and explore image collections in many ways. One possibility is through their
content, such as Colors, Shapes, Texture and Lines, or through associated information such as tags,
data or Global Positioning System (GPS) information. Every time we search for something, for example
for images from a specific day or event, the order in which the images are presented can be different but
the images will always be the same. However, our emotional state is not always the same: sometimes
we are happy, and other times sad or depressed. Therefore we are more receptive to some images than
others, depending of the emotions perceived from the image.
In the image domain, emotions describe the personal affectedness based on spontaneous perception
1
[19], and can be achieved, for example, through the colors or facial expressions of people present in the
image. In the worst case, these results will make us feel even worse, which, given the importance of the
emotions in our daily life, will lead to a significantly negative performance during cognitive tasks such as
attention, creativity, memory, decision-making, judgment, learning or problem-solving.
Although it seems interesting to take advantage of the emotions that an image transmits, for example,
by using them to explore a collection of images, currently there is no way of knowing which emotions
are associated with a given image. In order to identify the emotional content present in an image, i.e.,
the emotions that would be triggered when viewing the image, as well as the corresponding category of
those emotions (Negative, Positive or Neutral), we will follow two approaches: one using Valence and
Arousal information, and the other one using the content of the image.
1.2 Goals
This work aims to be able to identify the emotional content present in an image, regarding the corresponding category of emotions, i.e., if an image transmits Negative, Positive or Neutral feelings to the
viewer. We also want to be able to give an insight about what emotions would be triggered when viewing
an image.
To that end, we plan to take advantage of the Valence and Arousal values associated to some
datasets of images, and in the case where there are no information about V-A, we plan to use the
content of the images to derive the emotion or category of emotion that it conveys to the viewer.
To achieve this, we need to focus on three different sub-goals: i) develop an emotion recognizer
based on the Valence and Arousal information associated to images; ii) develop an emotion recognizer
based on the visual content of the images, such as Colors, Shape or Texture; iii) finally, we want to collect
information, using people, about the dominant emotions transmitted by a set of images, by performing
an experiment with people.
1.3 Solution
The solution developed, in the context of this work, consists in two recognizers that are able to identify
the emotional content of an image using different inputs, and producing different levels of output.
The first recognizer, Fuzzy Logic Emotion Recognizer (FLER), uses the normalized values of Valence
and Arousal of an image to automatically classify the classes of emotions: Anger, Disgust and Sadness
(ADS), Disgust (D), Disgust and Fear (DF), Disgust and Sadness (DS), Fear (F), Happiness (Ha), and
Sadness (S) and categories: Positive, Neutral and Negative, conveyed by an image. To describe each
class of emotions, as well as the categories, we used a Fuzzy Logic approach, in which each set is
characterized by a membership function that assigns to each object a Degree of Membership laying
between zero and one [90]. In the case of emotions, we used the Product of Sigmoidal membership
function for the Angle that correlates Valence and Arousal values, and Trapezoidal membership function
for the Radius, that will help to reduce emotion confusion between images with similar Angles. Regarding
the categories we used the Trapezoidal membership function, both for the Angle and the Radius. Finally,
for each class of emotions and category, we used a two-dimensional membership function that is the
result of the composition of the two one-dimensional membership functions mentioned above.
The second recognizer, Content-based Emotion Recognizer (CBER), uses visual content information
of the image to automatically classify if an image is Negative or Positive. To select the best descriptors
to use, we performed a large number of tests using different combinations of Color, Texture, Shape,
Composition and Joint descriptors/features. We started by analyzing a set of classifiers, in order to
understand which one best learns the relationship between features and the given category of emotion.
2
After that, and based on the relationships found, we proposed six different combinations of classifiers
using the Vote classifier as a base. For each of the proposed combinations of classifiers, we tested
several feature combinations. In the end, the best solution is composed by a Vote classifier, containing
John Platt’s sequential minimal optimization algorithm for training a support vector classifier (SMO),
Naive Bayes (NB), LogitBoost (LB), Random Forest (RF), and RandomSubSpace (RSS), and a combination of features, which include the Color Histogram (CH), Color Moments (CM), Number of Different
Colors (NDC), and Reference Color Similarity (RCS).
1.4 Contributions and Results
With the completion of this thesis, we achieved three main contributions:
• A Fuzzy recognizer with a classification rate of 100% in the case of categories and 91.56% in the
case of emotions, for Mikels dataset [66]; with the Geneva Affective PicturE Database (GAPED)
we achieved an average classification rate of 95.59% for the categories. Using our dataset, we
achieved a success rate of 68.70% for emotions. In the case of categories, we achieved 100% for
Negative category, 85% for the Positive and 28% for the Neutral.
• A recognizer based on the content of the images, that has a recognition rate of 87.18% for the
Negative category, and 57.69% for the Positive, using a dataset of images selected both from
International Affective Picture System (IAPS) and from GAPED datasets. Using our dataset, we
achieved a recognition rate of 76.54% for the Negative category and 52.38% for the Positive.
• A new dataset of 169 images from IAPS, Mikels and GAPED annotated with the dominant categories and emotions, according to what people felt while viewing each image.
1.5 Document Outline
In chapter 2, we describe the importance of emotions, as well as how they can be represented. Along
with it, we detail the previous works in the recognition of emotions from images, how to identify the
emotional state of a user, and some research areas where these two topics are combined: Emotionbased Image Retrieval (EBIR) and Recommendation Systems (RecSys). We also describe the relationship between emotions and the different visual characteristics of an image. Finally, we present the
datasets that we used in our work: International Affective Picture System (IAPS), Geneva Affective
PicturE Database (GAPED) and Mikels.
In chapter 3, we describe the Fuzzy Logic Emotion Recognizer (FLER) and the corresponding experimental results achieved, while in chapter 4, both the Content-based Emotion Recognizer (CBER) and
the experimental results obtained are described. We also present an analysis of the different possible
combinations between the different types of features used: Color, Texture, Composition, and Shape.
In chapter 5, we present a new dataset that is annotated with information collected through experiments with users. In chapter 6, we present the evaluation of FLER and CBER using our new annotated
dataset.
Finally, a summary of the dissertation, the conclusions and future work are presented in chapter 7.
3
4
2
Context and Related Work
Within this chapter we present a review and summary of the related works in the fields of emotions,
the recognition of the emotions in images using Content-based Image Retrieval (CBIR) and Facial Expression Recognition (FER), as well as the relationship between the emotions and the visual characteristics mentioned in CBIR, and finally Emotion-based Image Retrieval (EBIR) and Recommendation
Systems (RecSys). Although this seems to be a lot of fields, some of them such as emotions, and
RecSys are described here only to give some context, while CBIR, FER and EBIR are our main focus.
We also present and describe the datasets used in our work.
2.1 Emotions
“An emotion is a complex psychological state that involves three distinct components: a
subjective experience, a physiological response, and a behavioral or expressive response.”
[35]
Emotions have been described as discrete and consistent responses to external or internal events
with particular significance for the organism. They are brief in duration and correspond to a coordinated
set of responses, which may include verbal, behavioral, physiological and neural mechanisms. In affective neuroscience, the emotion can be differentiated from similar constructs like feelings, moods and
affects. Feelings can be viewed as a subjective representation of emotions. Moods are diffuse affective
states that generally last for much longer durations than emotions and are also usually less intense than
emotions. Finally, affect is an encompassing term, used to describe the topics of emotion, feelings, and
moods together [23].
The role of emotion in human cognition is essential. Emotions also play a critical role in rational
decision-making, perception, human interaction, and in human intelligence [69]. Emotions play an important role in the daily life of human beings. The importance (and need) of automatic emotion recognition has grown with the increasing role of human-computer interface applications. Nowadays, new forms
of human-centric and human-driven interaction with digital media have the potential of revolutionizing
entertainment, learning, and many other areas of life. Emotion recognition could be done from text,
speech, facial expressions or gestures [54].
5
Currently, given the importance of emotions/emotion-related variables in the gaming behavior, people
seek and are eager to pay for games that elicit strong emotional experiences. This can be achieved using
bio-signals in a biofeedback system, which can be implicit or explicit. The implicit biofeedback is similar
to affective feedback, i.e., the users are not aware that their physiological states are being sensed,
because the intention is to capture their normal affective reactions; the system modulates its behavior
according to the registered bio-signals. The explicit is originated from the field of medicine, with the intuit
of making the subjects more aware of their bodily processes by displaying the information in an easy
and clear way. This means the user has direct and conscious control over the application. If the user,
in the implicit feedback, starts to learn how the system works and use that knowledge to obtain control
over it, it becomes an explicit system. It is a popular trend that various game mechanics are used in
other areas, such as education, simulation, exercising, group work and design. For this reason, there is
the belief that the work in biofeedback interaction will find applications in a broad range of domains [46].
Previous studies have suggested that men and women process emotional stimuli differently. In [53],
it was verified if there would be any consistency in regions of activation in men and women when processing stimuli portraying happy or sad emotions presented in the form of facial expressions, scenes and
words. During emotion recognition of all forms of stimuli studied, the collected imaging data revealed
that the right insula and left thalamus were consistently activated for men, but not for women. The findings suggest that men rely on the recall of past emotional experiences to evaluate current emotional
experiences, whereas women seemed to engage the emotional system more readily. This finding is
consistent with the common belief that women are more emotional than men, which suggests possible
gender-related neural responses to emotional stimuli. This difference may be relevant to the evaluation
of the emotional reaction of a user to a given picture.
Figure 2.1: Universal basic emotions from Grimace 1
There are two different perspectives towards emotion representation. The first one (categorial), indicates that basic emotions have evolved through natural selection. Plutchik [71] proposed eight basic
emotions: Anger, Fear, Sadness, Disgust, Surprise, Curiosity, Acceptance, and Joy. All the other emotions can be formed by these basic ones, for example, disappointment is composed of Surprise and
Sadness. Ekman, following a Darwinian tradition, based his work in the relationship between facial
expressions and emotions derived from a number of universal basic emotions: Anger, Disgust, Fear,
happiness, Sadness, and Surprise (see Figure 2.1). Later he expanded the basic emotions by adding:
Amusement, Contempt, Contentment, Embarrassment, Excitement, Guilt, Pride in achievement, Relief,
Satisfaction, sensory Pleasure, and Shame. In the second perspective (dimensional), which is based
on cognition, the emotions (also called affective labels [30]) are mapped into the Valence, Arousal and
Dominance (VAD) dimensions. Valence goes from very Positive feelings to very Negative, Arousal is also
called activation and goes from states like sleepy to excited, and finally, dominance that corresponds to
the strength of the emotion [2, 20, 49, 54]. The most common model used is the two-dimensional, that
only uses Valence and Arousal (see Figure 2.2).
1 http://www.grimace-project.net/
6
Figure 2.2: Circumplex model of affect, which maps the universal emotions in the Valence-Arousal plane.
In [76], some correlations between basic emotions were described. One of the most important results
was that when happiness rises, all other emotions decline, and the other one, that Fear correlates
positively with Sadness and with Anger. These correlations are well-known phenomena in the field of
psychology.
Many studies in psychology involve manipulating Valence and/or Arousal via emotional stimuli. This
technique of inducing emotion in human participants is referred to as affective priming. Several methods
have been introduced for priming participants with Positive or Negative affect. Common methods include
images, text (stories), videos, sounds, word-association tasks, and combinations thereof. Such methods
are commonly referred to as Mood-Induction Procedures (MIP). In general, Positive emotions tend to
lead to better cognitive performance, and Negative emotions (with some exceptions) lead to decreased
performance [32].
Affective computing is a rising topic within human-computer interaction that tries to satisfy other user
needs, besides the need of the user to be as productive as possible. As the user is an affective human being, many needs are related to emotions and interaction [5]. Research has already been done
into recognizing emotions from faces and voice. Humans can recognize emotions from these signals
with a 70-98% accuracy, and computers are already pretty successful especially at classifying facial
expressions (80-90%). With the rising interest for Brain-Computer Interfaces (BCI), user’s Electroencephalography (EEG) have been analyzed as well [5].
In [2], they use information about the affective/mental states of users to adapt interfaces or add functionalities. In [32], the authors describe a crowdsourced experiment in which affective priming is used to
influence low-level visual judgment performance. They present results that suggest that affective priming
significantly influences visual judgments, and that Positive priming increases performance. Additionally,
individual personality differences can influence performance with visualizations. In addition to stable
personality traits, research in psychology has found that temporary changes in affect (emotion) can also
significantly impact performance during cognitive tasks such as memory, attention, learning, judgment,
creativity, decision-making, and problem-solving.
7
The category-based models can be used for tagging purposes, specially with a list of different adjectives for the same mood, which allows the generalization of the subjective perceptions of multiple users
and provides a dictionary for search and retrieval applications [19]. However, the dimensional model is
preferable in emotion recognition experiments because a dimensional model can locate discrete emotions in its space, even when no particular label can be used to define a certain feeling [54].
For our work, we will use the six universal basic emotions with the addition of a new emotion: the
Neutral. To map the emotions into a two-dimensional model (since it is better for the purpose of our
work), an adaptation of the circumplex model of affect introduced in [72] will be used (see Figure 2.2) [83].
2.2 Emotions in Images
In order to extract emotions from an image, we need to understand how their contents affect the way
emotions are perceived by users. For example, different Colors give us different feelings: bright Colors
help to create a Positive and friendly mood whereas dark Colors create the opposite. In the other hand,
the lines such as the diagonal ones indicate activity, and the horizontal ones express calmness.
In the human interaction, if we communicate with someone that appears to be sad, we tend to
sympathize with that person and feel sad too. However, if the person is happy we tend to become
happier. The same effect is observed in case we see sad or happy expressions in pictures, i.e., it will
also affect our emotional state.
2.2.1
Content-Based Image Retrieval
Content-based Image Retrieval (CBIR) is a well-known technique that uses visual contents of an image
to search images in large databases, according to users’ interests [42]. A wide range of possible applications for CBIR has been identified, such as crime prevention, architectural and engineering design,
fashion and interior design, journalism and advertising, medical diagnosis, geographical information and
remote sensing systems and education. Also, a large number of commercial and academic retrieval
systems have been developed by universities, companies, government organizations and hospitals [78].
The initial CBIR systems can be divided into two categories, according to the type of queries: text or
pictorial query. In the first case, the images are represented as text information like keywords or tags,
which can be very effective if appropriate text descriptions are given to the images in the database.
However, since the annotations are made manually, they are subjective and context-sensitive, and can
be wrong, incomplete or nonexistent. Also, this method can be expensive and time-consuming [42]. In
the second case, an example of the image is given as query. In order to obtain similar images, different
low-level features such as colors, edges, shapes and textures can be automatically extracted.
Typically, the system is composed by Feature Extraction (FE) and Similarity Measurement (SM). In
the case of the FE, a set of features, such as the indicated previously, are generated to accurately
represent the content of each image in the database. This set of features is called image signature
or feature vector, and is usually stored in a feature database. In SM, the distance between the query
image and each image in the database is computed, using the corresponding signatures, in a way
that the closest images are retrieved. The most used distances to calculate similarity are MinkowskiForm distance, Quadratic Form distance, Mahalanobis distance, Kullback-Leibler divergence, Jeffrey
divergence and the Euclidean distance [42, 78]. User interfaces in image retrieval systems consist
of two parts: the query formulation and the result presentation part. Recent retrieval systems have
incorporated users’ relevance feedback to modify the retrieval process in order to generate perceptually
and semantically more meaningful retrieval results.
An important thing we need to keep in mind is that human perception of image similarity is subjective,
semantic, and task-dependent. Besides that, each type of visual feature usually captures only one
8
aspect of image property, and it is usually hard for the user to specify clearly how different aspects are
combined. However, there is no single “best” feature that gives accurate results in any general setting,
which means that, usually, a combination of features is needed to provide adequate retrieval results [78].
Nowadays, researchers are merging fields such as computer vision, machine learning and image
processing, which provides an opportunity to find solutions of different issues such as semantic gap and
dimensionality reduction [42]. Semantic gap shows the difference between high-level concepts such
as emotions, events, objects or activities as conveyed by an image and limited descriptive power of
low-level visual features.
After the CBIR technical/theoretical characteristics have been explained, it is important to explain the
visual features that this technique uses:
Color:
This is the most extensively used visual content for image retrieval since it is the basic constituent
of images. It is relatively robust to background complication and independent of orientation and
image size. In some works like [78], grayscale is also considered as a color. Usually, in these
systems, Color histogram is the most used feature representation and gives us the description
of the colors present in an image as well as their quantities. It is obtained by quantizing image
color into discrete levels, then the number of times each discrete occur in the image is counted.
They are insensitive to small perturbations in camera position and are computationally efficient to
compute.
When a database contains a large number of images, histogram comparison will saturate the
discrimination. To solve this problem, the joint histogram technique [27] was introduced, in which
it is incorporated additional information without affecting the robustness of Color histograms.
It is possible that two different images have the same Color histogram because a single Color
histogram extracted from an image lacks spatial information of colors in the image. In [78], the
authors expose different possible solutions. In the first one, a CBIR takes account of the spatial
information of the Colors by using multiple histograms. In the second one, the spatial features
area (zero-order) and position (first-order) moments are used for retrieval. Finally, in the last one,
a different way of incorporating spatial information into the Color histogram, Color Coherence
Vectors (CCV), was proposed. Using the Color correlogram, it is possible to characterize not only
the color distributions of pixels, but also the spatial correlation of pairs of colors.
Modified Human Colour Perception Histogram (MHCPH) [77] is based on human visual perception
of the Color. The gray weights and color are distributed to neighboring bins smoothly with respect
to pixel information. The amount of weight that is distributed to the neighboring bins is estimated
using NBS distance, which makes it possible to extract the background color information effectively
along with the foreground information.
Shape:
It corresponds to an important criterion for matching objects based on their physical structure and
profile. Shape is a well-defined concept and there is considerable evidence that natural objects are
primarily recognized by their shape [78]. These features can represent the spacial information that
is not represented by Texture or Color, and contains all the geometrical information of an object
in the image. This information does not change even if the location or orientation of the object
changes.
The simplest Shape features are the perimeter, area, eccentricity and symmetry [42], but, usually,
two main types of Shape features are commonly used: global features and local features. Aspect
9
ratio, circularity and moment invariants are examples of global features and sets of consecutive
boundary segments corresponds to local features.
Shape representations can be divided into two classes: boundary-based and region-based. In
the first case, only the outer boundary of the shape is used. The most common approaches are
the rectilinear shapes, polygonal approximation, finite element models, and Fourier-based shape
descriptors. In the second one, the entire Shape region is used to compute statistical moments.
A good Shape representation feature for an object should be invariant to translation, rotation and
scaling [78].
Texture:
It is defined as all that it is left after Color and local Shape has been considered. It is used to
look for visual patterns, with properties of homogeneity that are not achieved by the presence of a
single color, in images and how they are spatially defined. Also, it contains information about the
structural arrangement of surfaces and their relationship to the surrounding environment. Texture
similarity can be used to distinguishing between areas of images with similar Color such as sky
and sea. Texture representation can be classified into three categories: statistical, structural and
spectral.
In the statistical approach, the Texture is characterized using the statistical properties of the gray
levels in the pixels of the image. Usually, there is a periodic occurrence of certain gray levels.
Some of the methods used are co-occurrence matrix, Tamura features, Shift-invariant Principal
Component Analysis (SPCA), Wold decomposition and multi-resolution filtering techniques such
as Gabor and Wavelet transform. Both the Tamura features and Wold decomposition are designed
according to physiological studies on the human perception of Texture and described in terms of
perceptual properties.
The structural methods describe the Texture as a composition of texels (texture elements) that are
arranged regularly on a surface according to some specific arrangement rules. Some methods,
such as morphological operator and adjacency graphs, describe the Texture by the identification
of structural primitives and their corresponding placement rules. If they are applied to regular
textures, they tend to be very effective [78].
In the spectral method, the Texture description is done by using a Fourier transform of an image
and then group the transformed data in a way that it gives some set of measurements.
These descriptions allowed us to understand how Color, Shape and Texture are characterized, as
well as how they usually appear in an image. It also allowed us to identify the visual descriptors and the
different approaches that are most commonly used to capture each of the visual features used in CBIR
systems.
2.2.2
Facial-Expression Recognition
The human face is one of the major ”objects” in our daily lives that is used to provide information about
the gender, attractiveness and age of a person, but also helps to identify the emotion that person is
feeling; this has an important role in human communication.
Underneath our skin, a large number of face muscles allow us to produce different configurations.
These muscles can be summarized as Action Unit (AU) [21] and are used to define the facial expressions
of an emotion. Facial expressions are typically classified as Joy, Surprise, Anger, Sadness, Disgust and
Fear [15, 21].
Recent research in cognitive science and neuroscience has shown that humans use mostly the
Shape for the perception and recognition of facial expressions of emotion. Furthermore, humans are
10
very good only at recognizing a few facial expressions of emotion. The most well recognized emotions
are Happiness and Surprise and the worst are Fear and Disgust. Learning why our visual system easily
recognizes some expressions and not others should help the definition of the form and dimensions of
the computational model of facial expressions of emotion [63].
To describe how humans perceive and classify facial expressions of an emotion, there are two types
of models: the continuous and categorical. In the first one, each facial expression of an emotion is
defined as a feature vector in a face space, given by some characteristics that are common to all the
emotions. In the second one, there are C classifiers, each one associated to a specific emotion category.
The continuous model explains how expressions of emotion can be seen at different intensities, whereas
the categorical explains, among other findings, why the images in a morphing sequence between two
emotions, like happiness and surprise, are perceived as either happy or surprise but not something in
between. Also several psychophysical experiments suggest the perception of emotions by humans is
categorical [22].
There have been developed models of the perception and classification of the six facial expressions
of emotion, in which sample feature vectors or regions of the feature space are used to represent each
one of the emotion labels, but only one emotion can be detected from a single image, despite the fact that
humans can perceive more than one emotion in a single image, even if they have no prior experience
with it.
Initially, researchers have created several feature and shape-based algorithms for recognition of
objects and faces [40, 55, 61], in which geometric, Shape features and edges were extracted from an
image and used to create a model of the face. Then, this model was fitted to the image, and in case of
a good fit, it is used to determine the class and position of the face.
In [63], an independent computational (face) space for a small number of emotion labels was presented. In this approach, it is only needed to sample faces of those few facial expressions of emotion. This approach corresponds to a categorical model, however the authors define each of these face
spaces as continue feature spaces. Essentially, the observed intensity in this continuous representation
is used to define the weight of the contribution of each basic category toward the final classification,
allowing the representation and recognition of a very large number of emotion categories without the
need to have a categorical space for each one or having to use many samples of each expression as
in the continuous model. With this approach, a new model was introduced; it consists of C distinct continuous spaces, in which multiple emotion categories can be recognized by linearly combining these C
face spaces. The most important aspect of this model is that it is possible to define new categories as
linear combinations of a small set of categories. The proposed model thus bridges the gap between the
categorical and continuous ones and resolves most of the debate facing each of the models.
The authors explained that the face spaces should include configural and shape features, because
the configural features can be obtained from an appropriate representation of shape, however expressions such as Fear and Disgust seem to be mostly based on Shape features, making the recognition
process less accurate and more susceptible to image manipulation. Each one of the six categories of
emotion used is represented in a shape space given by classical statistical shape analysis. The face
and the shape of the major facial components are automatically detected, i.e., the brows, eyes, nose,
mouth and jaw line. Then, the shape is sampled with d equally spaced landmark points and the mean
of all the points is computed.
To provide invariance to translation and scale, the 2d-dimensional shape feature vector is given by
the x and y coordinates of the d shape landmarks subtracted by the mean and divided by its norm. In the
case of the 3D rotation invariance, it can be achieved with the inclusion of a kernel. The authors used
the algorithm defined by [29] to obtain the dimensions of each emotion category, because it minimizes
the Bayes classification error.
11
Since two categories can be connected by a more general one, the authors use the already defined
shape space to find the two most discriminant dimensions separating each of the six categories previously listed. Then, in order to test the model, they trained a linear Support Vector Machine (SVM) and
achieved the following results: Happiness is correctly classified 99% of the times, Surprise and Disgust
95%, Sadness 90%, Fear 92% and Anger 94%. They also mentioned that adding new dimensions in
the feature space and using nonlinear classifiers makes it possible to achieve perfect classification.
In the last two decades, a new approach was studied: appearance-based, in which the faces are
represented by their pixel-intensity maps or the response to some filters (e.g. Gabors). The main
advantage of appearance-based model is that there is no need to: predefine a feature/shape model
like in the previous approaches, since the face model is given by the training images. Also, it provides
good results for near-frontal images of good quality, but it is sensitive to image manipulation like scale,
illumination changes or poses.
In [16], two methods are presented, one for static pictures and the other for video, for automatic facial
expression recognition using the shape informations of the face, extracted using Active Appearance
Models (AAM) that is a computer vision algorithm for matching a statistical model of a object shape
and appearance to a new image. The main difference between these methods is the type of selected
features. The system uses a face detection algorithm based on [68], a Facial Characteristic Point (FCP)
extraction method based on AAM, and the classification of the emotions is made using SVM. The
dataset used for training the facial expression recognizer was the Cohn-Kanade database containing a
set of video sequences of different subjects on multiple scenarios. Each of these sequences contains a
subject expressing an emotion from the Neutral state to the apex of that emotion, and only the first and
last frames are used. The AAM was built using 19 shape models, 24 texture models and 22 appearance
models, resulting in a shape vector of 58 face shape points. The model handles a certain degree of
scaling, translation, rotation and asymmetry (using parameters for both sides of the face). The effect
of illumination changes is minimized by scaling the texture data of the face samples during the training
of the AAM. To increase the performance of the model fitting, the authors decided to use samples with
occluded faces as well. A SVM classifier and 2-fold Cross Validation were used to present the results,
and in the case of the video sequences the results were better than for static images in all emotions.
The approximate results are: Fear 85% for image and 88% for video, Surprise 84%-89%, Sadness
83%-86%, Anger 76%-86%, Disgust 80%-82% and happy 73%-80%.
In [62], a new approach for facial recognition classification, also based on AAM, is presented. In
order to be able to work in real-time, the authors used AAM on edge images instead of gray ones, a
two-stage hierarchical AAM tracker and a very efficient implementation. With the use of edge images,
it is possible to overcome one of the problems in AAM: different illumination conditions. In this new
approach, it was used a 2-dimensional shape model S with 58 points placed in regions of the face which
usually have a lot of texture information and an appearance model used to transform the input image
into a linear-space of Eigenfaces. The combination of these two models leads to a model instance,
with appearance parameters and shape parameters p. The developed system is composed of foursubsystems: Face Detection, Coarse AAM, Detailed AAM and Facial Expression Classifier. The first one
identifies faces in real-time using [86] face detector. The position and size of the detected faces are used
to initialize the Coarse AAM and new shape components are added to describe: the scaling of the shape,
an approximation of the in-plane rotation and the translation on the x-axis and y-axis. This step allows
to do a coarse estimation of the input image. The Detailed AAM is initialized after the error associated
with the previous step drops below a given threshold, and is used to estimate the details of the face that
are necessary for a mimic recognition. Finally, for the classification of the facial expression were used
an AAM-classifier set, a Multi-Layer Percepton (MLP) based classifier and a SVM based classifier. The
emotions used were the six typical facial expressions and a new one: Neutral. The FEEDTUM mimic
12
database was used, and consists of 18 different persons (9 males and 9 females), each showing the six
different basic emotions and the Neutral in a short video sequence. Using the SVM classifier,
= 20
and p = 10, the average detection rate was 92%.
2.2.3
Relationship between features and emotional content of an image
Color is the result of interpretation in the brain of the perception of light in the human eye [18]. It is also
the basic constituent and the first discriminated characteristic of images for the extraction of the emotions. In the last years, many works in psychology have been making hypothesis about the relationship
between Colors and emotions [25, 87]. This research has shown that Color is a good predictor for emotions in terms of saturation, brightness, and warmth [38], and that the relationship between Colors and
human emotions has a strong influence on how we perceive our environment. The same happens for our
perception of images, i.e, all of us are in some way emotionally affected when looking at a photograph
or an image [81].
In photography and color psychology, color tones and saturation play important roles. Saturation
indicates chromatic purity, i.e., corresponds to the intensity of a pixel color. The purer the primary colors,
red (sunset, flowers), green (trees, grass), and blue (sky), the more striking the scenery is to viewers [39].
Brightness corresponds to a subjective perception of the luminance in the pixel’s color [18]. In the case
of too much exposure it will lead to a brighter shot, that often yields to lower quality pictures, but in the
case of the ones that are too dark, usually they are not appealing. However, an over-exposed or underexposed photograph under certain scenarios may yield very original and beautiful shots [17]. Also, in
photographs, the pure colors tend to be more appealing than dull or impure ones [17]. Regarding color
temperature, warm colors tend to be associated with excitement and danger, while images dominated
by cool colors tend to create cool, clamming, and gloomy moods [?, 65]. Images of happiness tend to
be brighter, more saturated and have more colors than images of Sadness [18].
Concerning the relationship between colors and emotions, usually red is considered to be vibrant
and exciting and is assumed to communicate happiness, dynamism, and power. Yellow is the most
clear, cheerful, radiant and youthful Color. Orange is the most dynamic Color and resemble glory. The
blue color is deep and may suggest gentleness, fairness, faithfulness, and virtue. Green should elicit
calmness and relaxation. Purple sometimes communicates Fear, while brown is associated with relaxing
scenes. A sense of quietness and calmness can be conveyed by the use of complementary colors, while
a sense of uneasiness can be evoked by the absence of contrasting hues and the presence of a single
dominant color region. This effect may also be amplified by the presence of dark yellow and purple
colors [25, 30].
Basic emotions seem to be fundamentally universal, and their external manifestation seems to be
independent of culture and personal experience. In what regards the brightness, there are distinct
groups of emotions: Happiness, Fear and Surprise combined with very light colors, Disgust and Sadness
with colors of intermediate lightness, and Anger with rather dark colors (usually black and red). The
colors relative to Sadness and Fear are very desaturated, while Happiness, Surprise and Anger are
associated with highly chromatic colors [13]. In the wheel of Emotions (See Figure 2.3), proposed by
Plutchik [71], it is possible to identify the different emotions and their corresponding colors. In the case of
the basic emotions, we have the following associations: Anger corresponds to red, Disgust to the purple,
Fear to the dark green, Sadness to the dark blue, Surprise to the light blue and, finally, Happiness to the
yellow.
13
Figure 2.3: Wheel of Emotions
Since perception of emotion in color is influenced by biological, individual and cultural factors [18],
mapping low-level color features to emotions is a complex task which theories about the use of colors,
cognitive models and involve cultural and anthropological backgrounds must be considered [?]. Given
that colors can be used in different ways, we need effective methods to measure their occurrence in
an image. Color Moments [17, 18, 60, 78] are measures that characterize color distribution in an image. Different histograms such as Color Histogram [74, 78], Fuzzy Histogram (for Dominant Colors) [4],
Wang Histogram [?] and Emotion-Histogram [81, 87] give the representation of the colors in an image.
Color Correlogram [78] allows combining the advantages of histograms with spatial and color information. Color Layout Descriptor [67] also captures the spatial distribution of color in an image. Number
of Colors [18] will be used to differentiate Positive from Negative images, since the first ones usually
have more colors. Scalable Color Descriptor [19, 67] allows analyzing the brightness/darkness, saturation/pastel/pallid and the color tone/hue. Itten Contrasts [60] captures information about the contrasts of
brightness, saturation, hue and complements.
Harmonious composition is essential in a work of art and useful to analyze an image’s character [?].
In terms of Composition, images with a simplistic composition and a well-focused center of interest are
sometimes more pleasing than images with many different objects [17, 39]. Nature scenes, such as
forests or waterscapes, are strongly preferred when compared to urban scenes for population groups
from different areas of the world [39].
In terms of Composition, there are common and not-so-common rules. The most popular and widely
known is the Rule of Thirds, that can be considered as a sloppy approximation to the ‘golden ratio’
(about 0.618) [17, 39]. It states that the most important part of an image is not the center of the image
but instead at the one third and two third lines (both horizontal and vertical), and their four intersections.
Therefore, viewer’s eyes can naturally concentrate on these areas than either the center or the borders
of the image, meaning that it is often beneficial to place objects of interest in these areas. This implies
that a large part of the main object often lies on the periphery or inside of the inner rectangle [17].
14
The size of an image has a good chance of affecting the photo aesthetics. Although most images are
scaled, their initially size must be agreeable to the content of the photograph. In the case of the aspect
ratio of an image, it is well-known that some aspect ratios such as 4:3 and 16:9 (which approximate the
‘golden ratio’) are chosen as standards for television screens or movies, for reasons related to viewing
pleasure [17]. A less common rule in nature photography is to use diagonal lines (such as a railway, a
line of trees, a river, or a trail) or converging lines for the main objects of interest to draw the attention of
the human eyes [39].
Professional photographers often reduce the Depth of Field (DOF) for shooting single objects by
using larger aperture settings, macro lenses, or telephoto lenses. On the photo, areas in the DOF are
noticeably sharper [17]. Another Composition rule is to frame the photo so that there are interesting
objects in both the close-up foreground and the far-away background. According to Gestalt psychology,
that produced influential ideas such as the concept of goodness of configuration, we do not see isolated
visual elements but instead patterns and configurations, which are formed according to the processes of
perceptual organization in the nervous system. This is given to the “law of Prägnanz”, which enhances
properties such as closure, regularity, simplicity or symmetry, leading us to prefer the “good” structures
[39].
Shape is a fairly well-defined concept, and there is considerable evidence that natural objects are
primarily recognized by their shape [78]. Growing evidence indicates that the underlying geometry of a
visual image is an effective mechanism for conveying the affective meaning of a scene or object, even
for very simple context-free geometric shapes. Objects containing non-representational images of sharp
angles are less well liked. Abstract angular geometric patterns tend to be perceived as threatening, and
circles and curvilinear forms are usually perceived as pleasant [51].
Accordingly to the fields of visual arts and psychology, shapes and their characteristics, such as
angularity, complexity, roundness and simplicity, have been suggested to affect the emotional responses
of human beings. Complexity and roundness of shapes appear to be fundamental to the understanding
of emotions. In the case of complexity, humans visually prefer simplicity. Although the perception of
simplicity is partially subjective to individual experiences, it can also be highly affected by parsimony
and orderliness. Parsimony refers to the minimalistic structures that are used in a given representation,
whereas orderliness refers to the simplest way of organizing these structures. For the case of roundness,
it indicates that geometric properties convey emotions like Anger or Happiness [56, 87].
Usually, perceptual Shape features are extracted through angles, line segments, continuous lines
and curves. The number of angles, as well as the number of different angles, can be used to describe
complexity. Line segments refer to short straight lines used to capture the structure of an image. Continuous lines are generated by connecting intersecting line segments having the same orientations with
a small margin of error. Line segments and Continuous lines are used to describe and interpret complexity of an image. Curves are a subset of continuous lines that are used to measure the roundness of
an image [56].
Regarding the lines, their directions can express different feelings. Strong vertical elements usually
indicate high tensional states while horizontal ones are much more peaceful. Oblique lines could be
associated with dynamism [12, 25, 87]. Lines with many different directions present chaos, confusion or
action. The longer, thicker and more dominant the line, the stronger the induced psychological effect [?].
In the field of computer vision, Texture is defined as all that is left after Color and local Shape have
been considered or it is defined by such terms as structure and randomness. Textures are also important for emotional analysis of an image [25, 60], and their use can change the way other features are
perceived; for example, in the case of the emotion unpleasantness, the addition of texture changes the
perception of the image’s colors [57].
From an aesthetics point of view, specific patterns such as flowers make people feel warm, while the
15
abstract patterns make people feel cool. Thin and sparse patterns such as dots and small flowers make
people feel soft. In contrast, the thick and dense patterns such as plaid make people feel hard [88]. In
some situations, a great deal of detail gives a sense of reality to a scene, and less detail implies more
smoothing moods [12].
Artists and professional photographers, in specific situations and in order to achieve a desired expression, create pictures which are sharp, or where the main object is sharp with a blurred background.
Purposefully blurred images were frequently present in the category of art photography images which
expressed Fear [60]. Graininess or smoothness in a photograph can be interpreted in different ways.
If as a whole it is smooth, the picture can be out-of-focus, in which case it is in general not pleasing to
the eye. If as a whole it is grainy, one possibility is that the picture was taken with a grainy film or under
high ISO settings. Graininess can also indicate the presence/absence and nature of Texture within the
image [17].
The following Texture features, Tamura [60, 74, 78], Gabor Transform [25, 39, 78, 87, 88], Waveletbased [60] and Gray-Level Co-occurence Matrix (GLCM) [60], are intended to capture the granularity
and repetitive patterns of surfaces in an image. With the use of these features we will be able to measure
the roughness or the crinkliness, the coarseness characterizes the grain size of an image, the contrast,
directionality, line-likeness and regularity of a surface [74].
2.3 Applications
All of us are in some way emotionally affected when looking at an image, which means we often relate
some of our emotional response to the context, or to particular objects in the scene. Usually CBIR
systems or Recommendation Systems (RecSys), do not take into account the emotions that the images
convey. However, recently, to solve this, new efforts have been made, which will be explained below.
2.3.1
Emotion-Based Image Retrieval
The low-level information used in CBIR systems does not sufficiently capture the semantic information
that the user has in mind [89].
In marketing and advertising research, attention has been given to the way in which media content
can trigger the particular emotions and impulse buying behavior of the viewer/listener since emotions
are quite important in brand perception and purchase intents [76]. Nowadays, many posters and movie
previews use specific emotions that are specifically designed to attract potential customers. Emotionbased Image Retrieval (EBIR) can be used to identify tense, relaxed, sad, or joyful parts of a movie, or to
characterize the prevailing emotion of a movie, which could be a great enhancement to personalizing the
recommendation processes in future Video-On-Demand systems (VOD) or Personal Video Recorders
(PVR) [30].
As an analogy to the semantic gap in CBIR systems, extracting the affective content information
from audiovisual signals requires bridging the affective gap. Affective gap can be defined as the lack
of coincidence between the measurable signal properties, commonly referred to as features, and the
expected affective state in which the user is brought by perceiving the signal [30].
In [89], the authors present the first studies that were made in this new area. One of them is based
on the Color theory of Itten, the expressive and perceptual features were mapped into emotions. Their
method segmented the image into homogeneous regions, extracted features such as color, hue, luminance, saturation, position, size and warmth from each region, and used its contrasting and harmonious
relationships with other regions to capture emotions. But this method was only designed for art painting
retrieval. In another study, the authors designed a psychology space that captures the human emotion and mapped those onto physical features extracted from images. In a similar approach, based on
16
wavelet coefficients, retrieved emotionally gloomy images through feedbacks called Interactive Genetic
Algorithm (IGA), but this method has the limitation of only differentiating two categories: gloomy or not.
Finally, in the last one, the authors proposed an emotional model to define a relationship between physical vales of color image patterns and emotions, using color, gray and texture information from an image
and input them into the model. Then, the model returned the degree of strength with respect to each
emotion. It, however, has a problem with generalization due to the narrow scope of experiments on only
five images and could not be applied to the image retrieval directly.
In [87], authors explored the strong relationship between colors and human emotions and proposed
an emotional semantic query model based on image color semantic description. Image semantics has
several levels: abstract semantics that contributes to the interpretation of the senses, semantic templates
(categories) related to the accumulation of semantic knowledge, semantic indicators corresponding to
image elements that are characteristic for certain semantic categories, and finally, the low-level image
features. The proposed model contains three stages. In the first one, the images were segmented using
color clustering in L*a*b* space because the definitions and measurements of this space color are suited
for vision perception psychology. In the second one, semantic terms using fuzzy clustering algorithm
were generated, and used to describe both the image region and the whole image. After that, in the last
one, an image query scheme through image color semantic description, that allows the user to query
images using emotional semantic words, was presented. This system is general and able to satisfy
queries for which it had not been explicitly designed. Also, the presented results demonstrate that the
features successfully captures the semantics of the basic emotions.
In [89], a new EBIR method was proposed, using query emotional descriptors called query color
code and query gray code. These descriptors were designed on the basis of human evaluation of
13 emotion pairs (like-dislike, beautiful-ugly, natural-unnatural, dynamic-static, warm-cold, gay-sober,
cheerful-dismal, unstable-stable, light-dark, strong-weak, gaudy-plain, hard-soft, heavy-light) when 30
random patterns with different color, intensity, and dot sizes are presented. For the emotion image
retrieval, when a user performs a query emotion, the associated query color code and query gray code
are obtained, and codes that capture color, intensity, and dot size are extracted from each database
image. After that, a matching process between the two color codes and between the two gray codes is
performed to retrieve images with a sensation of the query emotion. The major limitation of this method
was the use of the emotions pairs since they do not cover all emotions that a human can feel, and it is
difficult to map them onto the six basic emotions frequently used.
In 2008, the authors of [12] said: “On the contrary, there are very few papers on automatic photo
emotion detection if any.”, and proposed an emotion-based music player that combines the emotions
evoked by auditory stimulus with visual content (photos). The emotion detection from photos was made
using an own database with emotions manually annotated and a Bayesian classifier. To combine the
music and photos, besides the high-level emotions, low-level features as harmony and temporal visual
coherence was used. It is formulated as an optimization problem, solved by a greedy algorithm. The
photos for the database used were chosen based on two criteria: images related to daily life without
specific semantic meaning and photos without human faces, because they usually dominate the moods
of photos. The emotion taxonomy used was based in Hevner’s work, and consists of eight emotions:
sublime, sad, touching, easy, light, happy, exciting and grand. Since each photo was labeled by many
users and they could perceive different emotions to it, the aggregated annotation of an image is considered as a distribution vector over the eight emotion classes. A set of visual features that effectively reflect
emotions was used: color, textureness and line. Using a Bayesian framework, the obtained accuracy
was 43%, but the misclassified photos are often classified as nearby emotions.
The first retrieval system that indexes and searches images using human emotion was presented
17
in [43] 2 . In this system, the 10 Kobayashi emotional keyword were used for image tagging: romantic,
clear, natural, casual, elegant, chic, dynamic, classic, dandy and modern. In the case of the images
collected from the web, an indexer extracts physical features such as color, texture and pattern, and
transpose them to human emotions. The system allows the users to search through a query interface
based on emotional keywords and example images. The authors used 389 textile images from different
domains such as interior (images such as curtain, carpet and wallpaper), fashion (images of clothes)
and artificial (product designs).
In [19] the authors, in order to extract emotions (aggressive, euphoric, calm and melancholic) from
images, developed three new features: Color Histogram, Haar Wavelet and Color Temperature Histogram. The Color Histogram is calculated in the Hue, Saturation and Value (HSV) Color space with an
individual quantization of each channel (similar to the MPEG-7 Scalable Color Descriptor), and it covers
the properties of brightness/darkness, saturation/pastel/pallid and the color tone/hue. The Haar Wavelet
describes the mean and variance of the energy of each band, and allows to describe the structure or
horizontal/vertical frequencies. The Color Temperature Histogram is based on a first k-means clustering
of all image pixels in the LUV
3
Color space, and it describes the warm/cool impact of images. Using
this features, the authors achieve the following recognition rates: 44% using a Gaussian Mixture Models (GMM) and 53.5% using a SVM. The authors also stated that these results seem to be worse when
compared with other approaches, which can be explained by the heterogeneity of their reference set.
However, the heterogeneity is needed to cover the different interpretations of mood of various subjects.
In [67], the authors proposed a new EBIR system that uses an Artificial Neural Network (ANN) for
labeling images with emotional keywords based on visual features only. Advantages of such approach
is easiness adjustment to any kind of pictures and emotional preferences. The system consists of a
database of images, neural network, searching engine and interface to communicate with a user. For
all the images in the database, the authors extracted the following visual feature descriptors: Edge
Histogram, Scalable Color Descriptor, Color Layout Descriptor, Color and Edge Directivity Descriptor
(CEDD) and Fuzzy Color and Texture Histogram (FCTH). They used a supervised trained neural network
for the recognition of the emotional content of images. The experiments showed that average retrieval
rate depends on many factors: a database, a query image, number of similar images in the database
and the training set of the neural network. The authors also suggest some improvements to increase
accuracy of the results: a module for face detection and face expression analysis, and one to analyze
existing textual descriptions of images and other meta-data.
In [60], the authors investigate and develop new methods, based on theoretical and empirical concepts from psychology and art theory, to extract and combine low-level features that represent the emotional content of an image, and use them for image emotion classification. The features represent color,
texture, composition and content (faces and skin). For Color, they implement the following features:
brightness and saturation statistics, Colorfulness, Color names, hue statistics, Itten contrasts and Wang
Wei-ning specialized histogram. In the case of the Texture, they implement the Tamura, wavelet Textures
and GLCM features. Finally, for the Composition features, they used the level of detail of the image, the
DOF, the rule of thirds, and the dynamics given by the lines. The authors performed several experiments
and compared the results with similar works. They also stated that their feature sets outperform the results of the state of the work, specifically using the International Affective Picture System (IAPS) for five
of the eight categories used, which means that the best feature set is dependent on both the category
and dataset.
2 http://conceptir.konkuk.ac.kr
3 http://en.wikipedia.org/wiki/CIELUV
18
2.3.2
Recommendation Systems
Recommendation systems are used to help users find a small but relevant subset of multimedia items
based on their preferences. The most common implementations of recommendation systems are the
TiVo 4 system and the Netflix 5 system [83].
These systems can be divided into two types: the Collaborative-Filtering (CF) and the Content-based
Recommender (CBR). In the first one, they are based on collecting and analyzing a large amount of
information on user’s behaviors, activities or preferences, and predicting what they will like based on
their similarity to other users. In the second one, the items are annotated with metadata and the system
estimates the relevance level of an observed item based on the inclination of the user toward the item’s
metadata values.
Traditionally, the recommendation systems relied on data-centric descriptors for content and user
modeling. However, recently, there has been an increasing number of attempts to use emotions in
different ways to improve the quality of recommendation systems [83, 84].
In [84] a new metadata field containing emotional parameters was used to increase the precision
rate of the CBR systems: Affective Metadata (AM). The main assumption here is that the emotional
parameters contain information that account for more variance than the Generic Metadata (GM) typically
used. Furthermore, the users differ in the target emotive state while they are seeking and choosing
multimedia content to view. These assumptions lead to an hypothesis: these individual differences can
be exploited to achieve better recommendations.
The authors propose a novel affective modeling approach using the first two statistical moments
of the users emotive responses in VAD space. They performed a user-interaction session, and then
compared the performance of the recommendations systems with both the AM and GM. The results
achieved showed that the usage of the proposed affective features in a CBR system for images brings a
significant improvement over generic features, and also indicated that SVM algorithm is the best candidate for the calculation of item’s rating estimates. These results indicate that the formulated hypothesis
is true.
One of the most well known problems of these systems is usually referred to as the matrix-sparsity
problem. In theory, with the increase of the number of ratings per user, the model would be trained on
a larger training set, which allows to achieve better accuracy for the recommended items. But, since
the number of ratings per user is relatively low, the user models are not as good as they could be if the
users had rated more items. However, if we replace the need for explicit feedback from the user with an
implicit one, such as recording the emotional reaction of the user to a given item, and then use it as a
way of rate that item, we can try to reduce this issue. This idea allows us to compute the proposed AM
on the fly, as new information arrives. The inclusion of these methods would lead us to a standalone
recommender system that can be used in real applications.
In [83], a unifying framework using emotions in three different stages: entry, consumption and exit,
of the model is presented, since it is important that the recommendation system application detects the
emotions and makes good use of that information (as already explained earlier).
When the user starts to use the system, he is in a given affective state: the entry stage, which is
caused by some previous activities that are unknown to the system. When the recommendation system
suggests a collection of items, the user’s mood influences the choice that he will do because the decision
making process of the user (as explained in Section 2.1) is strongly influenced by his emotive state. For
example, if a user is happy or sad, he might want to consume a different type of content according to the
way he is feeling. In order to adapt the list of recommended items to the users entry mood, the system
4 http://www.tivo.com/
5 https://signup.netflix.com/MediaCenter/HowNetflixWorks
19
must be able to detect the mood and to use it in the content filtering algorithm as contextual information.
In the consumption stage, the user receives affective responses induced by the content that he is
viewing. These responses can be single values (for example when watching an image) or a vector of
emotions that change over the time (for example when watching a movie). In [84], these emotional
responses were used for generating implicit affective tags for the content.
The exit stage is when the user finishes the content consumption. In this stage, the exit mood will
influence the user’s next actions, which will be taken in account in the entry mood if the user continues
to use the recommendation system.
2.4 Datasets
Several possibilities have been explored so far in order to induce emotional reactions, relying on different
contexts and various degrees of participant involvement. The most used method of emotion induction
is through the presentation of emotionally salient material like pictures, audio or video, without explicitly
asking for a personal contribution from the participant. If stimuli are relevant enough, an appraisal is
automatically executed and will trigger reactions in other measurable components of emotion such as
physiological responses, expressivity, action tendencies, and subjective feeling. Although this kind of
induction can target different perceptual modalities, the use of the visual channel remains the most
common to convey emotional stimulation.
In the different areas of research based on visual stimulation, such as EBIR systems or psychological
studies, reliable databases are important for the success of emotion induction. Regarding this, in 1997,
the IAPS [50] database was introduced. However, the extensive use of the same stimuli lowers the
impact of the images since it increases the knowledge that participants have of the images. Another
problem seems to be the limited number of pictures for specific themes in the IAPS database. This
specially affects studies centered on a specific emotion thematic and designs that require a lot of trials
from the same kind (e.g., EEG recordings). In order to increase the availability of visual emotion stimuli,
in 2011, a new database called Geneva Affective PicturE Database (GAPED) [14], was created.
It is important to remember that contrary to the IAPS database, the goal of the GAPED is not to be
able to compare research performed by using the same database, but to provide researchers with some
additional pre-rated emotional pictures. Even though research has shown that the IAPS is useful in the
study of discrete emotions, the categorical structure of the IAPS has not been characterized thoroughly.
In 2005, Mikels [66] collected descriptive emotional category data on subsets of the IAPS in an effort to
identify images that elicit discrete emotions. In the following paragraphs we provide some detail about
these three datasets.
IAPS
The IAPS database contains about 1182 images, and provides a set of normative emotional stimuli for
experimental investigations of emotion and attention. The goal is to develop a large set of standardized,
emotionally-evocative, internationally accessible, color photographs that includes contents across a wide
range of semantic categories [50]. The authors rely on a relatively simple dimensional view, which
assumes emotions can be defined by a coincidence of values on a number of VAD dimensions. Each
picture of the database is plotted in terms of the mean Valence and Arousal rating. These ratings were
made by male, female and children subjects using Self-Assessment Manikin (SAM) questionnaire for
pleasure, Arousal and dominance, during 10 years.
20
GAPED
To increase the availability of visual emotion stimuli, a new database called GAPED was created. The
database contains 730 pictures, 121 representing Positive emotions using human and animal babies as
well as natural sceneries, 89 for the Neutral emotions, mainly using inanimate objects, and 520 for the
Negative emotions. In the case of the Negative pictures, they are divided into four categories: spiders,
snakes, human rights violation and animal mistreatment. The pictures were rated according to Valence,
Arousal, and the congruence of the represented scene with internal (moral) and external (legal) norms.
These ratings were made by 60 subjects, where each subject rated 182 images. Given the size of the
database, participants were divided into five groups, each one rated a subset of the database, which
means that only 39 images were rated by all participants.
Since Positive emotions are often neglected in the study of emotions, the GAPED has also followed
this orientation, with attention being put on developing large Negative categories and a unique Positive
category. Consequently, the database is asymmetric, with many more Negative than Positive pictures,
and with contents more specific in the Negative pictures.
Mikels
This new dataset is composed of 330 images from IAPS: 133 Negative and 187 Positive, and was
annotated with Positive and Negative emotions [79] [80]. The Positive emotions are Amusement, Awe,
Contentment and Excitement, while the Negative are Anger, Disgust, Fear and Sadness. These data
reveal multiple emotional categories for the images and indicate that this image set has great potential
in the investigation of discrete emotions.
The emotional category ratings were made by 30 males and 30 females, in two studies, using a
subset of Negative images and a subset of Positive images, with a constrained set of categorical labels.
For the Negative images, the study resulted in four categories: Disgust (31), Fear (12), Sadness (42),
and blended (48), i.e, more than one emotion present in the image. In the case of the Positive images,
the study resulted in six categories: Amusement (10), Awe (7), Contentment (15), Excitement (10),
Blended (71), and Undifferentiated (74), i.e., with all the emotions present in the image.
As we can see in Table 2.1, only GAPED and Mikels provides information about the category of an
emotion, i.e., Negative, Neutral or Positive. Mikels also discriminate the emotions elicited by the images
regarding Anger, Disgust, Fear, Sadness, Amusement, Awe, Contentment and Excitement. IAPS does
not provide any information about the emotional content of the images that compose the dataset, only
V-A information.
IAPS
GAPED
Mikels
# Total
1182
730
330
# Negative
N/A
520
133
# Neutral
N/A
89
N/A
# Positive
N/A
121
187
Emotions
No
No
Yes
Table 2.1: Comparision between IAPS, GAPED and Mikels datasets
Besides the IAPS and GAPED databases, in which each image was annotated with their Valence
and Arousal ratings, there are other databases (typically related to facial expressions) that were labeled
with the corresponding emotions, such as NimStim Face Stimulus Set6 , Pictures of Facial Affect (POFA)7
or Karolinska Directed Emotional Faces (KDEF)8 .
6 http://www.macbrain.org/resources.htm
7 http://www.paulekman.com/product/pictures-of-facial-affect-pofa/
8 http://www.emotionlab.se/resources/kdef
21
2.5 Summary
Emotion in human cognition is essential and plays an important role in the daily life of human beings,
namely in rational decision-making, perception, human interaction, and in human intelligence. Regarding
the emotion representation, there are two different perspectives: categorial and dimensional. Usually,
the dimensional model is preferable because it could be used to locate discrete emotions in space, even
when no particular label could be used to define a certain feeling.
In order to extract emotions from an image, we need to understand how their contents affect the way
emotions are perceived by users. This content can be facial expressions of the faces present in the
images, color, shape or texture information.
To describe how humans perceive and classify facial expressions of an emotion, there are two types
of models: the continuous and categorical. The continuous model explains how expressions of emotion
can be seen at different intensities, whereas the categorical explains, among other findings, why the
images in a morphing sequence between two emotions, like Happiness and Surprise, are perceived
as either happy or surprise but not something in between. There have been developed models of the
perception and classification of the six facial expressions of emotion. Initially, it used feature and shapebased algorithms, but, in the last two decades, appearance-based models (AAM) have been used. In
both cases the recognition rates are already very good, varying from 80% to 90%.
CBIR is a technique that uses visual contents of images to search images in large databases, using
a set of features, such as Color, Shape or Texture. Color is the most extensively used visual content
for image retrieval since it is the basic constituent of images. Shape corresponds to an important criterion for matching objects based on their physical structure and profile. Texture is defined as all that
left after Color and local Shape has been considered; it also contains information about the structural
arrangement of surfaces and their relationship to the surrounding environment. Each type of visual feature usually captures only one aspect of image property, which means that, usually, a combination of
features is needed to provide adequate retrieval results.
However, the low-level information used in CBIR systems does not sufficiently capture the semantic
information that the user has in mind. In order to solve this, the EBIR systems could be used. These
systems are a subcategory of the CBIR that, besides the common features, also use emotions as a
feature. Most of the research in the area is focused on assigning image mood on the basis of eyes
and lips arrangement, but colors, textures, composition and objects are also used to characterized the
emotional content of an image, i.e., some expressive and perceptual features are extracted and then
mapped into emotions. In the last five years, some EBIR systems have been development that were
able to achieve recognition rates of 44% and 53.5% using classification methods such as GMM or SVM.
Besides the extraction of emotions from an image, there has been an increasing number of attempts
to use emotions in different ways such as the increase of the quality of recommendation systems. These
systems help users find a small and relevant subset of multimedia items based on their preferences.
Finally, the most known problems of these systems: matrix-sparsity problem, can be solved using implicit
feedback, such as recording the emotional reaction of the user to a given item, and use it as a way of
rate that item.
As we can see a lot of work has been done identifying the relationship between emotions and the
different visual characteristics of an image, recognizing faces in images and analyze the emotions that
they transmit or even the new technique EBIR used to retrieve images based in emotion’s features.
However, there is no system for identifying the emotional content present in an image.
22
3
Fuzzy Logic Emotion Recognizer
As we have seen in Section 2.4, there are two types of datasets: those who have the images annotated
with V-A values, and the ones with images annotated with the emotions they convey. However, there is
no dataset with both characteristics or a model that, given the V-A values, can classify the emotions they
represent.
Hereupon, we propose a recognizer to classify an image with the universal emotions present in it and
the corresponding category (Negative, Neutral and Positive), based on their V-A ratings using Fuzzy
Logic. This recognizer will allow us to increase the number of images annotated with their emotions
without the need of manual classification, reducing both the subjectivity of the classification and the
extensive use of the same stimuli. This is particularly important because, if we use these images to
perform manual classification, the impact of them in future studies will be lower, since it increases the
knowledge that participants have about the images.
3.1 The Recognizer
In order to map V-A ratings into emotion labels, we used the Circumplex Model of Affect (CMA) [75] [72]
which states that all affective states arise from cognitive interpretations of core neural sensations that
are the product of two independent neurophysiological systems: Valence and Arousal. It is important to
mention that there are a lot of variations of this model with no consensus among them. In our case, we
used the model in Figure 3.1 to be able to recognize the following six emotions (defined accordingly to
Oxford Dictionaries1 ):
Anger: A strong feeling of annoyance, displeasure, or hostility.
Disgust: A feeling of revulsion or profound disapproval aroused by something unpleasant or offensive.
Fear: An unpleasant emotion caused by the belief that someone or something is dangerous, likely to
cause pain, or a threat.
Happiness: The state of being happy, i.e., feeling or showing pleasure or contentment.
Sadness: The condition or quality of being sad, i.e., feeling or showing sorrow or unhappy.
1 http://www.oxforddictionaries.com/us/definition/american_english/
23
Surprise: An unexpected or astonishing event, fact, or thing.
Figure 3.1: Circumplex Model of Affect with basic emotions. Adapted from [75]
To build our dataset for training and testing our recognizer, we used Mikels dataset [79] [80] [66]. To
our purposes, we have made two assumptions: 1) we assume that Amusement, Awe, Contentment and
Excitement correspond to the basic emotion Happiness, and 2) besides each isolated emotion, we also
consider classes of emotions that often occur together.
According to the assumptions made, our initial dataset is composed by 1 image of Anger, Disgust
and Fear (ADF), 6 images of Anger, Disgust and Sadness (ADS), 1 image of Anger and Fear (AF), 1
image of Anger and Sadness (AS), 31 images of Disgust (D), 25 images of Disgust and Fear (DF), 11
images of Disgust and Sadness (DS), 12 images of Fear (F), 3 images of Fear and Sadness (FS), 114
images of Happiness (Ha), and finally, 43 images of Sadness (S). Given that we removed the classes of
emotions with fewer samples (less than 5), the resulting dataset includes: ADS, D, DF, DS, F, Ha and
S.
For each image in the dataset, we started by normalizing the V-A values (ranging between 0.5
and 0.5). Then, we divided the Cartesian Space, using these values, in order to define each class of
emotions, and as we can see in Figure 3.2 there was a huge confusion among the different classes.
In order to reduce the existing confusion, and considering the Circumplex Model of Affect (See Figure 3.1), we used the Polar Coordinate System (See Figure 3.3) to represent each image in terms of
Angle (see Equation 3.1) and Radius (see Equation 3.2), each computed using the V-A. Angle was used
to identify the class of emotion for each image belongs to, while Radius was used to help reduce emotion
confusion between images with similar angles.
Angle(V alence, Arousal) = arctan(
Radius(V alence, Arousal) =
Arousal
) 2 [0 , 360 ]
V alence
p
2
p
2
V alence2 + Arousal2 2 [0,
24
2
]
2
(3.1)
(3.2)
Figure 3.2: Distribution of the images in terms of Valence and Arousal
Figure 3.3: Polar Coordinate System for the distribution of the images
25
Even with the use of Angle and Radius to describe each image, it still exists confusing among the
different classes of emotions, so instead of using rigid intervals we decided to used Fuzzy Set Theory to
describe each class of emotions, as well as the categories. A fuzzy set corresponds to a class of objects
with a continuum Degree of Membership (DOM), where each set is characterized by a membership
function, usually denoted as µA (x), which assigns to each object a DOM with a range between zero and
one [90]. Any type of continuous probability distribution function can be used as a membership function.
In our work we used the Product of Sigmoidal membership function and the Trapezoidal membership
function, which we shortly describe in the following paragraphs.
Product of Sigmoidal membership function:
A sigmoidal function (See Figure 3.4) depends on two parameters: a and c (see Equation 3.3).
The first one controls the slope, while the second is the center of the function. Depending on the
sign of the parameter a, the function is inherently open to the right or to the left.
Figure 3.4: Sigmoidal membership function
sigmf (x : a, c) =
1
1+e
(3.3)
a(x c)
The final equation for this membership function is given by:
psigmf (x : a1 , c1 , a2 , c2 ) =
1
1+e
a1 (x c1 )
⇥
1
1+e
a2 (x c2 )
(3.4)
Trapezoidal membership function:
The trapezoidal curve (See Figure 3.5) depends on four scalar parameters a, b, c, and d (see
Equation 3.5). The parameters a and d locate the “feet” of the trapezoid and the parameters b and
c locate the “shoulders”.
Figure 3.5: Trapezoidal membership function
trapmf (x : a, b, c, d) = max(min(
x
b
a
d
, 1,
a
d
x
), 0)
c
(3.5)
Regarding the computation of the parameters for the membership functions, we started by using
mean and stddev measures. In the case of the classes of emotions, for the Angle membership function, both measures were used for the slope parameters, i.e., a1 and a2 . The parameters c1 and c2
26
correspond, respectively, to the lowest and highest value of the Angle for that subset of images. For the
Radius membership function, b is the minimum, and c the maximum value of the Radius for that subset
of images, while a = b
✏1 and d = c + ✏2 , with ✏1 = ✏2 = 0.01 (empirical value). In the case of the
categories parameters, and since we used trapezoidal memberships for the angles and the radiuses,
b and c parameters correspond to the lowest and highest value of the Angle/Radius for that subset of
images; in the case of the parameters a and d, the only difference are the ✏ values that vary according
to each category. For all the classes of emotions, we removed the outliers, i.e, images with angles or
radius that were distant from the angles or radius of the majority of the images for the corresponding
class.
Although fuzzy sets are commonly defined using only one dimension, they can be complemented
with the use of cylindrical extensions. Given this, we used a two-dimensional membership function that
is the result of the composition of the two one-dimensional membership functions mentioned above.
For each category (see Figure 3.7 to 3.11) we used Trapezoidal membership function, both for
Angle and Radius (see Equation 3.6). In the case of the classes of emotions (see Figures 3.12 to 3.18)
we used the Product of Sigmoidal membership function for the Angle and the Trapezoidal membership
function for the Radius (see Equation 3.7).
category(Angle, Radius : a1 , c1 , a2 , c2 , a, b, c, d) = trapmf (Angle : a, b, c, d)
⇥trapmf (Radius : a, b, c, d)
emotions(Angle, Radius : a1 , c1 , a2 , c2 , a, b, c, d) = psigmf (Angle : a1 , c1 , a2 , c2 )
⇥trapmf (Radius : a, b, c, d)
2-D Membership Function
27
(3.6)
(3.7)
1-D Membership Functions for Angle
1-D Membership Functions for Radius
Figure 3.7: Membership Functions for Negative category
2-D Membership Function
1-D Membership Functions for Angle
1-D Membership Functions for Radius
Figure 3.9: Membership Functions for Neutral category
28
2-D Membership Function
1-D Membership Functions for Angle
1-DMembership Functions for Radius
Figure 3.11: Membership Functions for Positive category
2-D Membership Function
29
1-D Membership Function for Angle
1-D Membership Function for Radius
Figure 3.12: Membership Functions for Anger, Disgust and Sadness
2-D Membership Function
1-D Membership Function for Angle
1-D Membership Function for Radius
Figure 3.13: Membership Functions for Disgust
30
2-D Membership Function
1-D Membership Function for Angle
1-D Membership Function for Radius
Figure 3.14: Membership Functions for Disgust and Fear
2-D Membership Function
31
1-D Membership Function for Angle
1-D Membership Function for Radius
Figure 3.15: Membership Functions for Disgust and Sadness
2-D Membership Function
1-D Membership Function for Angle
1-D Membership Function for Radius
Figure 3.16: Membership Functions for Fear
32
2-D Membership Function
1-D Membership Function for Angle
1-D Membership Function for Radius
Figure 3.17: Membership Functions for Happiness
2-D Membership Function
33
1-D Membership Function for Angle
1-D Membership Function for Radius
Figure 3.18: Membership Functions for Sadness
Each image was annotated with the degree of membership for each possible category and class of
emotions, and were also associated to the image the two dominant categories and the two dominant
classes of emotions.
In Figure 3.19 there is a global view of the membership functions of all the classes of emotions for
Angle, being possible to see the existing confusion between each of the classes of emotions. There
is clearly a differentiation between the Positive emotion Happiness, ([0 , 95 ] [ [300 , 360 ]) and the
Negative emotions ([120 , 280 ]). However, there is a lot of confusion among the Negative emotions,
being the main confusions between DF and F, between D, DF, ADS and DS, and finally between ADS,
DS and S. With the exception of DF, that is overlapping with almost all other Negative emotions, even
with the ones without any obvious connection (for example S), the remaining are logical, and expectable,
overlaps of emotions.
Figure 3.19: Membership Functions of Angle for all classes of emotions
34
In Figure 3.20 there is a global view of the confusion between emotions regarding Radius. In this
case, and contrary to what happened in the case of Angle, there is no clear differentiation between
Negative and Positive emotions. As we can see, there are no emotions in the proximity of the extremes
(0 and 70), in fact the emotions are lying between 8 and 55. Almost all emotions are completely inside
the D interval ([10, 55]), which is the emotion with the biggest range of values for Radius. In some cases,
such as F, which is completely inside DF, or S, which almost overlaps completely DF, the Radius will
not be particularly helpful. However, and considering the results for the Angle, in the case of confusions
between ADS, DS and S, the use of Radius will be useful, for example in the interval of [10, 18] the
emotion will undoubtedly be S. So, the combination of the two attributes (Angle and Radius) allow us to
better distinguish the emotions.
Figure 3.20: Membership Functions of Radius for all classes of emotions
3.2 Experimental Results
In order to build the training dataset, we analyzed the dataset for each class of emotions and categories.
For both cases, we concluded that the distribution of the images is not symmetric, being more evident in
the classes of emotions.
For the proper evaluation of our model for the classes of emotions, and taking into account that
“Clinicians and researchers have long noted the difficulty that people have in assessing, discerning,
and describing their own emotions. This difficulty suggests that individuals do not experience, or recognize, emotions as isolated, discrete entities, but that they rather recognize emotions as ambiguous and
overlapping experiences.” as stated in [72], we consider that a result is correct if the expected class of
emotion is present (totally or partially) in the result label; if it is not present, we consider the class of
emotion with the biggest DOM as a confusion. For example, if the expected class of emotion was D, we
considered correct results ADS, DS, D, DF or any combination of one of those with a second class of
emotion.
As we can see in Table 3.1, the best results were achieved for D, F and Ha. In the case of Ha, this
result is due to the clear distinction between the Angle values for Ha and the remaining emotions; while
35
in the case of D, we believe it is due to the big interval, both for Angle and Radius; finally, in the case of
F, we believe it is because of the interval of Angle that is only overlapping with DF. However, DF shows
the worst result, but this is expectable given that both the Angle and Radius intervals are overlapping
with the majority of the emotions.
(%)
ADS
D
DF
DS
F
S
H
ADS
83.33
D
100
9.09
4.65
DF
DS
76
8
90.91
4.65
F
S
16.67
H
16
100
90.70
100
Table 3.1: Confusion Matrix for the classes of emotions in the IAPS dataset
For evaluating the model when it come to categories, we follow a similar approach to the one described above. If the expected category is one of the returned categories, we consider the result as
correct; otherwise we consider the one with the biggest DOM as a confusion. If the two categories have
the same DOM we select the “worst”, i.e., for example between Neutral and Positive, we choose Neutral
as the confusion result. Table 3.2 presents the achieved results for the IAPS dataset (corresponding
to our training dataset), while Table 3.3 shows the results for the GAPED dataset. These results were
achieved after the adjustment of the Radius parameters for each category (in order to eliminate some
non-classified results).
(%)
Negative
Positive
Negative
100
Positive
100
Table 3.2: Confusion Matrix for the categories in the Mikels dataset
(%)
Negative
Neutral
Positive
Negative
87.89
Neutral
7.69
98.88
Positive
4.42
1.12
100
Table 3.3: Confusion Matrix for the categories in the GAPED dataset
As expected the results achieved when using the same set for training and test were 100% for
Negative and Positive categories. With the use of GAPED as a testing set and with the addition of
the Neutral category (which did not exist in the training dataset), the results for the Negative category
became worse; however, this is mainly due to the existing confusion between the Negative and Neutral
categories in the used dataset [14]. The Positive category maintains an accuracy of 100%, while the
Neutral category obtained almost 99%.
3.3 Discussion
In this work, we developed a model to automatically classify the emotions and categories conveyed by
an image in terms of their normalized Valence and Arousal ratings. With this model we were able to
successfully annotate our training set with the dominant categories with classification rates of 100% and
the dominant classes of emotions with an average classification rate of 91.56%.
36
Although we intend to be able to recognize the six basic emotions, we don’t have any data for the
emotions of Anger and Surprise. In the case of Anger, this is due to the fact that it is difficult to elicit the
emotion through the images used as explained in [66], while in the case of the emotion Surprise, it is
because the work we followed did not consider this emotion.
In general, the results achieved are very good; however, in the case of emotions, it is important
to mention the existing confusion between some of them, mainly Disgust and Sadness (and the corresponding classes that are composed with at least one of these emotions), which can be explained by the
neuroanatomical findings in [48], in which the authors mentioned that some regions such as prefrontal
cortex and thalamus are common to these emotions, as well as the association with activation of anterior
and posterior temporal structures of the brain, using film-induced emotion.
We also annotated the GAPED dataset and the remaining pictures of the IAPS dataset. For GAPED
we have a non-classification rate of 23.4%, and an average classification rate of 95.59% for the categories (we don’t have any information about the emotions). In the case of IAPS we achieved a nonclassification rate of 4.86%, however we cannot identify the classification rates because, besides the
images we used as training set (Mikels), we did not have any information about the categories or emotions. The non-classified results are explained with the lack of images covering the whole space of the
CMA on both datasets, and, in the particular case of GAPED it can also be explained by the use of
a slightly different CMA. The existing confusion between the Negative and Neutral categories (in the
GAPED dataset) already existed as explained in [14], while the confusion between the Negative and
Positive categories can be explained by the use of different models of the CMA.
3.4 Summary
We developed a recognizer to classify an image with the universal emotions present in it and the corresponding category (Negative, Neutral and Positive), based on their V-A ratings using Fuzzy Logic. For
each image in the dataset, we started by normalizing the V-A values, and computed the Angle and the
Radius for each image in order to help reduce emotion confusion between images with similar angles.
To describe each class of emotions, as well as the categories, we used the Product of Sigmoidal
membership function and the Trapezoidal membership function. For the categories we used Trapezoidal
membership function, both for Angle and Radius, while for the classes of emotions, we used the Product
of Sigmoidal membership function for the Angle and the Trapezoidal membership function for the Radius;
As expected, the results achieved when using the same set for training and test were 100% for
Negative and Positive categories. In the case of the dominant classes of emotions we achieved an
average classification rate of 91.56%. With the use of GAPED as a testing set we achieved an average
recognition rate of 96% for categories. For GAPED, we achieved a non-classification rate of 23.4%,
while in the case of the IAPS, we achieved a non-classification rate of 4.86%.
37
38
4
Content-Based Emotion Recognizer
Emotion, which is also called mood or feeling, can be seen as emotional content of an image itself
or the impression it makes on a human. When talking about emotions, it is important to mention the
subjectivity inherent, since different emotions can appear in a subject while looking at the same picture,
depending on its current emotional state [67]. However, the expected affective response can be considered objective, as it reflects the more-or-less unanimous response of a general audience to a given
stimulus [30].
There is a general agreement on the fact that humans can perceive all levels of image features,
from the primitive/syntactic to the highly semantic ones [74], and also that artists have been exploring
the formal elements of art, such as lines, space, mass, light or color to express emotions [18]. Given
this, we assumed that emotional content can be characterized by the image color, texture and shape.
Additionally, and given that certain features in photographic images are believed, by many, to please
humans more than others, we also consider the aesthetics of an image, that in the world of art and
photography refers to the principles of the nature and appreciation of beauty [17, 39].
In order to acquire as much information as possible about an image, we will use different features
regarding Color, Texture, Shape, Composition, among others. However, a commitment between all the
information collected and the processing time has to be found. Since most of the descriptors only model
a particular property of the images, and in order to obtain the best results, a combination of features is
often required. As stated in [30, 74], low-level image features can be easily extracted using computer
vision methods, however they are no match for the information a human observer perceives.
After the identification of the features that can be used to describe an image in terms of their emotional content, we will train different classifiers in order to identify the best features to describe an image
according to their category of emotions. Given this, our goal is to identify the combination of visual features that can match human perception as closely as possible regarding the Positive or Negative content
of an image.
Regarding the features’ extraction, we selected the most used in literature and easiest to compute resulting in the following descriptor vector features (for simplification purposes, in the future, we
will only refer to them as “features”): AutoColorCorrelogram (ACC) [36], Color Histogram (CH) [78],
Color Moments (CM) [17], Number of Different Colors (NDC) [18], Opponent Histogram (OH) [85],
39
Perceptual Fuzzy Color Histogram (PFCH) [3, 4], Perceptual Fuzzy Color Histogram with 3x3 Segmentation (PFCHS) [3], Reference Color Similarity (RCS) [45], Gabor (G) [64], Haralick (H) [31],
Tamura (T) [82], Edge Histogram (EH) [8], Rule of Thirds (RT) [17], Color Edge Directivity Descriptor
CEDD [9], Fuzzy Color and Texture Histogram FCTH [11] and Joint Composite Descriptor (JCD) [10].
The majority of the features were extracted using jFeatureLib [26] and LIRE [58, 59], although PFCH,
PFCHS, RT and NDC were implemented by us.
For the classification, we used Weka 3.7.11, a data mining software [28]. This software allows us
to use three different groups of classifiers: simple, meta and combination. For the simple classifiers we
use Naive Bayes (NB) [37], Logistic (Log) [52], John Platt’s sequential minimal optimization algorithm
for training a support vector classifier (SMO) [33, 41, 70], C4.5 Decision Tree (algorithm from Weka)
(J48) [73], Random Forest (RF) [7], and K-nearest neighbours (IBk) [1]. In the case of meta classifiers,
i.e., classifiers based on other classifiers, we used LogitBoost (LB) [24], RandomSubSpace (RSS) [34],
and Bagging (Bag) [6]. For the combination of classifiers we used Vote with the Average combination
rule [44, 47]. Although one of the good practices of machine learning is to use normalized data, in our
tests we did not find any difference in the results, so we kept the features unnormalized. The tests and
results are described in subsection 4.2.
4.1 List of features used
In this section we describe shortly the several features extracted from the images
AutoColorCorrelogram [36]
Given that a color histogram only captures the color distribution in an image and does not include any spatial correlation information, the highlight of this feature is the inclusion of the spatial
correlation of colors with the color information.
Color Histogram [78]
This feature is a representation of the distribution of colors in an image, i.e., it represents the
number of pixels that have colors in each of a fixed list of color ranges (quantization in bins). In our
work we use a HSB Color histogram.
Color Moments [17]
This feature computes the basic color statistical moments of an image like mean, standard deviation, skewness and kurtosis.
Number of Different Colors [18]
This feature counts the number of different colors, using RGB space, that compose an image.
Opponent Histogram [85]
This feature is a combination of three 1D histograms based on the channels of the opponent
Color space: O1 , O2 and, O3 . The color information is represented by O1 and O2 , while intensity
information is represented by channel O3 .
Perceptual Fuzzy Color Histogram [4]
In this feature, for each pixel of the image, the degree of membership for its Hue is evaluated
and assigned to the correspondent bin of the fuzzy histogram. Therefore, after processing the
whole image, each of the 12 bins of the fuzzy histogram will have the sum of the DOMs (degree of
membership) for the corresponding Hues.
40
Perceptual Fuzzy Color Histogram with 3x3 Segmentation [3]
This feature divides an image into 9 equal parts, and performs PFCH in each one. The result is
the combination of the nine fuzzy histograms.
Reference Color Similarity [45]
This feature is not a histogram, since reference colors used are processed independently; any
subset of dimensions gives the same result as computing just these colors, making this feature
space very favorable for feature bagging and other projections.
Gabor [64]
This feature represents and discriminates Texture information, using frequency and orientation
representations of Gabor filters since they are similar to those of the human visual system.
Haralick [31]
This feature is based on statistics, and summarizes the relative frequency distribution, that describes how often one gray tone will appear in a specified spatial relationship to another gray tone
on the image.
Tamura [82]
This feature implements three of the following Tamura features: coarseness, contrast, directionality,
line-likeness, regularity and roughness.
Edge Histogram [8]
This feature captures the spatial distribution of undirected edges within an image. The image
is divided into 16-equal-sized, non overlapping blocks. After that, each block is divided into a
5-bin histogram counting edges in the following categories: vertical, horizontal, 35 , 135 and nondirectional.
Rule of Thirds [17]
This feature computes the color moments for the inner rectangle of an image divided into 9 equal
parts.
Color Edge Directivity Descriptor [9]
This feature incorporates Color and Texture information in a histogram. In order to extract the Color
information, it uses a Fuzzy-Linking histogram. Texture information is captured using 5 digital filters
that were proposed in the MPEG-7 Edge Histogram Descriptor.
Fuzzy Color and Texture Histogram [11]
This feature combines, in one histogram, Color and Texture information using 3 fuzzy systems.
Joint Composite Descriptor [10]
This feature corresponds to a joint descriptor that joins CEDD and FCTH information in one histogram.
4.2 Classifier
For testing and training, we used Mikels dataset [66, 79, 80], with 113 Positive images and 123 Negative
images. The Positive images are the ones with the Happiness label, while the Negative ones correspond
to ADS (6), D (31), DF (20), DS (11), F (12) and S (43) labels. We separated the data into a training and
test set using K-fold Cross Validation with K = 5 [60].
We started by analyzing a set of classifiers, in order to understand which one learned best the relation
between features and the given category of emotion (See Table A1). In the case of the simple classifiers
41
(NB, Log, SMO, J48, RF and IBk) we used their default configurations, but in the case of the meta
classifiers (LB, RSS and Bag) we used RF as the base classifier. For these preliminary tests, we used
all the features, but without any combination between them, and we did not consider the time required
to build the model.
With these classifiers, we achieved average recognition rates between 52.75% and 56.62%. However, after the observation of the results, we were not able to choose only one classifier. For example,
for ACC feature, the best result was achieved using Bag, while for PFCH or PFCHS the best result was
achieved using the NB classifier. Based on these relations (for each feature), we studied the following
combinations of classifiers (using Vote classifier):
Vote 1 (V1) Vote(SMO+NB+LB+Log+Bag)
Vote 2 (V2) Vote(SMO+NB+LB+RF+RSS)
Vote 3 (V3) Is similar to V2, but with default configurations for the LB and RSS classifier
Vote 4 (V4) Vote(SMO+NB+LB)
Vote 5 (V5) Vote(SMO+NB+Log)
Vote 6 (V6) Vote(SMO+NB)
As we can see in Figure 4.1, the global results for Vote classifiers are better than the ones achieved
using simple or meta classifiers. Regarding the features, and considering the average of recognition
rates across all the classifiers, the most promising features correspond to the PFCH with 64.27%, CH
with 64.13%, RT with 63.06%, PFCHS and JCD with 61.16%, CEDD with 60.59% and FCTH with
60.10%.
Figure 4.1: Average recognition considering all features
Although we were able to improve recognition rates with the use of the combination of classifiers,
we still have similar average recognition rates, from 56.57% to 57.76%, between them. Additionally, on
average, each classifier performs best for 3 of the 16 features, which means that there were no better
classifiers in the majority of the features.
4.2.1
One feature type combinations
Considering the preliminary results, we performed more tests using different combinations of features
inside each feature type (Color, Composition, Texture, Shape and Joint) using the six vote classifiers
(See Table A2). Given the amount of possible combinations, especially for Color features, we considered
42
as candidates for the best features those with a recognition rate greater than the average of all the
features (for each classifier).
Color
In Table A3, we can see the results for Color using only one feature. Tables A4 to Table A10 show the
results for combinations of two to all of the Color features. The grey cells in tables correspond to the
cells with a value greater than the average for that classifier. In the following paragraphs we discuss the
results for the various combinations of features.
One feature:
Even though we had only identified the Color features PFCH, PFCHS and CH, earlier, as the
most promising, we also included OH and RCS to the Color candidates list. As expected from the
literature, and as we can see in Figure 4.2 the best results were the ones corresponding to any
type of histogram, in particular the commonly used Color Histogram, as well as PFCH and PFCHS,
that take into account the way users perceive color.
Figure 4.2: Results for Color - one feature
Two features:
When we considered combinations of two Color features, the average recognition rates increased
almost 2%, considering all of the possible combinations. As we can see in figure 4.3 the best
results were achieved using combinations that include CH, PFCHS or PFCH.
Generally, for CH, the use of NDC, OH, PFCH and RCS improved its recognition rates, while CM
and PFCHS reduced them. The features OH and NDC, when combined with the other features,
also improved their average recognition rate. Finally, PFCH is improved slightly when combined
43
Figure 4.3: Results for Color - two features
with RCS. As expected, and in general, the combination of the best individual features gave us the
best results.
Three features:
In this case, the average rate between two and three features is almost the same. Considering the best combination features for two features: CH+RCS (68.22%), OH+PFCHS (67.37%),
PFCHS+RCS (66.95%), CH+OH (66.95%), and CH+NDC (66.95%), we can see in figure 4.4 that,
in general, the results were better. For example, the best combination for two features achieved
68.22%, but CH+OH+RCS has a recognition rate of 69.50%. Moreover, the second best combination for two features achieved worst results than all the best combinations for three features.
Figure 4.4: Results for Color - three features
For the other combinations with a recognition rate at least of ' 65%, we observe that the majority
44
of them achieved better results than the ones achieved only with the use of two features. However,
for example, CH+OH+PFCH has a rate of 65.68% while CH+OH has a better rate of 66.95%.
These observations allow us to conclude, for now, and against our beliefs, that with the increase
of information, in general, the accuracy of the classification is not linearly better.
Four features:
Contrary to what happened in the previously analyzed tests, the average rate between three and
four features has decreased. The same also happened if we consider only the average for the best
features. Although the differences appears to be minimal, nevertheless they meet our previous
observation. The best results were the ones including the combination CH+CM+NDC, where the
best combination was CH+CM+NDC+RCS with a recognition rate of 68.64% using V2.
Five features:
Comparing these tests with the ones above, we noticed a marginal decrease in the average recognition rate for all combinations. Just by looking to the best features: CH+CM+NDC+OH+RCS
(67.80%), and CH+CM+NDC+OH+PFCHS (66.95%), we found the contrary since there was a
small increase, in the recognition rate, when compared with CH+CM+NDC+OH (66.10%).
For all the combinations based on CM+NDC+OH+PFCH or NDC+OH+PFCH+PFCHS, in general,
there were no noticeable differences in the average recognition rates when compared with the
results from five features. But when we combined them with the RCS feature we verified an
increase in the corresponding recognition rates.
Six features:
CH+CM+NDC+OH+PFCH+RCS (67.80%), and CH+CM+NDC+OH+PFCH+PFCHS (67.38%) were
the best combinations. In both cases, when we compared them with CH+CM+NDC+OH+PFCH
(65.25%) there was a significant improvement in the recognition rates achieved.
In some cases, such as for example CM+NDC+OH+PFCH+PFCHS or CH+CM+NDC+OH+PFCH,
if we combine them with the RCS feature, there was an improvement in the corresponding average
recognition rates.
Seven features:
In this group we achieved a poor average recognition rate. However, this was expectable since the
majority of the combinations include the ACC feature. This feature had achieved the worst results
in all the performed tests. The best combination was CH+CM+NDC+OH+PFCH+PFCHS+RCS
that achieved an average recognition rate of 66.95% for V3.
All features:
Across all the tests, this was one of the worst results. However, as in the previous test this was
expectable since it includes the worst feature (ACC). In fact, as we can see in Table A9, the same
combination without ACC feature has an average rate of 64.69%.
In Table A11, there is the final list of Color candidate features. Given the size of this list, we started
by reducing the number of classifiers to analyze. We first looked into the recognition rates for all the
combinations for each of the classifiers. V4 was the best classifier with an average rate of 63.96%,
followed by V1, V2, V3, V5, and V6, with similar recognition rates of 63.83%, 63.71%, 63.62%, 62.70%,
and 62.19%. Given these results, we only kept analyzing the values for the best classifiers, i.e., V4, V1
and V2, and decided to keep only Color combinations with an average recognition rate of at least 66%,
reducing the list only to 7 combinations of features (See Table 4.1).
45
For the remaining classifiers, we analyzed the time they took to learn and build the model. In Figure 4.5 we can see the time that each vote take to build the model for the different number of features
used in the performed tests. As we can see, the most unstable classifiers was V1, V3 and V5. Given
this, subsequently we will only consider the V2 and V4 classifiers.
Figure 4.5: Time to build models
Composition
Given that we only consider one feature of this type (RT), we cannot perform an extended analysis.
However, and considering that this feature corresponds to the Color moments of a segmented part of
an image, i.e., it captures Color information for the inner rectangle of an image, we can perform some
comparisons against the Color results (see Table A12).
Across all the classifiers, this feature achieved an average recognition rate of 63.35%. If we compare
it with the average recognition rate for Color, there is a difference of almost 3% in the recognition rates,
but it is important to mention that RT only has a dimension of 4, which means it is extremely quick to
extract from an image, while the average dimension of Color is 343. Given both the recognition rate
and the dimension of this feature, we considered it as a promising feature, not only for combination with
other features, but also to use as a single feature.
Shape
Similarly to Composition, for this type, we only considered one feature (EH), which achieved the worst
results (see Table A13), in all the performed tests until now: an average rate of 44.71%. However, we
selected it for further tests, in order to see if in combination with other types, such as colors, it helps to
discriminate the emotional category of an image.
Texture
For this group, H was the best feature (56.78%) (see Table A14). If we observe the combinations of two
Texture features, the best one was H+T with a small decrease when comparing to H. When we combine
all the Texture features, the rate slightly decreases (55.08%). For further tests, we selected the two best
features: H+T, and H.
Joint
The best features were JCD and CEDD, respectively with 63.56% and 62.71% recognition rates (see
Table A15). For the combinations of two features, the majority achieved worse results than the individual
46
ones, with the exception of the FCTH+JCD that had the same rate as CEDD. For the combination of
all the features we achieved an average recognition rate of 61.44%. The selected features were: JCD
(61.16%), CEDD, and FCTH+JCD.
In Table 4.1 we can see the final list of features to use in the following tests. Regarding the distribution
of the type of features, we have 50.00% for Color features, 21.44% for Joint, 14.28% for Texture, and the
remaining 14.28% equally divided between Composition and Shape features.
At this point, and given these results, we expect that the combination of features of different types
increases the recognition rates, and allows us to better discriminate the emotional category of a given
image. The new tests were done using combinations of two and three different types of features.
Color
Composition
Shape
Texture
Joint
CH+RCS
CH+NDC+RCS
CH+OH+RCS
CH+PFCH+RCS
CH+PFCHS+RCS
CH+CM+NDC+RCS
CH+CM+NDC+OH+PFCH+RCS
RT
EH
H
H+T
CEDD
JCD
FCTH+JCD
Table 4.1: List of best features for each category type
4.2.2
Two feature type combinations
In the case of the tests performed using combinations of two types of features, the results can be seen in
Table A16 for Color and Composition, Table A17 for Color and Shape, Table A18 for Color and Texture,
Table A19 for Color and Joint, Table A20 for Composition and Shape, Table A21 for Composition and
Texture, Table A22 for Composition and Joint, Table A23 for Shape and Texture, Table A24 for Shape
and Joint, and Table A25 for Texture and Joint.
Using the combination of the best features for Color and Composition, almost all the combinations
performed worse than the original feature Color (i.e., without the Composition feature); the only exception
was OH+PFCHS+RCS+RT which increased the corresponding recognition rate. In the case of Color and
Shape, with the addition of Color information to the Shape features, all the combinations achieved better
results. For Color and Texture, some of the Color combinations were improved with the use of the
Texture feature H, namely CH+PFCH+RCS and CH+PFCHS+RCS. In fact, CH+PFCH+RCS+H, is one
of the best features. Regarding the two Texture features used: H and H+T, the first one when combined
with the different feature colors achieved, on average, better results. In the tests using Color and Joint,
we were combining two of the best feature types. None of the combinations performed better than the
original Color feature, which means that the use of CEDD, JCD, and FCTH+JCD did not add any useful
information to the one already captured by color.
For the combination of Composition and Shape, if we compare it with Shape feature EH, it is slightly
better, however it is considerably worse (more than 13%) if compared with the Composition feature RT.
On average, the achieved results using combined Composition and Texture features were worse than
the average recognition rate of the two types separately. In the case of the Texture feature H+T, it is
47
slightly better when combined with RT, with a similar dimension, which means that, in this case, it is
better to use the combined feature. For Composition and Joint, all of the combinations achieved worse
results than the isolated features. So, in this case it is preferable to use Composition feature alone.
Regarding Shape and Texture, and although the tested combinations achieved a better average
recognition rate when compared to Shape, it is still better to use one of the Texture features (H or H+T),
since the corresponding recognition remains better. For Shape and Joint combinations, when compared
with Shape the achieved results were better, but considerably lower than the results achieved for Joint.
In the case of Texture and Joint all of the combinations, when compared with Texture features, were
improved.
4.2.3
Three feature type combinations
For the tests using combinations of three types of features, the results are in Table A26 for Color, Composition and Shape, Table A27 for Color, Composition and Texture, Table A28 for Color, Composition
and Joint, Table A29 for Color, Shape and Texture, Table A30 for Color, Shape and Joint, and Table A31
for Color, Texture and Joint.
For Color, Composition and Shape, all the combinations achieved worst results with the addition
of Shape feature. For Color, Composition and Texture, with the addition of Texture information to
Color and Composition combinations, some of the new combinations achieved better results such as
OH+PFCHS+RCS+RT+H or OH+PFCHS+RCS+RT+H+T. For Color, Composition and Joint all the results were worst. In the case of Color and Shape and Texture we achieved better recognition rates,
especially with the use of H Texture feature. For Color, Shape and Joint, we achieved some better
results with the use of FCTH+JCD feature. For Color, Texture and Joint all the results were worst.
In general, the results with the addition of more information tend to decrease, even though we
were able to improve some of our previous results. Considering the results achieved until now, for
the next tests we will only use the best three feature type combinations: OH+PFCHS+RCS+RT+H,
OH+PFCHS+RCS+RT+H+T, CH+RCS+H+FCTH+JCD, and CH+PFCH+RCS+H+T+CEDD.
4.2.4
Four feature type combinations
For these tests, the results can be seen in Table A32 for Color, Composition, Texture and Shape, Table A33 for Color, Composition, Texture and Joint, Table A34 for Color, Texture, Joint and Shape, and in
Table A35 for Color, Texture, Joint and Composition.
For all the combinations, the achieved results were considerably worse than the original ones. The
average recognition rate of the initial combinations was 66.53%, while the new achieved recognition
rate decreased to 62.83%. Given these results, we will not perform tests using all the feature types
combinations.
4.2.5
Overall best features combinations
In Table 4.2 we can see the best features across all the tests, and the recognition rates achieved.
Color
Color and Composition
Color and Texture
%
CH + OH + RCS
CH + CM + NDC + RCS
CH + OH + RCS + RT
CH + CM + NDC + RCS + RT
CH + RCS + H
CH + PFCH + RCS + H
CH + CM + NDC + RCS + H
CH + PFCH + RCS + H + T
48
V2
68.64
68.64
67.37
66.95
67.80
68.22
68.22
66.95
V4
66.53
66.10
64.83
65.25
64.83
66.10
64.41
65.68
Color and Joint
Color, Composition and Texture
Color, Texture and Joint
CH + RCS + CEDD
CH + NDC + RCS + CEDD
CH + OH + RCS + CEDD
OH + PFCHS + RCS + RT + H
OH + PFCHS + RCS + RT + H + T
CH + RCS + H + FCTH + JCD
CH + PFCH + RCS + H + T + CEDD
66.95
67.80
68.64
65.25
66.95
68.22
66.95
65.23
65.25
64.83
68.22
66.56
63.98
66.10
Table 4.2: Overall best features
We consider the following combinations as the best ones: CH+CM+NDC+RCS, CH+OH+RCS,
CH+CM+NDC+RCS+H, CH+OH+RCS+CEDD, CH+PFCH+RCS+H, CH+PFCHS+RCS+RT+H+T, CH+
RCS+H+FCTH+JCD, and CH+PFCHS+RCS+RT+H, with recognition rates above 68.00% for V2, and
66.50% for V4. Almost all of the combinations were composed mainly by Color features, which was
expected since color is the primary constituent of images, and usually the most important characteristic
for influencing the way people perceive images. In some cases, the use of Texture or Joint features was
useful to reduce the number of Color features used to capture the emotional information of the images.
In Table A36 we can see the respective confusion matrices for each of the best features. For the Positive category the best combination was OH+PFCHS+RCS+RT+H (58.41%) using classifier V4, while for
the Negative the best were: CH+CM+NDC+RCS, and CH+OH+RCS, both using classifier V2 with a
recognition rate of 82.65%.
In order to confirm if our selected combinations really discriminate an image in terms of their emotional content, we also trained the two classifiers V2 and V4 using a new dataset with images from
GAPED. For the first tests, we used 121 Negative images (31 from Animal, 30 from Human, 30 from
Snake, and 30 from Spider, chosen randomly) and 121 Positive images. Although we had only considered Positive and Negative category in the tests performed until now, due to the use of Mikels dataset,
we also performed tests using Neutral category (89 images from GAPED dataset). The results of the first
tests, i.e., the ones only using Positive and Negative categories, can be seen in Table A37 for confusion
matrices. In Table A38 we can see the results using also Neutral category.
For the tests using Positive and Negative categories the best combinations were CH+OH+ RCS+CEDD
(70.25%) for Positive category, and CH+PFCH+RCS+H (82.11%) for Negative, in both cases using classifier V4. In the case of tests using also the Neutral category, the results were considerably worse,
but since we did not train the model considering Neutral category, they were somewhat expectable.
The biggest confusion was between Negative and Neutral categories, although this is a known problem
for the GAPED dataset. The best combination for Positive category was OH+PFCHS+RCS+RT+H+T
(58.78%) using classifier V2, in the case of Neutral category it was the CH+RCS+H+FCTH+JCD (65.17%)
using classifier V4, finally, for the Negative category we had two best combinations (both using classifier
V2): OH+PFCHS+RCS+RT+H (65.29%), and OH+PFCHS+RCS+RT+H+T (65.29%).
Given the results achieved until now, and in order to select the best classifier and combination of
features for our final recognizer, we created a new dataset of 468 images selected from both Mikels
dataset and GAPED dataset. From each one we selected 121 Negative and 113 Positive images, giving
us a total of 242 Negative images and 226 Positive images. We divided the dataset with
2
3
for training
(312 images), and the remaining for test (156 images).
As we can see in Table A39 the best combination for the Negative category was OH+PFCHS+RCS+RT+H+T
using classifier V4 (88.50%), while for the Positive it was CH+RCS+H+FCTH+JCD using classifier V2
(61.54%). For both the categories, and the one that we choose as the best overall combination was
CH+CM+NDC+RCS that using classifier V2 achieved an average recognition rate of 72.44% (Negative:
87.18% and Positive: 57.69%).
49
4.3 Discussion
Regarding the tests performed using only combinations of Color features, the ACC feature always
achieved the worst results. However, when we incorporate more Color information, the results have
always increased, from 56.43% (using only ACC) to 59.25% (using all features). Globally, the best features seem to be CH, CM, RCS and OH. For each of the group tests, they are in general the ones
with the best results and when used with Joint features, that appear to capture less information, the
recognition rates increased. Regarding PFCH, PFCHS and NDC, these features don’t always improve
the results.
When comparing the number of features used in each combination, we observed that with up to
four features, the increase of the average recognition rates are linear: more features gave us better
results. However, in the other cases, it seems that sometimes adding more features only confuses the
information. In fact, as we can see in Table 4.1, only one of the selected combinations has more than
four features.
Overall, and as expected from our previous studies from literature, the best results were achieved
using Color features (and combination of Color features). All other types, except Shape, also achieved
relatively good recognition rates, especially if we consider the subjectivity inherent to the way humans
interpret the emotional content of an image. In general, the results with the addition of more information
tend to decrease, even though we were able to improve some of our previous results.
Given that we performed all the tests using a small number of observations, and that in the majority
of the tests the amount of features used for each image is considerably bigger than the number of
observations, we considered the possibility of overfitting. Overfitting is a phenomenon that occurs when
a statistical model describes noise instead of the underlying relationship, i.e., it memorizes information,
instead of learning it. Usually occurs when a model is excessively complex, such as having too many
parameters relative to the number of observations. Although we used cross-validation in all the tests
we performed, which is helpful in reduce overfit in classifiers, we decided to verify if our final classifier
suffers from overfitting.
We started by testing our classifier using only the training set; in case of overfitting, the expectable
accuracy should be around 100%, however it is only in the order of 70%. Besides that, if our model
was Overfitting, when we use images to test that were not used to train the model, the classifier should
perform considerably worst, however the recognition rates were similar as the ones using the training
set. Given this, we believe that our classifier is not suffering from overfitting. Additionally, we also tried to
reduce the number of features used, by performing Principal Component Analysis (PCA), however the
results achieved were that all the features used are important.
4.4 Summary
We developed a recognizer to classify an image with the corresponding emotion category: Positive or
Negative, based on the content of the image, such as Color, Texture or Shape. Using a set of 156
images, for testing, that were not used for training, we achieved an average recognition rate of 72.44%
(Negative: 87.18% and Positive: 57.69%). The recognizer uses a Vote classifier based on SMO, NB,
LB, RF, and RSS, and is composed by CH, CM, NDC, and RCS features.
50
5
Dataset
In order to provide a new dataset annotated with the emotional content of each image, we performed
a study with different subjects. For this purpose we developed a Java application: EmoPhotoQuest,
that used the Swing toolkit to show the images to the users and collect the ratings for each one of the
displayed images.
5.1 Image Selection
Concerning the creation of the dataset, we started by selecting the images, using the results of the
recognizer developed in chapter 3, from the following datasets: IAPS, GAPED and Mikels. From the
first one we selected 86 images: 9 of A (Anger), ADS, D, DF, DS, F, Ha, N (Neutral), S and 5 of
Surprise (Su). From GAPED we selected 76 images: 8 of A, ADS, D, DF, DS, F, Ha, N, S, and 4 of
Su. Finally, from Mikels we selected 7 images: 1 for ADS, D, DF, DS, F, Ha and S. For each class of
emotions, we selected images with the biggest DOM possible.
The dataset contains multiple images with animals, such as snakes, spiders, dogs, sharks, horses,
cats, tiger, among others. The remaining images include children, war scenarios, mutilation, poverty,
diseases and death situations. It also include images from cirurgical procedures, as well as images of
natural catastrophes, car accidents or fire.
For the experience, we divided the dataset into 4 subsets: DS0 to DS3. The first one contains 57
images, 20 from our subset of IAPS, 20 from our subset of GAPED, and all the images from our subset
of Mikels. This dataset will be rated by all the participants. Dataset DS1 contains 40 images, while DS2
and DS3 contain 36 images each.
5.2 Description of the Experience
First, we started by explaining the purposes of the study and how it will be held. To ensure the willingness
of the subject, regarding Negative images, we started by showing three images as examples of what
can be expected. After that, the subject could decide to continue or not the study. If the subject decides
to continue the study, s/he should fill the user’s questionnaire (See Figure 7.3) with his/her personal
51
information (age, gender, etc.), as well as the classification of their current emotional state (categories
and emotions).
In the initial screen of EmoPhotoQuest (See Figure 5.1a), it is possible to select the language (Portuguese or English), as well as reading a summary of the most important aspects of the study. There
are 7 different blocks with nearly 14 images each. Images were presented in a random order, i.e., each
user will have a different sequence of images. Each image was shown to the user during 5 seconds
(See Figure 5.1b). After looking at the image, the user should evaluate his/her current emotional state,
and rate the image according to each of the universal emotions using a 5-Likert scale (See Figure 5.1c).
When the user fills all the requested information for that image, the Next button appears and s/he can
move on to the next image. Although in other study users usually have a limited time to respond, we
decided not to do it. This way we allowed the user to spend as much time as needed, without feeling
pressured to respond or even stressed out. In order to relax and avoid user fatigue, we provided a 30
seconds interval during which only a black screen was displayed (See Figure 5.1d).
(a) EmoPhoto: Start screen
(b) EmoPhoto: Image visualization screen
(c) Rate screen
(d) Pause screen
Figure 5.1: EmoPhoto Questionnaire
52
5.3 Pilot Tests
In order to verify and validate if our procedure had any error and also if it was completely clear to the
subjects, we performed two preliminary tests with different subjects. The first one was a 27 year old
man, that performed the test in Portuguese and took 35 minutes to complete it. The second subject was
an 18 year old female, that also preferred to take the test in Portuguese. The time spent to conclude the
test was 42 minutes. Regarding an image that was duplicated, none of the subjects had any doubt or
detected any error in our application for collecting their emotional information.
An interesting aspect of the performed tests was the sensitivity to the Negative images. The first
subject considered the majority of the images very violent, while the second one considered them almost
Neutral, and in some cases she enjoyed the Negative content. These preliminary results demonstrated
how subjective the emotional content of an image can be.
5.4 Results
We conducted 60 tests: 26 with females and 34 with males, with 70% of them belonging to the 18-29
age group (See Figure B2), and almost 60% having a BsC Degree (See Figure B4). Only 3 of the users
had participated in a study using any Brain-Computer Interface Device (See Figure B5), while none of
the users had participated in a study using the IAPS or GAPED database. In fact, the overwhelming
majority had no knowledge about these databases. Regarding their current emotional state (in terms
of categories), 31 participants classify it as Neutral, 25 as Positive, and only 4 as Negative (See Figure 5.2a). Considering now the emotional state according to each of the following emotions: Anger,
Disgust, Fear, Happiness, Neutral, Sadness, and Surprise, we can see in Figure 5.2b that the majority
of the participants were feeling moderately Happy or moderately Neutral, both with a Median of 3, in the
beginning of the tests.
Given the number of participants in our tests, each image of DS0 was rated by 60 participants, while
each image of DS1, DS2 and DS3 was rated by 20 participants.
(a) Categories
(b) Emotions
Figure 5.2: Emotional state of the users in the beginning of the test
During each session, participants were encouraged to share their opinions/comments about the
experience.
More than 40% of the participants indicated some type of difficulty in understanding the content of
some of the images, leading to confusion about their feelings. The majority identified the lack of context
as the main reason for this.
53
For example, some users did not understand if an animal in front of a car will be hit by it or not. In
this case there is confusion between feeling Negative if the animal is hit, and Neutral/Positive otherwise.
In some images of animals with people around them, it is not clear if the people are helping the injured
animal, or if they are the ones that caused injury. As in the previous example, if the people are helping,
the users tend to feel Positive, otherwise they feel Negative and irritated/angry. Another example is the
case of animals that are lying on the ground; it is not clear for the user if the animal is dead or just
sleeping. This doubt also influences the way the user feels: Negative in case of death, Neutral/Positive
otherwise.
Besides these concrete examples, one of these users explained that if he sees an image of a hideous
act that is made based on religious fanaticism, he feels disgusted and angry, but if it is due to necessity
(poverty, to get food, etc.) he only felt Sadness. Another user reported that he feels disgusted, not for
what the image expresses by itself, but because of the situation in which the image was taken: violence
against women or poverty. Some users also mentioned that some images had bad quality (pixelated)
or appear to be faked/manipulated with programs, such as Photoshop, which means they did not feel
affected by these images.
Some of the users (2) indicated that there are too many emotions to rate. However, other users (8)
suggest that there should be an option such as Confusion, Strangeness, Anxiety or Disturbed, because
they consider that some images do not correspond to any of the available emotions. Moreover, another
user considered that Happiness is not enough to discriminate the Positive feelings of some images, such
as cuteness. In the case of Surprise, some users (5) claimed that it is subjective, difficult to understand
and difficult to elicit from an image. There seem to be some exceptions to this, such as a shark moving
as it is attacking a person or images with unexpected content like a lamp or stairs. However, two users
considered Surprise as one of the most common emotions in the beginning of the test, but that tends
to disappear during the test. In the case of Anger, two users explicitly indicated us that none of the
images was able to trigger that emotion. In the case of the Neutral emotion, and given the existence
of the Neutral category, four users did not comprehend the use of the emotion suggesting that a rate
of “3” in all the other emotions will be equivalent to ”feeling Neutral”; one user suggested the use of
indifference/apathy term instead of Neutral.
Regarding the personal taste of the users, some of them do not appreciate spiders (4), snakes (3)
or aquatic animals (1), but some of them consider images with these animals beautiful because of the
colors in them. However, the opposite is also true (some users appreciate snakes (4) or spiders (3)).
Two users hate needles, one user hates hunting and another is afraid of “heights”, i.e, he reported that
he felt Fear from an image in which he thinks that should feel happy. In the contrary, two users identified
that a specific image should be considered as “Negative”, but since they enjoy the content in it (fire and
cirurgical instruments), they felt Positive and happy. Some users (3) declared that they are not sensitive
to some images, for example the ones with children smiling. They said that they should feel “happy”,
but they feel Neutral. Finally, one user also noticed that in a image with a couple in which the woman
is pregnant, usually this scenario would be Neutral to him, but as his sister is pregnant, he feels happy
because he remembers his sister.
One of the users was particularly happy in the beginning of the test, and reported us he did not
feel affected by the images. However after viewing various Negative images, he said that his emotional
state was getting worse. In fact, more users (4) stated that sequential Negative images, for example 3,
Negatively affects the emotional state more than, for example, one Negative, one Neutral, one Negative.
The same happens for a Positive image, the user feels Positive, but he is also influenced by the Negative
images, so he did not feel so happy as he “should”. However, two users justified that, given the extensive
amount of sequential Negative images, they tend to rate a Positive image with a higher value. Finally,
some users mentioned that the emotional content of the last image also interferes in the way they were
54
feeling at that moment.
In the case of the impact of images, two users indicated that if it were real, i.e., for example if they
were around a snake they would feel much more affected than by only seeing an image of a snake.
Another user mentioned that if the person (or people) that appear in the image were from his family
or friends, the impact in his emotional state would be considerably bigger. Two users also reported
feeling Fear not for what the image transmits, but because they imagined themselves in that situation.
Concerning the Negative images, one user mentioned that they should be “more shocking”. Four users
would have preferred if the images were larger, ideally fullscreen and with high definition quality, while
two other users suggested that the use of videos instead of images would cause a bigger impact in their
emotional state. Finally, one user suggested the use of 3D using a device such as Oculus Rift.
Concerning the design of the test, six users considered it very long, i.e., with too many images,
and two other users suggested that should have been more Positive images. A large number of users
also reported that the test had too many images of snakes (18) or spiders (7). With so many images
of snakes/spiders, the users (6) reported that they got used to them, and stopped feeling afraid or
disgusted. To avoid the use of many images with the same animals (snakes and spiders) users (3)
suggested the use of salamanders, grasshoppers, scorpions or maggots. In the case of the pause
screen, seven users considered it very long, and one of them did not even understand the need of a
pause between blocks of images. At least one user appreciated the pause screen, and suggested the
use of a timer to indicate the time left for resting. Finally, some users (6) explained that it was complicated
to analyse what they were feeling, given that it was very subjective, and also difficult to rate from 1 to 5;
two of them gave the example that they would only give a rating of 5 in extreme cases, such as if they
started crying or laughing out loud. Besides this, three of them also mentioned that the first images of
each sequence could have had biased ratings because people are adapting to the rating scheme.
The existing comments as well as the reported inconsistencies represent a minority of participants
(10%). The remaining participants did the experiment as it should be, and their responses were aligned
with the emotions that were supposed to be transmitted by the images.
5.5 Discussion
For the images classified as Negative by our users (Figure 5.3) almost all of them had at least 50%
of negative ratings, however 20 to 30% of the images also had a significant number of neutral ratings.
Besides that, only 27% did not have any positive vote.
Regarding the images classified as Neutral or Positive (see Figure 5.4), for the first case (images
from 1033 to Sp139) almost 39% had a considerable number of negative ratings, and only 12% did
not have any Positive vote. In the case of the Positive images (from 1340 to P124) almost 50% had at
least one negative rating, while 10% were rated by all the participants as positive. As in the case of the
Negative images, we can see a lot of neutral ratings for each positive image.
We compared the achieved results, concerning categories, between our dataset and the GAPED/Mikels
datasets for each of the images of our dataset, in order to obtain the agreement between them. In the
case of the images from Mikels dataset (see Table 5.1), the agreement was 100% for the Positive image,
where in the case of the Negative there is confusion between the Negative and Neutral categories.
(%)
Negative
Positive
Negative
66.77
Neutral
33.33
Positive
100
Table 5.1: Confusion Matrix for the categories between Mikels and our dataset
55
Figure 5.3: Classification of the Negative images of our dataset (from users)
Figure 5.4: Classification of the Neutral and Positive images of our dataset (from users)
For the GAPED dataset (see Table 5.2) we analyzed 76 images (33 Negative, 9 Positives, and 34
Neutral). For the Neutral and Positive categories the achieved agreement was 100% for each, while in
the case of the Negative, similarly to what happens for Mikels dataset, there is confusion between the
Negative and Neutral categories.
(%)
Negative
Neutral
Positive
Negative
55
Neutral
43.33
100
Positive
1.67
100
Table 5.2: Confusion Matrix for the categories between GAPED and our dataset
56
5.6 Summary
In this chapter we described the experience performed to annotate a new dataset of images with the
emotional content of each image. Besides that, we also collected important information about what
users think about the experience, and what influences the way they feel during the visualization of an
image. Given this, we consider the following aspects as the most important: the way a person interprets
an image, specifically the context in which the image is inserted, the current emotional state of the
person, and the previous personal experiences of the person.
From the results achieved we conclude that there was no clear agreement between the users, with
this fact being more evident in the Negative and Neutral categories, while the Positive category was
the most consensual. We also compared the agreement, for each image, between our dataset and
Mikels/GAPED datasets. In both cases, there was an overall good agreement, with the worst results
achieved in the Negative category, where the images considered as Negative in the Mikels or GAPED
were mainly considered as Neutral by our users.
57
58
6
Evaluation
In this chapter we present the evaluation, using the new dataset, of the two recognizers: Content-based
Emotion Recognizer (CBER) and Fuzzy Logic Emotion Recognizer (FLER).
6.1 Results
Each image of the new dataset was classified by the two recognizers: FLER and CBER. Concerning the
categories, each image was annotated with the dominant category using CBER and FLER; in the later
each image was annotated with up to two dominant categories. In the case of the emotions, only FLER
was used to annotate the image with the most dominant emotions (up to three).
Besides the classifications made by our recognizers, each image had already the classification made
by the participants of our study (see Chapter 5).
6.1.1
Fuzzy Logic Emotion Recognizer
In the following paragraphs we will describe the evaluation performed concerning the categories (Negative, Neutral, and Positive), as well as the emotions (ADS, D, DF, DS, Ha, N, and S).
Categories
In Table 6.1 we can see the results achieved, using our dataset, to evaluate FLER considering the
categories. From our dataset we used 21 Positive, 67 Neutral and 81 Negative images. In the case of
the Negative category, the achieved recognition rate was 100%, while in the Positive category, it achieved
almost 86%. For the Neutral category, the achieved results was considerably worst (only 28%).
(%)
Negative
Neutral
Positive
Negative
100
61.19
4.76
Neutral
Positive
28.36
9.52
10.45
85.71
Table 6.1: Confusion Matrix for the categories using our dataset
59
When we compared these results with the ones achieved using only GAPED dataset (see Section
3.2) the Negative and Positive categories achieved good results, existing an increase in the Negative
results (from 87.89 to 100%), and a decrease in the Positive (from 100% to 85.71%). It is also clear
that the Neutral category achieved a poor result; it decreases from almost 99% to 28%. However, this
result can be explained by the lack of agreement between the results from our users and the previous
classification of the images from the GAPED dataset (See Section 5.3), as well as the existing confusion
between Negative and Neutral category for the GAPED.
Emotions
Concerning the classification in terms of the emotions that an image conveys, we considered that a
given emotion is present in the image if the median of the values assigned by users to that emotion was
2.0. Considering this, from the 169 images that compose our dataset, almost 23% did not have any
emotion associated. None of the non-annotated images belongs to the Positive category, and almost
60% corresponds to the Neutral category.
Considering only the 131 images with emotions associated, there were no images with the emotions
Anger or Surprise. For the remaining Negative emotions, we had 18 images of Sadness, 8 of Fear, and
5 images associated with Disgust. In the case of Happiness there were 17 images, while for Neutral we
had 36. In the case of two emotions in the same image, we had the following combinations: DS (8), AS
(7), HaN (4), DF(3), FSu (2), NS (2), AD (1), AF (1), DSu (1), and FS (1). Considering combinations of
three emotions in the same image, we had: ADS (7), AFS (3), ADF(2), DFN (1), DFSu (1), FHaN (1),
and HaNSu (1). Finally, there was only one image with four emotions associated: ADSSu.
To check if the emotion identified by our recognizer was correct, we assumed that a result is considered correct, if at least one of the emotions for our dataset is present in the emotions identified by the
recognizer. For example if an image has the emotions ADS from the dataset, all the following emotions,
from the recognizer, will be considered correct: A, D, S, AD, AS, or DS. Given this, our FLER achieved a
success rate of 68.70%. Considering the subset of images annotated with negative emotions, we had a
success rate of 88.41%, while in the case of the images with the positive emotion it was 82.35%. In the
case of the Neutral, it was only 38.89%, and in this case we observed a lot of confusion between the N
and S, DF or F.
6.1.2
Content-Based Emotion Recognizer
For this evaluation, and given that CBER only classifies an image in terms of being Negative or Positive,
we did not consider the Neutral images of our dataset; therefore we used 21 Positive and 81 Negative
images. In Table 6.2 we can see the achieved results.
(%)
Negative
Positive
Negative
76.54
47.62
Positive
23.46
52.38
Table 6.2: Confusion Matrix for the categories using our dataset
If we compare these results with the ones obtained in Section 4.2.5, in both cases there was a
decrease in the recognition rates, from 87.18% to 76.54% to the Negative category, and from 57.69%
to 52.38% in the case of the Positive. Although this can be justified by the use of only one category,
for each image, given that even across our users, in many cases and for different reasons (see Section
5.4), there is no consensus about which feeling each image transmits (see Figure 6.1).
If we consider the negative images (from1304 to Sp146 in Figure 6.1), almost all the images had at
least 50% of negative ratings, but there are also a lot of neutral ratings (in average from 20% to 30%)
60
in these images. Besides that, only 27% of the negative images did not have any positive vote. For the
positive images (from 1340 to P124), almost 50% had at least one negative rating, while 10% were rated
by all the participants as positive. As in the negative images, we can see a lot of neutral ratings for each
positive image.
Figure 6.1: Classification of the Negative and Positive images of our dataset (from users)
6.2 Discussion
Although there is a lot of work done in understanding the content of an image (see section 2.3.1), the
majority of this work did not specifically focus on the emotions or categories that an image conveys.
In some cases it was possible to identify whether a picture is gloomy or not, associate the visual
content of an image to adjectives such as sublime, sad, touch, aggressive, romantic, elegant, chic, or
calm, or pairs of emotions such as like-dislike or gaudy-plain. Besides that, in general, the images that
were used were not generic, since they correspond to painting art or textures related to clothing and
decoration.
The most similar work [?], to the one that we have developed in the case of CBER, managed to sort
pictures using categories (Positive/Negative) with an accuracy of 55%. In the case of basic emotions:
Happiness, Sadness, Anger, Disgust and Fear, they obtained an accuracy of 52%. Concerning the work
we had developed in FLER, and as far as we know, there is no similar work.
6.3 Summary
In this chapter, we perform additional evaluation of our recognizers, using the new dataset. For each
image, we compared the classification of each recognizer to the one achieved using the experiment
described in Chapter 5.
In the case of CBER, using our dataset, we achieved the following recognition rates: 76.54% for
the Negative category and 53.28% for the Positive. In the case of FLER, we achieved a success rate
of 68.70%, using our dataset, for emotions. In the case of categories, we achieved 100% for Negative
category, 88% for the Positive and 28% for the Neutral.
We also briefly compare our work with the works detailed in Chapter 2.
61
62
7
Conclusions and Future Work
In this Chapter, we present a summary of the dissertation, our final conclusions and the contributions of
our work. We also present the new issues that might be addressed in the future.
7.1 Summary of the Dissertation
In this work, we proposed two solutions to identify the emotional content conveyed by an image, one
using the Valence and Arousal values, and another using the content of the image, such as colors,
texture or shape. We also provide a new dataset of images annotated with emotions, obtained from
experiments with users.
In Chapter 2, we described the importance of emotions, as well as how they can be represented.
Emotion in human cognition is essential and plays an important role in the daily life of human beings,
namely in rational decision-making, perception, human interaction, and in human intelligence. Regarding
the emotion representation, there are two different perspectives: categorial and dimensional. Usually,
the dimensional model is preferable because it could be used to locate discrete emotions in space, even
when no particular label could be used to define a certain feeling.
Along with it, we detailed the previous works in the recognition of emotions from images, and how image contents, such as faces, Color, Shape or Texture information, affect the way emotions are perceived
by the users. To describe how humans perceive and classify facial expressions of an emotion, there are
two types of models: the continuous and categorical. The continuous model explains how expressions
of emotion can be seen at different intensities, whereas the categorical explains, among other findings,
why the images in a morphing sequence between two emotions, like Happiness and Surprise, are perceived as either happy or surprise but not something in between. There have been developed models
of the perception and classification of the six facial expressions of emotion. Initially, they used feature
and Shape-based algorithms, but, in the last two decades, appearance-based models (AAM) have been
used.
We also described the relationship between emotions and the different visual characteristics of an
image, namely Color, Shape, Texture, and Composition. Color is the most extensively used visual
content for image retrieval since it is the basic constituent of images. Shape corresponds to an important
63
criterion for matching objects based on their physical structure and profile. Texture is defined as all that
is left after color and local shape has been considered; it also contains information about the structural
arrangement of surfaces and their relationship to the surrounding environment. Composition is based
on common (and not-so-common) rules. The most popular and widely known is the Rule of Thirds, that
can be considered as a sloppy approximation to the golden ratio (about 0.618) [41, 42]. It states that the
most important part of an image is not the center of the image but instead at the one third and two third
lines (both horizontal and vertical), and their four intersections.
We also presented CBIR: a technique that uses visual contents of images to search images in large
databases, using a set of features, such as Color, Shape or Texture. However, the low-level information
used in CBIR systems does not sufficiently capture the semantic information that the user has in mind.
In order to solve this, the EBIR systems could be used. These systems are a subcategory of the CBIR
that, besides the common features, also use emotions as a feature. Most of the research in the area
is focused on assigning image mood on the basis of eyes and lips arrangement, but colors, textures,
composition and objects are also used to characterized the emotional content of an image, i.e., some
expressive and perceptual features are extracted and then mapped into emotions. Besides the extraction
of emotions from an image, there has been an increasing number of attempts to use emotions in different
ways, such as the increase of the quality of recommendation systems. These systems help users find
a small and relevant subset of multimedia items based on their preferences. Finally, the most wellknown problem of these systems, matrix-sparsity problem, can be solved using implicit feedback, such
as recording the emotional reaction of the user to a given item, and use it as a way of rate that item.
Finally, we present the datasets that we used in our work: IAPS, GAPED and Mikels. The IAPS
database provides a set of normative emotional stimuli for experimental investigations of emotion and
attention. The goal is to develop a large set of standardized, emotionally-evocative, internationally accessible, color photographs that includes contents across a wide range of semantic categories [59]. To
increase the availability of visual emotion stimuli, a new database called GAPED was created. Even
though research has shown that the IAPS is useful in the study of discrete emotions, the categorical
structure of the IAPS has not been characterized thoroughly. In 2005, Mikels collected descriptive emotional category data on subsets of the IAPS in an effort to identify images that elicit discrete emotions.
Besides the IAPS and GAPED databases, in which each image was annotated with their Valence
and Arousal ratings, there are other databases (typically related to facial expressions) that were labeled
with the corresponding emotions, such as NimStim Face Stimulus Set, Pictures of Facial Affect (POFA)
or Karolinska Directed Emotional Faces (KDEF).
In Chapter 3, we presented a recognizer to classify an image, based on their V-A ratings using Fuzzy
Logic, with the universal emotions present in it and the corresponding category (Negative, Neutral and
Positive). For each image in the dataset, we started by normalizing the V-A values, and computed the
Angle and the Radius for each image in order to help reduce emotion confusion between images with
similar angles.
To describe each class of emotions, as well as the categories, we used the Product of Sigmoidal
membership function and the Trapezoidal membership function. For the categories, we used Trapezoidal
membership function, both for Angle and Radius, while for the classes of emotions, we used the Product
of Sigmoidal membership function for the Angle and the Trapezoidal membership function for the Radius;
We also present the achieved results concerning the experimental results that we have done. When
using the same set for training and test, we achieved a recognition rate of 100% for Negative and Positive
categories. In the case of the dominant classes of emotions we achieved an average classification rate
of 91,56%. With the use of GAPED as a testing set we achieved an average recognition rate of 96% for
categories. For GAPED, we achieved a non-classification rate of 23.4%, while in the case of the IAPS,
we achieved a non-classification rate of 4.86%.
64
In chapter 4, we described a recognizer to classify an image with the corresponding emotion category: Positive or Negative, based on the content of the image, such as Color, Texture or Shape. We
also presented the several studies that we have made, concerning the combinations of different visual
features, to select the best one for our recognizer. The recognizer uses a Vote classifier based on SMO,
NB, LB, RF, and RSS, and is composed by CH, CM, NDC, and RCS features.
Finally, we presented the experimental results, in which, using a set of 156 images, for testing, that
were not used for training, we achieved an average recognition rate of 72.44% (Negative: 87.18% and
Positive: 57.69%).
In chapter 5, we described the experience performed to annotate a new dataset of images with the
emotional content of each image, as well as the collected information about what users think about the
experience, and what influenced the way they felt during the visualization of an image. Next, we discuss
the aspects that we considered as the most important: the way a person interprets an image, specifically
the context in which the image is inserted, the current emotional state of the person, and the previous
personal experiences of the person.
We also presented the comparison about the agreement, for each image, between our dataset and
Mikels/GAPED datasets. In the case of the images from Mikels dataset, the agreement was 100% for all
the 4 Negative images, as well as for the only Positive image. The remaining two images were classified
as Neutral by the people, although their original classification was Negative. For the GAPED dataset,
we achieved 100% of agreement for the Negative category, and almost 90% for Positive. In the case of
the Neutral category, only about 24% of the images were considerer as Neutral in both datasets, with
the majority, almost 77%, being considered Negative.
In chapter 6, we performed additional evaluation of our recognizers, using the new dataset. For each
image, we compared the classification of each recognizer to the one achieved using the experiment
described in Chapter 5. We also briefly compare our work with the works detailed in Chapter 2.
In the case of CBER, using our dataset, we achieved the following recognition rates: 76.54% for
the Negative category and 53.28% for the Positive. In the case of FLER, we achieved a success rate
of 68.70%, using our dataset, for emotions. In the case of categories, we achieved 100% for Negative
category, 88% for the Positive and 28% for the Neutral.
7.2 Final Conclusions and Contributions
Although there is a lot of work regarding the retrieval of images based on their content, most of this work
did not take into account the emotions that an image conveys. Therefore, our work focused on retrieving
the emotions related to a given image, by providing two recognizers: one using the Valence and Arousal
information from the image, and the other using the visual content of the image. This way, we increased
the number of images annotated with their emotions without the need of manual classification, reducing
both the subjectivity of the classification and the extensive use of the same stimuli. In short, the main
contributions of this work were:
• A Fuzzy recognizer that achieved a recognition rate of 100% for categories of emotion and
91.56% for emotions, using Mikels dataset [66]; for GAPED, the recognizer achieved an average classification rate of 95.59% for the categories of emotion and, finally, using our dataset, it
achieved a success rate of 68.70% for emotions, and, in the case of categories, it achieved 100%
for Negative category, 88% for the Positive and 28% for the Neutral.
• A recognizer based on the content of the images, that has obtained a recognition rate of 87.18%
for the Negative category and 57.69% for the Positive, using a dataset of images selected both
65
from IAPS and from GAPED datasets. Using our dataset, this recognizer achieved a recognition
rate of 76.54% for the Negative category and 53.28% for the Positive.
• A new dataset of 169 images from IAPS, Mikels and GAPED annotated with the dominant categories and emotions, havin in account what users felt while viewing each image.
7.3 Future Work
From the experimental evaluation of the developed recognizers detailed in Section 3.2, 4.2 and Chapter
6, we can establish new guidelines for the work to be done in the future.
Concerning FLER, we used 6 images for ADS, 11 for DS, 12 for F, 24 for DF, 31 for D, 43 for S,
and finally, 114 for Ha, for the creation of each Fuzzy Set. As we can see, the distribution of the images
according to each class of emotion is not balanced, and in the majority of the cases, there is a small
number of images of each. Given this, we consider important to use more annotated images to adjust
the Fuzzy Sets for each class of emotions, and consequently the Fuzzy Sets for each of the categories.
Considering the results obtained throughout this work in the case of the categories, the next possible
step is to merge the two recognizers into one. If a particular image, provided as input to the “new”
recognizer has information about their Valence and Arousal values, a weighting system should be used
between the values of DOM (assigned by FLER) and the estimated probability (assigned by the CBER)
in order to classify the image. Otherwise, it should be used only CBER classification.
Further, we suggest to complement the new dataset with data collected using an BCI device (e.g.,
Emotiv). This way, each image will have the emotion felt, and the emotion reported by the users. Another
possibility is to use the automatic identification of the category of emotions from content, in order to
organize or sort the results of an image search, or even to filter the images that will be displayed to the
user. Besides that, for example in therapy sessions, it may be helpful to use the emotional information
from the images and emotional state of the user, to improve their emotional state.
66
Bibliography
[1] D Aha and D Kibler. Instance-based learning algorithms. Machine Learning, 6:37–66, 1991.
[2] O AlZoubi, RA Calvo, and RH Stevens. Classification of eeg for affect recognition: an adaptive approach. AI 2009: Advances in Artificial Intelligence Lecture Notes in Computer Science, 5866:52–
61, 2009.
[3] JC Amante. Colorido : Identificação da Cor Dominante de Fotografias. PhD thesis, 2011.
[4] JC Amante and MJ Fonseca. Fuzzy Color Space Segmentation to Identify the Same Dominant
Colors as Users. DMS, 2012.
[5] Danny Oude Bos. EEG-based Emotion Recognition: The Influence of Visual and Auditory Stimuli.
Capita Selecta Paper, 2007.
[6] Leo Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996.
[7] Leo Breiman. Random Forests. Machine Learning, 45(1):5–32, 2001.
[8] Shih-Fu Chang, T Sikora, and A Purl. Overview of the MPEG-7 standard. Circuits and Systems for
Video Technology, IEEE Transactions on, 11(6):688–695, June 2001.
[9] SA Chatzichristofis and YS Boutalis. CEDD: color and edge directivity descriptor. A compact descriptor for image indexing and retrieval. Computer Vision Systems, pages 312–322, 2008.
[10] Savvas A Chatzichristofis, Y S Boutalis, and Mathias Lux. Selection of the proper compact composite descriptor for improving content based image retrieval. In B Zagar, editor, Signal Processing,
Pattern Recognition and Applications (SPPRA 2009), page 0, Calgary, Canada, February 2009.
ACTA Press.
[11] Savvas A Chatzichristofis and Yiannis S Boutalis. FCTH: Fuzzy Color and Texture Histogram - A
Low Level Feature for Accurate Image Retrieval. In Proceedings of the 2008 Ninth International
Workshop on Image Analysis for Multimedia Interactive Services, WIAMIS ’08, pages 191–196,
Washington, DC, USA, 2008. IEEE Computer Society.
[12] Chin-han Chen, MF Weng, SK Jeng, and YY Chuang. Emotion-based music visualization using
photos. Advances in Multimedia Modeling, 4903:358–368, 2008.
[13] O da Pos and Paul Green-Armytage. Facial expressions, colours and basic emotions. Journal of
the International Colour Association, 1:1–20, 2007.
[14] Elise S Dan-Glauser and Klaus R Scherer. The Geneva affective picture database (GAPED): a new
730-picture database focusing on valence and normative significance. Behavior research methods,
43(2):468–77, June 2011.
67
[15] Charles Darwin. The Expression of the Emotions in Man and Animals. 1872.
[16] Drago Datcu and L Rothkrantz.
[17] Ritendra Datta, Dhiraj Joshi, Jia Li, and James Z Wang. Studying Aesthetics in Photographic
Images Using a Computational Approach. In Proceedings of the 9th European Conference on
Computer Vision - Volume Part III, ECCV’06, pages 288–301, Berlin, Heidelberg, 2006. SpringerVerlag.
[18] CM de Melo and Jonathan Gratch. Evolving expression of emotions through color in virtual humans using genetic algorithms. Proceedings of the 1st International Conference on Computational
Creativity ({ICCC-X)}, 2010.
[19] Peter Dunker, Stefanie Nowak, André Begau, and Cornelia Lanz. Content-based Mood Classification for Photos and Music: A Generic Multi-modal Classification Framework and Evaluation
Approach. In Proceedings of the 1st ACM International Conference on Multimedia Information
Retrieval, MIR ’08, pages 97–104, New York, NY, USA, 2008. ACM.
[20] Paul Ekman. Basic emotions, chapter 3, pages 45–60. John Wiley & Sons Ltd, New York, 1999.
[21] Paul Ekman and Wallace Friesen. Pictures of Facial Affect. Consulting Psychologists Press, Palo
Alto, CA, 1976.
[22] Paul Ekman and Erika L. Rosenberg. What the Face Reveals: Basic and Applied Studies of Spontaneous Expression Using the Facial Action Coding System (Facs) (Series in Affective Science).
Oxford University Press, 2005.
[23] Elaine Fox. Emotion Science Cognitive and Neuroscientific Approaches to Understanding Human
Emotions, September 2008.
[24] J Friedman, T Hastie, and R Tibshirani. Additive Logistic Regression: a Statistical View of Boosting.
Technical report, Stanford University, 1998.
[25] Syntyche Gbèhounou, François Lecellier, Christine Fernandez-maloigne, and U M R Cnrs. Extraction of emotional impact in colour images. 6th European Conference on Colour in Graphics,
Imaging and Vision, 2012.
[26] Franz Graf. JFeatureLib, 2012.
[27] Ramin Zabih Greg Pass. Comparing Images Using Joint Histograms. 1999.
[28] Mark Hall, Eibe Frank, and Geoffrey Holmes. The WEKA Data Mining Software: An Update. ACM
SIGKDD, 11(1):10–18, 2009.
[29] Onur C Hamsici and Aleix M Martı́nez. Bayes Optimality in Linear Discriminant Analysis. IEEE
Trans. Pattern Anal. Mach. Intell., 30(4):647–657, 2008.
[30] Alan Hanjalic. Extracting Moods from Pictures and Sounds. IEEE SIGNAL PROCESSING MAGAZINE, (March 2006):90–100, 2006.
[31] R Haralick, K Shanmugam, and I Dinstein. Texture Features for Image Classification. IEEE Transactions on Systems, Man, and Cybernetics, 3(6), 1973.
[32] Lane Harrison, Drew Skau, and Steven Franconeri. Influencing visual judgment through affective
priming. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages
2949–2958, 2013.
68
[33] Trevor Hastie and Robert Tibshirani. Classification by Pairwise Coupling. In Michael I Jordan,
Michael J Kearns, and Sara A Solla, editors, Advances in Neural Information Processing Systems,
volume 10. MIT Press, 1998.
[34] Tin Kam Ho. The Random Subspace Method for Constructing Decision Forests. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 20(8):832–844, 1998.
[35] D.H. Hockenbury and S.E. Hockenbury. Discovering psychology. New York: Worth Publishers,
2007.
[36] Jing Huang, S Ravi Kumar, Mandar Mitra, Wei-Jing Zhu, and Ramin Zabih. Image Indexing Using
Color Correlograms. 1997 IEEE Conference on Computer Vision and Pattern Recognition, 0:762,
1997.
[37] George H John and Pat Langley. Estimating Continuous Distributions in Bayesian Classifiers. In
Eleventh Conference on Uncertainty in Artificial Intelligence, pages 338–345, San Mateo, 1995.
Morgan Kaufmann.
[38] Evi Joosten, GV Lankveld, and Pieter Spronck. Colors and emotions in video games. 11th International Conference on Entertainment Computing, 2010.
[39] Dhiraj Joshi, Ritendra Datta, Elena Fedorovskaya, Quang-tuan Luong, James Z Wang, Li Jia, and
Jiebo Luo. Aesthetics and Emotions in Images [A computational perspective ]. IEEE Signal Processing Magazine, (SEPTEMBER 2011):94–115, 2011.
[40] Takeo Kanade. Picture Processing System by Computer Complex and Recognition of Human
Faces. 1973.
[41] S S Keerthi, S K Shevade, C Bhattacharyya, and K R K Murthy. Improvements to Platt’s SMO
Algorithm for SVM Classifier Design. Neural Computation, 13(3):637–649, 2001.
[42] A Khokher and R Talwar. Content-based image retrieval: state of the art and challenges. (IJAEST)
INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES,
9(2):207–211, 2011.
[43] Youngrae Kim, So-jung Kim, and Eun Yi Kim. EBIR: Emotion-based image retrieval. 2009 Digest
of Technical Papers International Conference on Consumer Electronics, pages 1–2, January 2009.
[44] J Kittler, M Hatef, Robert P W Duin, and J Matas. On combining classifiers. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 20(3):226–239, 1998.
[45] Hans-Peter Kriegel, Erich Schubert, and Arthur Zimek. Evaluation of Multiple Clustering Solutions.
In MultiClust@ECML/PKDD, pages 55–66, 2011.
[46] Kai Kuikkaniemi, Toni Laitinen, Marko Turpeinen, Timo Saari, Ilkka Kosunen, and Niklas Ravaja.
The influence of implicit and explicit biofeedback in first-person shooter games. CHI’10 Proceedings
of the SIGCHI Conference on Human Factors in Computing Systems, pages 859–868, 2010.
[47] Ludmila I Kuncheva. Combining Pattern Classifiers: Methods and Algorithms. John Wiley and
Sons, Inc., 2004.
[48] R D Lane, E M Reiman, G L Ahern, G E Schwartz, and R J Davidson. Neuroanatomical correlates
of happiness, sadness, and disgust. The American journal of psychiatry, 154(7):926–33, July 1997.
69
[49] P J Lang. The emotion probe: Studies of motivation and attention. American psychologist, 50:372,
1995.
[50] P.J. Lang, M.M. Bradley, and B.N. Cuthbert. International affective picture system (IAPS): Affective
ratings of pictures and instruction manual. NIMH Center for the Study of Emotion and Attention,
1997.
[51] Christine L. Larson, Joel Aronoff, and Elizabeth L. Steuer. Simple geometric shapes are implicitly
associated with affective value. Motivation and Emotion, 36(3):404–413, October 2011.
[52] S le Cessie and J C van Houwelingen. Ridge Estimators in Logistic Regression. Applied Statistics,
41(1):191–201, 1992.
[53] T M C Lee, H-L Liu, C C H Chan, S-Y Fang, and J-H Gao. Neural activities associated with emotion
recognition observed in men and women. Molecular psychiatry, 10(5):450–5, May 2005.
[54] Yisi Liu, Olga Sourina, and MK Nguyen. Real-time EEG-based emotion recognition and its applications. Transactions on computational science XII, 2011.
[55] David G. Lowe. Three-dimensional object recognition from single two-dimensional images. Artificial
Intelligence, 31:355–395, 1987.
[56] Xin Lu, Poonam Suryanarayan, Reginald B Adams, Jia Li, Michelle G Newman, and James Z Wang.
On Shape and the Computability of Emotions. Proceedings of the ACM Multimedia Conference,
2012.
[57] Marcel P. Lucassen, Theo Gevers, and Arjan Gijsenij. Texture affects color emotion. Color Research
& Application, 36(6):426–436, December 2011.
[58] Mathias Lux and Savvas A Chatzichristofis. Lire: Lucene Image Retrieval: An Extensible Java
CBIR Library. In Proceedings of the 16th ACM International Conference on Multimedia, MM ’08,
pages 1085–1088, New York, NY, USA, 2008. ACM.
[59] Mathias Lux and Oge Marques. Visual Information Retrieval Using Java and LIRE. Synthesis
Lectures on Information Concepts, Retrieval, and Services. Morgan & Claypool Publishers, 2013.
[60] Jana Machajdik and Allan Hanbury. Affective image classification using features inspired by psychology and art theory. Proceedings of the international conference on Multimedia - MM ’10,
page 83, 2010.
[61] D Marr. Early processing of visual information. Philosophical Transactions of the Royal Society of
London, B275:483–524, 1976.
[62] Christian Martin, Uwe Werner, and HM Gross.
A real-time facial expression recognition
system based on active appearance models using gray images and edge images.
IEEE,
216487(216487):1–6, 2008.
[63] Aleix Martinez and Shichuan Du. A Model of the Perception of Facial Expressions of Emotion by
Humans: Research Overview and Perspectives. Journal of Machine Learning Research : JMLR,
13:1589–1608, May 2012.
[64] S Marčelja. Mathematical description of the responses of simple cortical cells. J. Opt. Soc. Am.,
70(11):1297–1300, November 1980.
70
[65] Celso De Melo and Ana Paiva. Expression of emotions in virtual humans using lights, shadows,
composition and filters. Affective Computing and Intelligent Interaction, pages 549–560, 2007.
[66] Joseph a Mikels, Barbara L Fredrickson, Gregory R Larkin, Casey M Lindberg, Sam J Maglio,
and Patricia a Reuter-Lorenz. Emotional category data on images from the International Affective
Picture System. Behavior research methods, 37(4):626–30, November 2005.
[67] Katarzyna Agnieszka Olkiewicz and Urszula Markowska-kaczmar. Emotion-based image retrieval
- An artificial neural network approach. Proceedings of the International Multiconference on Computer Science and Information Technology, pages 89–96, 2010.
[68] Michael Jones Paul Viola. Robust Real-time Object Detection. International Journal of Computer
Vision, 2001.
[69] W.R. Picard. Affective Computing, 1995.
[70] J Platt. Fast Training of Support Vector Machines using Sequential Minimal Optimization. In
B Schoelkopf, C Burges, and A Smola, editors, Advances in Kernel Methods - Support Vector
Learning. MIT Press, 1998.
[71] R. Plutchik. The nature of Emotions. Am. Sci., 89(4):344–350, 2001.
[72] Jonathan Posner, James a Russell, and Bradley S Peterson. The circumplex model of affect:
an integrative approach to affective neuroscience, cognitive development, and psychopathology.
Development and psychopathology, 17(3):715–34, January 2005.
[73] Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo,
CA, 1993.
[74] Thomas Rorissa, Abebe; Clough, Paul; Deselaers. Exploring the Relationship Between Feature.
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY,
59(5):770–784, 2008.
[75] J A Russell. A circumplex model of affect. Journal of personality and social psychology, 39(6):1161–
1178, 1980.
[76] Stefanie Schmidt and WG Stock. Collective indexing of emotions in images. A study in emotional
information retrieval. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE
AND TECHNOLOGY, 60(February):863–876, 2009.
[77] SG Shaila and A Vadivel. Content-Based Image Retrieval Using Modified Human Colour Perception
Histogram. ITCS, SIP, JSE-2012, CS & IT, pages 229–237, 2012.
[78] DS Shete and MS Chavan. Content Based Image Retrieval: Review. International Journal of
Emerging Technology and Advanced Enginnering, 2(9):85–90, 2012.
[79] A. Smith. A new set of norms. Behavior Research Methods, Instruments, and Computers, (3x(x),
xxx-xxx), 2004.
[80] A. Smith.
Smith2004norms.txt.
Retrieved October 2, 2004 from Psychonomic Society Web
Archieve: http://www.psychonomic.org/ARCHIEVE/, 2004.
[81] Martin Solli. Color Emotions in Large Scale Content Based Image Indexing. PhD thesis, 2011.
[82] H Tamura, S Mori, and T Yamawaki. Texture features corresponding to visual perception. IEEE
Transactions on Systems, Man and Cybernetics, 8(6), 1978.
71
[83] M Tkalčič, A Kosir, and J Tasic. Affective recommender systems: the role of emotions in recommender systems. Decisions@RecSys, 2011.
[84] Marko Tkalčič, Urban Burnik, and Andrej Košir. Using affective parameters in a content-based
recommender system for images. User Modeling and User-Adapted Interaction, 20(4):279–311,
September 2010.
[85] Koen E a van de Sande, Theo Gevers, and Cees G M Snoek. Evaluating color descriptors for
object and scene recognition. IEEE transactions on pattern analysis and machine intelligence,
32(9):1582–96, September 2010.
[86] Paul Viola and Michael Jones. Rapid object detection using a boosted cascade of simple features.
In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE
Computer Society Conference on, volume 1, pages I—-511. IEEE, 2001.
[87] WN Wang and YL Yu.
Image emotional semantic query based on color semantic descrip-
tion. Proceedings of the Fourth International Conference on Machine Learning and Cybernetics,
Guangzhou, (August):18–21, 2005.
[88] X Wang, Jia Jia, Yongxin Wang, and Lianhong Cai. Modeling the Relationship Between Texture Semantics and Textile Images. Research Journal of Applied Sciences, Engineering and Technology,
3(9):977–985, 2011.
[89] HW Yoo. Visual-based emotional descriptor and feedback mechanism for image retrieval. Journal
of information science and engineering, 1227:1205–1227, 2006.
[90] L. A. Zadeh. Fuzzy Sets*. Information and Control, 8:338–353, 1965.
72
Appendix A
%
NB
56.78
61.44
53.81
50.00
58.05
65.68
66.53
57.20
56.78
46.18
47.88
49.58
51.69
62.71
51.69
62.29
56.14
ACC
CH
CM
NDC
OH
PCFH
PCFHS
RCS
RT
EH
Gabor
Haralick
Tamura
CEDD
FCTH
JCD
Average
RF
56.78
56.36
52.12
51.27
56.78
57.63
54.66
58.05
61.02
52.54
45.76
53.81
50.00
61.44
53.81
59.75
55.11
Log
50.00
61.44
56.78
50.85
53.81
61.86
50.85
61.02
56.78
47.03
36.86
56.78
55.93
55.08
51.27
52.54
53.68
J48
57.20
56.36
48.73
52.12
53.39
55.93
56.36
57.63
60.17
55.51
50.00
48.31
47.03
55.08
52.12
59.32
54.08
SMO
55.93
65.25
52.97
52.12
58.90
63.98
58.90
58.05
61.86
44.92
50.85
55.51
52.97
58.90
55.93
58.90
56.62
Ibk
56.36
58.05
47.03
51.69
49.58
55.80
54.66
54.24
61.44
49.58
45.34
46.19
52.54
54.66
52.97
53.81
52.75
Bag
60.17
59.75
51.27
47.03
41.27
56.78
60.59
62.29
59.32
48.73
43.22
50.42
47.88
62.29
55.93
58.47
54.09
RSS
58.90
60.59
52.54
51.69
52.97
60.59
58.90
60.17
58.90
47.46
43.64
54.24
46.19
60.59
55.08
59.32
55.11
LB
58.47
61.86
55.51
52.54
54.66
57.63
57.20
61.02
61.02
46.19
43.22
50.42
47.88
58.05
58.05
59.32
55.19
Table A1: Simple and Meta classifiers results for each feature
%
V1
55.93
65.68
52.54
52.97
57.63
64.41
59.75
61.02
64.41
44.07
44.07
57.20
56.36
60.59
60.17
58.90
57.23
ACC
CH
CM
NDC
OH
PCFH
PCFHS
RCS
RT
EH
Gabor
Haralick
Tamura
CEDD
FCTH
JCD
Average
V2
56.78
63.56
51.27
52.54
58.05
61.86
63.14
60.17
64.83
44.92
44.07
56.78
53.81
62.71
59.75
63.56
57.36
V3
59.32
63.98
51.69
51.69
59.75
64.83
62.71
60.17
63.98
44.92
45.76
55.93
52.54
61.02
62.71
62.29
57.71
V4
56.78
63.14
52.97
52.12
59.75
64.83
62.29
59.75
61.86
44.92
48.73
56.36
53.81
61.86
61.44
63.56
57.76
V5
53.39
63.56
52.97
52.12
58.47
65.68
60.17
60.59
61.44
44.92
46.19
54.66
52.97
58.47
59.75
59.75
56.57
V6
56.36
64.83
52.97
52.12
58.90
63.98
58.90
58.05
61.86
44.92
51.27
55.51
52.97
58.90
56.78
58.90
56.70
V4
56.78
63.14
52.97
V5
53.39
63.56
52.97
V6
56.36
64.83
52.97
Table A2: Vote classifiers results for each feature
%
V1
55.93
65.68
52.54
ACC
CH
CM
73
V2
56.78
63.56
51.27
V3
59.32
63.98
51.69
52.97
57.63
64.41
59.75
61.02
58.74
NDC
OH
PFCH
PFCHS
RCS
Average
52.54
58.05
61.86
63.14
60.17
58.42
51.69
59.75
64.83
62.71
60.17
59.27
52.12
59.75
64.83
62.29
59.75
58.95
52.12
58.47
65.68
60.17
60.59
58.37
52.12
58.90
63.98
58.90
58.05
58.26
V4
58.90
55.08
56.78
57.20
54.24
56.36
56.36
61.86
63.56
63.98
63.98
64.83
64.83
53.81
59.32
62.71
63.56
58.90
59.32
65.68
62.71
59.75
62.71
64.83
62.29
62.29
65.68
66.95
61.02
V5
54.66
52.97
53.81
53.81
55.51
51.69
53.39
62.71
63.98
65.25
63.56
60.59
66.10
52.54
60.17
62.29
61.02
59.32
58.47
66.10
60.59
59.32
62.71
64.83
61.44
59.32
63.98
63.14
59.76
V6
58.05
56.36
56.36
57.63
57.20
56.78
57.20
64.41
65.25
66.52
65.25
58.90
67.80
52.54
59.32
61.86
58.90
56.78
58.90
64.41
59.75
58.47
61.44
61.44
61.02
60.59
65.68
61.44
60.37
V4
59.32
59.32
56.36
57.20
56.36
60.17
55.08
54.66
55.51
56.36
55.93
57.63
54.66
56.36
57.20
55.93
V5
54.24
54.24
55.08
53.39
54.66
53.39
50.85
51.27
54.24
52.12
52.12
54.24
55.08
52.54
56.78
58.47
V6
57.63
58.05
57.20
57.63
58.90
58.05
56.78
57.20
57.63
58.05
57.20
57.63
57.63
57.20
57.63
58.05
Table A3: Results for Color using one feature
%
V1
56.36
55.51
54.66
56.36
55.93
53.39
55.08
65.68
66.95
66.95
64.83
63.98
65.25
48.73
58.05
62.29
61.02
61.02
57.20
64.83
61.44
63.14
62.29
67.37
62.29
61.02
63.98
64.41
60.71
ACC+CH
ACC+CM
ACC+NDC
ACC+OH
ACC+PFCH
ACC+PFCHS
ACC+RCS
CH+CM
CH+NDC
CH+OH
CH+PFCH
CH+PFCHS
CH+RCS
CM+NDC
CM+OH
CM+PFCH
CM+PFCHS
CM+RCS
NDC+OH
NDC+PFCH
NDC+PFCHS
NDC+RCS
OH+PFCH
OH+PFCHS
OH+RCS
PFCH+PFCHS
PFCH+RCS
PFCHS+RCS
Average
V2
59.32
55.93
58.05
61.02
57.20
56.36
59.32
65.25
64.83
64.83
63.98
64.41
68.22
51.27
58.47
60.17
63.14
61.02
58.32
63.56
62.29
60.59
62.29
62.72
60.59
63.14
63.56
64.83
61.24
V3
56.36
57.78
58.47
57.20
56.78
56.78
58.05
63.56
62.29
64.41
64.41
63.14
64.83
49.58
58.47
61.86
63.14
60.59
59.75
65.25
63.56
60.59
62.71
64.83
61.86
63.14
65.25
65.68
61.08
Table A4: Results for combination of two Color features
%
V1
57.20
56.36
54.66
56.36
57.63
57.20
54.66
52.54
53.81
54.66
55.08
56.78
56.78
52.96
58.90
59.32
ACC+CH+CM
ACC+CH+NDC
ACC+CH+OH
ACC+CH+PFCH
ACC+CH+PFCHS
ACC+CH+RCS
ACC+CM+NDC
ACC+CM+OH
ACC+CM+PFCH
ACC+CM+PFCHS
ACC+CM+RCS
ACC+NDC+OH
ACC+NDC+PFCH
ACC+NDC+PFCHS
ACC+NDC+RCS
ACC+OH+PFCH
74
V2
59.75
58.47
57.20
57.63
58.05
59.32
55.93
56.36
56.78
58.05
58.48
56.78
57.20
56.36
60.17
57.63
V3
58.47
58.47
55.51
55.93
56.36
60.59
57.63
55.51
56.78
57.20
56.78
57.78
55.93
56.78
58.90
58.05
53.39
57.20
57.20
55.51
54.66
64.41
66.10
64.83
63.56
63.98
66.52
64.41
63.98
66.95
65.68
64.83
69.50
61.86
66.52
64.83
58.90
63.56
61.86
61.44
61.02
63.56
60.59
61.86
62.29
62.71
62.29
64.83
60.59
61.44
63.14
63.98
63.98
61.44
64.83
61.86
60.66
ACC+OH+PFCHS
ACC+OH+RCS
ACC+PFCH+PFCHS
ACC+PFCH+RCS
ACC+PFCHS+RCS
CH+CM+NDC
CH+CM+OH
CH+CM+PFCH
CH+CM+PFCHS
CH+CM+RCS
CH+NDC+OH
CH+NDC+PFCH
CH+NDC+PFCHS
CH+NDC+RCS
CH+OH+PFCH
CH+OH+PFCHS
CH+OH+RCS
CH+PFCH+PFCHS
CH+PFCH+RCS
CH+PFCHS+RCS
CM+NDC+OH
CM+NDC+PFCH
CM+NDC+PFCHS
CM+NDC+RCS
CM+OH+PFCH
CM+OH+PFCHS
CM+OH+RCS
CM+PFCH+PFCHS
CM+PFCH+RCS
CM+PFCHS+RCS
NDC+OH+PFCH
NDC+OH+PFCHS
NDC+OH+RCS
NDC+PFCH+PFCHS
NDC+PFCH+RCS
NDC+PFCHS+RCS
OH+PFCH+PFCHS
OH+PFCH+RCS
OH+PFCHS+RCS
PFCH+PFCHS+RCS
Average
58.90
60.17
58.90
60.59
58.90
65.25
66.52
63.98
63.98
67.80
65.25
63.14
63.56
67.37
64.41
65.68
68.64
63.98
66.52
64.83
58.90
63.14
60.17
60.59
61.44
63.14
60.59
63.14
62.29
61.44
61.02
64.83
58.48
64.41
62.29
65.68
63.56
63.14
67.37
66.10
61.68
57.63
58.90
57.20
59.32
57.20
63.56
62.29
63.14
63.98
63.56
62.71
63.14
63.98
65.69
61.86
64.83
65.25
62.29
66.52
65.25
59.75
63.98
63.56
59.32
60.59
63.98
61.86
63.98
63.98
61.44
62.29
64.83
61.86
61.86
64.41
64.83
62.29
65.25
66.95
66.52
61.22
56.36
57.20
57.63
58.48
57.20
62.72
62.70
64.41
64.83
65.25
63.56
64.41
65.68
65.68
63.14
66.10
66.53
63.98
66.52
64.41
59.32
63.56
63.56
59.32
60.59
63.56
62.29
62.71
63.56
63.98
63.14
64.41
61.85
61.86
64.83
67.37
62.29
64.83
66.10
66.52
61.26
53.39
56.36
53.39
53.39
53.82
63.56
63.25
63.14
61.02
65.25
63.68
63.98
62.29
66.10
63.56
63.14
69.07
58.05
64.83
61.86
60.17
63.14
61.02
59.32
61.86
62.71
60.59
60.59
62.71
60.59
62.71
63.98
61.02
60.59
65.25
62.29
62.71
62.29
64.83
60.17
59.36
55.93
57.63
56.36
56.78
57.63
65.25
64.83
64.41
58.90
66.95
66.25
66.10
58.90
67.80
64.41
59.32
67.80
61.44
65.68
61.44
59.32
63.14
58.90
55.51
61.02
58.90
60.17
60.17
62.71
58.05
61.86
61.02
60.59
60.59
64.83
62.72
58.05
63.98
60.59
60.59
60.34
V4
58.90
57.20
58.90
56.36
59.75
55.08
55.51
56.78
56.78
55.08
57.63
57.20
56.78
59.32
58.47
V5
53.81
55.08
52.97
55.93
53.81
50.85
53.81
50.85
54.66
57.20
54.24
57.20
53.81
55.93
54.66
V6
57.63
58.48
57.20
59.75
57.20
56.78
57.63
58.05
58.05
58.05
57.63
58.05
56.36
57.20
57.63
Table A5: Results for combination of three Color features
%
ACC+CH+CM+NDC
ACC+CH+CM+OH
ACC+CH+CM+PFCH
ACC+CH+CM+PFCHS
ACC+CH+CM+RCS
ACC+CM+NDC+OH
ACC+CM+NDC+PFCH
ACC+CM+NDC+PFCHS
ACC+CM+NDC+RCS
ACC+NDC+OH+PFCH
ACC+NDC+OH+PFCHS
ACC+NDC+OH+RCS
ACC+OH+PFCH+PFCHS
ACC+OH+PFCH+RCS
ACC+PFCH+PFCHS+RCS
V1
56.36
56.78
55.51
58.05
56.78
54.66
56.36
56.78
56.36
58.05
56.78
55.93
55.93
55.93
59.32
75
V2
58.05
59.32
58.05
57.63
59.32
55.93
58.05
56.78
60.17
58.05
56.78
59.75
58.90
60.17
59.32
V3
59.75
59.32
59.32
59.75
59.32
57.63
56.78
54.24
60.17
57.20
56.78
58.47
57.63
59.32
57.63
66.10
63.98
63.98
66.10
62.29
63.56
61.44
62.29
63.98
66.10
61.02
65.25
59.84
CH+CM+NDC+OH
CH+CM+NDC+PFCH
CH+CM+NDC+PFCHS
CH+CM+NDC+RCS
CM+NDC+OH+PFCH
CM+NDC+OH+PFCHS
CM+NDC+OH+RCS
NDC+OH+PFCH+PFCHS
NDC+OH+PFCH+RCS
NDC+OH+PCFHS+RCS
NDC+PFCH+PFCHS+RCS
OH+PFCH+PFCHS+RCS
Average
64.41
64.83
63.14
68.64
61.44
62.71
60.59
63.98
63.14
65.78
63.14
61.02
60.71
60.59
65.25
62.29
65.25
62.71
62.71
62.71
61.44
63.98
65.68
65.25
63.98
60.56
62.71
64.83
65.68
66.10
62.29
63.56
61.86
62.29
64.83
66.10
65.68
63.56
60.34
65.25
63.56
61.44
65.68
63.56
63.56
60.59
61.44
62.71
65.68
58.90
62.29
58.13
64.83
64.41
57.20
67.80
62.29
59.75
59.75
58.05
62.71
60.59
60.17
60.17
59.39
V4
56.36
59.32
56.26
59.75
56.78
55.93
56.78
56.35
59.32
58.05
64.41
66.95
63.98
62.71
63.14
63.98
60.00
V5
55.93
53.81
55.51
55.08
56.36
54.66
54.66
54.24
57.63
54.66
63.56
62.71
64.41
59.75
62.28
63.14
58.02
V6
58.48
58.05
59.76
57.63
58.05
57.63
58.05
57.63
57.20
57.20
64.41
58.90
65.25
58.47
61.44
60.17
59.27
V4
58.05
56.78
58.90
57.20
59.74
59.75
56.78
60.17
60.17
55.93
59.32
59.32
58.05
67.38
64.83
65.25
62.29
59.99
V5
56.36
55.51
55.93
54.66
55.93
55.51
56.34
57.63
55.93
53.81
57.35
54.66
53.81
61.86
63.14
63.14
61.44
57.24
V6
58.47
58.90
58.05
59.32
57.20
58.90
58.90
57.20
58.05
58.05
58.05
58.90
57.20
61.02
64.41
61.86
59.75
59.07
Table A6: Results for combination of four Color features
%
ACC+CH+CM+NDC+OH
ACC+CH+CM+NDC+PFCH
ACC+CH+CM+NDC+PFCHS
ACC+CH+CM+NDC+RCS
ACC+CM+NDC+OH+PFCH
ACC+CM+NDC+OH+PFCHS
ACC+CM+NDC+OH+RCS
ACC+NDC+OH+PFCH+PFCHS
ACC+NDC+OH+PFCH+RCS
ACC+OH+PFCH+PFCHS+RCS
CH+CM+NDC+OH+PFCH
CH+CM+NDC+OH+PFCHS
CH+CM+NDC+OH+RCS
CM+NDC+OH+PFCH+PFCHS
CM+NDC+OH+PFCH+RCS
NDC+OH+PFCH+PFCHS+RCS
Average
V1
57.63
57.20
58.90
58.90
57.63
56.78
56.36
55.51
58.47
57.20
65.25
65.25
65.25
61.86
62.71
65.68
60.04
V2
58.90
59.75
58.05
61.44
58.05
58.05
60.17
57.63
59.32
58.90
62.71
64.41
67.80
61.44
61.44
63.14
60.70
V3
57.63
59.75
59.75
60.59
57.2
58.9
58.47
58.47
59.75
59.75
63.56
64.83
62.71
61.02
62.71
64.41
60.59
Table A7: Results for combination of five Color features
%
ACC+CH+CM+NDC+OH+PFCH
ACC+CH+CM+NDC+OH+PFCHS
ACC+CH+CM+NDC+OH+RCS
ACC+CH+CM+NDC+PFCH+PFCHS
ACC+CH+CM+NDC+PFCH+RCS
ACC+CH+CM+NDC+PFCHS+RCS
ACC+CH+CM+OH+PFCH+PFCHS
ACC+CH+CM+OH+PFCH+RCS
ACC+CH+CM+OH+PFCHS+RCS
ACC+CM+NDC+OH+PFCH+PFCHS
ACC+CM+NDC+OH+PFCH+RCS
ACC+CM+NDC+OH+PFCHS+RCS
ACC+NDC+OH+PFCH+PFCHS+RCS
CH+CM+NDC+OH+PFCH+PFCHS
CH+CM+NDC+OH+PFCH+RCS
CH+CM+NDC+OH+PFCHS+RCS
CM+NDC+OH+PFCH+PFCHS+RCS
Average
V1
56.78
57.63
58.47
59.75
58.05
58.90
61.44
58.47
59.32
56.78
57.20
57.63
57.20
63.56
65.68
63.98
63.56
59.67
V2
61.44
60.59
58.47
60.17
60.59
59.32
59.75
58.90
60.59
58.90
58.90
60.59
59.75
63.14
67.80
64.83
64.41
61.07
V3
59.75
58.90
58.47
60.59
60.17
60.17
57.20
59.75
59.75
58.47
58.05
60.59
58.47
66.10
65.25
64.83
63.56
60.59
Table A8: Results for combination of six Color features
76
%
ACC+CH+CM+NDC+OH+PFCH+PFCHS
ACC+CH+CM+NDC+OH+PFCH+RCS
ACC+CH+CM+NDC+OH+PFCHS+RCS
ACC+CH+CM+NDC+PFCH+PFCHS+RCS
ACC+CH+CM+OH+PFCH+PFCHS+RCS
ACC+CH+NDC+OH+PFCH+PFCHS+RCS
ACC+CM+NDC+OH+PFCH+PFCHS+RCS
CH+CM+NDC+OH+PFCH+PFCHS+RCS
Average
V1
59.75
58.47
58.48
58.90
62.29
59.32
57.63
65.68
60.07
V2
60.17
61.86
61.86
60.59
61.44
61.44
60.59
63.98
61.49
V3
60.59
58.90
62.71
61.02
59.32
60.17
58.9
66.95
61.07
V4
57.20
60.17
59.75
59.32
58.90
58.90
58.90
65.68
59.85
V5
54.66
56.36
56.36
55.08
58.05
55.51
54.66
62.71
56.67
V6
59.32
57.63
58.90
60.17
60.59
58.90
59.32
63.14
59.75
V4
58.90
V5
56.36
V6
60.59
V4
63.14
59.75
64.83
62.29
59.75
61.86
63.56
63.98
63.98
64.83
64.83
62.71
63.56
65.68
62.71
62.71
64.83
62.29
62.29
65.68
66.95
62.72
62.70
64.41
64.83
65.25
63.56
64.41
65.68
65.68
63.14
66.1
66.53
63.98
66.52
64.41
63.56
63.56
63.56
62.71
V5
63.56
58.47
65.68
60.17
60.59
62.71
63.98
65.25
63.56
60.59
66.10
62.29
61.02
66.10
60.59
62.71
64.83
61.44
59.32
63.98
63.14
63.56
63.25
63.14
61.02
65.25
63.68
63.98
62.29
66.1
63.56
63.14
69.07
58.05
64.83
61.86
63.14
61.02
62.71
60.59
V6
64.83
58.90
63.98
58.90
58.05
64.41
65.25
66.52
65.25
58.90
67.80
61.86
58.90
64.41
59.75
61.44
61.44
61.02
60.59
65.68
61.44
65.25
64.83
64.41
58.90
66.95
66.25
66.10
58.90
67.8
64.41
59.32
67.80
61.44
65.68
61.44
63.14
58.90
58.90
60.17
Table A9: Results for combination of seven Color features
%
V1
60.59
ALL
V2
59.75
V3
59.32
Table A10: Results for combination of all Color features
%
V1
65.68
57.63
64.41
59.75
61.02
65.68
66.95
66.95
64.83
63.98
65.25
62.29
61.02
64.83
61.44
62.29
67.37
62.29
61.02
63.98
64.41
64.41
66.10
64.83
63.56
63.98
66.52
64.41
63.98
66.95
65.68
64.83
69.50
61.86
66.52
64.83
63.56
61.86
63.56
61.86
CH
OH
PFCH
PFCHS
RCS
CH+CM
CH+NDC
CH+OH
CH+PFCH
CH+PFCHS
CH+RCS
CM+PFCH
CM+PFCHS
NDC+PFCH
NDC+PFCHS
OH+PFCH
OH+PFCHS
OH+RCS
PFCH+PFCHS
PFCH+RCS
PFCHS+RCS
CH+CM+NDC
CH+CM+OH
CH+CM+PFCH
CH+CM+PFCHS
CH+CM+RCS
CH+NDC+OH
CH+NDC+PFCH
CH+NDC+PFCHS
CH+NDC+RCS
CH+OH+PFCH
CH+OH+PFCHS
CH+OH+RCS
CH+PFCH+PFCHS
CH+PFCH+RCS
CH+PFCHS+RCS
CM+NDC+PFCH
CM+NDC+PFCHS
CM+OH+PFCHS
CM+PFCH+PFCHS
77
V2
63.56
58.05
61.86
63.14
60.17
65.25
64.83
64.83
63.98
64.41
68.22
60.17
63.14
63.56
62.29
62.29
62.72
60.59
63.14
63.56
64.83
65.25
66.52
63.98
63.98
67.80
65.25
63.14
63.56
67.37
64.41
65.68
68.64
63.98
66.52
64.83
63.14
60.17
63.14
63.14
V3
63.98
59.75
64.83
62.71
60.17
63.56
62.29
64.41
64.41
63.14
64.83
61.86
63.14
65.25
63.56
62.71
64.83
61.86
63.14
65.25
65.68
63.56
62.29
63.14
63.98
63.56
62.71
63.14
63.98
65.69
61.86
64.83
65.25
62.29
66.52
65.25
63.98
63.56
63.98
63.98
62.29
62.71
62.29
64.83
60.59
61.44
63.14
63.98
63.98
61.44
64.83
61.86
66.10
63.98
63.98
66.10
62.29
63.56
61.44
62.29
63.98
65.25
65.25
65.25
65.25
61.86
62.71
65.68
63.56
65.68
63.98
63.56
65.68
60.59
63.83
CM+PFCH+RCS
CM+PFCHS+RCS
NDC+OH+PFCH
NDC+OH+PFCHS
NDC+OH+RCS
NDC+PFCH+PFCHS
NDC+PFCH+RCS
NDC+PFCHS+RCS
OH+PFCH+PFCHS
OH+PFCH+RCS
OH+PFCHS+RCS
PFCH+PFCHS+RCS
CH+CM+NDC+OH
CH+CM+NDC+PFCH
CH+CM+NDC+PFCHS
CH+CM+NDC+RCS
CM+NDC+OH+PFCH
CM+NDC+OH+PFCHS
CM+NDC+OH+RCS
NDC+OH+PFCH+PFCHS
NDC+OH+PFCH+RCS
OH+PFCH+PFCHS+RCS
CH+CM+NDC+OH+PFCH
CH+CM+NDC+OH+PFCHS
CH+CM+NDC+OH+RCS
CM+NDC+OH+PFCH+PFCHS
CM+NDC+OH+PFCH+RCS
NDC+OH+PFCH+PFCHS+RCS
CH+CM+NDC+OH+PFCH+PFCHS
CH+CM+NDC+OH+PFCH+RCS
CH+CM+NDC+OH+PFCHS+RCS
CM+NDC+OH+PFCH+PFCHS+RCS
CH+CM+NDC+OH+PFCH+PFCHS+RCS
ALL
Average
62.29
61.44
61.02
64.83
58.48
64.41
62.29
65.68
63.56
63.14
67.37
66.10
64.41
64.83
63.14
68.64
61.44
62.71
60.59
63.98
63.14
61.02
62.71
64.41
67.8
61.44
61.44
63.14
63.14
67.8
64.83
64.41
63.98
59.75
63.71
63.98
61.44
62.29
64.83
61.86
61.86
64.41
64.83
62.29
65.25
66.95
66.52
60.59
65.25
62.29
65.25
62.71
62.71
62.71
61.44
63.98
63.98
63.56
64.83
62.71
61.02
62.71
64.41
66.1
65.25
64.83
63.56
66.95
59.32
63.62
63.56
63.98
63.14
64.41
61.85
61.86
64.83
67.37
62.29
64.83
66.10
66.52
62.71
64.83
65.68
66.10
62.29
63.56
61.86
62.29
64.83
63.56
64.41
66.95
63.98
62.71
63.14
63.98
67.38
64.83
65.25
62.29
65.68
58.90
63.97
62.71
60.59
62.71
63.98
61.02
60.59
65.25
62.29
62.71
62.29
64.83
60.17
65.25
63.56
61.44
65.68
63.56
63.56
60.59
61.44
62.71
62.29
63.56
62.71
64.41
59.75
62.28
63.14
61.86
63.14
63.14
61.44
62.71
56.36
62.70
V5
61.44
V6
61.86
V5
44.92
V6
44.92
V5
46.19
54.66
52.97
52.12
51.69
57.20
56.36
53.03
V6
51.27
55.51
52.97
53.81
52.12
57.20
58.05
54.42
Table A11: List of candidate features for Color
%
RT
V1
64.41
V2
64.83
V3
63.98
V4
61.86
Table A12: Results for Composition feature
%
EH
V1
44.07
V2
44.92
V3
44.92
V4
44.49
Table A13: Results for combination of Shape features
%
G
H
T
G+H
G+T
H+T
G+H+T
Average
V1
44.07
57.20
56.36
50.85
48.73
55.93
56.36
52.79
V2
44.07
56.78
53.81
48.31
47.88
55.93
55.08
51.69
V3
45.76
55.93
52.54
52.54
49.58
55.93
55.08
52.48
V4
48.73
56.36
53.81
51.27
52.12
56.36
54.24
53.27
Table A14: Results for combination of Texture features
78
62.71
58.05
61.86
61.02
60.59
60.59
64.83
62.72
58.05
63.98
60.59
60.59
64.83
64.41
57.20
67.80
62.29
59.75
59.75
58.05
62.71
60.17
64.41
58.9
65.25
58.47
61.44
60.17
61.02
64.41
61.86
59.75
63.14
60.59
62.19
%
CEDD
FCTH
JCD
CEDD+FCTH
CEDD+JCD
FCTH+JCD
CEDD+FCTH+JCD
Average
V1
60.59
60.17
58.90
59.32
58.05
61.44
62.29
60.11
V2
62.71
59.75
63.56
60.59
63.56
62.71
61.44
62.05
V3
61.02
62.71
62.29
62.29
61.02
62.29
63.56
62.17
V4
61.86
61.44
63.56
61.44
60.59
61.02
61.02
61.56
V5
58.47
59.75
59.75
57.20
56.78
58.90
58.05
58.41
V6
58.90
56.78
58.90
55.08
55.08
57.20
55.08
56.72
Table A15: Results for combination of Joint features
%
CH+RCS+RT
CH+NDC+RCS+RT
CH+OH+RCS+RT
CH+PFCH+RCS+RT
OH+PFCHS+RCS+RT
CH+CM+NDC+RCS+RT
CH+CM+NDC+OH+PFCH+RCS+RT
Average
V2
65.21
65.25
67.37
67.80
64.41
66.95
64.41
65.92
V4
64.41
65.25
64.84
63.14
67.38
65.25
63.56
64.83
Table A16: Results for combination of Color and Composition features
%
CH+RCS+EH
CH+NDC+RCS+EH
CH+OH+RCS+EH
CH+PFCH+RCS+EH
OH+PFCHS+RCS+EH
CH+CM+NDC+RCS+EH
CH+CM+NDC+OH+PFCH+RCS+EH
Average
V2
66.10
64.41
63.56
60.17
61.44
61.86
62.72
62.89
V4
64.41
63.56
63.56
60.17
62.71
62.29
59.75
62.35
Table A17: Results for combination of Color and Shape features
%
CH+RCS+H
CH+NDC+RCS+H
CH+OH+RCS+H
CH+PFCH+RCS+H
OH+PFCHS+RCS+H
CH+CM+NDC+RCS+H
CH+CM+NDC+OH+PFCH+RCS+H
CH+RCS+H+T
CH+NDC+RCS+H+T
CH+OH+RCS+H+T
CH+PFCH+RCS+H+T
OH+PFCHS+RCS+H+T
CH+CM+NDC+RCS+H+T
CH+CM+NDC+OH+PFCH+RCS+H+T
Average
V2
67.80
66.95
66.10
68.22
63.56
68.22
64.83
67.80
67.80
63.98
66.95
62.29
67.37
63.98
66.13
V4
64.83
64.40
63.14
66.10
66.56
64.41
62.29
63.56
63.56
61.86
65.68
62.71
63.14
66.10
64.17
Table A18: Results for combination of Color and Texture features
%
V2
66.95
CH+RCS+CEDD
79
V4
65.23
CH+NDC+RCS+CEDD
CH+OH+RCS+CEDD
CH+PFCH+RCS+CEDD
OH+PFCHS+RCS+CEDD
CH+CM+NDC+RCS+CEDD
CH+CM+NDC+OH+PFCH+RCS+CEDD
CH+RCS+JCD
CH+NDC+RCS+JCD
CH+OH+RCS+JCD
CH+PFCH+RCS+JCD
OH+PFCHS+RCS+JCD
CH+CM+NDC+RCS+JCD
CH+CM+NDC+OH+PFCH+RCS+JCD
CH+RCS+FCTH+JCD
CH+NDC+RCS+FCTH+JCD
CH+OH+RCS+FCTH+JCD
CH+PFCH+RCS+FCTH+JCD
OH+PFCHS+RCS+FCTH+JCD
CH+CM+NDC+RCS+FCTH+JCD
CH+CM+NDC+OH+PFCH+RCS+FCTH+JCD
Average
67.80
68.64
65.68
63.98
67.37
66.10
65.25
65.25
65.25
62.29
64.41
66.95
63.56
66.53
65.23
65.25
62.29
64.41
66.95
64.41
65.45
65.25
64.83
63.98
61.02
63.98
62.71
65.25
64.83
63.56
63.56
61.12
64.83
59.75
63.14
64.83
63.56
63.56
61.02
64.83
64.41
63.58
Table A19: Results for combination of Color and Joint features
%
V2
47.88
RT+EH
V4
52.12
Table A20: Results for combination of Composition and Shape features
%
V2
55.51
55.93
55.72
RT+H
RT+H+T
Average
V4
57.63
58.90
58.27
Table A21: Results for combination of Composition and Texture features
%
V2
61.44
62.29
62.71
62.15
RT+CEDD
RT+JCD
RT+FCTH+JCD
Average
V4
61.44
60.59
59.32
60.45
Table A22: Results for combination of Composition and Joint features
%
V2
50.85
50.42
50.64
EH + H
EH + H + T
Average
V4
51.65
49.15
50.40
Table A23: Results for combination of Shape and Texture features
%
V2
56.78
56.36
57.63
56.92
EH + CEDD
EH + JCD
EH + FCTH + JCD
Average
80
V4
52.97
59.75
57.63
56.78
Table A24: Results for combination of Shape and Joint features
%
V2
61.02
59.32
58.47
60.59
61.44
61.44
60.38
H+CEDD
H+JCD
H+FCTH+JCD
H+T+CEDD
H+T+JCD
H+T+FCTH+JCD
Average
V4
58.57
56.78
60.17
59.32
61.02
61.02
59.48
Table A25: Results for combination of Texture and Joint features
%
CH+RCS+RT+EH
CH+NDC+RCS+RT+EH
CH+OH+RCS+RT+EH
CH+PFCH+RCS+RT+EH
OH+PFCHS+RCS+RT+EH
CH+CM+NDC+RCS+RT+EH
CH+CM+NDC+OH+PFCH+RCS+RT+EH
Average
V2
62.29
63.98
61.44
63.98
62.29
62.29
61.86
62.59
V4
61.02
61.02
62.71
61.02
63.98
61.86
59.32
61.56
Table A26: Results for combination of Color, Composition and Shape features
%
CH+RCS+RT+H
CH+NDC+RCS+RT+H
CH+OH+RCS+RT+H
CH+PFCH+RCS+RT+H
OH+PFCHS+RCS+RT+H
CH+CM+NDC+RCS+RT+H
CH+CM+NDC+OH+PFCH+RCS+RT+H
CH+RCS+RT+H+T
CH+NDC+RCS+RT+H+T
CH+OH+RCS+RT+H+T
CH+PFCH+RCS+RT+H+T
OH+PFCHS+RCS+RT+H+T
CH+CM+NDC+RCS+RT+H+T
CH+CM+NDC+OH+PFCH+RCS+RT+H+T
Average
V2
66.92
66.10
64.41
66.53
65.25
67.37
64.83
66.95
66.52
64.41
66.95
66.95
66.10
66.10
66.10
V4
63.98
65.25
63.98
64.41
68.22
64.41
63.14
63.98
64.41
62.71
63.98
66.56
64.83
65.25
64.65
Table A27: Results for combination of Color, Composition and Texture features
%
CH+RCS+RT+CEDD
CH+NDC+RCS+RT+CEDD
CH+OH+RCS+RT+CEDD
CH+PFCH+RCS+RT+CEDD
OH+PFCHS+RCS+RT+CEDD
CH+CM+NDC+RCS+RT+CEDD
CH+CM+NDC+OH+PFCH+RCS+RT+CEDD
CH+RCS+RT+JCD
CH+NDC+RCS+RT+JCD
CH+OH+RCS+RT+JCD
CH+PFCH+RCS+RT+JCD
81
V2
63.98
67.37
68.22
63.14
61.44
65.68
66.52
65.25
65.68
66.10
66.10
V4
61.86
62.71
63.14
60.59
61.86
61.86
61.44
64.83
63.98
64.41
63.98
OH+PFCHS+RCS+RT+JCD
CH+CM+NDC+RCS+RT+JCD
CH+CM+NDC+OH+PFCH+RCS+RT+JCD
CH+RCS+RT+FCTH+JCD
CH+NDC+RCS+RT+FCTH+JCD
CH+OH+RCS+RT+FCTH+JCD
CH+PFCH+RCS+RT+FCTH+JCD
OH+PFCHS+RCS+RT+FCTH+JCD
CH+CM+NDC+RCS+RT+FCTH+JCD
CH+CM+NDC+OH+PFCH+RCS+RT+FCTH+JCD
Average
63.56
65.68
65.25
65.25
65.68
66.10
66.10
61.86
65.68
64.83
65.21
61.44
63.14
63.98
64.83
63.98
64.41
63.98
60.17
63.14
63.56
63.01
Table A28: Results for combination of Color, Composition and Joint features
%
CH+RCS+EH+H
CH+NDC+RCS+EH+H
CH+OH+RCS+EH+H
CH+PFCH+RCS+EH+H
OH+PFCHS+RCS+EH+H
CH+CM+NDC+RCS+EH+H
CH+CM+NDC+OH+PFCH+RCS+EH+H
CH+RCS+EH+H+T
CH+NDC+RCS+EH+H+T
CH+OH+RCS+EH+H+T
CH+PFCH+RCS+EH+H+T
OH+PFCHS+RCS+EH+H+T
CH+CM+NDC+RCS+EH+H+T
CH+CM+NDC+OH+PFCH+RCS+EH+H+T
Average
V2
64.41
65.25
62.71
63.14
61.44
64.83
62.71
62.29
61.01
63.98
63.98
63.14
63.14
63.98
63.29
V4
65.68
65.68
63.98
63.56
62.71
63.56
61.86
63.29
63.98
65.68
61.86
62.71
63.98
62.29
63.63
Table A29: Results for combination of Color, Shape and Texture features
%
CH+RCS+EH+CEDD
CH+NDC+RCS+EH+CEDD
CH+OH+RCS+EH+CEDD
CH+PFCH+RCS+EH+CEDD
OH+PFCHS+RCS+EH+CEDD
CH+CM+NDC+RCS+EH+CEDD
CH+CM+NDC+OH+PFCH+RCS+EH+CEDD
CH+RCS+EH+JCD
CH+NDC+RCS+EH+JCD
CH+OH+RCS+EH+JCD
CH+PFCH+RCS+EH+JCD
OH+PFCHS+RCS+EH+JCD
CH+CM+NDC+RCS+EH+JCD
CH+CM+NDC+OH+PFCH+RCS+EH+JCD
CH+RCS+EH+FCTH+JCD
CH+NDC+RCS+EH+FCTH+JCD
CH+OH+RCS+EH+FCTH+JCD
CH+PFCH+RCS+EH+FCTH+JCD
OH+PFCHS+RCS+EH
CH+CM+NDC+RCS+EH+FCTH+JCD
CH+CM+NDC+OH+PFCH+RCS+EH+FCTH+JCD
Average
V2
62.71
63.56
63.14
62.71
60.59
63.14
62.29
62.29
61.44
63.14
62.29
63.14
63.98
61.44
62.29
65.25
62.71
62.71
61.44
63.56
62.29
62.67
Table A30: Results for combination of Color, Shape and Joint features
82
V4
62.71
62.29
64.41
59.75
59.75
62.71
59.32
64.41
63.98
61.86
61.86
60.59
64.41
61.02
64.41
64.83
63.14
63.56
62.71
62.29
60.59
62.41
%
CH+RCS+H+CEDD
CH+NDC+RCS+H+CEDD
CH+OH+RCS+H+CEDD
CH+PFCH+RCS+H+CEDD
OH+PFCHS+RCS+H+CEDD
CH+CM+NDC+RCS+H+CEDD
CH+CM+NDC+OH+PFCH+RCS+H+CEDD
CH+RCS+H+T+CEDD
CH+NDC+RCS+H+T+CEDD
CH+OH+RCS+H+T+CEDD
CH+PFCH+RCS+H+T+CEDD
OH+PFCHS+RCS+H+T+CEDD
CH+CM+NDC+RCS+H+T+CEDD
CH+CM+NDC+OH+PFCH+RCS+H+T+CEDD
CH+RCS+H+JCD
CH+NDC+RCS+H+JCD
CH+OH+RCS+H+JCD
CH+PFCH+RCS+H+JCD
OH+PFCHS+RCS+H+JCD
CH+CM+NDC+RCS+H+JCD
CH+CM+NDC+OH+PFCH+RCS+H+JCD
CH+RCS+H+T+JCD
CH+NDC+RCS+H+T+JCD
CH+OH+RCS+H+T+JCD
CH+PFCH+RCS+H+T+JCD
OH+PFCHS+RCS+H+T+JCD
CH+CM+NDC+RCS+H+T+JCD
CH+CM+NDC+OH+PFCH+RCS+H+T+JCD
CH+RCS+H+FCTH+JCD
CH+NDC+RCS+H+FCTH+JCD
CH+OH+RCS+H+FCTH+JCD
CH+PFCH+RCS+H+FCTH+JCD
OH+PFCHS+RCS+H+FCTH+JCD
CH+CM+NDC+RCS+H+FCTH+JCD
CH+CM+NDC+OH+PFCH+RCS+H+FCTH+JCD
CH+RCS+H+T+FCTH+JCD
CH+NDC+RCS+H+T+FCTH+JCD
CH+OH+RCS+H+T+FCTH+JCD
CH+PFCH+RCS+H+T+FCTH+JCD
OH+PFCHS+RCS+H+T+FCTH+JCD
CH+CM+NDC+RCS+H+T+FCTH+JCD
CH+CM+NDC+OH+PFCH+RCS+H+T+FCTH+JCD
Average
V2
66.52
64.83
66.53
65.68
63.56
65.25
62.71
66.95
66.53
66.10
66.95
62.71
64.83
63.14
65.25
62.29
64.83
63.98
59.75
64.41
64.83
66.52
65.68
65.68
63.56
58.48
63.98
64.41
68.22
65.68
64.83
64.41
60.59
64.41
65.25
65.68
65.25
65.68
65.25
62.71
63.25
66.52
64.61
V4
63.14
62.71
63.14
61.02
62.29
61.86
62.29
64.41
64.41
64.41
66.10
62.71
64.83
65.68
63.14
61.86
63.56
62.71
60.59
61.86
64.41
61.44
60.59
62.71
65.25
59.75
61.44
63.56
63.98
65.98
63.56
63.68
59.75
64.41
63.56
62.29
61.86
63.98
64.41
60.17
62.29
63.98
62.99
Table A31: Results for combination of Color, Texture and Joint features
%
OH + PFCHS + RCS + RT + H + EH
OH + PFCHS + RCS + RT + H + T + EH
Average
V2
61.86
61.86
61.86
V4
62.71
63.56
63.14
Table A32: Results for combination of Color, Composition, Texture and Shape features
%
OH + PFCHS + RCS + RT + H + T + CEDD
OH + PFCHS + RCS + RT + H + T + JCD
OH + PFCHS + RCS + RT + H + T + FCTH + JCD
OH + PFCHS + RCS + RT + H + CEDD
83
V2
63.56
63.56
62.71
63.56
V4
62.71
61.86
60.17
63.56
61.44
64.83
63.28
OH + PFCHS + RCS + RT + H + JCD
OH + PFCHS + RCS + RT + H + FCTH + JCD
Average
62.29
61.02
61.94
Table A33: Results for combination of Color, Composition, Texture and Joint features
%
CH + RCS + H + FCTH + JCD + EH
CH + PFCH + RCS + H + T + CEDD + EH
Average
V2
63.56
61.86
62.71
V4
65.25
61.86
63.56
Table A34: Results for combination of Color, Texture, Joint and Shape features
%
CH + RCS + H + FCTH + JCD + RT
CH + PFCH + RCS + H + T + CEDD + RT
Average
V2
65.25
63.14
64.20
V4
63.98
61.86
62.92
Table A35: Results for combination of Color, Texture, Joint and Composition features
CH+CM+NDC+RCS
%
Negative
Positive
V2
Negative
82.11
46.02
Positive
17.89
53.98
%
Negative
Positive
V4
Negative
75.61
44.25
Positive
24.39
55.75
V4
Negative
73.17
45.13
Positive
26.83
54.87
V4
Negative
74.80
46.02
Positive
25.20
53.98
V4
Negative
77.24
45.13
Positive
22.76
54.87
V4
Negative
72.36
40.71
Positive
27.64
59.29
CH+CM+NDC+RCS+H
%
Negative
Positive
V2
Negative
79.67
44.25
Positive
20.33
53.98
%
Negative
Positive
CH+OH+RCS+CEDD
%
Negative
Positive
V2
Negative
79.67
43.36
Positive
20.33
56.64
%
Negative
Positive
CH+OH+RCS
%
Negative
Positive
V2
Negative
82.11
46.02
Positive
17.89
53.98
%
Negative
Positive
CH+PFCH+RCS+H
%
Negative
Positive
V2
Negative
80.49
45.13
Positive
19.51
54.87
%
Negative
Positive
84
CH+RCS+H+FCTH+JCD
%
Negative
Positive
V2
Negative
79.67
44.25
Positive
20.33
55.75
%
Negative
Positive
V4
Negative
75.61
48.67
Positive
24.39
51.33
V4
Negative
77.24
41.59
Positive
22.76
58.41
V4
Negative
78.05
46.02
Positive
21.95
53.98
OH+PFCHS+RCS+RT+H
%
Negative
Positive
V2
Negative
75.61
46.02
Positive
24.39
53.98
%
Negative
Positive
OH+PFCHS+RCS+RT+H+T
%
Negative
Positive
V2
Negative
78.86
46.02
Positive
21.14
53.98
%
Negative
Positive
Table A36: Confusion Matrices for each combination
CH+CM+NDC+RCS
%
Negative
Positive
V2
Negative
78.51
34.71
Positive
21.49
65.29
%
Negative
Positive
V4
Negative
81.82
33.06
Positive
18.18
66.94
V4
Negative
80.17
34.71
Positive
19.83
65.29
V4
Negative
71.07
29.75
Positive
28.93
70.25
V4
Negative
76.86
34.71
Positive
23.14
65.29
V4
Negative
82.64
35.54
Positive
17.36
64.46
CH+CM+NDC+RCS+H
%
Negative
Positive
V2
Negative
77.69
34.71
Positive
22.31
65.29
%
Negative
Positive
CH+OH+RCS+CEDD
%
Negative
Positive
V2
Negative
75.21
32.23
Positive
24.79
67.77
%
Negative
Positive
CH+OH+RCS
%
Negative
Positive
V2
Negative
77.69
32.23
Positive
22.31
67.77
%
Negative
Positive
CH+PFCH+RCS+H
%
Negative
Positive
V2
Negative
80.17
38.02
Positive
19.83
61.98
%
Negative
Positive
85
CH+RCS+H+FCTH+JCD
%
Negative
Positive
V2
Negative
78.51
35.54
Positive
21.49
64.46
%
Negative
Positive
V4
Negative
76.86
32.23
Positive
23.14
67.77
V4
Negative
76.86
33.88
Positive
23.14
66.12
V4
Negative
77.69
33.88
Positive
22.31
66.12
OH+PFCHS+RCS+RT+H
%
Negative
Positive
V2
Negative
75.21
34.71
Positive
24.79
65.29
%
Negative
Positive
OH+PFCHS+RCS+RT+H+T
%
Negative
Positive
V2
Negative
76.03
32.23
Positive
23.97
67.77
%
Negative
Positive
Table A37: Confusion Matrices for each combination using GAPED dataset with Negative and Positive categories
CH+CM+NDC+RCS
%
Negative
Neutral
Positive
V2
Negative Neutral
51.24
28.93
33.71
49.44
23.97
22.31
Positive
19.83
16.85
53.72
%
Negative
Neutral
Positive
V4
Negative Neutral
37.19
44.63
24.72
62.92
22.31
29.75
Positive
18.18
12.36
47.93
V4
Negative Neutral
51.24
31.41
25.84
62.92
25.62
27.27
Positive
17.36
11.24
47.11
V4
Negative Neutral
58.68
22.31
24.72
64.04
21.49
26.45
Positive
19.01
11.24
52.07
V4
Negative Neutral
37.19
47.11
30.34
56.18
23.97
29.75
Positive
15.70
13.48
46.28
CH+CM+NDC+RCS+H
%
Negative
Neutral
Positive
V2
Negative Neutral
61.16
18.18
23.60
57.30
27.27
21.49
Positive
20.66
19.19
51.24
%
Negative
Neutral
Positive
CH+OH+RCS+CEDD
%
Negative
Neutral
Positive
V2
Negative Neutral
55.37
21.49
22.47
59.55
23.14
22.31
Positive
23.14
17.98
54.55
%
Negative
Neutral
Positive
CH+OH+RCS
%
Negative
Neutral
Positive
V2
Negative Neutral
37.19
47.11
30.34
56.18
23.97
29.75
Positive
15.70
13.48
46.28
%
Negative
Neutral
Positive
86
CH+PFCH+RCS+H
%
Negative
Neutral
Positive
V2
Negative Neutral
64.46
14.05
26.97
51.69
28.10
18.18
Positive
21.49
21.35
53.72
%
Negative
Neutral
Positive
V4
Negative Neutral
54.55
28.93
24.72
64.04
26.45
23.14
Positive
16.53
11.24
50.41
V4
Negative Neutral
54.55
25.62
25.84
65.17
23.14
23.14
Positive
19.83
8.99
53.72
V4
Negative Neutral
59.50
24.79
30.34
52.81
22.31
23.97
Positive
15.70
16.85
53.72
V4
Negative Neutral
61.16
21.49
28.09
55.06
21.49
23.14
Positive
17.36
16.85
55.37
CH+RCS+H+FCTH+JCD
%
Negative
Neutral
Positive
V2
Negative Neutral
62.81
16.53
24.72
59.55
23.97
21.49
Positive
20.66
15.73
54.55
%
Negative
Neutral
Positive
OH+PFCHS+RCS+RT+H
%
Negative
Neutral
Positive
V2
Negative Neutral
65.29
14.05
29.21
46.07
23.14
19.01
Positive
20.66
24.72
57.85
%
Negative
Neutral
Positive
OH+PFCHS+RCS+RT+H+T
%
Negative
Neutral
Positive
V2
Negative Neutral
65.29
12.40
32.58
42.70
23.14
18.18
Positive
22.31
24.72
58.78
%
Negative
Neutral
Positive
Table A38: Confusion Matrices for each combination using GAPED dataset with Negative, Neutral and Positive
categories
CH+CM+NDC+RCS
%
Negative
Positive
V2
Negative
87.18
42.31
Positive
12.82
57.69
%
Negative
Positive
V4
Negative
82.05
44.87
Positive
17.95
55.13
V4
Negative
82.05
42.31
Positive
17.95
57.69
V4
Negative
83.33
39.74
Positive
16.67
60.26
CH+CM+NDC+RCS+H
%
Negative
Positive
V2
Negative
79.49
39.74
Positive
20.51
60.26
%
Negative
Positive
CH+OH+RCS+CEDD
%
Negative
Positive
V2
Negative
80.77
41.03
Positive
19.23
58.97
%
Negative
Positive
87
CH+OH+RCS
%
Negative
Positive
V2
Negative
78.21
43.59
Positive
21.79
56.41
%
Negative
Positive
V4
Negative
75.64
46.15
Positive
24.36
53.85
V4
Negative
79.49
43.59
Positive
20.51
56.41
V4
Negative
85.90
44.87
Positive
14.10
55.13
V4
Negative
87.18
47.45
Positive
12.82
52.56
V4
Negative
88.50
46.15
Positive
11.54
53.85
CH+PFCH+RCS+H
%
Negative
Positive
V2
Negative
80.77
39.74
Positive
19.23
60.26
%
Negative
Positive
CH+RCS+H+FCTH+JCD
%
Negative
Positive
V2
Negative
79.49
38.47
Positive
20.51
61.54
%
Negative
Positive
OH+PFCHS+RCS+RT+H
%
Negative
Positive
V2
Negative
82.05
44.87
Positive
17.95
55.13
%
Negative
Positive
OH+PFCHS+RCS+RT+H+T
%
Negative
Positive
V2
Negative
85.90
43.59
Positive
14.10
56.41
%
Negative
Positive
Table A39: Confusion Matrices for each combination using Mikels and GAPED dataset
88
Appendix B
Questionnaire
89
Figure B1: EmoPhoto Questionnaire
90
Results of the Questionnaire
Figure B2: 1. Age
Figure B3: 2. Gender
91
Figure B4: 3. Education Level
Figure B5: 4. Have you ever participated in a study using any Brain-Computer Interface Device?
92
Figure B6: 7. How do you feel?
Figure B7: 8. Please classify your emotional state regarding the following cases: Anger, Disgust, Fear,
Happiness, Neutral, Sadness and Surprise
93
94