Computer-Assisted Understanding of Stance in Social Media

Formalizations, Data Creation, and Prediction Models

Universität Duisburg-Essen

2019

Computer-Assisted Understanding of Stance in Social Media — Pietro Jeng/Unsplash

Die automatische Analyse von Positionen in sozialen Medien kann beispielsweise Regierungen und Unternehmen Einblicke darüber geben, wie Nutzer ihre Entscheidungen oder Produkte bewerten.

Schlagworte

Soziale Medien, Stance

Zusammenfassung

Stance (dt: Haltung, Position oder Standpunkt) bezeichnet die positive oder negative Evaluation von Personen, Dingen oder Ideen (Du Bois, 2007). Versteht man den Stance, den Menschen in den sozialen Medien zum Ausdruck bringen, eröffnen sich vielfältige Anwendungsmöglichkeiten: Auf der einen Seiten können Regierungen, Unternehmen oder andere Informationssuchende Einblicke darüber gewinnen, wie Menschen ihre Entscheidungen, Ideen oder Produkte bewerten. Auf der anderen Seite können Social Media Nutzer, denen der Stance anderer Nutzer bekannt ist, effizientere Diskussionen führen und letztendlich bessere kollektive Entscheidungen treffen.

Da die Anzahl der in sozialen Medien getätigter Beiträge zu hoch für eine manuelle Analyse ist, sind computergestützte Methoden zum Verständnis von Stance notwendig. In dieser Arbeit untersuchen wir drei Hauptaspekte solcher computergestützten Methoden: (i) abstrakte Stance Formalisierungen, die sich über mehrere Social Media Beiträge hinweg quantifizieren lassen, (ii) die Erstellung geeigneter Datensätze, die einer bestimmten Formalisierung entsprechen, und (iii) automatische Systeme zur Erkennung von Stance, die Social Media Beiträgen ein Stance Label zuordnen können. Wir untersuchen vier verschiedene Formalisierungen, die sich darin unterscheiden, wie spezifisch die Erkenntnisse sind, welche sie bei der Analyse von Social Media Debatten liefern: Stance gegenüber einzelnen Targets definiert Stance als ein Tupel, welches aus einem einzigen Target (z.B. Atheismus) und einer Polarität (z.B. für oder gegen das Target sein) besteht. Stance gegenüber mehreren Targets modelliert eine Polarität, die gegenüber einem übergeordneten Target und mehreren logisch verknüpften Targets ausgedrückt wird. Stance gegenüber nuancierten Targets, modelliert Stance als eine Polarität gegenüber allen Texten in einem bestimmten Datensatz. Darüber hinaus untersuchen wir hasserfüllten Stance als eine Formalisierung, die modelliert, ob ein Text Hass gegenüber einem einzelnen Target (z.B. Frauen oder Flüchtlingen) ausdrückt.

Systeme, die auf Methoden des maschinellen Lernens basieren, benötigen eine ausreichende Menge von mit Labeln versehenen Trainingsdaten. Da solche Daten nicht für jede Formalisierung verfügbar sind, wurden im Rahmen dieser Arbeit eigene Datensätze erstellt. Auf der Basis dieser Datensätze führen wir quantitative Analysen durch, welche Aufschluss darüber geben, wie zuverlässig die Annotation der Daten ist und in welcher Weise Social Media-Nutzer Stance kommunizieren. Unsere Analyse zeigt, dass die Zuverlässigkeit unserer Daten durch subjektive Interpretationen der Annotatoren und durch die Häufigkeit, mit der bestimmte Targets auftreten, beeinflusst wird. Unsere Studien zeigen weiterhin, dass die Wahrnehmung von Hass mit dem persönlichen Stance der Annotatoren korreliert, woraus wir folgern, dass Stance Annotationen bis zu einem gewissen Grad subjektiv sind und dass diese Subjektivität bei der Datenerstellung zukünftig berücksichtigt werden sollte. Darüber hinaus schlagen wir einen neuartigen Prozess für die Erstellung von Datensätzen vor, die subjektive Annotationen beinhalten, die der Formalisierung Stance gegenüber nuancierten Targets entsprechen und damit umfassende Einblicke in die zugrundeliegende Social Media Debatte liefert.

Um den Stand der Technik der automatischen Stance Erkennung zu untersuchen, haben wir relevante shared tasks organisiert und an ihnen teilgenommen, sowie Experimente an eigenen Datensätzen durchgeführt. Unsere Untersuchungen zeigen über alle Experimente und Datensätze hinweg, dass vergleichsweise einfache Methoden eine äußerst wettbewerbsfähige Leistung erbringen. Des Weiteren zeigen unsere Betrachtungen, dass neuronale Ansätze zwar wettbewerbsfähig, aber nicht deutlich besser als herkömmliche Ansätze zur Textklassifizierung sind. Wir zeigen, dass Ansätze, die auf der Beurteilungsähnlichkeit basieren – definiert als das Ausmaß mit dem Texte von einer großen Anzahl von Menschen ähnlich beurteilt werden – die Leistung von Referenzansätzen deutlich übertreffen. Daraus schließen wir, dass diese Beurteilungsähnlichkeit eine vielversprechende Richtung ist, um weitere Verbesserungen in den Bereichen automatischen Erkennung von Stance und verwandten Aufgaben wie Sentimentanalyse oder Argument Mining zu erzielen.

Interview mit Dr. Michael Wojatzki

Anja Zeltner

Was bedeutet der Ausdruck „Stance“ im Social-Media-Kontext?

Michael Wojatzki

Anja Zeltner: Was bedeutet der Ausdruck „Stance“ im Social-Media-Kontext?

Michael Wojatzki: „Stance“ – zu Deutsch „Haltung“, „Position“ oder „Standpunkt“ – bezeichnet die positive oder negative Evaluation von Personen, Dingen oder Ideen. Der Stance von Menschen bringt sie dazu, eine bestimmte Partei oder einen bestimmten Kandidaten zu wählen, ein bestimmtes Produkt zu kaufen oder bestimmte Menschen zu vermeiden oder aufzusuchen. Da Menschen heutzutage ihren Stance weithin über soziale Medien zum Ausdruck bringen, können die sozialen Medien als wichtige Quelle dazu dienen, den Stance von Gruppen oder der Gesellschaft als Ganzem einzuschätzen. Wenn die Menge an der zum Ausdruck gebrachten Positionen aussagekräftig verarbeitet und verstanden werden kann, können Regierungen, Unternehmen oder andere Entscheider wertvolle Einblicke darüber erhalten, wie Menschen ihre Entscheidungen oder Produkte bewerten und diese an die Bevölkerung oder ihre Nutzer anpassen.

Anja Zeltner: Du hast untersucht, ob man Vorhersagen darüber treffen kann, wie sich Menschen zu kontroversen Themen positionieren, z. B. zu Klimawandel oder Gender-Equality. Was waren deine Ergebnisse?

Michael Wojatzki: Es ist tatsächlich möglich, solche Vorhersagen zu treffen. Allerdings müssen dafür zwei Voraussetzungen gegeben sein: Erstens sind Daten vonnöten, mit denen das System, das die Analyse vornimmt, „trainieren“ kann. Diese Daten können zum Beispiel dadurch erhalten werden, dass Menschen zum Beispiel relevante Facebook-Posts mögen. Zweitens sind die zur Vorhersage genutzten Modelle immer themenspezifisch. Das heißt wenn man ein Modell trainiert hat, welches Positionen zum Thema „Klimawandel“ vorhersagt, so kann es nicht ohne weiteres für das Thema „Genderequality“ genutzt werden. Selbst unter diesen Bedingungen ist es sehr schwierig, die Position von Individuen vorherzusagen. Die Schwierigkeit für Computer ist hier, dass Individuen oft widersprüchliche Positionen vertreten – es zum Beispiel unmoralisch finden, Fleisch zu essen, es aber trotzdem tun. Vorhersagen auf Basis einer größeren Menge an Menschen können hingegen mit hoher Wahrscheinlichkeit getroffen werden. Dabei wird dann untersucht, was für ein Prozentsatz einer Gruppe mit Positionen übereinstimmt oder diese ablehnt.

Anja Zeltner: Wie kann die Analyse von Haltungen in sozialen Medien eine Gesellschaft positiv oder negativ beeinflussen?

Michael Wojatzki: Maschinen, die automatisch Stance in großem Umfang verstehen können, haben großes Potenzial, aber sie können auch gefährlich für eine Gesellschaft werden. Etwa dann, wenn Leute Stance-Erfassung allein dazu nutzen, um mit Inhalten zu interagieren, die mit ihren eigenen Positionen übereinstimmen. So werden existierende Echokammern noch verstärkt und Diskurse über kontroverse Themen – die eigentlich alle Mitglieder einer Gesellschaft miteinander führen sollten – finden nicht statt. Dies kann letztendlich zu einer zersplitterten Gesellschaft führen, in der kein Konsens mehr möglich ist. Zudem bietet Stance-Erfassung die technische Basis für eine allumfassende Zensur oder die Verfolgung von politischen Dissidenten. Gleichzeitig kann automatische Stance-Erfassung die Effizienz steigern, in der Nutzer von sozialen Medien oder Organisationen Posts entdecken oder filtern, die wiederum einen Stance ausdrücken in Bezug auf Zielsetzungen, an denen sie interessiert sind. Auf diese Weise kann automatische Stance-Erfassung einer Gesellschaft als Ganzes helfen, effizienter zu kommunizieren und so bessere Entscheidungen zu treffen.

Volltext auf OpenD

Danksagung

Die vorliegende Dissertation ist das Resultat eines langen Weges bestehend aus Schule, Studium, Promotion und praktischer Aktivitäten. Auf diesem Weg haben mich eine Vielzahl von Personen begleitet ohne die ich diesen nicht hätte beschreiten können und denen ich nun danken möchte:

Zuallererst und mit Nachdruck möchte ich mich bei Torsten Zesch bedanken. Seit Torsten mich im Studium in die schwarze Magie des machine learnings eingeführt hat, war er mir stets ein Mentor, der mich immer ermutigt hat, sich nicht mit oberflächlichen Antworten zufrieden zu geben. Während meiner Zeit als studentische Hilfskraft und als Doktorand habe ich von Torsten unglaublich viel über wissenschaftliches Arbeiten, Informatik, NLP aber auch darüber gelernt, das große Ganze nicht aus den Augen zu verlieren. Ich bedanke mich bei Chris Biemann, der mir wertvolles Feedback gegeben hat und mit dem ich ein spannendes GermEval 2017 erleben durfte. Mein Dank gilt der DFG, dem Graduiertenkolleg UCSM und der Universität Duisburg-Essen, die Sachmittel und ein produktives Arbeitsumfeld zur Verfügung gestellt haben.

Ausdrücklich möchte ich mich bei meinen Kollegen am LTL Andrea Horbach, Darina Gold, Huangpan Zhang, Osama Hamed und Tobias Horsmann bedanken, die immer für Diskussionen und Gedankenexperimente zur Verfügung standen. Andrea und Darina wurden nie müde jemandem mit weniger tiefem linguistischen Verständnis Konzepte wie named entities zu verdeutlichen. Besonderer Dank gilt Tobias, der mich währende meiner gesamten Promotion mit Rat und Tat begleitet hat und mit dem ich mich aufgemacht habe die Tiefen des deep learnings zu erforschen. Tobias stand dabei für endlose Diskussionen über shapes und layers zur Verfügung und ich habe viel von seiner pragmatischen Herangehensweise gelernt. Ich danke auch allen SHKs am LTL, die mich bei Annotationen oder bei der Entwicklung von Prototypen unterstützt haben.

Ich bedanke mich bei Andre Körner, Christoph Schmidt, Dimitar Denev und Heiko Beier, die mich bei Ausflügen in die Praxis gefördert haben und von denen ich viel über Kundenorientierung und workarounds gelernt habe. Mein Dank gilt auch allen UCSM Doktoranden, Postdocs und PIs, die mir interessante Einblicke in andere Forschungsdisziplinen und -kulturen ermöglicht haben. Hervorheben möchte ich die gesamte IWG hate speech, in der ich besonders viel über interdisziplinäres Arbeiten gelernt habe.

Ich bedanke mich auch bei allen Freunden und Kommilitonen, die mir während des Studiums zur Seite standen. Hervorheben möchte ich hier Gerrit Stöckigt, Benjamin Räthel und Jonas Braier, die mich in besonders vielen Kursen und Prüfungen begleitet haben. Bedanken möchte ich mich auch bei allen meinen Freunden aus meiner Schulzeit, die mich immer durch ihre spezielle Art unterstützt haben. Weiterhin bedanke ich mich bei meiner gesamten Familie.

Der größte Dank geht an meine Eltern, die auf so vielfältige Weise zu dieser Dissertation beigetragen haben, dass man diese Beiträge nicht einzeln würdigen kann. Zuletzt möchte ich mich von ganzem Herzen bei Julia bedanken. Für Alles.

Acknowledgements

I thank the numerous anonymous reviewers that helped to improve my research projects and my submissions. Furthermore, I thank the NLP community for constructive feedback and many interesting conversations. I would also like to thank my international collaborators Nicolás E. Díaz Ferreyra, Oren Melamud, Saif M. Mohammad, and Svetlana Kiritchenko with whom I have carried out research projects and excursions that have been deeply rewarding for me. My special thanks goes to Saif, who made my visit to Ottawa a fascinating experience and from whom I learned a lot about how to do research that is compelling to a large number of people.

Abstract

Stance can be defined as positively or negatively evaluating persons, things, or ideas (Du Bois, 2007). Understanding the stance that people express through social media has several applications: It allows governments, companies, or other information seekers to gain insights into how people evaluate their ideas or products. Being aware of the stance of others also enables social media users to engage in discussions more efficiently, which may ultimately lead to better collective decisions.

Since the volume of social media posts is too large to be analyzed manually, computeraided methods for understanding stance are necessary. In this thesis, we study three major aspects of such computer-aided methods: (i) abstract formalizations of stance which we can quantify across multiple social media posts, (ii) the creation of suitable datasets that correspond to a certain formalization, and (iii) stance detection systems that can automatically assign stance labels to social media posts.

We examine four different formalizations that differ in how specific the insights and supported use-cases are: Stance on Single Targets defines stance as a tuple consisting of a single target (e.g. Atheism) and a polarity (e.g. being in favor of the target), Stance on Multiple Targets models a polarity expressed towards an overall target and several logically linked targets, and Stance on Nuanced Targets is defined as a stance towards all texts in a given dataset. Moreover, we study Hateful Stance, which models whether a post expresses hatefulness towards a single target (e.g. women or refugees).

Machine learning-based systems require training data that is annotated with stance labels. Since annotated data is not readily available for every formalization, we create our own datasets. On these datasets, we perform quantitative analyses, which provide insights into how reliable the data is, and into how social media users express stance. Our analysis shows that the reliability of datasets is affected by subjective interpretations and by the frequency with which targets occur. Additionally, we show that the perception of hatefulness correlates with the personal stance of the annotators. We conclude that stance annotations are, to a certain extent, subjective and that future attempts on data creation should account for this subjectivity. We present a novel process for creating datasets that contain subjective stances towards nuanced assertions and which provide comprehensive insights into debates on controversial issues.

To investigate the state-of-the-art of stance detection methods, we organized and participated in relevant shared tasks, and conducted experiments on our own datasets. Across all datasets, we find that comparatively simple methods yield a competitive performance. Furthermore, we find that neuronal approaches are competitive, but not clearly superior to more traditional approaches on text classification. We show that approaches based on judgment similarity – the degree to which texts are judged similarly by a large number of people – outperform reference approaches by a large margin. We conclude that judgment similarity is a promising direction to achieve improvements beyond the state-of-the-art in automatic stance detection and related tasks such as sentiment analysis or argument mining.

Zusammenfassung

Da die Anzahl der in sozialen Medien getätigter Beiträge zu hoch für eine manuelle Analyse ist, sind computergestützte Methoden zum Verständnis von Stance notwendig. In dieser Arbeit untersuchen wir drei Hauptaspekte solcher computergestützten Methoden: (i) abstrakte Stance Formalisierungen, die sich über mehrere Social Media Beiträge hinweg quantifizieren lassen, (ii) die Erstellung geeigneter Datensätze, die einer bestimmten Formalisierung entsprechen, und (iii) automatische Systeme zur Erkennung von Stance, die Social Media Beiträgen ein Stance Label zuordnen können. Wir untersuchen vier verschiedene Formalisierungen, die sich darin unterscheiden, wie spezifisch die Erkenntnisse sind, welche sie bei der Analyse von Social Media Debatten liefern: Stance gegenüber einzelnen Targets*Im Kontext dieser Arbeit lässt sich target mit Zielobjekt oder Gegenstand der Evaluation übersetzen definiert Stance als ein Tupel, welches aus einem einzigen Target (z.B. Atheismus) und einer Polarität (z.B. für oder gegen das Target sein) besteht. Stance gegenüber mehreren Targets modelliert eine Polarität, die gegenüber einem übergeordneten Target und mehreren logisch verknüpften Targets ausgedrückt wird. Stance gegenüber nuancierten Targets, modelliert Stance als eine Polarität gegenüber allen Texten in einem bestimmten Datensatz. Darüber hinaus untersuchen wir hasserfüllten Stance als eine Formalisierung, die modelliert, ob ein Text Hass gegenüber einem einzelnen Target (z.B. Frauen oder Flüchtlingen) ausdrückt.

Introduction

Stance refers to an evaluation, either positive or negative, which is directed towards a person, thing, or idea and is one of the most basic human cognitions (Du Bois, 2007). Our stance drives us to vote for a certain party or candidate, to buy a certain product, or to avoid or approach people. Knowing a group's stance on an issue is a key prerequisite for meaningful decisions. For example, if a government has to decide, within the democratic decision-making process, whether to invest in wind energy, it is – besides other factors – crucial to know the citizens' stance towards this target.

Commonly, one estimates the stance of groups by conducting interviews, focussed surveys, or voting procedures. However, as these methods are cost- and time-intensive, they do not scale to an arbitrarily high number of topics and people. In addition to the limits posed by their costs, there are several well-known biases, such as the sampling-bias (i.e. if the selected sample is not representative for the whole group), that question the validity of these methods (Rosenthal, 1965; Smart, 1966).

Given that nowadays people are expressing their stances in large quantities on social media sites, social media can provide an alternative source for estimating the stance of groups or the whole society. This source also has the advantage that social media users express their stances out of their own motivation and therefore their statements are not affected by an interviewer or a setting in which a survey takes place. Although the enormous availability of social media allows a wider overview than traditional methods on collecting stance, it also has the drawback that the data volumes are too large to be analyzed manually. Hence, in this thesis we study computer-assisted analysis processes, which can be automated to high degrees and, therefore, can be applied to almost unlimited amounts of data.

Besides the difference in the volumes of data, social media data differs from survey data in the extent to which the data is structured. While survey data consists of multiple choice or at least semi-structured free-text answers, people express themselves freely in social media. This freedom means that our analysis processes need to cope with the variance and the ambiguities inherent to natural language. Below, we illustrate this characteristic of natural language with three statements which express a stance towards wind energy:

(A) Wind is the least expensive source for generating energy!
(B) Windmills are cheaper than alternatives.
(C) Wind is the most expensive source for generating energy!

Examples A) and B) demonstrate that statements can convey a stance in favor of wind energy, while having highly different surface forms (i.e. only the word Wind occurs in both examples). In contrast, Examples A) and C) have highly similar surface forms,*The examples are identical with the exception of the word least in Example A) being changed to most in Example C). but express opposing stances. Furthermore, Example B) can be interpreted in different ways and thus is somewhat ambiguous. This means that the term alternatives could either refer to other energy sources (e.g. coal or solar) or to other forms of harvesting wind energy (e.g. by using a kite).

In order to cope with these characteristics of natural language, we rely on stance detection systems that are based on techniques from the fields of natural language processing and supervised machine learning. These systems have to be trained on a sufficiently large amount of samples to learn a function that maps regularities in the surface forms to a predefined stance formalization. We define stance formalization as an abstract representation of the expressed stance. For instance, in examples A) and B) we could define the symbol {WIND ENERGY, $\oplus$ } to represent that both texts express a stance in favor of wind energy. More specifically, we formally define stance as a tuple consisting of (i) a target (e.g. wind energy) and (ii) a stance polarity (e.g. being in favor of ( $\oplus$ ) or being against ( $\ominus$ ) the target).

As our definition leaves the exact nature of target and polarity open, it can be used as a blueprint for more concrete formalizations. In the present thesis, we consider four formalizations that differ in how complex they are and thus in what use-cases they support:

Stance on Single Targets: In the least complex formalization, we formalize a stance tuple consisting of a single, predefined target and a three-way polarity (i.e. $\oplus$ , $\ominus$ , and NONE). For instance, this formalization supports a use-case in which a government is interested in people's overall stance towards issues such as the independence of Catalonia, feminism, or abortion. The formalization also supports decision-makers in a company who want to know if their product is being evaluated positively, negatively or neutrally.
Stance on Multiple Targets: In certain use-cases, we are interested in differentiating between more finely grained stances. For instance, in an energy debate, we might be interested in what stance people have towards solar, towards wind, and towards nuclear energy. Hence, in the formalization Stance on Multiple Targets, we simultaneously model a $\oplus$ , $\ominus$ , or NONE stance towards an overall target and a set of logically linked targets. The relationship between the different, predefined targets can be described by classical semantic relation types such as meronymy, hyperonymy, or hyponymy.
Stance on Nuanced Targets: If our use-case is to obtain a comprehensive understanding of all nuances of stance towards a topic or if we do not have knowledge of the most relevant aspects of a topic, it is not viable to predefine a set of logically linked targets. Hence, for the formalization Stance on Nuanced Targets, we formalize $\oplus$ or $\ominus$ stance towards all texts in the dataset we want to analyze.
Hateful Stance: Stance formalizations cannot only be distinguished in terms of the granularity of the targets, but also in terms of different polarities. For the fourth formalization, we study Hateful Stance as an especially unpleasant form of stance taking. We define hateful stance analogously to Stance on Single Targets, but define a Hatefulness Polarity instead of the regular polarity. We argue that this formalization can help to contain hateful stance and the negative consequences it has on its targets.

We provide an abstract example for each of the three formalizations Stance on Single Targets, Stance on Multiple Targets, and Stance on Nuanced Targets in Subfigure formalization of Figure1.1. Note that we do not show an example for Hateful Stance as this formalization differs from the formalization Stance on Single Targets only by the use of a different polarity.

Before we can train systems that automatically assigns the described formalisms to social media text, we need a sufficient amount of training examples. This training data needs exactly those labels that correspond to the formalization our systems should recognize, when applied to new data. Since data is not available for all of our formalizations, we manually created datasets ourselves. While our individual efforts on data creation differ in several details, they all incorporate the three following steps: We (i) first gather raw data relevant to the issue we want to explore, (ii) let several annotators independently annotate a stance formalization, and (iii) consolidate the different annotations into a final gold label. This process is visualized in the Subfigure data creation of Figure 1.1.

The data creation process allows us to perform a number of quantitative analyses that help us to understand our stance formalizations in more detail: More specifically, we analyze the degree to which multiple annotators agree on the annotation of a formalism (inter-rater reliability), which indicates how subjective the annotations are (Artstein and Poesio, 2008). Subjective annotations pose a serious problem for automated systems that are trained on this data, as the resulting predictions will correspond to the subjective annotations. A third person may not agree with such a subjective prediction, which renders such a system useless.

We develop various metrics that provide us with empirical insights in how social media users express stances. These include scores that quantify how polarizing statements or issues are, or scores that describe the degree to which a text is similarly judged by a large number of people. Furthermore, we study the frequency with which individual elements of the formalizations occur, or the frequency with which several elements co-occur.

After the creation of datasets, we can attempt to train stance detection systems. All of our stance detection systems correspond to prototypical text classification systems. Thus, as shown in Subfigure detection systems in Figure 1.1, they all involve the three steps preprocessing, machine learning, and evaluation. First, while preprocessing, the data must be translated into a form that a computer can digest. Preprocessing includes a segmentation of the data (e.g. splitting a text into words) and a vectorization of it. For vectorization, we represent raw data by a set of measurable properties. For instance, we represent text as a sequence of word vectors (cf. Mikolov et al. (2013b)) or by averaging the polarity scores of all words (i.e. how positive or negative a word is perceived (Mohammad and Turney, 2010)) that are contained in the text.

Next, we use machine learning algorithms to learn a function that maps the vectorized data to a stance label. To investigate which type of machine learning is best suited for stance detection, we experiment with a set of algorithms that can be roughly categorized into (a) neural networks (often referred to as deep learning (Glorot and Bengio, 2010)) and (b) more traditional learning algorithms such as SVMs.

Figure 1.1

Overview of and relationship between the three major aspects of computer-aided stance detection. Note that the processes for creating data and building a stance detection system vary according to the targeted stance formaliztaion. The Subfigures data creation and detection systems exemplify the processes for the formalization Stance on Single Targets.

Finally, we perform evaluations that measure the quality of the entire systems and their individual components. For these evaluations, we test the performance of our predictions on data, which was not part of the training process, but which is annotated with the same stance labels. In order to compare systems under fair conditions, we participated in and organized shared task initiatives. Shared tasks refer to comparative evaluation set-ups in which organizers make sure that all participating systems are evaluated under the same conditions (e.g all systems are evaluated using the same performance metrics). Furthermore, we conduct controlled experiments in which we evaluate the effectiveness of individual components. For instance, we test how the overall systems' performance changes if we assume a perfect performance of upstream components.

Main Contributions

Now, we describe the main contributions of this thesis. We group these contributions into our three main research areas Formalization, Data Creation, and Detection Systems:

Formalization

We provide a comprehensive overview of stance formalisms and compare stance with related NLP concepts such as sentiment and arguments. Thereby, we carve out the conceptual similarities of the NLP concepts target-based sentiment, topic-based sentiment, aspect-based sentiment, and stance. We argue that these theoretical considerations pave the way for combining future efforts in these areas.

Our theoretical considerations indicate that existing stance formalisms may be too coarse-grained to provide a comprehensive overview of all nuances of a debate.To close this gap, we propose a new stance formalization, which we call stance on nuanced targets and which models stance towards all texts in a dataset.

In addition, we describe how our formalization of stance can be adapted to model hatefulness that is expressed towards targets (e.g. refugees or women). In this way, we contribute to consolidating the research of new NLP tasks such as aggression detection or hate speech detection into a unified field.

Data Creation

For studying the formalization stance on multiple targets, we created two novel datasets and made them available to the research community – one contains tweets about atheism and the other contains Youtube comments about the death penalty. We show that people do not always agree on their interpretation of which stance a text expresses towards multiple targets. As a major reason for disagreement, we identify the ambiguity and distribution of the predefined targets. Our study suggests that good targets should be selected according to these criteria.

Moreover, we propose a novel process of collecting data to study the formalization stance on nuanced targets. Our process consists of two steps which can be carried out via crowdsourcing: First, we engage people to generate a comprehensive dataset of assertions relevant to certain issues. Second, we ask people what stance they have towards the collected data assertions (i.e. whether they personally agree or disagree with the assertions). The resulting dataset covers sixteen different issues (e.g. vegetarianism and veganism, legalization of Marijuana, or mandatory vaccination), contains over 2000 assertions and about 70,000 stance annotations. We show how this data can be analyzed in order to obtain novel insights into the nature of the corresponding debate.

We also examined our formalization of hateful stance using two datasets. Specifically, we were involved in the creation of one of the first German datasets that contains hatefulness labels. In addition, we created a dataset which is annotated with both stance on nuanced targets and hateful stance towards women. Our research shows that reliably annotating hateful stance is even more challenging than the annotation of stance on multiple targets. Furthermore, we show that whether an annotator perceives a text as hateful is influenced by whether the text is phrased implicitly and by what stance the annotator has towards the text. More specifically, we show that annotators that agree with a text are unlikely to perceive the text as hateful, or vice versa, annotators that perceive a text as hateful are unlikely to agree with it.

Detection Systems

We participated in the first shared tasks on stance detection (i.e. SemEval 2016 Task 6 and StanceCat@IberEval 2017). In addition, we co-organized a shared task that focussed on customer reviews about Deutsche Bahn - the German public train operator. With our involvement in these initiatives, we have actively contributed to determining the current state of the art in the area of single-target stance detection. Across different initiatives, we find that comparatively simple systems yield highly competitive results, but that the usage of word polarity lists, dense word vectors, and ensemble learning consistently leads to improvements over baseline systems. In addition, we could demonstrate that the state of the art in calculating word-relatedness does not sufficiently model the lexical semantics that are required for stance detection.

In our datasets that are annotated with stance on multiple targets, we show that overall stance classifiers already model stance on explicit targets to a large extent. Furthermore, we demonstrate that the overall stance can be quite reliably predicted if specific targets are given. This means that future attempts on stance detection could potentially reuse models that have been trained to predict stance on targets that correspond to the more specific targets.

In the context of stance on nuanced targets, we propose two novel NLP tasks: predicting the stance of individuals and predicting the stance of groups. For solving these tasks, we propose to rely on an automatically estimated judgment similarity – the degree to which two assertions are judged similarly by a large number of people. We show that judgment similarity can be quite reliably estimated from texts and is a useful means for solving the proposed tasks. For the prediction of stance of groups our approach based on judgment similarity outperforms reference approaches by a wide margin. Furthermore, we demonstrate that approaches that are based on judgment similarity substantially outperform competitive baselines in the task of predicting hateful stance.

Publication Record

The research presented in this thesis was partly published in peer-reviewed conference and workshop proceedings and, thereby, made accessible to the research community. We will now report and brieﬂy summarize these publications.

We have contributed to determine the current state of the art in detecting Stance on Single Targets by participating in and organizating shared task initiatives on the topic. Our participation in the ﬁrst shared task on stance detection (i.e. SemEval 2016 Task 6 by Mohammad et al.) is described in the publication:

Michael Wojatzki and Torsten Zesch. 2016a. ltl.uni-due at SemEval-2016 Task 6: Stance Detection in Social Media Using Stacked Classiﬁers. In Proceedings of the International Workshop on Semantic Evaluation (SemEval), San Diego, USA, pages 428–433.

In our submission, we experiment with a combination of neural and non-neural classifiers. While our system does not outperform competitive baselines, we could demonstrate that the approach has the potential for significant performance gains. Furthermore, we co-organized a shared task that focussed on social media customer feedback which targets Deutsche Bahn. We report the findings of this shared task in the publication:

Michael Wojatzki and Torsten Zesch. 2017. Neural, Non-neural and Hybrid Stance Detection in Tweets on Catalan Independence. In Proceedings of the Workshop on Evaluation of Human Language Technologies for Iberian Languages (IBEREVAL), Murcia, Spain, pages 178–184.

Our analysis shows that comparatively simple systems yield competitive results, but we also identify methods that consistently lead to improvements over these baselines.

We investigate Stance on Multiple Targets by analyzing two newly created datasets: First, we examine the relationship between overall and specific stances in a dataset about atheism. We find that overall stance can be quite reliably predicted from more specific stances. The creation and analysis of this dataset is described in the publication:

Michael Wojatzki and Torsten Zesch. 2016b. Stance-based Argument Mining: Modeling Implicit Argumentation Using Stance. In Proceedings of the Conference on Natural Language Processing (KONVENS), Bochum, Germany, pages 313–322.

Subsequently, we compared two different sets of specific targets in a dataset about the death penalty. We could not conclude that one set is clearly preferable over another, but that they complement each other well. The dataset along with the experiments are described in the publication:

Michael Wojatzki and Torsten Zesch. 2018. Comparing Target Sets for Stance Detection: A Case Study on YouTube Comments on Death Penalty. In Proceedings of the Conference on Natural Language Processing (KONVENS), Vienna, Austria, pages 69–79.

In certain use-cases, existing stance formlizations may be too coarse-grained to obtain a comprehensive overview of all nuances of a debate. Hence, we propose a new formalization of stance (Stance on Nuanced Targets), which models stance on an even more fine-grained level. To obtain data that is annotated with corresponding labels, we propose a new process for collecting data. We described this process, the resulting data, as well as the analysis of the data in the publication:

Michael Wojatzki, Saif M. Mohammad, Torsten Zesch, and Svetlana Kiritchenko. 2018b. Quantifying Qualitative Data for Understanding Controversial Issues. In Proceedings of the Language Resources and Evaluation Conference (LREC), Miyazaki, Japan, pages 1405–1418.

Next, we describe two novel NLP tasks that we can try to solve on the basis of the collected datasets. In addition, we describe how judgment similarity can be used to solve these tasks. The two tasks, as well as our approach based on judgment similarity, are described in the publication:

Michael Wojatzki, Torsten Zesch, Saif M. Mohammad, and Svetlana Kiritchenko. 2018c. Agree or Disagree: Predicting Judgments on Nuanced Assertions. In Proceedings of the Conference on Lexical and Computational Semantics (*SEM 2018), New Orleans, USA, pages 214–224.

Subsequently, we examine the formalization Hateful Stance. To that end, we investigate the reliability of hatefulness annotation in a dataset about the European refugee crisis. We find that the annotations are indeed influenced by subjective interpretations. This examination is described in the publication:

Björn Ross, Michael Rist, Guillermo Carbonell, Benjamin Cabrera, Nils Kurowsky, and Michael Wojatzki. 2016. Measuring the reliability of hate speech annotations: the case of the European refugee crisis. In Proceedings of the Workshop on Natural Language Processing for Computer-Mediated Communication (NLP4CMC), Bochum, Germany, pages 6–9. (All authors of this publication were equally involved in the research.)

Next, we investigate whether the implicitness of texts has an influence on how hateful they are perceived to be. The experiment indicates that implicitness has an influence on the perception of hatefulness but that this influence is moderated by other factors (e.g. by whether a text contains a threat). We report his experiment in the publication:

Darina Benikova, Michael Wojatzki, and Torsten Zesch. 2017. What Does This Imply?: Examining the Impact of Implicitness on the Perception of Hate Speech. In Proceedings of the International Conference of the German Society for Computational Linguistics and Language Technology (GSCL), Berlin, Germany, pages 171–179. (The author of this thesis developed the research design, performed the quantitative analyses, and implemented and evaluated the machine learning approaches).

Finally, we create and analyze a dataset that is annotated with both stance on nuanced targets and hateful stance. We also find a strong relationship between hateful stance and stance on nuanced targets, and demonstrate how this relationship can be used for predicting hatefulness. These investigations, as well as approaches on automatically detecting hateful stance, are described in the publication:

Michael Wojatzki, Tobias Horsmann, Darina Benikova, and Torsten Zesch. 2018a. Do Women Perceive Hate Diﬀerently: Examining the Relationship Between Hate Speech, Gender, and Agreement Judgments. In Proceedings of the Conference on Natural Language Processing (KONVENS), Vienna, Austria, pages 110–120.

Thesis Overview

In the following, we provide an overview of the structure of this thesis. In Chapter 2, we first give an overview of our formalization of stance and related NLP concepts such as sentiment or arguments. Hereby, we show that all stance formalizations can be defined as tuples consisting of targets and a polarity that is expressed towards that target.

Stance on Single Targets

In Chapter3, we examine the simplest formalization of stance in which we define stance towards a single target. First, we give a brief overview of NLP and machine learning systems, which represent the state of the art in this area. Subsequently, we describe our participation and organization in three shared task initiatives on the topic. We also show that the targets used in these tasks require different lexical semantics as they are measured by state of the art methods.

Stance on Multiple Targets

In Chapter 4, we study a more fine-grained formalization that models stance towards multiple, logically-linked targets. Since there was no suitable data available for our investigations, we created our own dataset. To that end, we first introduce the basics of annotating and evaluating NLP datasets in this chapter. Subsequently, we describe how we create and analyze two different datasets: one about atheism and one about the death penalty.

Stance on Nuanced Targets

In Chapter 5, we propose a new formalization of stance, which models stance on an even more fine-grained level. We introduce a novel process of collecting data that corresponds to this new formalization of stance. Subsequently, we show how the data can be analyzed to obtain novel insights into controversial issues. In addition, we define wo novel tasks that can be evaluated on the collected data, and show how judgment similarity can be used to solve these tasks.

Hateful Stance

In Chapter 6, we adapt our stance formalization stance on single targets to capture hateful expressions which target, for instance, refugees or women. First, we describe how this formalization relates to concepts such as hate speech and provide an overview of the current state of the art in detecting hateful stance. Subsequently, we describe two annotation experiments to determine whether experts differ from laypersons in the annotation of hatefulness and to determine the influence of implicitness. Next, we describe how we create a dataset that contains annotations that correspond to hateful stance and to stance on nuanced targets. Finally, we analyze this dataset and demonstrate how judgment similarity can be utilized to predict hateful stance.

Formalizing Stance

There are multiple ways to express stance using natural language. For example, if one favors wind energy in a debate on how the society should generate electricity, one could express the exact same stance in various ways:

(A) We should use wind energy to produce electricity!
(B) I prefer wind energy over other options.
(C) Wind energy is the only sane option.

If we face several statements and our task is to generate quantitative insights about the expressed stances, we must be able to determine whether the different statements express the same stance or not. Therefore, we need abstract formalizations of stance that can be measured and quantified across all utterances. For the utterances above, we could – for example – introduce a symbol {WIND ENERGY, $\oplus$ } and use it to represent the expressed stances.

This process of enhancing (textual) data with such formalizations is called annotation (Bird and Liberman, 2001; Ide and Romary, 2004). In this chapter, we will shed light on different approaches on formalizing stance, and how these formalizations relate to similar concepts such as sentiment or arguments.

Stance as a Target–Polarity Tuple

The meaning of a message expressed in natural language can change dramatically due to the context.For instance, in a debate on healthy food with the expression Apples are awesome!, the author communicates that she likes the fruit called apple. However, in a debate that targets computer brands, the same utterance may express that the author favors products of the brand Apple. If the debate is specifically concerned with the pros and cons of PCs, one could even conclude that the author dislikes PCs. To capture this ambiguity, we rely on a formalization of stance that is defined towards a target (cf. Mohammad et al. (2016)). Hence, we define stance as a tuple consisting of:

A) a polarity of the stance that is being expressed. The polarity expresses how the target is evaluated. For instance, one can be in favor of $\oplus$ or against $\ominus$ a target.

B) a target towards which the stance is expressed. A target can theoretically be anything that one can favor or oppose, or what can be the subject of a debate. Examples for targets include controversial social issues (e.g. atheism or gay rights), persons of public interest (e.g. politicians), or commercial products and services (e.g. electronic devices). A target can also be a facet of an issue or even a single statement.

Figure 2.1

Example of how two different stance tuples are communicated depending on the target of the debate. The example illustrates that stance can only be interpreted in the context of a given target.

Figure 2.1 exemplifies these two basic building blocks of stance formalisms. The figure also shows how the same message communicates two different stances depending on two targets.

Stance Detection as Natural Language Processing Task

We argue that systems that are able to automatically detect stance are useful in a variety of applications such as sorting and filtering transcribed speeches according to the taken stance or identifying newspaper articles that support or oppose a certain position. An area where such systems could be especially useful is the social media domain as it is especially rich in opinionated and thus stance-bearing texts. We here understand social media data as any kind of user-generated content which is created and shared via internet-based applications (Kaplan and Haenlein, 2010). These applications include websites for networking (e.g. facebook.com or researchgate.net), micro-blogging (e.g. twitter.com, tumblr.com), as well as platforms for sharing media (e.g. youtube.com, instagram.com and reddit.com) (Barbier et al., 2013, p.1). Besides user-generated content, social media data also comprises the attributes of users and informations on user-user, and user-content connections (Barbier et al., 2013, p.5).

The need for automatizing social media stance detection is a consequence of the quickly growing amount of social media content. While contained stances may provide valuable insights into public opinion, social dynamics, or customer groups, the amount of social media data makes manual processing hardly possible. Social media users share different types of content. However, when expressing a stance, they still mostly rely on text.

Thus, in this work we cast stance detection as inferring stance from text. The computer science branch that is concerned with automatically processing texts is called natural language processing (NLP).*Other common names for the field are computational linguistics, linguistic computing or language technology. Although each name may emphasize a different focus of research, there is no universally recognized distinction between them and they are often used interchangeable. We here decided to use the term NLP to emphasizes the empirical nature of the conducted research.

We will now describe how the task of stance detection is affected by almost all levels of NLP and discuss the challenges that the social media domain poses along these levels. The purpose of this short description is, on the one hand, to more precisely outline the problem of detecting stance from language data and, on the other hand, to define the terminology for the following chapters. Subsequent to this description, we will introduce the concept of annotation complexity, which will be used to classify and discuss related work on stance detection.

Natural Language Processing in Social Media

Traditionally, the challenges of natural language processing are described using a hierarchy of levels. These levels are phonology and orthography, morphology, syntax, semantic, pragmatic and discourse (cf. (Jurafsky and Martin, 2000, p.38)) We will now briefly outline these levels, describe how they relate to the automatic detection of stance, and discuss the challenges that arise when working on social media data.

Phonologic and Orthographical Level

The first level deals with translating a language input – written or spoken – into a form that is processable by a computer. On the phonological level, systems process the purely acoustic signal of language (i.e. sound waves). Typical tasks are automatic speech recognition, in which one tries to map sound waves to text, or speech synthesis which is targeted towards mapping text to acoustic units. As the goal of this work is to analyze social media – in which the vast mass of data is present in written form – we neglect the phonological level. On the orthographical level, systems automatically transcribe the text that can be found on scans, photographies or other pictures into digital text (so-called optical character recognition). Again, we rely on the fact that most of the language in social media is already in a machine-readable form and neglect this step.

Morphological Level

The morphological level is concerned with analyzing meaningful units within the input. These units include words or morphemes – the smallest meaningful or grammatically functional units of linguistics. A word may be composed of several morphemes (e.g. windmill is composed of wind and mill. Tasks on the morphological level include tokenization of input strings into morphemes or words, or normalizations of words to their base forms (called lemmas). The morphological level is important for stance detection as computers need to be able to recognize words in order to decode their semantics in downstream NLP steps, or to handle inflections. For instance, for stance detection it may be necessary to recognize that plural and singular forms (mills vs. mill) point to similar concepts. Spelling mistakes (e.g. omission of whitespaces) and non-standard use of text (e.g. usage of emoticons or urls) often make the tasks on the morphological level difficult on social media data – our main field of application. For instance, Derczynski et al. (2013) shows that the performance of off-the-shelf tokenizers dramatically decreases when applied to Twitter data.

Syntactical Level

Syntax refers to the rule system with which elementary units (e.g words or phrases) are combined (e.g. into sentences) in natural language. Consequently, the syntactical level is concerned with matching these rules to the extracted stream of tokens. Typical tasks include the tagging of tokens with word classes (so-called part-of-speech tags), the detection of grammatical dependencies. Syntax is important for understanding stance of messages as one, for instance, needs to distinguish between a statement (Apples are awesome) and its negated form (Apples are not awesome). In addition, in an statement such as I hate that bloated, significantly-worse-than-Clinton, Trump. syntactic knowledge is required to understand that the stance (indicated by the term hate) is expressed towards Trump and not towards Hilary Clinton. The non-standard nature of social media text may negatively affect the performance of tools that are designed to solve syntactic tasks. Examples for NLP tasks that can be solved less well on social media data than on standard texts are part-of-speech tagging (Gimpel et al., 2011; Ritter et al., 2011; Horsmann and Zesch, 2016) and dependency parsing (Foster et al., 2011).

Semantic Level

The semantic level is concerned with the meaning of words, morphemes, sentences or other previously extracted structures. Typically, semantic tasks try to measure the relationship (e.g. similarity) between the meaning representation of these units. In order to solve these task, it is often necessary to first resolve the ambiguity inherent in language (e.g. given the context, does the word apples refer to the fruit or the brand?). Stance detection can therefore be regarded as a highly semantic task. Similar to the underlying levels, the performance of tools that solve semantic tasks is often significantly lower on social media data compared to standard text (Ritter et al., 2011; Li et al., 2012).

Discourse Level

The discourse level is concerned with understanding the sequence of multiple messages of several speakers or authors. The discourse level can have a great impact on the perception and understanding of stance.Often, the meaning of utterances can only be understood in the context of utterances of other persons to which they refer. For instance, for understanding the stance that is communicated with I hate them! it is necessary to know if the utterance refers to a preceding one. A characteristic of social media is that different users can interact with each other. As a consequence, social media data contains plenty of especially complex discourses (Forsythand and Martell, 2007).

Pragmatic Level

The pragmatic level incorporates the situational context in which an utterance is made. This context can be of personal or social nature, or or shaped by the context of the current debate. Since stance is defined with respect to a given target, stance can be considered as a highly pragmatic task. Depending on the target, the semantic orientation (whether a word evokes good or bad associations) of the words may also change (Turney, 2002). For instance, the word unpredictable may convey a positive association if the target is a certain book or movie. However, if the target is a car, the word unpredictable induces a negative association. We also need pragmatic knowledge to know that the phrase maximally bloated expresses a negative stance towards Trump, but a positive stance if used to express a stance towards a inflatable boat.

The resolution of implicit expressions is also typically assigned to the pragmatic level. As shown in Figure 2.1, stance can be implicitly expressed by arguing for an entity that is negatively correlated with the target (e.g. arguing for windmills to express a $\ominus$ stance towards nuclear power). Hence, the resolution of implicit expressions und thus the pragmatic level is crucial for stance detection. Since social media data usually comes with rich context, automatic text processing can rely on such signals to improve performance (González-Ibánez et al., 2011).

Annotation Complexity

In the following, we discuss different ways of formalizing stance and how these formalizations differ them from formalizms used in related NLP tasks. However, in the present thesis we do not consider all possible theoretical formalizations, but only those which have been applied to data in the form of concrete annotation schemes.

We examine stance and related formalizations according to the building blocks shown above (i.e. polarity, and targets). To further characterize them within these blocks, we analyze the complexity of the different annotation schemes. According to the international standard for linguistic annotation (Ide and Romary, 2004)*ISO TC37 SC WG1-1 every linguistic annotation consists of the two steps (i) segmentation and (ii) annotation of these segments with linguistic or otherwise informative classes. Thus, we consider the complexity of an annotation as a function of how complex the process of identifying the elementary units of the annotation scheme is (e.g. finding words is less complex than finding syntactic dependencies) and how complex the assignment of classes is (e.g. choosing whether a tweet is positive or negative is less complex than choosing if it is very positive, positive, neutral, negative, or very neutral). The segmentation complexity includes whether or not we can reliably utilize heuristics such as a separation according to white spaces (Fort et al., 2012). The complexity of assignment classes includes how large the set of classes is and how ambiguous the assignments are (Fort et al., 2012). If an annotation scheme is more complex one can distinguish a larger number of different cases. Thus, an annotation scheme's complexity is directly related to its expressiveness. However, complexity is also negatively correlated with inter-annotator agreement (Bayerl and Paul, 2011), as e.g. a more complex scheme results in more errors and thus a lower reliability. The intuition behind this negative correlation is that the more possibilities one has, the more errors are possible. Of course reliability is also effected various other factors. The factors include training of the annotators, properties of the data (e.g. social media vs. newswire text), how implicit the communication is and how much interpretation is required. Note, that we do not compare the reliability of annotations directly, as it is difficult to compare different reliability scores with each other and reliability scores do not account for whether the annotators are trained.

Of course, the complexity is also influenced by the syntactic, semantic and pragmatic context in which the annotation is done (Fort et al., 2012; Joshi et al., 2014). The complexity of annotations can be be quantified by relying on behavioural cues such as those which are provided from eye-tracking (Joshi et al., 2014; Tomanek et al., 2010). Since we have neither behavioral cues for the annotations nor the full context of all approaches, we will here analytically compare the different annotation schemes based on numeric properties such as the number of classes to assign. Our analytical approach to determining complexity neglects the complexity of the thought processes that are needed for both segmentation and class assignment (e.g. it is easily perceivable that annotating whether a sentence is intelligently formulated is much harder than annotating whether a sentence contains question marks). Nevertheless, in order to be able to compare approaches on measurable criteria, we will build our analysis on the number of possible choices.

Polarity

As a first building block, we discuss stance polarity. As described above, stance polarity refers to the judgment the stance expresses towards the given target. A stance annotation scheme has to define which values (or classes of values) the polarity can adopt. The number of possible values directly influences the complexity of an annotation scheme. For instance, if we only allow for the two classes $\oplus$ and $\ominus$ , unarguably, the scheme is less complex compared to a model that also uses a third NONE class. We will now describe the polarity classes that are used in stance research and then describe the greater variety of polarities, which can be found in the field of text polarity classification.

Three-way and Binary Stance

With respect to polarity, we can distinguish stance models based on whether they assume that stance is binary (FOR or AGAINST) or if they also model that messages sometimes do not contain any stance. This distinction also corresponds to whether stance is modelled in data originating from dedicated debating scenarios – in which all participants consciously participate – or in unconstrained debates as they are conducted in social networks. The former works typically formalize stance that is expressed on debate websites that try to enforce structured debates by asking users to label their message as having a FOR or AGAINST stance on the target of the debate (Somasundaran and Wiebe, 2010; Anand et al., 2011; Walker et al., 2012a; Hasan and Ng, 2013; Sridhar et al., 2014). In such constrained debates, there are hardly any contributions that do not relate to the target. Consequently, these works model stance as a binary phenomenon.

Binary annotation schemes that model stance expressed towards another text (which also can be seen as a target, cf. Section 2.3) commonly rely on the classes AGREE and DISAGREE. For instance, Menini and Tonelli (2016) model whether two argumentative texts agree or disagree with each other. Wang and Cardie (2014) and Celli et al. (2016) annotate whether a reply in a discussion forum agrees or disagrees with the post it replies to.

Binary formalisms can also be found in data that comes from constraint debates that occur in the offline world. For instance, Thomas et al. (2006) annotated whether the speeches in congressional floor debates SUPPORT or OPPOSE a given piece of legislation. Similarly, Faulkner (2014) labeled essays with the classes FOR and AGAINST a given target.

In contrast to these works on structured debates, there are approaches that model stance in unconstrained social media debates. The goal of these approaches is to structure a collection of e.g. posts on Twitter. By nature, such a collection of tweets contains data that does not carry any stance or where the stance cannot be inferred. Thus, these works consider an additional NONE class, which allows us to filter out these tweets that do not contribute to the debate. NONE class was first introduced by the SemEval 2016 task 6 Detecting Stance in Tweets (Mohammad et al., 2016). This model was adopted by following shared tasks, namely the NLPCC Task 4 on stance detection in Chinese microblogs (Xu et al., 2016)and the IberEval shared tasks on stance in tweets on Catalan independence (Taulé et al., 2017, 2018). The same polarity classes are also used in the annotations by Sobhani et al. (2017). Vychegzhanin and Kotelnikov (2017) additionally introduce a class CONFLICT depicting if a social media message contains both $\oplus$ and $\ominus$ stances.

To the best of our knowledge, there is only a limited amount of work on more fine grained stance classification. An example for a more finely grained polarities is the work by Persing and Ng (2016b) that uses the classes AGGREE STRONGLY, AGREE SOMEWHAT, NEUTRAL, DISAGREE SOMEWHAT, DISAGREE STRONGLY, and NEVER ADRESSED for labelling student essays. For consecutive texts, SemEval 2017 Task 8 A models a special notion of stance that aims to detect fake news by using the classes SUPPORT, DENY, QUERY, COMMENT (Derczynski et al., 2017). Such finely grained polarities, however, can be frequently found in research on text polarity.

Text Polarity

Text polarity detection refers to approaches that try to measure the polarity – the emotional tone or overall sentiment – that is conveyed by a text. The used polarity distinctions range from binary (POSITIVE vs. NEGATIVE) labels (Hu and Liu, 2004), over psychologically-motivated sets such as the six basic emotions*anger, sadness, happiness, surprise, disgust, and fear (Ekman et al., 1969)(Mohammad and Turney, 2010), to complex linguistically motivated sets as modelled in WordNet (Strapparava et al., 2004).

Whether a whole text conveys a positive, negative, or a neutral sentiment is intensively studied by the Sentiment Analysis in Twitter series of shared tasks that have been conducted within the SemEval framework. So far, the format generated several challenges that all used the same set of polarity classes (positive, negative or neutral): SemEval 2013 task 2b (Nakov et al., 2013), SemEval 2014 task 9b Rosenthal et al. (2014), SemEval 2015 task 10b (Rosenthal and McKeown, 2012), SemEval 2016 task 4 a (Nakov et al., 2016), and SemEval-2017 task 4 a (Rosenthal et al., 2017). The same set of polarity classes (POSITIVE, NEGATIVE or NEUTRAL) is used by the TASS 2017 initiative for Spanish tweets (Martınez-Cámara et al., 2017) or by the SAIL initiative for tweets written in the Indian Languages Bengali, Hindi, and Tamil (Martınez-Cámara et al., 2017). Other shared tasks such as the Italien SENTIPOLC (Valerio et al., 2014; Barbieri et al., 2016) or the French DEFT (Benamara et al., 2017) do not require that the labels POSITIVE, NEGATIVE or NEUTRAL occur exclusively by allowing a MIXED category.

A more fine-grained sentiment score and thus a more complex annotation scheme is used in the shared tasks TASS 2016, 2015, 2014 and 2013 on detection sentiment on Spanish tweets (Garcıa-Cumbreras et al., 2016; Villena-Román et al., 2015, 2014a,b). In detail, they used a six point scale (STRONG POSITIVE, POSITIVE, NEUTRAL, NEGATIVE, STRONG NEGATIVE and one additional NO SENTIMENT class). Similarly, Semeval 2015 task 11 uses an eleven-point scale ranging from -5 (disgust) to +5 for (extreme pleasure) (Ghosh et al., 2015). A special set of emotions has been used by Pestian et al. (2012) to study the emotional tone of suicide notes (i.e. ABUSE, ANGER, BLAME, FEAR, GUILT, HOPELESSNESS, SORROW, FORGIVENESS, HAPPINESS, PEACEFULNESS, HOPEFULNESS, LOVE, PRIDE, THANKFULNESS, INSTRUCTIONS, and INFORMATION).

In addition to a fixed number of classes, there are also formalizations that express the intensity of emotions through real valued scores. A continuous scale allows for more fine grained gradations and thus introduces a higher complexity than a categorical scale. Examples for text polarity annotations on continuous scales include SemEval 2007 task 14 (Strapparava and Mihalcea, 2007), WASSA-2017 Shared Task on Emotion Intensity (Mohammad and Bravo-Marquez, 2017), and SemEval 2018 task 1 (Mohammad et al., 2018), which quantify the strength of expressed emotions (e.g. JOY, FEAR, SURPRISE) and polarity orientation (from MOST POSITIVE to MOST NEGATIVE).

In addition to the polarity of cohesive texts, there is also a large body of research on the polarity of single, isolated words. Studying the sentiment or emotional tone of words is closely related to the task of creating and analyzing sentiment lexicons. Efforts on lexicon creation often consider the words in isolation. Practical guidelines often explicitly state that annotators should try to avoid their personal context by imagining how people at large would judge the sentiment of a word (Mohammad, 2016). However, there are approaches on annotating words for being positive, negative, both, or neutral in the context of the sentence in which they occur. For example, the MPQA Subjectivity lexicon is created from the annotations of words in the MPQA corpus (Wilson et al., 2005).

Binary and therefore less complex distinctions between positive and negative opinion words are compiled by Stone et al. (1962) and Liu (2004). The same binary complexity is used by approaches that try to automatically increase the amount of hand-labeled words. These approaches try to infer the sentiment of a word from the context in which it occurs in a corpus. For instance, Turney (2002) considers the collocations to seed words such as EXCELLENT and POOR. Similarly, Hatzivassiloglou and McKeown (1997) compile a list of positive or negative adjectives and then expanded this list by assuming that adjectives that are in conjunction (e.g. and) have the same polarity. There are also approaches on creating binary lexicons that reflect domain-specific sentiment (e.g. bullish or bearish} in the financial domain (Das and Chen, 2007)).

Furthermore, there are approaches on creating lexicons that try to model a more fine-grained polarity. These approaches often rely on interviewing groups through offline or online surveys. For instance, researchers have collected fine-grained ratings for words for the dimensions POSITIVE or NEGATIVE (Mohammad et al., 2013), VALENCE, AROUSAL and DOMINANCE (Bradley and Lang, 1999), as well as PLEASANTNESS, ACTIVATION, and IMAGERY (how well a word brings an image to one's mind)(Whissell, 1989).

Finally, there are approaches on estimating fine-grained word polarity that make use of ontologies that model the meaning of words. Lexicons that can be generated from these ontologies map words to polarity scores between -1 and 1 (Cambria et al., 2010; Gatti et al., 2015) or to numerical scores describing how OBJECTIVE, POSITIVE, and NEGATIVE the words are (Esuli and Sebastiani, 2006).

Targets

Targets are the subject of the expressed stance. Accordingly, a target can be anything that can be evaluated or judged. Targets can be political, ethical, or social issues (e.g. gay rights), people or groups (e.g. Hillary Clinton or a political party), or concrete objects (e.g. a city, a product, or a brand). In addition, one can also formalize a stance towards any kind of statement (e.g. one can be against or in favor of a statement such as We should use windmills to generate electricity). In fact, if one compares the stance tuples { windmills, $\oplus$ } and {We should use windmills to generate electricity, $\oplus$ }, the distinction between statements and issues is blurry, since in a debate on electricity sources these stances are highly correlated.

If we have a formalism that includes a NONE polarity class, one can theoretically derive a stance for every possible target from every possible statement (even if the stance will mostly have a NONE polarity). Since one has to repeat the same annotation process for each given target, the complexity of the schema increases with each additional target. As there is an infinite number of possible targets (given by the infinite number of topics and statements), stance would have an infinitely high complexity. Therefore, due to practical reasons, approaches are limited to fixed amounts of targets.

In this section, we will look at stance detection, target-based and topic-based sentiment analysis, aspect-based sentiment analysis, and argument mining. These formalizations all have in common that they formalize evaluations based on a fixed set of targets. We will discuss the complexity of these approaches based on the size, structure, and origin of these sets. Subsequently, we will take a look at the topical domains of these targets and argue that the discussed approaches are all highly similar and that their different names rather reflect different research traditions.

We distinguish between the three stages of set complexity: (i) single targets (2) multiple targets, and (iii) nuanced targets.

Single Targets

The least complex form of target sets is that only one stance towards one target is annotated per data instance (i.e. the set contains exactly one annotation per element). In many applications, this low complexity is sufficient. Modelling stance towards a single target corresponds to voting situations (e.g. referendums on social issues), in which one also expresses a pro, contra, or neutral stance towards a single target.

Most previously discussed approaches on stance detection fall into this category (e.g. the shared tasks by Mohammad et al. (2016), Taulé et al. (2017), or Xu et al. (2016)). Therefore, it is irrelevant that most of the datasets cover several targets (e.g. atheism, abortion}, climate change, Hillary Clinton, and feminism in the dataset by Mohammad et al. (2016)), since each instance is annotated with stance towards only one target. We argue that these datasets are actually a collection of several sub-datasets (i.e. one atheism stance dataset, one abortion stance dataset, etc.). Approaches that formalize a binary stance in constraint debates (e.g. Anand et al. (2011) or Walker et al. (2012a)) can also be clearly classified as single target approaches, as their target is also defined by the constraints of the debate.

Target and Topic-based Sentiment

There are other annotation schemes than stance that measure the polarity of text expressed towards targets. One group of approaches that also formalize whether a text expresses a polar evaluation (i.e. positive, negative or neutral) towards a target is called target-based sentiment analysis. Examples of approaches that refer to themselves as target-based or target-dependent sentiment analysis are: Dong et al. (2014) who annotated tweets with sentiment towards one given target.

Their dataset contains evaluations of the targets bill gates, taylor swift, xbox, windows 7, and google. Similarly, Jiang et al. (2011) collected tweets about five targets (Obama, Google, iPad, Lakers, and Lady Gaga) and manually labeled each tweet with a sentiment towards the given target. Another group of approaches that formalizes evaluations of targets refer to themselves as topic-based sentiment analysis. However, the building blocks are again (i) a text that conveys (iii) a polarity (e.g. positive or negative sentiment) towards a topic (e.g. a person or a movie). Representatives of this group are O’Hare et al. (2009), SemEval 2016 task 4c (Nakov et al., 2016), and SemEval 2017 task 4b,c (Rosenthal et al., 2017).

Another set of approaches that formalize polar evaluations of targets is obtained directly from (web) platforms on which users can rate products, services or movies. These formalizations mostly consist of ratings (e.g. on a scale, number of stars) of these targets in conjunction with a written review. Since each review in this scheme rates only one movie, they are clearly single-target approaches. Popular examples for this include the Movielens dataset (Harper and Konstan, 2015) (star ratings of movies), the Yelp dataset (star ratings on a scale from 1 to 5 on various businesses including restaurants and shopping)*available from www.yelp.com/dataset/challenge; last accessed November 5th 2018, and the Netflix dataset (star ratings on a scale from 1 to 5 on movies) (Bennett and Lanning, 2007).

The discussion above shows that our formalization of stance (a tuple consisting of polarity and target) can be referred to with several names. While the individual projects may also have a different focuses, we argue that these names are basically interchangeable.

Multiple Targets

In addition to the approaches that formalize stance towards exactly one target, there are approaches that annotate stance towards multiple targets for each instance. These approaches are more complex and thus more expressive than single-target approaches. This complexity may be necessary if we face a problem in which there are more than two sides (e.g. if there are several candidates or parties for a election) or if we are interested in a more finely grained differentiation of positions (e.g. if we want to further differentiate the people with a $\oplus$ stance on nuclear power into those that think safety needs to be improved and those who think that the plants are already safe). Naturally, the increased complexity also has a negative influence on how easily the schemas can be annotated and thus on their reliability.

One of the least complex multiple target stance approaches is to annotate stance towards two targets. Such a dataset is created by Sobhani et al. (2017), who annotate each tweet of their collection with stance towards a pair of two targets. These target pairs are Donald Trump and Hillary Clinton, Donald Trump and Ted Cruz, or Hillary Clinton and Bernie Sanders.

Approaches at the intersection of argument mining and stance detection, annotate the overall stance and a set of argument tags. These approaches only differentiate between whether the argument is supported ( $\oplus$ stance) or whether the argument does not occur (NONE stance). Examples of such the efforts are Conrad et al. (2012) and Sobhani et al. (2015) who both manually create a set of 15-20 argument tags based on domain knowledge (varying according to the overall target). By grouping reoccurring phrases in their dataset, Hasan and Ng (2013) annotate a set of a similar size.

Other approaches model both the stance that is expressed towards one overall target and stance towards a set of subordinated or otherwise related targets, which are extracted from collaboratively constructed resources. For instance, Boltužić and Šnajder (2015) first annotate forum posts with their overall stance on gay marriage, abortion, Obama or Marijuana and with stance towards pro and contra arguments (e.g. Abortion is a woman's right). The set of pro and contra arguments was manually extracted from the debate website iDebate.org. For each target, up to nine arguments are extracted. One of the largest target sets so far has been used by Ruder et al. (2018) who labele news articles with their stance towards 366 targets. In this study, the authors group the targets, which are derived from DBPedia and frequent Twitter mentions, into the types popular (e.g. Arsenal F.C., Russia), controversial (e.g. Abortion, Polygamy) and political (e.g. Ted Cruz, Xi Jinping).

Aspect-based Sentiment

A similar formalism is used by approaches of aspect-based sentiment analysis, which model what kind of polarity (positive, negative or neutral sentiment) an author expresses towards an entity (e.g. a camera) and its aspects (e.g. lens, prize). While the expression having a stance towards a camera may be a non-standard expression, a sentiment towards an entity is clearly a polar (e.g. $\oplus$ or $\ominus$ Hu and Liu (2004)) evaluation of it. Thus, the sentiment on the entity can be seen as overall stance and the polarity expressed on aspects can be seen as stance towards these aspects.

An early aspect-based sentiment dataset is the one created by Hu and Liu (2004), which consists of reviews of five electronic products in which the sentences have been labeled with a sentiment on aspects from a manually pre-defined list of up to 96 aspects (called product features). Ganu et al. (2009) created a corpus of restaurant reviews in which they annotated about 3400 sentences with sentiment towards the restaurant and an aspect category (food, service, price, ambience, anectdotes, and miscellaneous). This dataset was incrementally expanded, and lead to a series of SemEval challenges on aspect-based sentiment analysis (Pontiki et al., 2014, 2015, 2016).*SemEval 2014 Task 4, SemEval 2015 Task 12, and SemEval 2016 Task 5. Note that these task had several sub-tasks which explored various facets of aspect-based sentiment. The final SemEval dataset contains reviews and tweets in eight languages (Arabic, Chinese, Dutch, English, French, Russian, Spanish, and Turkish) and seven domains (e.g. restaurants, laptops, mobile phones, digital cameras, hotels, museums and telecommunication). In each domain each review is annotated with sentiment towards up to 22 entities (e.g. laptop, display, keyboard), and which of the up to nine attributes of the entity is evaluated (e.g. its price, its usability).

Another recurring initiative on aspect-based sentiment analysis, was conducted within the TASS framework (Villena-Román et al., 2014b, 2015; Garcıa-Cumbreras et al., 2016; Martınez-Cámara et al., 2017). These challenges repeatedly made use of a collection of tweets on aspects of the political landscape of Spain and tweets collected during a football game between the clubs Real Madrid and F.C. Barcelona. The collection of football tweets (called the social TV corpus) was annotated with sentiment towards about 40 aspects (e.g. the different players or the referee). Each of the tweets about politics (the STOMPOL corpus) was annotated with a sentiment towards one or more of the six aspects economy, health system, education, political party (e.g. speeches, electoral programme, corruption, criticism), and other. In addition, these aspects are linked to one or more entities (the major political parties of Spain).

With SentiRuEval, a similar initiative was carried out for Russian restaurant and car reviews (Loukachevitch et al., 2015). In SentiRuEval the aspects are general for both domains, food, service, interior, and price for restaurants, driveability, reliability, safety, appearance, comfort, and costs. Blinov and Kotelnikov (2014) use a similar formalization by annotating the sentiment towards the aspects food, service, and interior in Russian restaurant reviews.

Other approaches on aspect-based sentiment include: Thet et al. (2010) who label clauses in movie reviews with sentiment that is expressed towards the movie (overall) and towards five additional aspects cast, director, story, scene, music. The dataset of McAuley et al. (2012) contains user reviews on beer, pubs, toys and audio books, an overall rating, and a rating according to a small number of aspects (e.g. feel, look, smell, taste).

In addition, there are related approaches that describe themselves as opinion analysis. For instance the NTCIR shared tasks OPINION and MOAT on opinion analysis also annotate a polarity towards sets of up to 32 topics to newspaper articles (Seki et al., 2007, 2008, 2010).

Comparing Stance with Aspect-, Topic- and Target-based Sentiment

Sobhani et al. (2017) and Wei et al. (2018) suggest that the difference between stance and aspect-based sentiment is the amount of implicitness that is allowed by the annotation schemes. For instance, the annotation guidelines of Pontiki et al. (2014) strictly exclude inferences and implicitness, while most stance annotation guidelines explicitly include them (Mohammad et al., 2016). However, there is also research on implicitly expressed polarity towards targets that is labels as sentiment analysis (Greene and Resnik, 2009; Russo et al., 2015; Van de Kauter et al., 2015a,b). Furthermore, to distinguish between implicitly and explicitly expressing a polarity is often a non-trivial decision. While it is clear to most people that we should build windmills contains a more explicit stance towards windmills than wind energy should be promoted more often, it is hard to tell if both utterances are fully explicit.

Another possible decision criterion between stance and related formalisms is the domain (e.g. politics or products) to which the formalizations are applied. To shed light on this criterion, we compared the domains used in works that refer to themselves as either stance, aspect-based sentiment, or target respectively topic-based sentiment. We show this comparison in 2.1. The distribution shows that aspect-based sentiment is more frequently used in commercial domain (products, services and brands), and that stance is more common for political or ethical questions. However, as shown by – for instance – Xu et al. (2016) which include the target phone SE in their stance dataset, there are several outliers to this pattern. Furthermore, we do not find a clear pattern for target respectivly topic-based sentiment. Note also, that because persons or groups are often closely associated with ethical and political issues or products and services, they are sometimes placeholders for the associated domain. For example, a stance towards Charles Darwin, often communicates a stance towards Evolution – as the topic is closely linked to his person. Similarly, expressing a stance towards Steve Jobs highly correlates with expressing a stance towards the brand Apple. Similar correlations are also conceivable between the other domains.

Table 2.1

approach	dataset(s)	ethical or political question	public persons or groups	product, service or brand
stance		✓
		✓		✓
		✓		✓
		✓
		✓
		✓
		✓
		✓	✓
		✓		✓
		✓
		(✓)	✓
		✓
		✓
		✓	✓	(✓)
target- or topic based sentiment				✓
			✓	✓
			✓	✓
			✓	✓
		✓	✓	✓
		✓	✓	✓
aspect-based sentiment				✓
				✓
				✓
				✓
				✓
				✓
		✓	✓
				✓

Targeted domains of stance and similar NLP formalisms.]{Targeted domains of stance detection and similar NLP formalisms. The distribution of domains shows that political and ethical issues are more common for stance detection and commercial issues are more likely to be found for aspect-based sentiment. The distribution also shows that this domain bias is not used consistently.

This analysis shows that domains are also not a sufficient decision criterion for distinguishing the formalizations from each other. We therefore conclude that stance, aspect-based sentiment, or target- and topic-based sentiment are virtually equivalent. Of course, the individual annotation projects have a different focus and may differ from each other.

Nuanced Targets

The majority of the above described multi-target approaches relies on manually-defined sets of targets that are derived from domain knowledge or from external sources. With some notable exceptions such as Ruder et al. (2018), target sets that are created in this manner contain less than 25 targets.

Depending on the use case, these target sets that resemble small ontologies*in most cases the relationship between the targets within the sets of multi-target approaches can be described using classical semantic relation types such as meronymy, hyperonymy, and hyponymy. For instance, in aspect-based sentiment the aspects are often physical parts or attributes of the target entity (e.g. the relation between a laptop and its screen can be described as a meronymy), or in Sobhani et al. (2017) all targets are candidates for the U.S. presidential elections (and thus all candidates share the same hypernym). may not be complex enough to model all nuances that are expressed by people. For instance, in a debate on the legalization of Marijuana, one could face the utterances Marijuana should be legal for adults, Marijuana should be legal for chronic patients, Recreational use of Marijuana should be legal, The same legal regulations as for alcohol and tabac should apply for Marijuana. While for most people these utterances represent pro-legalization stances, the utterances all stress different nuances of the legalization of Marijuana and all express a slightly different stance. However, being able to determine the stance of people towards these nuances can be of importance when one, for instance, tries to find a compromise between opposing positions.

Yet, it is a difficult or even impossible task to manually model all nuances of targets that are discussed within a collection of utterances. Therefore, in this section, we discuss approaches that infer target sets from the data to which they are applied. We refer to these approaches as nuanced target stance detection. Among the nuanced target approaches, we distinguish between approaches that model stance towards spans within the texts and stance towards other utterances.

The first group of approaches formalizes stance as agreement or disagreement between several utterances. These utterances may relate to each other within a post-reply structures such as in forum threads, or are texts that are uttered in the same context. Consequently, the number of targets and hence the complexity of the annotation scheme is correlated with the size and structure of the dataset to annotate. Abbott et al. (2011) annotate whether posts from 4forum.com agree or disagree with the post to which they reply. Their corpus contains posts about ten controversial issues such as evolution, death penalty and climate change. The CorEA corpus created by Celli et al. (2014) contains user comments on blog posts that are annotated for agreement, disagreement and neutrality to the parent posts. Menini and Tonelli (2016) first extract excerpts from transcriptions of discourses and declarations of the U.S. presidents Kennedy and Nixon. Next, they annotate whether these excerpts agree or disagree with each other. In the dataset by Thomas et al. (2006)y Walker et al. (2012b) transcribed speeches from the US congress are labeled based on whether they approve (YEA) or disapprove (NAY) legislation initiatives. The Internet Argument Corpus by Walker et al. (2012b) contains questions and replies that were posted in a forum. The replies are annotated on an eleven point scale where minus five indicates full disagreement with the question and five indicates full agreement with the question.

Finally, there are approaches that combine agreement or disagreement between utterances with span-based annotations. In a dataset on Android, iPad, Twitter, and Layoffs (in technology companies) Ghosh et al. (2014) annotate spans of a post as the targets of a stance expressed by a responding post. A comparable scheme was used by Andreas et al. (2012) for discussion threads on LiveJournal and Wikipedia.

The second group of approaches base their annotations on other annotations such as named entities or the output of syntactic parsers. This pre-annotation significantly contributes to the complexity of the annotation. Thus, compared to approaches that use unprocessed utterances as targets, we consider these approaches – that require pre-annotation – to be more complex. Mitchell et al. (2013) rely on the Spanish and English Twitter dataset of Etter et al. (2013), which is labeled for named entities. They label all sentences in the corpus for the sentiment expressed towards these entities (They refer to their work as target-dependent sentiment). Kobayashi et al. (2006) manually annotate certain named entities (products or companies) and aspects (e.g. ENGINE, SIZE) as spans in a corpus consisting of Japanese weblog posts. Next, they extract opinion tuples consisting of subject, aspects, and evaluation (i.e. polarity). Similarly, the MPQA corpus contains sentiments (POSITIVE, NEGATIVE, and NEUTRAL) towards entities that are annotated as spans (Wiebe et al., 2005; Deng and Wiebe, 2015).

Span-based approaches are also related to approaches that formalize the text polarity of spans. These approaches annotate the polarity of phrases in the context of the sentence in which they occur. The datasets that have been used for SemEval 2013 task 2 (A) Nakov et al. (2013), SemEval 2014 task 9(A) Rosenthal et al. (2014), SemEval 2015 task 10(A)) Rosenthal et al. (2015) contain phrases*phrase are groups of words that form a syntactical unit within a sentence. Phrases can be single words that are annotated based on whether they convey a positive, negative or neutral sentiment given their context. Similarly, in the Stanford Sentiment Treebank Socher et al. (2013) phrases are annotated with sentiment scores ranging from one (VERY NEGATIVE) to 25 (VERY POSITIVE). Other approaches that formalize the polarity of phrases such as IJCNLP-2017 Task 2 Yu et al. (2017) or SemEval-2016 Task 7 Kiritchenko et al. (2016) do not provide context of the phrases and thus cannot be assigned to the span-based approaches.

Argument Mining

Another group of approaches that is more distantly related to span-based stance detection is called argument mining. The goal of argument mining is to automatically determine argumentative structures within texts (Green et al., 2014). The main difference between argument mining and stance detection is that argument mining not only formalizes if one positively ( $\oplus$ ) or negatively ( $\ominus$ ) evaluates a target, but also how this evaluation is expressed. To quantify this how, argument mining approaches typically perform two subtasks: (i) determining argumentative microstructures such as claims or premises within texts (called argument components or argumentative discourse units) and (ii) determining the relationships of these microstructures (Palau and Moens, 2009; Peldszus and Stede, 2013; Lippi and Torroni, 2016).

What kind of microstructures are modelled and how complex their relationships are depends on which argumentation theory a particular argument mining approach follows. A popular theory is the Toulmin model of argumentation (Toulmin, 1958), which postulates that an argument may contain the following components:*the simplified description of Toulmins argument components is adopted from Habernal et al. (2014)

Claim: Claim refers to the assertion for which one argues that it is correct.
Data (Grounds): The evidence that is used to support the claim that is made.
Warrant The warrant refers to the logic connection of data and claim. This means that the warrant explains why the presented data supports the claim.
Backing Backing presents information that support the trustworthiness of the warrant.
Qualifier The qualifier refers to a statement that weakens or conditions (i.e. qualifies) the argument.
Rebuttal A rebuttal is a statement which presents a counter-argument or which indicates situations when the claim is invalid.

Instead of the rather complex Toulmin model, other argument mining approaches rely on simpler models (e.g. based onFreeman (1991)) which postulate that arguments consist of at least one claim and any number (including zero) of premises. Approaches that have implemented a claim–premise scheme are the works of Palau and Moens (2009) and Peldszus and Stede (2013). Persing and Ng (2016a) and Stab and Gurevych (2014) additionally annotate major claims which express the author's stance on the overall topic of the debate. Rinott et al. (2015) model claims and context dependent evidence which are spans in the text that support the claim in the context of a given topic. The AraucariaDB dataset also contains premise annotations, but uses conclusions (statements that can be concluded from premises) instead of claims (Reed and Rowe, 2004). There are also works that do not formalize argumentation components but rather the quality of arguments (Wachsmuth et al., 2017). For instance, researchers have annotated arguments for their convincingness (Habernal and Gurevych, 2016) and persuasiveness (Persing and Ng, 2017), or diﬀerentiated between claims that are subjective or objective (Rosenthal and McKeown, 2012), or qualiﬁed or bald (Arora et al., 2009).

Due to the more complex goal of argument mining of formalizing how arguments are expressed, naturally, the annotation schemes are more complex than stance annotations. For example, even in a comparably simple claim–premise scheme, we first need to determine the spans of discourse units, then we need to assign the labels claim and premise to the units, and finally we need to link premises to the claims to which they correspond. Although argument mining as a whole and stance detection are not directly comparable due to different objectives, claims – a component which is contained in most argument mining models – are directly related to (span-based) stance.

As a claim is an assertion or statement for which one argues, we can conclude that the person that utters the whole argument considers the claim to be correct and hence has a $\oplus$ -stance towards the claim. Similarly, for counter-arguing we can conclude a $\oplus$ -stance towards the claim against which one argues. The other way around, any { $\oplus$ | $\ominus$ |NONE}-stance on a target may be transformed into a claim of the form I am {IN FAVOR OF|AGAINST|NEUTRAL TOWARDS} target. This intersection of argument mining and stance detection has only been addressed by a few approaches: For instance, some approaches model both a major claim, which expresses the author's stance on the overall topic, and a claim (Persing and Ng, 2016a; Stab and Gurevych, 2014). In addition, Kwon et al. (2007) differentiates between claims that support ( $\oplus$ stance) and claims that oppose ( $\ominus$ stance) the debate topic.

Chapter Summary

In this chapter, we provided an overview of the various formalizations that NLP researchers have used to quantify stance in texts. We identified the two components polarity and target as being central to all stance formalizations. Next, we reviewed related work and discussed the complexity of the underlying annotation schemes with respect to these components. For stance polarity we find that one can group existing formalizations into binary ( $\oplus$ vs. $\ominus$ ) and three-way ( $\oplus$ vs. $\ominus$ vs. NONE) approaches. We conclude that three-way approaches are better suited for social media data, as the NONE class can be used to filter unavoidable bycatch. Formalizations with more complex polarity inventory – such as those found in text-polarity research – have hardly been used to formalize stance.

According to the complexity of the used target sets, we distinguish between the (i) least complex single target formalizations, the (ii) more complex multiple target formalizations, and the (iii) most complex nuanced target formalizations. For single target approaches we find that stance is highly similar to target and topic-based sentiment. For multi target approaches we compare stance with aspect-based sentiment. We find that these formalizations are used interchangeably, but that aspect-based sentiment is used more frequently in the commercial domain, while stance is more common for political or ethical questions. For the most complex target sets, we identify approaches which formalize stance that is expressed towards other utterances and stance that is expressed towards text spans. The latter is related to \textbf{argument mining}, which often also includes the identifications of claims – statement for which one argues to be correct (and therefore expresses a $\oplus$ stance). We also conclude that span-based stance formalizations are more complex and may potentially suffer from unreliable social media preprocessing.

In the following three chapters, we will examine (i) single target stance, (ii) multi target stance, and (iii) stance towards nuanced target in more detail. In the next chapter, we will start with the least complex formalization – stance towards single targets.

Stance on Single Targets

In this chapter, we will look at the simplest case of stance detection, which involves predicting the stance towards single, predefined targets. As a concrete use-case we could, for instance, attempt to predict which stance towards wind energy is expressed by a set of tweets we have collected by searching for an appropriate hashtag (e.g. #windEnergy). If one then compares the number of $\oplus$ and $\ominus$ tweets, one can get an impression of whether the majority is for or against wind energy and how clear this vote is. We visualize this use-case in 3.1.

To study single target stance detection, we will first introduce supervised machine learning – the nowadays dominant paradigm in text classification tasks – and show how supervised machine learning can be used for automatically detecting stance. Next, we will describe our contributions to two of the first shared tasks on stance detection*SemEval 2016 Task 6 and StanceCat@IberEval 2017 as well as our organization of GermEval 2017, a shared task on aspect-based sentiment classification. As we have argued in the previous chapter, in this work, we treat stance and aspect-based sentiment as synonymous. In this chapter, we conduct a series of quantitative experiments to better understand the developed systems and the state of the art of single target stance detection.

Formally, the process of automatically detecting stance on single targets can be seen as the function $F(x,t)=y$ where $t$ is our given target (e.g. nuclear power), $x$ is any input text and $y$ is the predicted stance ( $\in$ { $\oplus$ , $\ominus$ , NONE} resp. $\in$ { $\oplus$ , $\ominus$ }). However, the complexity of stance detection (see Chapter 2) makes it impractical, or even impossible*given that human language is capable of infinitely producing new words and sentences, and hence inputs $x$ ., for humans to manually craft the function $F(x,t)=y$ . Thus, in this work, we focus on machine learning to automatically determine the function $F(x,t)=y$ from large amounts of sample data. In 3.2, we visualize the basic workflow of machine learning approaches on stance detection: We start with labeled input data, which is first preprocessed to be usable for a machine learning algorithm. This preprocessing typically involves segmentation of the data (e.g. into words) and vectorization (e.g. translating the instances into sequence of word vectors). Subsequently, we can use a machine learning algorithm to train a model which can predict labels for unlabelled data. The quality of this prediction can finally be evaluated on other (gold) data.

Figure 3.1

Example of an applied annotation scheme which corresponds to the formalization Stance on Single Targets

Machine learning approaches can be distinguished based on whether they are supervised or unsupervised. Supervised means that the approaches have access to labeled training instances, while unsupervised means that the approaches solely rely on large amounts of relevant, but unlabelled data. In the case of stance detection, labeled training instances would be a tuple consisting of a text $x$ , a stance label $y$ and the target $t$ . While a supervised approaches learn functions that assign labels (i.e. $\in$ { $\oplus$ , $\ominus$ , NONE}) to input instances, unsupervised approaches try to detect inherent similarities or dissimilarities within the data and to use this to group the instances. These groups can be seen as generic labels, but unsupervised approaches usually cannot describe what these labels mean except how similar or dissimilar they are to the other labels (or groups). Since in our case we know which classes we are looking for, we focus here on supervised approaches. Supervised machine learning approaches can be further distinguished based on whether they are trained to predict discrete labels (i.e. $\in$ { $\oplus$ , $\ominus$ , NONE}) or continuous scores (e.g. real values). If we want to predict discrete labels (or classes), the approaches are called classifications, while if we want to predict continuous scores the approaches are called regressions.

Figure 3.2

Basic overview of the workflow of supervised (stance detection) approaches.

Preprocessing

Due to the complexity of language and stance, we rely on machine learning for creating models that can infer stance from language. However, before we can create such a model, we need to represent data in a way that is both expressive for an tweet's stance and processable for computers. We do this by representing instances through a set of measurable or countable properties – called features. The process of deriving and assigning these features is called feature extraction (Bishop, 2006, p.2). In the following, we describe typical feature extraction procedures used to represent social media data for stance detection. We discuss these steps along the hierarchy of NLP levels introduced in Section 2.1.1. Specifically, we will discuss human-defined approaches on representing stance-bearing texts in a meaningful way. In the subsequent Chapter 3.2.2, we will discuss methods that try to automatically extract meaningful representations.

In general, a good set of features should discriminate between instances with different class labels (Zheng and Casari, 2018, p.2). This means that, based on the values of the features, $\oplus$ -instances should be more similar to other $\oplus$ -instances than to $\ominus$ -instances.

Ngram Features

One of the simplest ways to represent text is to count whether a particular word occurs or not. However, before we can do this, we first need to identify words in our input. As we have seen in the previous chapter, this can be a non-trivial task in social media. Therefore, researchers typically use social media-specific segmenters that are robust to the usage or emoticons and urls, or platform specific phenomena such as @-mentions in Twitter. A popular example for a social media specific tokenizer is the Twokenizer} which is optimized for segmenting tweets (Gimpel et al., 2011). Instead of using all words to represent instances, several researchers suggested to rely on tokens which are especially indicative of stance (Anand et al., 2011; Lai et al., 2017b; Somasundaran and Wiebe, 2010). These tokens may be platform specific (e.g. @-mentions or hashtags on Twitter), or punctuation symbols such as exclamation- and question marks.

In the case of stance detection, it seems promising to not just use the occurrence of single words as features, but also the sequences of words. For example, in utterance keep the noisy windmills away!, the sequence noisy windmills seems to be highly indicative of the author's stance towards windmills. These sequences of $n$ consecutive words are called ngrams (Jurafsky and Martin, 2000, p.117). Certain ngrams are commonly named after the size of $n$ (i.e. $n=1$ : unigrams, $n=2$ : bigrams, $n=3$ : trigrams). Usually, ngrams are formed within the boundaries of sentences. For instance, in the example above, we can form the follwoing uni-, bi-, and trigrams:

unigrams: $\{$ keep, the, noisy, windmills, away, ! $\}$
bigrams: $\{$ keep the, the noisy, noisy windmills, windmills away, away ! $\}$
trigrams: $\{$ keep the noisy, the noisy windmills, noisy windmills away, windmills away ! $\}$

If one orders ngrams according to their frequency, the distribution follows an inverse power law over most collections of text (Zipf, 1935; Smith and Devine, 1985; Ha et al.,2002).*This law is called Zipf's law after the linguist George Kingsley Zipf (1902-1950) In other words, the frequency of the $i$ th most frequent ngram ( $tf(ngram_{i})$ ) is proportional to $1/i$ :

tf(ngram_{i}) \propto \frac{1}{i}.

This means that few ngrams occur very frequently, but most occur infrequently. Thus only a few ngrams make good features as they may reoccur in other instances.

While the exponent of the inverse power law is close to one for unigrams (Zipf, 1935), it gets increasingly smaller for larger $n$ -values (Smith and Devine, 1985; Ha et al., 2002). Hence, longer ngrams are even more sparsely distributed. In practice, ngrams with $n$ >4 are rarely used for stance detection (Mohammad et al., 2016; Xu et al., 2016; Taulé et al., 2017).

Researchers and practitioners often assume that not all ngrams have the same importance for all tasks. For instance, in the tweet I hate windmills for being so loud the trigram for being so is probably less indicative of the author's stance than the trigram I hate windmills. To account for this varying importance of ngrams, words that appear in predefined lists (so-called stop-word lists) are often excluded from consideration or one conducts a relevance weighting of terms or ngrams. A widely-used weighting scheme is the term frequency inverse document frequency (tf-idf) approach, which assumes that words that occur in only a small amount of tweets (or other documents) are espcially discriminative (Jurafsky and Martin, 2000, p.805). tf-idf of an ngram $i$ within a document $d$ can be defined as:

tf\textrm{-}idf(i,d)=tf(i,d) \cdot idf(i)

where

tf(i)

indicates how often the ngram

i

appears in

d

and where

idf(i)

is defined as the ratio of the total number of documents in our collection

N

and the number of documents

n(i)

that contain the ngram we are looking for:

In addition, the term ngram is often used for sequences of characters. Character ngrams are formed in exactly the same way as word ngrams (of course with characters instead of words as base units), however, their number is limited by the underlying alphabet. If we use character ngrams in this work, we will explicitly point this out.

Syntactic Features

To represent the stance expressed by a texts, it may be advantageous to make use of syntactic knowledge. For instance, in the utterance It's the windmills that are – due to maintenance costs – too expenisve!, we will only find very long (and thus sparse) ngrams that contain both windmills and expensive – a word which clearly communicates stance towards windmills. However, syntactic knowledge may tell us that expensive is an adjective that describes the word windmills. Using such stance-indicative tuples (i.e. (windmills, expensive)) can be automatically constructed from a dependency parse (Anand et al., 2011; Faulkner, 2014). In addition, part-of-speech (POS) information has been part of the feature repertoire since the beginning of word polarity research (Pang et al., 2002; Turney, 2002). Other syntactic features for representing stance include the occurrence of modal verbs (e.g. can, could, or may), conditional sentences, or negations (Mohammad et al., 2016; Taulé et al., 2017).

Semantic Features

As stance detection involves understanding the meaning of a text, it seems reasonable to incorporate semantic knowledge into our data representation. Usually, semantic knowledge is incorporated by enriching the representations with lexical semantics (the meaning of the contained words) and assigning text polarity scores to the instances.

Among the several ways to include word meaning into the data representations, the usage of word vectors became most popular in recent years (Collobert et al., 2011). The underlying assumption of these vectors is that words that share similar contexts (and thus are equally distributed within a text collection) are semantically similar.*One of the first to describe this principle was the philosopher Gotlob Frege one has to ask for the meaning of words within context, not in isolation | orig.: nach der Bedeutung der Wörter muss im Satzzusammenhange, nicht in ihrer Vereinzelung gefragt werden) (Frege, 1884, p.9). The hypothesis was also famously summarized with the quote: You shall know a word by the company it keeps (Firth, 1957, p.11). For instance, the words electricity and power are highly similar and also often occur in similar contexts (e.g. [...] you should save gas and $\_\_\_$ [...], [...] the quantity of $\_\_\_$ produced by the generator [...]). The word vectors are created in a way that words that share a large number of contexts also have similar vectors.

Traditionally, these vectors are created by filling a matrix (with words as rows and the contexts as columns) with the number of times the word occurs in the contexts (Gabrilovich and Markovitch, 2007; Turney and Pantel, 2010; Baroni et al., 2014). A context can be defined by single words, larger ngrams, but also by the entire document (Turney and Pantel, 2010). The vector of a word is then simply the row vector of the matrix. To quantify the similarity between two words, one can now compare their vectors. While many metrics have been suggested (e.g. Jaccard, Euclidean distance), one of the most prominent ones is cosine similarity (i.e. the normalized dot product between the vectors):

cos(\vec{w_1},\vec{w_2}) =\frac{\vec{w_1} \cdot \vec{w_2}}{\vert \vec{w_1} \vert \cdot \vert \vec{w_2} \vert}

with

\vec{w_1}

being the row vector of word

w_1

\vec{w_2}

being the row vector of word

w_2

, and

\vert \vec{w} \vert

being the vector length. The resulting score ranges from

-1.0

(vectors point in the opposite direction), over

.0

(vectors are orthogonal) to

1.0

(vectors are identical).

The vectors that are obtained using the described counting procedures are often referred to as sparse vectors since most of their entries are zero. These zeros mean that they contain a large amount of redundant information. In contrast, there are also so-called dense vectors, which almost exclusively have non-zero entries, and which are usually shorter (most approaches use 25-500 dimensions). The entries in dense vectors are usually within a continuous range of values, which means that their entries are real-valued and not integers as in most counting-based approaches.

Approaches for generating dense vectors can be grouped into dimensionality reduction methods and prediction-based methods (Levy and Goldberg, 2014). The goal of dimensionality reduction is to find dimensions within the word-context matrix that have the largest variance. Many dimensionality reduction problems are not exactly solvable and are therefore approximated. A prominent example of dimensionality reduction is Latent semantic analysis (LSA) (Deerwester et al., 1990). A more recent approach is GloVe that relies on co-occurrence ratios of words within an word-context matrix to create dense word vectors (Pennington et al., 2014).

Prediction-based methods for generating dense vectors train neural networks that solve word prediction tasks. After the training they use an inner state of the neural nets to derive the vectors. One of the first prediction-based approachs is word2vec, that uses two different shallow neural network architectures (skipgram: predict words from context and CBOW: predict context from words) to generate dense vectors (Mikolov et al., 2013a). Charagram (Wieting et al., 2016) and FastText (Bojanowski et al., 2017) are prediction-based methods that incorporated character ngram features in their training. There is also work on creating sense-specific word vectors (Neelakantan et al., 2014; Guo et al., 2014) or fitting them to knowledge sources (Faruqui et al., 2015).

To aggregate word vectors for representing whole instances, one usually concatenates or averages the vectors (Mitchell and Lapata, 2010; Le and Mikolov, 2014). There are also approaches that weigh each vector (e.g. using tf.idf) or prediction-based methods to create the vectors for whole sentences or paragraphs (Le and Mikolov, 2014).

Another popular choice for representing data for stance detection tasks are text polarity features. Specifically, almost the half of the submissions to the first SemEval shared task on stance detection relied on established word polarity lexicons (Mohammad et al., 2016). There are also approaches that create task- or data-specific sentiment lexicons (Somasundaran and Wiebe, 2010). A simple strategy to derive features for whole instance is to average the polarity scores of all words that are contained in the utterance. There are also more complex aggregation strategies that use context (e.g. negations) to adjust the polarity scores of words (Polanyi and Zaenen, 2006; Choi and Cardie, 2009). Other approaches use readily available sentiment tools (e.g. the tools of Socher et al. (2013) or Kiritchenko et al. (2014)) to directly assign sentiment scores to the whole utterance (Walker et al., 2012a; Habernal and Gurevych, 2016).

Pragmatic Features

As shown in Chapter 2, the formalization of stance depends on a given, predefined target. To model this dependence, several researchers have explored the idea of creating features that represent targets. For instance, Augenstein et al. (2016) concatenate word vector(s) that represent the target with the word vectors that represent the instances (e.g. if the target is atheism they add the word vector of atheism to the representation of the instances). Similarly, Du et al. (2017) and Zhou et al. (2017) translate input instances into a sequence of word vectors and an additional target representation. Lai et al. (2017b) used a set of features that model knowledge on the targets Hillary Clinton and Donald Trump. For instance, they create a list to capture the political party or party colleagues of Trump and Clinton.

Other approaches make use of situational context of social media. Walker et al. (2012a) and Sridhar et al. (2014) derive features from the structure of the debates. For instance, they model whether consecutive speakers agree or disagree with each other. Other features that are commonly used to describe the social media context include the number of replies a post has attracted, whether or not a post is the final post in a thread, the number of followers an author has, whether the author has a verified profile, or what time the posting is made (Artzi et al., 2012; Romero et al., 2013; Wei et al., 2016b).

Learning Algorithms

Now that the instances have a computationally processable form, we can try to determine a function that maps labels to the vectorized data. One of the easiest ways to come up with such a function is to define rules that map certain feature manifestations to labels. For instance, one could first sum up all tokens that are tagged with a positive sentiment into a variable # positive and all tokens that are tagged with a negative sentiment #negative. Next one could define a rule that assigns the class $\oplus$ if # positive > # negative, $\ominus$ if # negative < # positive, and NONE if # positive = # negative. This simple rule models the intuition, that texts that are in favor of the target are more likely to have a positive tone and texts that are against the target are more likely to be written in a negative tone. However, this rule ignores the fact that the subject of the text does not have to be the target, but can even be a target that is associated with the other side of the debate. For example, it might be the case that our target is NUCLEAR POWER, but the text is about WINDMILLS. This may require us to reverse our rule (# positive > # negative: $\oplus$ and # negative < # positive: $\ominus$ ). Such a flip could be modelled as an additional rule. However, the amount of additional targets may be large (or infinite) and as these additional targets may have complex relationships to the target of interest, many other shifts other than a complete reversal are possible. Due to the productivity of language each target can also have a (theoretically infinite) large number of surface forms (i.e. target indicative words or phrases). Moreover, all kinds of language phenomena such as negations, comparisons, or irony and sarcasm potentially influence our simple rule. This example shows that even such a simple intuition quickly needs a set of rules that can be difficult to formulate manually. In the case of stance detection, the problem is even more fundamental, as there are hardly any general intuitions and the underlying linguistic, psychological, and philosophical mechanisms are not fully understood.

In contrast to this deductive approach, functions can also be induced from observations. That means, we infer such a function from labeled instances by using machine learning. In this work, we apply two families of machine learning approaches. These families can be distinguished based on which representation they work: (i) classical machine learning approaches that utilize features which have been (manually) engineered to separate the data regarding the label to be learned (cf. Section 3.1) and (ii) neural network-based approaches that make use of simpler data vectorizations (usually word vectors) and try to automatically derive suitable representations of the instances.

Figure 3.3

Fitting of linear or logistic, univariate functions to training instances. The curves in this example were fitted to the data using the method provided by Lenth (2016).

Machine Learning with Feature Engineering

Curve Fitting As stated above, machine learning can abstractly be described as automatically determining a function that maps feature values onto labels (either classes in classification or real-valued scores in regression) based on a set of training examples. Once we have determined such a function, we can apply it to new data and automatically predict the label.

One of the simplest ways to determine such a function is to define a set of functions (e.g. all univariate, linear functions of the form $f (x) = ax + b$ ) and to select the one parametrization that has the best fit with the examples.*This family of algorithms is also called regression, but as we will see below they can be generalized to classication problems For linear functions, this means that we search for those parameters a and $b$ that result in the lowest error on the given data. As error we here understand the difference between the scores that are predicted when feeding the trainings instances to $f(x)$ and the scores with which they are actually labeled (called the gold labels). As in practice most relationships (and feature spaces) are not univariate it may be necessary to generalize the equation to multi-variatelinear functions:

y=\sum \limits_{j=1}^{J}w_j x_{i,j}+w_0

In order formalize the error and to explicitly include or exclude the distance between the predictions and the gold values, an error function e is usually used to weigh this loss:

loss_{multivariate}(w_{j=1}...w_{j=J},w_0)=\sum \limits_{i=1}^{N}e(y_i-(w_0+\sum \limits_{j=1}^{J}w_j\cdot x_{i,j}))

Common error functions are the quadratic loss, the logarithmic loss or the mean absolute loss.

Many real-world problems are not linear in nature. However, the described fitting approach works with any base function. A popular choice is logistic regression, which is defined as:

sigma (x)={\frac {1}{1+e^{-x}}}

Figure 3.3 exemplifies how both a linear and a logistic function is fitted to the same trainings instances.

If we want to use a fitted curve for a regression task, we can simply insert our variables (i.e. our features) and use the output as the prediction. However, if we want to apply it to a regression task, we need to discretize the space of output values. In the case of a logistic regression we can interpret the output values as the probability of one of the asymptotes. This means that in the simplest case of a logistic function, we have the asymptotes zero and one and thus can interpret the output as probability distribution for $y = 1$ . Hence, an output value of $.7$ means a 70% probability of class $1$ and a 30% probability of class $1$ . Analogously, an output value of $.3$ means a 30% probability of class $1$ and a 70% probability of class $1$ . According to the maximum likelihood estimation,*Choose that estimation according to which the observed data seems most plausible. for classifying an instance, we can now simply predict $1$ for values $>= 50\%$ and $0$ otherwise. The probability estimation can be generalized to a multinomial logistic regression with more than two outcome classes using:

p(y=j|w_{j=1}...w_{j=J},w_0)=\frac{e^{y_j}}{\sum\nolimits_{n=1}^N e^{y_n}}

for all y ∈ N with N being the set of classes.

This transformation of logistic regression to a classification is ultimately based on a threshold (e.g. .5 in the example above). Such threshold-based approaches can also be used for other functions (e.g. linear, polynomial). However, the outlined probabilistic interpretation is hardly possible for functions that do not behave asymptotically.

Support Vector Machines

The basic idea of support vector machines (SVMs) is to determine a hyperplane that is able to optimally split the traininginstancesoftwo classes (Bishop, 2006, p.326). More specifically, SVMs perform this split by constructing that one hyperplane that has the maximum margin tothetraininginstances(Flach, 2012, p.211). They do not rely on all input vectors, but only on those closest to the hyperplane to be constructed (Flach, 2012, p.211). If the classes are not linearly separable in the given vector space, SVMs map the training vectors into a higher-dimensional space and try to construct the hyperplane there. Thereby, SVMs are able to learn complex non-linear functions, but are still interpretable as geometric decision boundaries in a high-dimensional feature space (Hearst et al., 1998). Furthermore, by using the socalled kernel-trick they do not rely on computations in that high-dimensional feature space and are thus computationally efficient (Bishop, 2006, p.326).

Figure 3.4

The general idea of an SVM: A hyperplane is trained to separate two classes (green stars vs. purple circles). The hyperplane is based on the margin $\frac{2m}{||w||}$ and the support vectors $\vec{s_1}$ , $\vec{s_2}$ and $\vec{s_3}$ . The Figure is adapted from (Flach, 2012, p.212; Figure 7.7)

We will now describe the different ideas of the SVM algorithm in more detail: If we want to classify the classes $\oplus$ and $\ominus$ , we are looking for a hyperplane that linearly separates instances, which are labeled accordingly and that are represented by in an n-dimensional feature space. This hyperplane can be specified using the following formula:

w^T x + b=0

with

w^T

being the n-1-dimensional direction vector in the n-dimensional vector space and the vector

b

being the n-1-dimensional shift from the origin.

For classifying an instance (or the corresponding vector) as either $+1$ or $−1$ , we can simply determine if the vector lies above or below the hyperplane:

f(x) = g(w^T x + b)

with

g(z)

being a function that assigns the classes

\oplus

and

\ominus

g(z)= \begin{cases} \oplus, & \text{if}\ z \ge 0 \\ \ominus, & \text{otherwise} \end{cases}

For a given dataset of training instances (labeled as

\oplus

\ominus

), we now try to determine a hyperplane that has a maximal margin between all instances labeled as

\oplus

and all instances labeled as

\ominus

(Bishop, 2006, p.326). This margin

||w||

can be defined as the distance between the support vectors and the hyperplane. To separate the classes, the hyperplane must make sure that all instances x that are labeled as

\ominus

satisfy

w^T x + b < 0

and all instances x that are labeled as

\oplus

satisfy

w^T x + b > 0

. We can now determine the optimal hyperplane that fulfils the aforementioned constraint by finding the minimal

w

\min_{w}{ \frac{1}{2}{||w||}^2}

Figure 3.4 demonstrates the creation of a hyperplane (a straight) in a two-dimensional feature space based on a maximum margin that is constructed using the support vectors

\vec{s_1}

\vec{s_2}

and

\vec{s_3}

Tackling Non-linearity

Figure 3.5

Tackling non-linearly separable data using a softmargin (Subfigure (c)) and a mapping into a higher dimensional space (Subfigure (d)). Subfigure (c) is adapted from (Bishop, 2006, Figure 7.3; p. 332) and Subfigure (d) is adapted from (Bishop, 2006, Figure 7.14; p. 225).

So far, we have shown how SVMs create a classifier by constructing a hyperplane that linearly separates our training instances. However, many real world problems are not simply linearly separable. See Figure 3.5(a) and 3.5(b) for classification problems that are not linearly separable.

There are two mechanisms that are used to deal with such data: (i) introducing a slack variable $\xi$ that allows that individual points may violate the margin and (ii) transforming the data into a higher-dimensional feature space, where the classes may be separable by a hyperplane.

If an SVM makes use of the slack variable $\xi$ , it is called a soft margin SVM (as opposed to a hard margin SVMs that do not use $\xi$ ) (Flach, 2012, p.216). With a slack variable all instances that are labeled as $−1$ have to satisfy $w^T x + b ≤ −1 − ξ$ and all instances that are labeled as 1 have to satisfy $w^T x + b ≥ 1 − ξ$ with $ξ ≥ 0$ . Hence, in the case of instances that violate the margin, the punishment is proportional to the extent of the misclassification. In Figure 3.5(c) we visualize how a soft margin SVM separates a non-linearly separable problem with a hyperplane. The resulting error is quantified by the slack variable $\xi$ . Note that the slack helps also to learn a function that is not that strongly influenced by outliers and can thus potentially generalize better to unseen data (Flach, 2012, p. 216).

Even by using a slack variable, some problems remain difficult to separate (e.g. see Figure 3.5(b)). In these cases, we use a mapping function $Φ(x)$ to transform the present vector space into a higher-dimensional space, in which the problem may become linearly separable (Flach, 2012, p.225). This mechanism is visualized in Figure 3.5(d). However, performing both the mapping and the determination of the optimal hyperplane is computationally expensive. Thus, in practice, one uses the so-called kernel trick.

The kernel trick is based on describing the hyperplane as a series of dot products such as $x_1 \cdot x_2$ . After a transformation into a higher-dimensional vector space, we would now have to calculate the dot products $\Phi(x_1) \cdot \Phi(x_2)$ , which can be computationally expensive. Hence, instead of mapping function $Φ(x)$ , we define a kernel $k(x, y)$ , which behaves like the dot product in the space to be mapped (Flach, 2012, p.225). Using this trick, we do not need to expensively transform vectors into higher spaces and calculate their similarity, but we can directly use the kernel function. Common kernels are: (i) the polynomial kernels $k(x,y)=(x \cdot y)^d$ , (ii) radial basis functions (RBF) kernels $k(x,y)=exp( - ||x - y||^2 - (2\sigma^2))$ and (iii) sigmoid kernels $k(x,y)=\sigma(x \cdot y)$ .

There are several ways to generalize support vector machines to multi-class classification problems (Bishop, 2006, 338). The most common strategy is to train an individual classifier for each class which separates the class from the other classes. Having executed this one-vs-all classification for all classes, one can derive a final decision if one follows the prediction of that SVM that produced the largest margin.

Machine Learning with Neural Networks

The other family of learning algorithms we are using in this work can be classified as neural network approaches. The basic idea of these learning algorithms is to not be dependent on manual feature engineering, but to discover suitable representations of instances by themselves (Goodfellow et al., 2016, p.4).

Artificial Neuron

Figure 3.6

A single artificial neuron that receives input from the bias $b$ , and $x_1$ … $x_n$ . Each input has a specific weight $w_1$ ... $w_n$ . The output is calculated by putting the sum of the weighted inputs through an activation function g such as sigmoid or tanh.

Neural networks comprise so-called artificial neurons (or perceptrons) (Rosenblatt, 1958). Artificial neurons are models of real, biological neurons – nerve cells which occur in almost all multicellular animals.*Although inspired by biological neural networks, artificial neural networks are also strongly influenced by best practices in engineering and machine learning. We do not want to give the impression that today’s artificial neural networks are learning or thinking in the same way as real beings.

Biological neurons receives a signal from receptors or other neurons. If this signal exceeds a certain threshold value the neuron fires – the neuron transmits a signal itself. Figure 3.6 shows how artificial neurons mimic this process: The neuron receives the input $x_1$ ... $x_n$ and computes a single output. As not all input is equally important, artificial neurons additionally introduce the parameters $w_1$ ... $w_n$ to weight the input. To increase or decrease the probability to fire (or to transmit a certain output) independently of the input, artificial neurons are equipped with a fixed bias input $b$ , which is also weighted by a weight $w_0$ .

The output is finally calculated by putting the weighted sum of all inputs and the biases through an activation function. Activation functions transform a continuous range of values (i.e. the weighted input of a neuron) into a different (usually smaller) number space. An exception to this is the linear activation function that returns the exact output value, and it is therefore often called identity function. Closest to the biological model is the heaviside step function*named after the British mathematician Oliver Heaviside which returns one if the input exceeds a certain threshold and zero otherwise. The activation function sigmoid transforms the input values into a space between zero and one:

\sigma(x)=\frac{1}{1+e^{-x}}

Similarly, the hyperbolic tangent transforms the input values into a space between minus one and one:

tanh(x)=\frac{(e^{x}-e^{-x})}{(e^{x}+e^{-x})}

Thereby, activation functions reduce the influence of high negative or positive values without completely discarding them. An activation function that combines non-linearity and the identity function is called rectifier (Glorot et al., 2011)*A neuron that uses the rectifier as an activation function is called rectifier linear unit (ReLU) and is defined as:

rec(x)={\begin{cases}0&{\text{for }}x<0\\x&{\text{for }}x\geq 0\end{cases}}

Figure 3.7

Commonly used activation functions.

Figure 3.7 shows the outputs of the described activation functions. While there are reasons for the usage of certain activation functions (e.g. the derivation of the rectifier is computationally less intensive than sigmoid), the choice of the activation function often remains an empirical question (Glorot and Bengio, 2010; Glorot et al., 2011). In practice, this means that one experiments with different activation functions and uses the one that yields the most satisfactory result.

Multilayer Perceptron and Deep Neural Networks

Figure 3.8

Simple multilayer perceptron with one input layer with ve nodes, one hidden layer with four nodes and one output layer with three nodes. Figure adapted from (Rumelhart et al., 1986, p.318, Figure 1).

With artificial neurons as the basic building block of artificial neural networks, we now discuss how one can construct neural network architectures.

Network architectures are often described by a sequence of layers. Layers consist of (network) nodes, which are mostly neurons, but sometimes they perform other operations (e.g. fixed transformations). Unless otherwise stated, layers are always fully connected – i.e. all nodes of a layer are connected to all nodes of the layer’s neighbours. Fully connected layers are also often called dense layers (as they are densely-connected).

One of the simplest network architectures is the so-called multilayer perceptron. A multilayer perceptron has three different types of layers:*Further below we will describe some more advanced layers as well The input layer which builds the interface between the input instances and the network, the intermediate hidden layer, and finally the output layer that returns the result of the networks calculation. Except for the input layer, each layer’s nodes are neurons. We show an example of this architecture in Figure 3.8.

Hornik (1991) shows that a multilayer perceptron with only one input, one hidden and one output layer which uses a bounded activation function*the set of possible outcome values has a lower and upper bound can approximate any function. However, in practice, one often uses architectures with several hidden layers to learn more abstract representations. These hopefully deep representations learned using deep architectures are the reason why learning with neural networks is often referred to as deep learning (Glorot and Bengio, 2010).

Classification and Regression with Neural Networks

So far we have seen how neural networks convert an input into an output. Next, we discuss how neural networks can perform regressions or classifications.

In the case of regression we simply need an output layer consisting of a single node. If this layer is equipped with a linear activation, the network can produce every possible output value – given that the weights are set correctly. For classification, we use an output layer with the same number of nodes as we have classes. For instance, if we want to predict a three-way stance (classes: $\in\{\oplus, \ominus,$ NONE $\}$ ), we need an output layer with three nodes (as shown in Figure 3.8). Now, we can assign each class to one node and interpret the activation of this node as activation for this particular class. This activation can be converted into a concrete classification decision by a so-called softmax. Softmax maps an $n$ -dimensional vector of real numbers $\vec{a}$ into an $n$ -dimensional vector of real numbers $\vec{a'}$ in which each component is within the value range from zero to one. In this way, all the components of $\vec{a'}$ add up to 1. For each output neuron $o$ the softmax operation is defined as:

softmax(a_o)=\frac{e^{a_o}}{\sum\nolimits_{n=1}^N e^{a_n}}

with

N

being the number of classes.

Training Neural Networks

The training of artificial neural networks is done using an algorithm called back propagation (Rumelhart et al., 1985). Back propagation consists of a forward pass in which we feed in the training data, and a backward pass in which we update the neural net based on the made error.

We will now outline the basic steps of this algorithm: First all parameters of the network (i.e. the weights and biases) are randomly initialized.

During the forward pass, we successively feed all training instances (i.e. input data + gold labels) to the network. This means that we present the training instances to the network as an input and calculate the output. Next, for each instance we compare the output which results from one instance with its gold label by, for instance, calculating the difference between a gold and a predicted score. Instead of simply calculating the difference between gold and prediction, several different error functions $E$ have been proposed. The underlying intuition for most of the more advanced error functions is not to weigh all errors equally, such as by squaring the errors. If we use a squared error function, we obtain a larger error for larger deviations and a comparatively small error for small deviations. Thereby, during training, large deviations between gold and prediction have a particularly strong impact on the weights of the network.

The overall idea of the backward pass is to update all trainable parameters (i.e. weights and biases) in a way that minmizes the error made. Therefore, we start backwards from the output layer and calculate how much each weight of each neuron contributes to the overall error. We can compute the contribution of a weight by estimating the partial derivative of the total error function with respect to this particular weight. The partial derivative can be computed using the chain rule.*The chain rule states that one can differentiate a function that consists of two concatenated functions by separately differentiating the two concatenated functions and multiplying them. If we, for instance, want to compute the contribution of the weight $w_1$ in Figure 3.6, we can compute:

\frac{\delta E}{\delta w_1}= \frac{\delta E}{\delta o} \frac{\delta o}{\delta in} \frac{\delta i}{\delta w_1}

where

o

is the neuron's output (i.e.

g(\sum_{i=1}^n w_ix_i+w_0b)

with

g

as any differentiable activation function*note that some activation functions such as ReLu are not fully differentiable. However, we can compute subderivates by differentiating the sub-functions from which ReLU is composed.) and

in

is the neuron's input (i.e.

\sum_{i=1}^n w_ix_i+w_0b

Next, to decrease the total error, we can subtract $\frac{\delta E}{\delta w_1}$ from the current value of $w_1$ . However, we do not simply subtract this value, but multiply it with a weight first. This weight is called learning rate. Intuitively, one could say that we follow a direction indicated by the derived function (upwards or downwards) by a defined step length.

Through repeating the described procedure with a lot of training instances, back propagation tries to find a global optimum for the parameters of the neural net. However, a known disadvantage of back propagation is that the algorithm can get stuck in local minima (Ruder, 2016). In addition, sometimes the weights fluctuate heavily (each instance strongly changes the weights) or that they converge only slowly to an optimum (each instance changes the weights only to a small extent). To overcome these problems, there are several algorithms that try to automatically increase or decrease the learning rate over time (Ruder, 2016). Examples for these algorithms include Adagrad (Duchi et al., 2011), Adadelta (Zeiler, 2012), and Adam (Kingma and Ba, 2014).

The function that is learnt on a training set might not necessarily generalize to a test set. Therefore, one often trains neural networks using regularization techniques such as early stopping or $L^2$ regularization (Goodfellow et al., 2016, p. 239–246). The idea behind early stopping, is to evaluate the performance of the net after each epoch on a separate test set. If the performance on this set starts to decrease, one can assume that the network is beginning to overfit the training data and thus one can stop training. $L^2$ regularization refers to penalizing parameters if they deviate from zero. Furthermore, a particularly popular method for preventing neural networks from overfitting is dropout, which refers to to randomly removing (or droping) neurons and their edges from the network (Srivastava et al., 2014).

Recurrent Neural Networks

Figure 3.9

Simple, unfolded Recurrent Neural Network for single target stance detection. The hidden layer responsible for instance2 receives additional input from the hidden layer of instance1. Instances can be any element (e.g. characters, tokens, sentences, documents) of a sequence of instances. Figure adapted from (Rumelhart et al., 1986, p.355, Figure 17B).

So far, we looked at neural networks that are able to perform classification or regression of isolated instances. If we want to apply neural methods to words, we can simply use the dimensions of a word vector as input for a neural net. If we want to classify a sentence, we can concatenate the vectors of all words to a two dimensional matrix, and then use every dimension of this matrix as an input dimension. However, this modelling neglects that the sequence of the words of a sentence also contains meaningful information. For instance, if a sentence contains a negation, the position at which the negation occurs is important. If there is a don’t before the word love in the sentence I love global warming, the meaning of love is reversed. An architecture that is able to model sequences is the so-called recurrent neural networks (RNNs) (Rumelhart et al., 1986).

In addition to the flow of information from input layers to output layers (over several hidden layers) that is specific to one instance, RNNs introduce connections between the hidden layers over several instances. Figure 3.9 exemplifies these additional connections for a sequence of two instances. RNNs, however, are not limited to connecting two instances but can connect all subsequent elements of a sequence. Therefore, for the output of instance $n$ within a sequence of $k$ instances, these recurring connections enable the network to incorporate both the input instance $n$ and all previous instances $1$ … $n-1$ . Sometimes not only previous instances but also subsequent instances have an influence on an instance. For example, in the sentence The president pushes the country, the interpretation of pushes heavily depends on whether the sentence ends with over the cliff or towards success. For this purpose, one can also construct RNNs backward connections. This means that the output for an instance $n$ within a sequence of $k$ instances, depends on the input instance $n$ and all following instances $n+1$ ... $k$ . There are also bi-directional RNNs that have both forward and backward connections.

Due to the concatenation of long sequences, RNNs are particularly affected by the vanishing or exploding gradient problem (Hochreiter, 1991; Bengio et al., 1993). The vanishing or exploding gradient problem refers to the problem of gradients either becoming so small that the network barely converges to any minima or of becoming extremely large, which causes unstable predictions.

One of the many possible solutions to the vanishing gradient problem is to use gated RNNs (Goodfellow et al., 2016, 398). Gated RNNs use weights at the recurrent connections that regulate the influence of the recurrent connections depending on which instance of the sequence is currently processed. This means that they learn to conditionalize the influence of the sequential information. We will now take a closer look at one of the most popular gated RNN architectures:

LSTMs

Figure 3.10

Overview of a LSTM module and its three gates. The direction of the arrows indicate the flow of information during the forward pass. Figure adapted from Sha et al. (2016).

The basic idea of long-short term memory neural nets (LSTMs) (Hochreiter and Schmidhuber, 1997) is that they are able to learn for each instance how much context (i.e. how many previous or following instances of the sequence) to incorporate. For instance, an LSTM may learn that the word push is highly dependent on the surrounding context and that the word president is rather independent.

To learn such a behavior they not only pass the activation $h_{t}$ of their cells, but also a internal cell state $c_t$ . For instance, in time step $2$ , our hidden layer would receive both the activations $h_{t_1}$ and the cell states $c_{t_1}$ . Both $h_{t_2}$ and the cell states $c_{t_2}$ , is then influenced by $h_{t_1}$ , $c_{t_1}$ and the original input $x_t$ of $h_{t_2}$ . This influence is regulated through so-called gates. Gates are realized as neural network layers that are not connected with all the nodes of previous layers or time steps, but that have only certain connections depending on their intended functionality. Hence, through back-propagation, the network can learn to appropriately mix past and current information. The connection between the different gates is realized with pointwise multiplication. An LSTM typically has a forget gate, an input gate and an output gate (Hochreiter and Schmidhuber, 1997). Figure 3.10 gives an overview on the sequence of the gate layers within an LSTM module. Now, we will have a closer look at each gate’s function:

At first, the forget gate decides how much of the activation of the previous time step $h_{t_{n-1}}$ and of the original input $x_t$ should be considered or forgotten. Formally, this operation is realized with:

f_t=\sigma(W_a x_t + U_a h_{t-1}+ b_a)

With

W_a

U_a

being weight matrices and

b_a

the bias vector that are learned during training. This means that the forget gate simply learns to map

h_{t_{n-1}}

and

x_t

to a real valued score between 0 and 1. To obtain an updated cell state

C^{\prime}_{t}

, the output of the forget gate is then merged with the cell state

c_{t_{n-1}}

by a simple point-wise multiplication:

C^{\prime}_{t}=c_{t_{n-1}} \cdot f_t

The second gate in the LSTM unit is the input gate that decides how much of the cell input (i.e.

h_{t_{n-1}}

and

x_t

) should be used to update the cell state. The input gate is realized by two steps: (a) based on the cells input, we create a vector of values

I_{candidates,t}

that could be potentially used as input to the cell state and (b) we determine a vector

i_t

that tells us what parts of this candidate vectors we should use for the update. In detail, these operations are realized using:

I_{candidates,t}= tanh(W_b\cdot [h_{t-1},x_t] + b_b) \\

i_t= \sigma(W_c\cdot [h_{t-1},x_t] + b_c) \\

Again,

W_b

and

W_c

refer to matrices of trainable weights and

b_b

and

b_c

refer to the trainable bias vectors. The intuition behind the activation functions is that

\sigma

scales the values to a space from zero to one. Thus, the resulting vector

i_t

can be used to filter the vector

I^{\prime}_{candidates,t}

. Next, we calculate the new cell state by simply multiplying the candidate vector with the vector that represents the elements that should be used of the candidate vector:

I_{t}= I_{candidates,t} \cdot i_t

Now that we have computed the update, we can merge it with the output of the forget gate by forming a sum:

C^{\prime \prime}_{t}=C^{\prime}_{t} + I^{\prime}_{t}

Note, that

C^{\prime \prime}_{t}

will also be forwarded to the next time step (i.e. it is also the final

C_{t}

Finally, given the updated cell state and the original input, the LSTM’s output gate learns what should be passed to the next layer and time step (i.e. $h_t$ ). Similar to the input gate, the output gate relies on two operations: a) We determine what parts of the original input can potentially serve as an output learns what should be passed to the next layer and time step (i.e. ). Similar to the input gate, the output gate relies on two operations: a) We determine what parts of the original input can potentially serve as an output and b) we filter this potential output based on the previously estimated cell state $C^{\prime \prime}_{t}$ : $O_{candidates,t}$ is determined using a sigmoid layer:

O_{candidates,t}= \sigma(W_d\cdot [h_{t-1},x_t] + b_d) \\

where

W_d

and

b_d

are a weight matrix and a bias vector that is learnt through training. Next, we can determine the final output of the LSTM unit by calculating:

h_t= O_{candidates,t} \cdot tanh(C_{t}) \\

Convolutional Neural Nets

Convolutional Neural Networks (CNNs) (Le Cun et al., 1989) are responsible for many breakthroughs of deep learning – mainly in the area of computer vision. The basic idea of CNNs is that extracted features are expressive of several regions in the input. For data that is used in computer vision such a feature may be the edge of objects that occur in a picture. In the case of stance detection, such features could be the proximity of a negative word and a particular name.

To learn such reoccurring features, each neuron of an convolutional layer is only affected by spatially limited fraction of the input instance (Goodfellow et al., 2016, 325). The limited fraction of the data is determined through a so-called filter that slides over the data in a predefined way. To extract features that are meaningful at several regions of the input, all the neurons of a CNN-layer share the same weights. The matrix containing theses weights is called filter kernel.

Hence, the input $c$ of a single neuron of a convolutional layer $m$ is determined by the filtered input matrix $I_{j,k}$ and $m$ ’s filter kernel $F_{j,k}$ to weigh these inputs:

$c_m=\sum_{j,k}i_{j,j}w_{j,j}$

with $i_{j,j} \in I_{j,k}$ and $w_{j,j} \in F_{j,k}$ (Severyn and Moschitti, 2015). Typically, in addition to the actual convolutional layers CNNs have a pooling layer that aggregates the convolved representations. Figure 3.11 shows an example of a simple CNN with one convolutional and one generic pooling layer.

Figure 3.11

Simple Convolutional Neural Network that uses a filter size of three. The shape of the first matrix corresponds to $n \times d$ where d is the dimensionality of the used word embeddings and $n$ the number of words. The shape of the matrix that results from the convolution is $n \times m$ where $m$ is the number of used filters. Finally, the shape resulting from the pooling operation is just $m$ . Note that we here only show convolution operations, where the filter is entirely within the two dimensional matrix. The figure is adapted from Severyn and Moschitti (2015).

Filters operate on the those units into which we have segmented our data. This means that, if we segmented our data into tokens, the minimum size of a filter is a token. In the example in Figure 3.11 the convolution uses a filter size of three, which means that each filter extracts features from three tokens. In the case of NLP, one often uses two dimensional filters whose first dimension represents the number of words and whose second dimension represents the dimensionality of the used embeddings. In computer vision more flexible filter sizes are used.

Having defined a filter size, we can now slide the filter across the provided instances. The way in which the filter slides over the data is determined by its stride. If the stride is one, the filter shifts by one unit, if it is two it shifts by two units, and if it is $n$ the shift is $n$ units. Thereby, each step represents the input for exactly one neuron. As in other neural network architectures, the weights of the neurons of a convolutional layer are randomly initalized.

Pooling layers perform a fixed function and hence have no learnable parameters (Goodfellow et al., 2016, 330). Common pooling operations are: max pooling, which selects the maximal value from a convolved representation, min pooling, which selects the minimal value from a convolved representation, or average pooling that calculates the average of all values within the convolved representation. Pooling operations can be further distinguished by whether they are applied globally to the complete representation or only to a part of it.

Evaluation of Stance Detection Systems

In the sections above, we shed light on how to learn a mapping from given training instances to their labels. Theoretically, this goal could be satisfied by simply memorizing the training data. However, such a memorization would only be feasible in practice if all possible cases are covered by the training data. As we have seen in Section 3.2, such a data collection would be a hardly feasible and – due to the productivity of language – probably impossible.

Hence, we want a function to generalize to unseen data, and – in the ideal case – to be universally applicable. Therefore, we evaluate the quality of a learned function by measuring its performance on test data – data that has the same structure as the training data (i.e. that has been labeled in the same way), but that has not been involved in training the function.

By comparing the performance on the training data and the test data, we can assess if the learnt function over- or underfitts the training data (Bishop, 2006, p. 66). Overfitting refers to a function that learns the peculiarities of the training data to an extent that negatively affects the performance on training data. An example for an extreme form of overfitting is the above described memorizing of the mapping of training instances to their labels. Underfitting means that the model even fails to sufficiently describe the instances of training data.

In practice, before developing and training a model, one usually splits the available data into a train and a held-out test set.*It is also best practice to search for good hyperparamters by iteratively tuning a model on a development set. The development set represents a further fraction of the train set that is solely used for repeatedly testing the models. The division is usually done randomly. However, this split may again introduce a bias into the result as the randomly separated test data may have some peculiarities as well.

To tackle this problem, one often uses a technique known as $k$ -fold cross-validation (Bishop, 2006, 32). In $k$ -fold cross-validation, we split the available data, into $k$ subsets that ideally all have the same size. According to this split, we can define $k$ train and $k$ test sets. Next, we run $k$ iterations of training and testing our model. The overall performance of the model can then be calculated as the average of the performance of the $k$ iterations.

Scores

Now that we have defined a testbed we can actually measure the performance. Therefore, we now describe the measurements that are used in this work.

The simplest way to measure performance of a classification is to measure the accuracy. The accuracy is defined as the ratio of correctly predicted instances to the total number of instances:

accuracy=\frac{\# \textrm{ correctly classified instances}}{\# \textrm{ all instances}}

However, the accuracy is vulnerable to skewed class distributions. For instance, if we have a dataset in which 80% of the data is labeled with

\ominus

and 20% with

\ominus

, we get an accuracy of

.8

if we always just predict

\ominus

without considering the input at all. Although such a model may be accurate, unarguably, it is not useful in practical terms.

Hence, we need measurements that take into account that there are two types of errors that can be made for each of the classes. For each class we can define a positive or negative prediction, where positive is the prediction of the respective class (e.g. $\ominus$ when being interested in $\ominus$ ) and negative is the non-prediction of the class (e.g. $\oplus$ when being interested in $\ominus$ ). The two types or error are then

false positive: We predict the positive class for an instance, but the instance has a negative gold label.
false negative: We predict the negative class for an instance, but the instance has a positive gold label.

Accordingly, we can also define cases of correct prediction in terms of true positives (correctly predicted as belonging to the positive class) and true negatives (correctly predicted as belonging to the negative class). Figure 3.12 visualizes these errors and correct prediction within a confusion matrix.

Figure 3.12

Classification of errors and correct prediction for a single class.

Based on this matrix, we can then determine the precision and the recall of the prediction for each class. Precision refers to the ratio of the number of instances that have been correctly classified as being positive and the number of instances that were classified as being positive (including the false positives):

precision=\frac{TP}{TP+FP}

The recall refers to the proportion of instances that have been correctly classified as being positive to all instances that have a positive gold label (hence also including those instances that were wrongly classified as negative):

precision=\frac{TP}{TP+FN}

To combine these measures into a single one, one typically uses the harmonic mean (called F-score) of precision and recall:

F_\beta=(1+\beta^2)\frac{precision \cdot recall}{\beta^2 \cdot precision + recall}

with

\beta

as parameter that balances the influence of precision and recall. Most common is the

F_1

-measure with

\beta=1

, which evenly weighs precision and recall.

In most cases, we want to have an individual score indicating the performance of the

classification of several classes. Therefore, one can average the scores over all classes. We can distinguish between two ways of averaging, such as the F₁ score:

for the macro-averaged $F_1$ we simple compute the average precision and average recall over all classes.
for the micro-averaged $F_1$ we calculate the individual true positives, false positives, and false negatives across all classes and then calculate the $F_1$ accordingly.

Since in regression tasks we have a continuous space of possible values, it seems inadequate to compute performance based on exact match of predicted and gold values. Imagine that, for example, if we have a gold value of $1$ , both the predictions $1.0001$ and $3000$ would be considered equally wrong. Thus, we calculate the correlation between the values that have been predicted by the model and the gold values as a performance measure. Commonly, one uses Pearson’s correlation coefficient r, which is defined for tuples $x_i,y_i$ as:

r_{xy} = \frac{\sum \limits_{i=1}^{n} (x_i-\overline{x})\cdot(y_i-\overline{y})}{\sqrt{\sum \limits_{i=1}^{n}(x_i-\overline{x})^2}\cdot \sqrt{\sum \limits_{i=1}^{n}(y_i-\overline{y})^2} }

where

\overline{x}

being the mean of all predictions

x \in X

and y being the mean of all corresponding gold values

y \in Y

. Other measures used for evaluating the agreement between non-nominal gold and prediction values are Spearmans

\rho

that describes the correlation of ranks Spearman (1904), and weighted

\kappa

, which takes the probability of random agreement into account Cohen (1968). Both classification and regression performance can also be evaluated using loss-based measures (cf. 3.2.1).

Evaluation Strategies

Although with cross-validation and metrics we have the basic tools to evaluate stance detection systems, the complex and data-driven nature of these systems makes it difficult to directly infer general knowledge using these tools. For example, if researcher A claims that her system achieves a high macro $F_1$ in a ten-fold cross-validation on her dataset A, this may not mean that we can achieve a high micro $F_1$ in a five-fold cross-validation on our dataset B using the exact same system. While the goal of creating a universally applicable system is hard to reach,*As famously, pointed out by Wolpert and Macready (1997) in their no-free-lunch-theorem any optimization-based system that becomes better at a highly specialized problem, will perform worse on all other problems. In the case of stance detection that would mean that the better we are at one particular stance dataset, the worse we will be on other stance datasets. Of course, one can also try to make the algorithms valid for many (or all) stance datasets, but this is also adds even more complexity to the optimization problem. we here discuss strategies to infer more or less general knowledge.

Our first strategy is to participate in so-called shared tasks. A shared task is an evaluation initiative that is organized around a particular challenge (i.e. solving an NLP task on a particular dataset). A shared task usually involves a strict distinction between participants and organizers. The organizers provide the labeled training data, define the target metric and perform the evaluation of the participating systems. The test data is kept secret until the evaluation. Given this framework, the participants try to create a system that will perform well on the unknown test data (in other words they share the task). As the framework of the shared task is identical for all teams, one can now directly compare their performance. In addition, since the test data (both labels and raw data) is unknown during the development of the systems, the framework prevents (un-)accidental fitting to the data. From a practical perspective, a high variance of the participating systems provides insight into which general components are helpful for the task beyond the parameterizations of certain components. For instance, if several neural network-based approaches perform consistently better than traditional machine learning approaches, then this finding can potentially be generalized. Of course, the results obtained are only meaningful for the provided dataset.

In the sections above, we have shown that NLP systems and thus also stance detection systems can be seen as pipelines consisting of several components such as preprocessing steps or learning algorithms. If one focusses the pipeline character, one can distinguish between two types of evaluations of such systems (Palmer and Finin, 1990; Paroubek et al., 2007; Resnik and Lin, 2010): (i) end-to-end or black box approaches in which we evaluate the whole pipeline based on its input and output, and (i) component-wise or white box approaches in which if we evaluate each intermediate step. While black box evaluations are often sufficient for practical use, where we are just interested in the final result, white box approaches are necessary for understanding the functionality of the systems (Resnik and Lin, 2010).

However, obtaining a complete white box is difficult, as most of the components of a stance detection system have a range of parameters that may influence its effectiveness. Examples for such parameters include the hyperparameters of neural networks (e.g. how many layers to use? or How many nodes to use in each layer?), the kind of embedding resource we use, or if our POS-tagger should differentiate several classes of nouns (e.g. singular and plural nouns). To evaluate just one component one would require adequate gold labels (e.g. our data would need gold POS tags) and experiment with all or at least enough parameter configurations.

In addition to the large number of parameters, there may also be a complex interaction between the components, their parameters and the effectiveness of the system (Resnik and Lin, 2010). In some NLP pipelines an error that is made by one component will be propagated through the whole pipeline. Imagine, we use a rule to extract a feature based on patterns of POS-tags (e.g. adjectives and nouns), but the used POS-tagger performs poorly. In this case, our rule to extract the POS pattern feature may work perfectly, but the extracted feature will not be very helpful given the erroneous POS tags. Consequently, in such cases the individual components may perform reasonably, but the overall performance would be low. In other pipelines, individual components may be robust against the poor quality of certain components because components can compensate for errors or because several steps are independent. Thus, these pipelines may yield reasonable performance despite of erroneous components. These dependencies imply that for a complete white box approach we would have to not just evaluate all components individually but also, all possible combinations of them.

Since one of the main goals of this work is to understand how stance detection system work, we here adapt a gray box approach, in which we treat the whole system as blackbox, but still evaluate a few components that we suspect are particular meaningful for the effectiveness of the systems. To do this, we first generate hypotheses about the influence of certain components or parameterisations on the overall performance. For testing these hypotheses we use controlled experiments. Many of these experiments can be assigned to one of following three types:

Direct Evaluation: Directly evaluating the performance of a single component against gold labels or pseudo gold labels that have been heuristically derived.

Ablation Tests: Testing how the system’s overall effectiveness changes if we remove a component (Fawcett and Hoos, 2016). An ablation test tells us how important a component is for the system as a whole. If the loss of performance is high, we can conclude that the component has a high importance for the system. If the loss of performance is low, the component has a low importance for the system and may possibly be removed to simplify the system.

Oracle Conditions: Testing how the overall systems’ effectiveness changes if we assume a perfect performance for the component. The test allows us to judge how significant the error is that a particular component makes. If we observe a much better performance in an oracle condition, we can deduce that it is particular worthy to improve the performance of the particular component. For example, if we have a system that relies on POS tags and oracle POS tags lead to a significant improvement, one should try to optimize the performance of the POS tagger.

Stacked Classification for Detecting Threeway Stance

We now describe experiments that were conducted to better understand stance detection on single targets. Therefore we present our contributions to two of the first shared tasks on stance detection (SemEval 2016 task 6 Detecting Stance in Tweets (Mohammad et al., 2016) and the IberEval task Stance and Gender Detection in Tweets on Catalan Independence (Taulé et al., 2017)) and an appropriate part of a shared task (GermEval 2017) we have organized. The general set-up in all three initatives is comparable: For a given target such as the Catalan Independence, atheism or a railway operator, one has to classify whether the provided text express a three-way stance ( $\oplus$ , $\ominus$ , and NONE).

Because our contributions build upon each other, we describe them in the order in which they were conducted and thus start by describing our submission to SemEval 2016 task six. For this system, we hypothesize that three-way stance detection actually involves two subtasks: First we have to decide whether a text contains any stance at all. Hence, we have to train a system that distinguished between the classes NONE and STANCE. Second, for the texts that have been classified as stance-bearing, we have to classify the polarity of the stance as $\oplus$ or $\ominus$ . We first describe the dataset used in the competition, then we describe our approach, and subsequently we conduct an error analysis.

Data

The shared task SemEval 2016 task 6 includes two subtasks: Subtask A is a supervised scenario in which both test and train data is labeled. Subtask B is weakly supervised scenario in which only test data was labeled. For training, the participants were provided with a corpus of unlabelled texts relevant to the target. Since our focus was on the supervised task, we here only describe subtask A. For more information on subtask B, refer to the overview paper Mohammad et al. (2016) or our submission Wojatzki and Zesch (2016a).

The provided dataset consists of tweets that were collected based on target-specific hashtags such as #hillary4President. These tags have been removed from the record to prevent classifiers from relying on these obvious cues. The annotation was done via crowdsourcing.*crowdflower.com (now figure-eight); last accessed November 5th 2018

The data for subtask A contains the following targets: atheism, climate change is a real concern, feminist movement, Hillary Clinton, and legalization of abortion. In the following we shorten the targets climate change is a real concern and legalization of abortion to climate change resp. abortion.

Each tweet is only labeled with a stance towards one target. Thus the SemEval task is clearly a single-target stance detection task. In order to illustrate the dataset, below, we show a tweet that expresses a $\oplus$ stance, a $\ominus$ stance, and a NONE stance towards atheism:

$\oplus$ : IDC what the bible says, constitution says you can take it and shove it right back where it came from .. We aren’t a theocracy.

$\ominus$ : I would rather stand with #God and be judged by the world than stand with the world and be judged by God. #Godsnotdead #Truth

NONE: ”They tried to bury us. They didn’t know we were #seeds.” #MexicanProverb #Tolerance #Coexistence #peace #Liberty #OneWorld

In Table 3.1 we visualize the number and distribution of classes for the different targets. The distribution shows that there are approximately as many $\ominus$ instances as the sum of $\oplus$ and NONE instances. Hence, the distribution of classes can be considered as being imbalanced. In addition, there are imbalances between the different targets. This variance is most obvious for the target Climate Change, which only has 26 $\ominus$ instances and is clearly biased towards $\oplus$ instances.

As an evaluation metric the organizers use the macro-average of $F_1(\oplus)$ and $F_1({\ominus})$ . The intuition behind the metric is to focus on the stance-bearing classes ( $\oplus$ ) and $\ominus$ )). Note, that the detection of the NONE class is still indirectly evaluated as correctly or incorrectly classifying instances as NONE has an impact on $F_1(\oplus)$ and $F_1({\ominus})$ .

Table 3.1

stance	atheism	climate change	feminist movement	Hillary Clinton	lagalization of abortion	all
$\ominus$	464	26	511	565	544	2,110
$\oplus$	124	335	268	163	167	996
NONE	145	203	170	256	222	1,057
$\sum$	733	564	949	984	933	4,163

Dataset (train + test) used in the shared task SemEval 2016 task 6 – Detecting Stance in Tweets by Mohammad et al. (2016).

System

Now, we describe how we use a sequence of stacked classifiers for first classifying whether the tweet contains any stance (classes: NONE and STANCE) and second to distinguish between $\oplus$ or $\ominus$ polarity for the instances which have been labeled with STANCE.

Before applying this classification, we preprocess the data with the DKPro Core framework*version 1.7.0 (Eckart de Castilho and Gurevych, 2014). For tokenization, we use the Twokenizer*version 0.3.2by Gimpel et al. (2011). We determine sentence boundaries with the DKPro default sentence splitter. We also assign POS tags by using the OpenNLP PoS tagger.*maxent model in version 3.0.3In addition, we identify hashtags with the help of the Arktweet PoS tagger (Gimpelet al., 2011).

On the basis of this preprocessing, we extract a large feature set of ngram, syntactic, sentiment and word-embedding features. However, after running an (cross-validated) ablation tests on the training data, sentiment and word-embedding features have not proved useful. We now briefly describe the remaining features.

Stance-lexicon

For capturing a stance-specific word polarity, we construct a stance lexicon. In this way, we compare the statistical association of each word with the two poles of each classification ( $\ominus$ vs. $\oplus$ or STANCE/ NONE). In detail, we compute the gmean association of a word with both classes and subtract the scores. For instance, the gmean association (cf. Evert (2004)) of a word $x$ with $\oplus$ is computed as:

gmean_{\oplus}(x)=\frac{c_\oplus(x)}{\sqrt{c_\oplus(x) \cdot c_\ominus(x)}}

with

c_\oplus(x)

being the number of times x occurs in instances labeled with

\oplus

and

c_\ominus(x)

is the count of

x

in the residue. We normalize the polarity lexicon by lowercasing all tokens, removing @- and

\#

-prefixes, lemmatizing plurals, and merging words with a Levensthein edit distance that is below .15. We normalize only for the STANCE vs. NONE classification as we observe that normalization is not beneficial for the

\ominus

vs.

\oplus

classification. We use the same method to create a hashtag-stance lexicon. For calculating the polarity feature for each tweet, we simply average the stance score

s

for each token.

Bi- and Trigram Features

To capture longer expressions, we capture the 500 most frequent bi -and trigrams and derive binary features depending on their occurrences in the tweets. However, the resulting 500 features would outnumber the other features. Thus, we first train an SVM that is equipped with ngrams only and subsequnetly use the output of this classification as a feature.

Syntax Features

As syntax features we use the number of times conditional sentences and modal verbs, exclamation- and question marks, and negations occur in each instance.

Concept Features

While our lexicon feature captures wording that is associated with a single class, there are also words that are associated with both classes. We hypothesize that these words are proxies for important concepts in the debate. We try to retrieve these words by selecting the twelve most frequent nouns for each target. Next, we normalize (in the same way as above) the words and manually remove words that occur only with one class. Subsequently, we form tuples consisting of these words and their surrounding uni-, bi-, and trigrams. Then we train an SVM to classify for these tuples whether they occur in a $\ominus$ vs. $\oplus$ resp. STANCE/ NONE instance. We use the output of this classifier for each occurrence of the words as a feature.

Target Transfer Features

To exploit a potential overlap between the targets (e.g. both Abortion and Feminist Movement are related to the rights of women), we apply the trained target-specific models to the tweets of other targets. We use the resulting classification as an additional feature. This feature is only used for the AGAINST vs. FAVOR classification for the targets Climate Change is a Real Concern, Hillary Clinton and Legalization of Abortion.

Based on the described feature sets, we assemble a sequence of classifications in which we first determine if a tweet contains any stance and then further distinguish this stance into the classes $\ominus$ and $\oplus$ . For both classification steps we use the Weka SVM classifier that is integrated in DKPro TC*version 0.8.0(Daxenberger et al., 2014). Note that the feature sets in both classification steps are comparable, but that the normalization of the stance lexicon and the target transfer features are only used in the $\ominus$ vs. $\oplus$ classification. The sequence of stacked classifiers is visualized in Figure 3.13.

Figure 3.13

Overview of the sequence of stacked classifications we used in our contribution to SemEval 2016 task 6 – Detecting Stance in Tweets.

Results

With the help of the implemented system, we now carry out experiments to better understand the task of single target stance detection.

Comparison with Other Approaches

In the official competition we are ranked twelfth of nineteen teams and achieved a score of $.62$ in the official metric. We compare the obtained score to the scores of other teams and baselines in Figure 3.14. The figure shows that even the best approach only reaches a performance score of $.69$ , which leaves a lot of room for improvement. From this, we conclude that the task is a difficult challenge. This conclusion is also supported by the fact that less than a third of the teams were able to beat a simple majority class baseline.

Furthermore, we notice that the variance between the performances of the teams is small. With the exception of the submission Thomson Reuters, the difference between the best and worst ranking submission is just 10%. Hence, we cannot identify a team that stands out from the others in terms of superior performance. This is underlined by the fact that the best performing approach is the trigram (SVM) baseline of the organizers.

We analyze the systems of the other participants and identify two main strands of approaches: The first stand are knowledge-light, neural architectures – as for instance submitted by the two best teams. More precisely, the best scoring approach utilizes LSTM layers (Zarrella and Marsh, 2016) and the second best scoring approach uses a network with a CNN layer in its core (Wei et al., 2016a). The second strand includes more traditional classifiers such as SVMs that are equipped with features based on linguistic preprocessing.

In Figure 3.14, we show how these strands are distributed over all participants. We observe that no strand seems to be superior to the other. Interestingly, theperformance of both approaches is very similar, although they are based on fundamentallydifferent paradigms.

Figure 3.14

Performance of submissions to SemEval 2016 task 6 (Mohammad et al., 2016). We colorcode whether systems are primarily based on neural machine learning (blue) or feature engineering (red). We use gray for approaches that rely different techniques (e.g. a majority class classification) or that did not published a description of their systems.

Ablation

While the set of implemented feature is useful on the training data, the selected configuration could still be overfitting to the training data. Thus, we conduct an additional ablation test in which we train our system on the training data and determine the loss of performance on the test data. This ablation test is visualized in Table 3.2. The results show that for almost all features the performance barely changes or even improves if one omits them. To some extent, this can be explained by the fact that the features are correleated (e.g. modal verbs are also unigrams). However, that also means that the features are not important for our system.

The only feature that is associated with a substantial loss across all targets is the stance lexicon. This means that the stance lexicon has a significant impact on the classification. In addition, we observe a loss if we remove the concept features for the targets atheism, climate change, and feminist movement. From this we conclude that the identified concepts indeed play a role in the classification.

As a result of the conducted ablation test, we additionally train a model with only the stance lexicon and the concept features. When training a model with this reduced feature set, the performance over all test data increases by about three percentage points (from .65 to .62). This indicates the other features (e.g. the transfer feature) caused our system to over-fit to the training data.

Table 3.2

	all	atheism	climate change	feminist movement	hillary clinton	abortion
all features	.62	.53	.36	.55	.44	.57
-stance lexicon	.54	.48	.29	.50	.41	.46
-concepts	.62	.52	.35	.54	.46	.58
-negation	.62	.55	.36	.55	.44	.57
-target transfer	.62	.53	.35	.55	.46	.58
-punctuation	.62	.56	.35	.55	.47	.57
-conditional sentences	.63	.58	.36	.55	.44	.60
-modal verbs	.63	.58	.36	.55	.44	.60
-ngrams	.64	.59	.38	.56	.51	.58

Ablation test of the feature set on the test data of the SemEval 2016 task 6 dataset. The performance is calculated using the SemEval metric $((F_1(\oplus)+F_1(\ominus))\frac{1}{2})$ .

In summary, our participation in the shared task shows that a two-step procedure does not lead to substantial improvements over one-step procedures. If we compare our approach with the other systems and with the baseline systems, we can conclude that comparatively simple systems already lead to state-of-the-art performance.

Target-specific Lexical Semantics

As shown above, our system does not benefit from semantic knowledge in the form of word vectors. However, in a post-competition analysis, the organizers of the shared task show that the performance of an SVM equipped with ngrams increases if one adds features which are derived from word vectors that are created from a domain specific corpus (Mohammad et al., 2017). Similarly, several studies on sentiment analysis demonstrate that it is beneficial to rely on task-specific vectors instead of on general-purpose word vectors (Tang et al., 2016; Ren et al., 2016; Lan et al., 2016). To better understand why general purpose embeddings are not helpful in the present task, we now investigate how a given target influences the perceived semantic relationship of words. In detail, we examine how the perceived relatedness*Word relatedness refers to the degree to which the meaning of two words is related. Word relatedness is similar to the concept of word similarity, but is more general (Budanitsky and Hirst, 2006; Zesch and Gurevych, 2010). If two words are similar that means that two words describe a similar concept (e.g. electricity and power). Word relatedness also subsumes other relations such as antonymy (e.g. calm vs. rough), functional relationships (e.g. plant (produces) energy) or common associations (e.g. power and supply). of two words changes if we present them in the context of a target.

We hypothesize that people judge the relatedness of words differently, when presented a specific context. For example, without any context the words baby and mother should be closely related, while baby and murder should be more distantly related. However, in the context of a debate on abortion the words baby and murder seem to have a closer connection. In addition, we hypothesize that the performance of predicting the relatedness of words using general-purpose word vectors is lower compared to words without context.

Judging Word Relatedness in Context

In order to systematically examine these hypotheses, we construct a set of words that should be subject to such a semantic shift in the five SemEval targets. For constructing the sets, we first examine the degree of association between nouns and the classes $\oplus$ and $\ominus$ in the SemEval data using the statistical association measure Dice (Smadja et al., 1996). The idea behind this is that words that are particularly indicative for a stance could be particularly affected by the unique semantics of a debate (as opposed to words that do not contribute much to one’s stance, such as articles).

Next, for each of the five targets in the SemEval data, we manually form twenty pairs for which we hypothesize that the context causes a stronger relatedness between the words. This results in a set of 100 pairs. We also add 30 control pairs from the WordSim-353 dataset (Finkelstein et al., 2001). We select these 30 pairs according to an equal distribution across the spectrum of relatedness.

To measure the difference of perceived relatedness, we use a questionnaire which was submitted to a total of 109 participants. Specifically, we choose a between-group design with a treatment group, which was given a context (59 participants), and a control group, which was given no context (50 participants). To set the context for the experimental group, we provide the participants with a short definition and a symbolic image of the target which were taken from from corresponding Wikipedia articles. In each group, we let the participants estimate the relatedness of the pairs on a one-to-five scale where one means UNRELATED and five means HIGHLY RELATED. The data collection was conducted in a laboratory setting. As compensation the participants received 10€ or subject hour certificates (as needed by their study program).

After the survey, we average the ratings to obtain a relatedness scores for each word pair. To measure statistical significance, we aggregate the scores for each target and compare the scores using a T-test. The results of this comparison are shown in Table 3.3. As we are conducting multiple independent significant tests, there is a risk of erroneous significant results. Hence, we adjust the commonly used significant level of $.05$ using Bonferroni correction. As we are conducting five significant tests, we adjust the signiﬁcance level to $\frac{.05}{15}=.003$ . We show the complete list of averaged judgments per word pair in the appendix A.1.

Table 3.3

target	subset	difference
abortion	all pairs	+.36*
	domain-specific pairs control pairs	+.70* -.78*
atheism	all pairs	+.06
	domain-specific pairs control pairs	+.24 -.56*
climate change	all pairs	+.48*
	domain-specific pairs control pairs	+.77* -.39
feminism	all pairs	+.23
	domain-specific pairs control pairs	+.54* -.84*
Hillary Clinton	all pairs	+.27
	domain-specific pairs control pairs	+.38* -.22

Mean diﬀerences in the judgments of word relatedness between the two conditions (i.e. without context $-$ with context) according to the diﬀerent targets. * indicates statistical signiﬁcance under a bonferroni-adjusted signiﬁcance level ( $p<.003$ ) as measured with multiple t-tests.

For the targets climate change and abortion we observe a significant change when presenting a context to the participants. The change is an increase in relatedness. That means, that on average, the pairs are judged to be more strongly related.

If we differentiate between control variables and the domain-specific pairs, the effect becomes more pronounced. For the domain-specific pairs, we find a for all targets except for atheism a significant increase of relatedness. The mean difference for the target climate change is even $+.77$ , which is an increase of almost 20%. For some of the individual pairs the increase is larger than 1.5 (e.g the score of the pair choice-body changes from $2.60$ to $4.22$ in the context of climate change), which is substantial given the one-to-five scale. A reason for the rather small change due to the context of atheism could be that many of the extracted pairs (e.g. bible-sins, lord-joy, or faith-superstitions) strongly evoke religious context by themselves.

For three of the five targets (abortion, feminism, and atheism) we observe that a presented context results in a significant decreased relatedness ratings for the control variables. In detail, we find this effect is especially strong for variables that have a relatively high relatedness score if presented without context. For instance, the score of the pair midday-noon decreases from $4.79$ to $3.27$ if presented in the context of atheism. Hence, presenting a special context seems to suppress other contexts in which there are particularly strong relationships. In contrast, we hardly find any effect by the context for control variables that have a relatively low relatedness without context. An example is that the perceived relatedness of king-cabbage changes only from $1.61$ to $1.63$ in the context of Abortion.

Automatically Measuring Word Relatedness in Context

Next, we analyze whether the found change has an impact on the efficiency of automatically estimating word relatedness using word vectors. Therefore, we download a set of publicly available pretrained word vectors and test how well they estimate the relatedness. For estimating relatedness, we simply calculate the cosine similarity of the vectors that correspond to the two words in a pair. Finally, we evaluate the performance of this approach by correlating the predicted with the collected relatedness.

We use the following pre-trained word vectors: (i) all three available versions of the FastText vectors (Bojanowski et al., 2017).*available from https://fasttext.cc/docs/en/english-vectors.html; last accesses November 5th 2018 The vectors were trained either from the CommonCrawl*commoncrawl.org; last accesses November 5th 2018 or on a mixture of Wikipedia and curated news news-sources. The latter is available in a version that was trained with and without sub-word information.

All FastText vectors have 300 dimensions. (ii) all available GloVE vectors (Pennington et al., 2014).*available from https://nlp.stanford.edu/projects/glove/; last accesses November 5th 2018 The GloVE vectors were trained on a Twitter corpus, a combination of Wikipedia and the GigaWord corpus (Parker et al., 2011), as well as on the Common-Crawl. The vectors are available with several different dimensionalities ranging from 25 to 300. (iii) the English Polyglot vectors (Al-Rfou et al., 2013), which are trained on Wikipedia.*available from https://sites.google.com/site/rmyeid/projects/polyglot; last accesses November5th 2018 The polyglott vectors have 64 dimensions.

We visualize the result of this evaluation in Figure 4.8. We find that both for the relatedness with and without context the prediction works substantially better on the control pairs than domain-specific pairs. For some vectors (e.g. FastText (d=300) common crawl) the difference between control and domain-specific pairs is over 60%. We conclude that the semantics of those words that are crucial for stance-detection is only insufficiently modelled.

There is a constant drop of about 10% between predicting relatedness with context and without context for the control variables. This is in line with our hypothesis that context influences the prediction of word-relatedness. However, for the domain-specific word pairs, we see this effect only for a few vectors (e.g. the FastText vectors). In some cases, the prediction for contextualized relatedness works even slightly better (e.g. GloVe (d=300) Wikipedia+Gigawords). Hence, the context does not seem to have a major impact on the prediction of word-relatedness the domain-specific word pairs. However, this could also be a result of the low performance of predicting word relatedness of our domain specific words. Overall, those vectors trained on curated text (e.g. Gigaword or Wikipedia) seem to perform slightly better than those trained on social media or web data.

In summary, our experiments on target-specific lexical semantics result in two findings: First, we show that whether one considers word-relatedness in the context of a target has a significant influence on human word-relatedness judgments. Second, we demonstrate that state-of-the-art methods on automatically measuring word-relatedness are not able to account for this influence, mainly since the relationship between words, which are particularly important for our debates, is poorly modelled. We conclude that methods that can account for influence the targets on word-relatedness would also be useful means to advance the state of the art in automatic stance detection.

Figure 3.15

Correlation coefficients of the relatedness prediction based on different word embeddings.

Combining Neural and Non-neural Classification

In the previously described SemEval task, we identified two major strands of approaches: (i) Neural architectures that translate the input in a sequence of word vectors and feed it into neural networks, and (i) more traditional approaches that represent the input based on ngram, word-vector and sentiment features and than train a classification function with learning algorithms such as SVMs. As both strands are based on highly different learning paradigms, we hypothesize that a hybrid system that combines both strands could have the strengths of both strands and yield superior performance. We test this hypothesis by participating in the IberEval task Stance and Gender Detection in Tweets on Catalan Independence (Taulé et al., 2017) with such a hybrid system.

Data

The goal of the IberEval competition is to compare systems that are able to detect the stance towards the target independence of Catalonia. The occasion is the referendum on the independence of Catalonia held on the first October 2017 by the regional government of Catalonia.

The dataset for the competition was collected in a similar way as the SemEval dataset. However, in the IberEval competition the tweets were collected in Spanish and Catalan. First, the organizers collected a set of tweets using the query hashtags #Independencia and #27S.*#27S refers to the date of the regional elections on 27th September 2015 in which a pro-independence coalition secured a majority in the regional parliament Next, the tweets were labeled with the polarity classes $\oplus$ , $\ominus$ , and NONE by three trained annotators. To give an impression of the data, below, we show a translated example (from Spanish) for a tweet that expresses a $\oplus$ stance, a $\ominus$ stance, and a NONE stance towards the independence of Catalonia:

$\oplus$ : Who can advocate that Catalan students receive only 5% of state scholarships and those in Madrid receive 58%? *translated version of the tweet: Quien puede defender que los estudiantes catalanes reciban solo el 5% de las becas del estado y los de Madrid reciban el 58%
$\ominus$ : High participation benefits the non-independence parties. Let’s go Catalans, let’s vote! Better united!*translated version of the tweet: Alta participacion beneﬁcia a los partidos no independentista. Vamos catalanes, a votar! Mejor unidos!
NONE: I’m going to vote today. All Catalans should go vote. Whoever does not vote, is not allowed to complain.*translated version of the tweet: Hoy voy a votar. Todos los catalanes deberian ir a votar. Quien no vote, que luego no se queje.

We show the class distribution for both languages in Table 3.4. The distribution shows a clear imbalance in both datasets. Less than 10% of the Catalan data is labeled with $\ominus$ stance. Contrary to this, less than 10% of the Spanish tweets are labeled with $\oplus$ stance.

Table 3.4

dataset	$\oplus$	$\ominus$	NONE	$\sum$
Catalan	3331	163	1926	5420
Spanish	419	1807	3174	5400

Dataset (train + test) used in the IberEval task Stance and Gender Detection in Tweets on Catalan Independence

System

A hybrid system combining neural and classical approaches can be successful if the individual approaches have strengths and weaknesses that balance each other out. For example, it would be conceivable that the SVM – as a linear classifier – generalizes better to unseen data, but reacts less sensitively to key sequences. Hence, for our hybrid system we train both an SVM and an LSTM classifier. Next, we design a system that decides whether to rely on the SVM or LSTM prediction. As both approaches require segmented text, we tokenize the tweets using the Twokenizer (Gimpel et al., 2011) from the DKPro Core framework (Eckart de Castilho and Gurevych, 2014).*30version 1.9.0

As the SVM classifier, we use a linear kernel SVM provided by the DKPro TC framework. We equip the SVM with uni-, bi-, and tri- word ngram features, and bi-, tri-, and four- character ngram features. We also use word vector features. Therefore, we first average the FastText (Bojanowski et al., 2017) vectors of all words of a tweet and then use each dimension of this averaged vector as a feature.

The LSTM classifier is implemented using the Keras framework and the Theano backend. In the initial layer of the network, we translate the input data into sequences of word vectors. We use the Spanish and Catalan FastText vectors provided by Bojanowski et al. (2017). The next layer is the bidirectional LSTM layer, which has 138 LSTM units. During development, we found that it is beneficial to add another dense layer after the LSTM layer. The final layer is a softmax classification layer.

In order to create the hybrid system, we first classify all tweets with both the SVM and the LSTM. Then, we label all tweets for each classifier and for each language with whether our prediction was wright (TRUE) or wrong (FALSE). Afterwards, we train a new classifier to automate this decision for each approach. For instance, we learn a classifier that predicts whether the SVM correctly or incorrectly classifies the tweets. To be able to explain the decisions of the hybrid system, our goal wasto base this classification ona small set of rules. Hence, we use a decision tree (weka’s J48) for this classification. For the decision tree, we represent the tweets with features that may affect whether a tweet canbetterbeclassifiedwithanSVMoranLSTM:

Number of Tokens per Tweet SVMs and LSTMs differ in the maximum length of the sequences that they can model. While the SVM is limited to the length of the ngrams, the LSTM can potentially capture whole tweets.

Text Similarity LSTMs and SVMs are both dependent on the test data being similar to the training data. We model the similarity between instances in terms of their ngram overlap.

Type-token-ratio Redundancy within the tweets may also affect their classifiability. Hence, we use the type-token-ratio as a feature to capture the lexical complexity of the tweets.

Word Vector Coverage As both approaches depend on pre-trained word-vectors, we measure the proportion of words that are covered by the resource of Bojanowski et al. (2017) and words that are not covered by the resource.

To conduct the final classification ( $\ominus$ , $\oplus$ , or NONE), we follow the recommendation of the tree classifier and select a prediction accordingly. For example, if the tree predicts for a Spanish tweet that the LSTM is correct and the SVM is mispredicting, we choose the LSTM prediction. If the tree predicts that either both systems are correct or that both systems are wrong, we choose the SVM prediction as it is more reliable (cf. Table 3.5). Figure 3.16 visualizes the architecture of the hybrid system.

Figure 3.16

Overview of our hybrid approach on predicting stance towards Catalan Independence.

Results

Before comparing the performance of our system to the performance of other submissions, we evaluate the performance of its components. For these evaluations, we measure their performance in a tenfold cross-validation on the training data. The official metric assigns equal weight to classes $\oplus$ and $\ominus$ . However, since the class distribution is imbalanced, the rarer of the two positive classes ( $\oplus$ and $\ominus$ ) is weighted disproportionately. For instance, on the Catalan dataset, a system that correctly predicts all 3331 $\oplus$ tweets ( $F_1(\oplus)=1$ ) but that incorrectly predicts the 163 $\ominus$ tweets would receive a official score of only $.5$ . Hence, we here report the micro-averaged $F_1$ over all classes. We show this evaluation in Table 3.5.

For the SVM we conduct an ablation test and find for the Catalan data that the performance decreases if we remove any feature type (word vector, character-, and word ngrams) from the model. For Spanish, however, we only observe a decrease for word ngrams. We conclude that, as in SemEval, ngrams are the most important feature. Since the word vector and character ngram features do not decrease the model’s performance on the Spanish data, we keep them in the model.

When comparing the performance of SVM, LSTM and hybrid prediction, we find a constantly better performance for Catalan than for Spanish. The difference is 10% for the SVM and hybrid approach, and 6% for the LSTM. We also observe that for both languages the SVM and the hybrid prediction outperform the LSTM.

Interestingly, we observe an almost identical performance for the SVM and the hybrid prediction. To examine this further, we inspect how similar the prediction of the models are. As a measure of the similarity of the predictions, we calculate the agreement metric Cohen’s $\kappa$ (cf. Section 4.1.2). We find rather high agreement scores for both Catalan (κ = .92, 169 differing predictions) and Spanish ( $\kappa=.90$ , 230 differing predictions). As the agreement between SVM and LSTM is rather low ( $\kappa=.28$ for Spanish and $\kappa=.39$ for Catalan), the hybrid approach seems to lean mainly in the direction of SVM prediction.

Finally, we evaluate the decision tree’s performance for predicting whether tweets are correctly or incorrectly classified by SVMs or LSTMs. We find that the prediction works better for the SVM (Spanish: .75; Catalan: .66) than for the LSTM (Spanish: .75; Catalan: .66). The rather mediocre performance (given that random guessing would result in a score of .5) shows that this component is the weak point of the hybrid system and that the whole approach would benefit from further improving the type prediction. To nevertheless demonstrate the potential of the hybrid approach, we run an experiment with an oracle condition, in which the decision tree always correctly predicts whether a tweet is better classified with an SVM or an LSTM. The performance of a hybrid system that uses this oracle condition is shown in Table 3.5. In this oracle experiment, we observe that the performance increases by .09 for Catalan and by .13 for Spanish. This emphasizes that a hybrid approach could be superior over SVM- or LSTM-based approaches if the type prediction could be improved.

Table 3.5

	Catalan	Spanish
SVM	.80	.70
SVM w/o word vectors SVM w/o character ngrams SVM w/o word ngrams	.79 .78 .78	.70 .70 .68
biLSTM hybrid (J48-based)	.70 .80	.64 .70
hybrid (oracle)	.89	.83

Performance of the neural, non-neural and hybrid system on predicting stance towards Catalan Independence. The performance is reported in terms of the microaveraged $F_1$ -score that is obtained from the ten-fold cross-validation on the training data. For the SVM we show the results of an feature ablation test.

Comparison with Other Approaches

For comparing our system with the other submission we rely on the official metric $\frac{F_1(\oplus)+F_1(\ominus)}{2}$ . As in the training data, our hybrid system performs very similarly to our SVM. Again, our LSTM performs worse than our other approaches.

In Figure 3.17 we compare the performance of our submissions to the performance of all submissions of the shared task. Each team was allowed to submit multiple predictions. However, we limit ourselves to the best submission per language.

We observe that no team obtained a score of over .5. The majority class baseline is outperformed by only a few systems and only by a small percentage. For Catalan the difference between the best performing system iTacos (Lai et al., 2017a) and the baseline is even less than one percent. This shows that stance detection is a difficult task. This finding is consistent with the findings of the SemEval task.

Among the systems there is again a variety of neural and more traditional (mostly SVM-based) approaches. We notice that the top scoring systems utilized rich feature sets, which include stylistic (e.g. POS ratio or number of capitalized words) and social media-specific features (e.g. occurrence url or hashtag). Interestingly, the neural approaches (attope, our LSTM, and deepCybErNet) are the lowest scoring submissions in both languages.

In summary, our participation in StanceCat@IberEval shows that our hybrid systemhas the potential to outperform state-of-the-art systems. However, the experiment also shows that we are not yet able to proof this claim under real-life conditions. We arguethat once weunderstand which instances can be predicted better with neural systemsand which instances can be predicted better with classical systems, we might be able to present superior hybridsystems.

Figure 3.17

Performance of approaches in the IberEval task Stance and Gender Detection in Tweets on Catalan Independence. We only show the best performing submissions of the other teams.

Stance in Social Media Customer Feedback

For our final study on single target stance, we turn to a commercial topic. Specifically, we investigate texts from various social media sources that express a $\oplus$ , or NEUTRAL evaluation towards the services of the largest German railway company Deutsche Bahn, which carries over twelve million passengers each day.*according to the companies own information: https://www.deutschebahn.com/en/group/ataglance/facts_figures-1776344; last accesses November 5th 2018. Since a large number of people use Deutsche Bahn services every day, it is not surprising that a large number of people share their corresponding experiences on social media platforms. These experiences can provide other customers with valuable insights on the quality of services and may be a source of feedback for the Deutsche Bahn. Hence, analyzing the feedback of Deutsche Bahn customers seems to be a valuable application domain for stance detection.

In contrast to the previous studies, we do not participate in a shared task, but organize it this time. Aim of the shared task is to model all subtasks that are important for an automatic analysis of customer reviews in the real world. Hence, we define four subtasks ranging from (A) determining relevance of reviews, (B) classifying which polarity ( $\oplus$ , $\ominus$ , or NEUTRAL) is expressed towards the target as a whole, (C) classifying which polarity is expressed towards aspects of the service, to (D) extracting linguistic expressions which are used to express polarity on aspects. Note, that we choose to label the task as an aspect-based sentiment task as the term is more popular in the commercial domain.

Here, we only report our findings on subtasks A (relevance detection) and B (classification of expressed polarity towards the target) as they directly correspond to single target stance detection. Relevance detection is about identifying whether a review contains an evaluation of the target or whether it is off-topic. Thus, relevance detection corresponds to classifying whether a text contains any stance on the target or whether a text expresses a NONE stance. Classifying target-specific reviews into $\oplus$ , $\ominus$ , or NEUTRAL evaluations corresponds to the stance schema of the previously described shared tasks, where one has to decide if a text expresses a $\oplus$ , $\ominus$ , or NONE stance towards a target.

Data

The data for our shared task was collected at Technische Universität Darmstadt as part of the project ABSA-DB: Aspect-based Sentiment Analysis for DB Products and Services (thus, the corresponding project members are the ones who deserve credit for creating the dataset). The raw data was collected from various social media sources such as social networking sites (e.g. posts on Facebook or Google plus), microblogs (e.g. Twitter), news forums (e.g. www.spiegel.de/forum), and Q&A sites (e.g. www.gutefrage.net). These sources were crawled for query terms such as deutsche bahn and – to reduce the massive amount of by-catch – additionally automatically filtered for relevance.

Next, for each month in the period from May 2015 to June 2016, about 1,500 reviews were randomly selected. After collecting the reviews, the data was randomly divided into a train (80% of the data), development (10% of the data) and test (10% of the data) set. In addition to the synchronic test set, which originates from the same period as the train data, a diachronic test set was created, which dates from the period from November 2016 to January 2017.

The collected customer feedback was annotated by a team consisting of six trained annotators and a curator. The team iteratively refined the annotation guidelines to improve the reliability of annotations.

Consequently, for subtask A, each review is annotated based on whether it contains feedback on the target Deutsche Bahn or if it is off-topic. We show the distribution of relevant and irrelevant documents across the different sets in Table 3.6. We notice that there are significantly more relevant than irrelevant documents in all sets. We attribute this to the pre-filtering of documents according to relevance.

Table 3.6

dataset	relevant	irrelevant	$\sum$
train	16,201	3,231	19,432
development	1,931	438	2,369
test_synchron	2,095	471	2,566
test_diachron	1,547	295	1,842

Distribution of Relevance labels in GermEval 2017(A) relevance detection.

Subsequently, the annotators labeled the reviews based on whether it refers to an aspect and whether this aspect is positively negatively evaluated. The aspects to be annotated were predefined within an aspect inventory, which also included an option for reviews that do not clearly address one aspect of the inventory (called general aspect). Based on these annotations, we derive a document sentiment (resp. stance) towards the target using a set of rules: If a review contains only positive (resp. negative) and neutral aspect evaluations, we define the overall stance of the review to be $\oplus$ (resp. $\ominus$ ). If a review contains both negative and positive evaluations, we set the stance to be NEUTRAL regardless of the number of negative and positive evaluations. Table 3.7 shows the resulting distribution of document stances across the different datasets. Again we observe a skewed class distribution, with more than twice as many NEUTRAL instances as the sum of $\oplus$ and $\ominus$ instances. It is noteworthy that there are substantially more $\ominus$ instances than $\oplus$ . Hence, it seems to be more common to vent anger about delays or other inconveniences than to report positive experiences.

Table 3.7

dataset	$\oplus$	$\ominus$	NEUTRAL
train	1,179	5,045	13,208
development	148	589	1,632
test_synchron	105	780	1,681
test_diachron	108	497	1,237

Distribution of sentiment classes in GermEval 2017(B) sentiment detection.

Results

Subtask (A) attracted five teams and subtask (B) attracted eight teams. As each team was allowed to submit several predictions, we received a large variety of different approaches. Again, we will only report the best-performing submission for each team. For both subtasks, we report the micro-averaged F₁ score as an evaluation metric. We choose this metric because it weighs all instances equally and is thus not distorted by the imbalance of the classes. We also include the SVM that uses these advanced features – such as word vector features or features that are derived from an automatically expanded sentiment lexicon – in our comparison. We compare the performance of the approaches with a majority class baseline and a simple SVM-based text classification approach. This baseline SVM classifier is only equipped with frequency-weighted unigrams and sentiment features derived from the German sentiment lexicon by Waltinger (2010). The baseline system is described in more detail and with additional, more advanced features in Ruppert et al. (2017).

Several participating teams use hand-crafted rules to either normalize social media language (e.g. emoticons or URLs) or to derive features from it (Sayyed et al., 2017; Sidarenka, 2017; Mishra et al., 2017; Hövelmann and Friedrich, 2017). Most of the teams rely on semantic information by using some form of word vectors. The two teams Naderalvojoud et al. (2017) and Lee et al. (2017) even experiment with training task-specific word or sentence representations. In addition, most teams use some form of word polarity lexicons such as SentiWS (Remus et al., 2010) or the lexicon by Waltinger (2010). Among the used classifiers, we see linguistically motivated and threshold-based approaches (Schulz et al., 2017), neural approaches (e.g. Mishra et al. (2017)), and meta-learning approaches such as gradient boosting (e.g. Sayyed et al. (2017)) or other forms of ensemble learning (e.g. Sidarenka (2017) or Lee et al. (2017)).

Table 3.8

team	approach	synchronic	diachronic
Sayyed et al. (2017)	gradient boosted trees	.903	.906
Hövelmann and Friedrich (2017)	fastText classifier	.899	.897
Ruppert et al. (2017)	SVM (extended)	.895	.894
Mishra et al. (2017)	biLSTM	.879	.970
Lee et al. (2017)	stacked NN	.873	.881
Ruppert et al. (2017)	SVM (weighted unigrams + lexicon)	.852	.868
UH-HHU-G	ridge classifier	.835	.849
organizers	majority class baseline	.816	.839

Results for GermEval 2017(A) relevance detection. We only show the best submission of each team. Baselines and reference implementations are italicized.

We compare all these approaches in Table 3.8 for subtask (A) relevance detection and in Table 3.9 for subtask (B) sentiment classification. In the overall comparison, as expected, we find that the relevance subtask can be better solved than the sentiment subtask. Surprisingly, for both the relevance and sentiment classification we do not see large differences between the synchronic and diachronic test sets. In some cases, the diachronic set can be predicted even better than the synchronic set. From this, we conclude that the models learn little time-specific bias and generalize well to different time periods.

In the relevance classification most teams outperform both majority class baseline and the simple SVM classifier. Due to the skewed class distribution, the majority class baseline already yields a strong performance ( $F_1=.816$ ), which is only exceeded by less than 10% – even by the best system. The SVM with the extended feature set is only outperformed by two approaches by a tight margin of less than 1%.

As in subtask (A), in subtask (B), we find that the majority class prediction and the simple SVM classifier are strong baselines, but that most teams are able to outperform them. We notice that – with a difference of less than 1% – the three top scoring teams perform very similarly. This is surprising, since the approaches chosen are very different: Naderalvojoud et al. (2017) combined a RNN with a sentiment lexicon, Hövelmann and Friedrich (2017) make use of linguistic preprocessing and the fastText classier, and Sidarenka (2017) learn an ensemble classifier from the predictions of an LSTM and an SVM. Another surprise is that the SVM with the advanced feature set beats even the best of the three by more than 1%.

Table 3.9

team	approach	synchronic	diachronic
Ruppert et al. (2017)	SVM (extended)	.767	.744
Naderalvojoud et al. (2017)	RNN lexicon	.749	.736
Hövelmann and Friedrich (2017)	fastText classifier	.748	.742
Sidarenka (2017)	biLSTM + SVM	.745	.718
Sayyed et al. (2017)	gradient boosted trees	.733	.750
Lee et al. (2017)	stacked NN	.722	.724
UH-HHU-G	ridge classifier	.692	.691
Mishra et al. (2017)	biLSTM	.685	.675
Ruppert et al. (2017)	SVM (weighted unigrams + lexicon)	.667	.694
organizers	majority class baseline	.656	.672
Schulz et al. (2017)	threshold based classification	.612	.616

Results for GermEval 2017(B) sentiment classiﬁcation. We only show the best submission of each team. Baselines and reference implementations are italicized.

All in all, we find that meta-learning strategies such as ensemble learning and the usage of sentiment lexicons are among the most successful strategies in both subtasks. However, the two baseline classifiers of Ruppert et al. (2017) show that traditional machine learning approaches are highly competitive and can even be superior. In addition, we find that social media-specific preprocessing and the use of word vectors also seem to be beneficial.

Chapter Summary

In this chapter, we reviewed the current state of the art of single-target stance detection systems. We first introduced a generic text classification framework, in which we can describe the majority of today’s stance detection systems. In particular, we discussed (i) more traditional machine learning approaches that are based on feature engineering and SVMs, and (ii) knowledge-light neural network architectures.

Subsequently, we reported on our participation in SemEval 2016 task 6 Detecting Stance in Tweets and IberEval 2017 task Stance and Gender Detection in Tweets on Catalan Independence, as well as the organization of GermEval 2017 Shared Task on Aspect-based Sentiment in Social Media Customer Feedback. Across all competitions, we find that single-target stance detection is a challenging task in which even simple majority baselines are hard to outperform. The submissions in the three tasks constantly show that the usage of word polarity lexicons leads to slight improvements of the systems. Further improvements can be achieved by incorporating the context (Augenstein et al., 2016; Du et al., 2017; Zhou et al., 2017).

The use of word vectors seems to be beneficial as well. However, our experiment on word-relatedness indicates that pre-trained word vectors may not sufficiently cover the lexical semantics of stance. The experiment also suggests that there is a potential for improving the state of the art by word vectors that model target-specific semantics.

Furthermore, we find that simple text classification systems that are equipped with ngram features only, yield a highly competitive performance and are not clearly inferior to neural network-based systems. In this point, stance detection differs from many other NLP tasks where neural approaches are the dominant paradigm. However, the results of IberEval 2017 and GermEval 2017 show that theres is a potential for improving performance by combining neural, non-neural, or rule based approaches.

In the next chapter, we will examine a more complex formalization of stance. Instead of stance that is expressed towards a single target, the subject of the next chapter models stance towards several, logically linked targets.

Stance on Multiple Targets

As people’s stances are often more complex than being $\oplus$ or $\ominus$ towards one target, in many scenarios it is necessary to adapt our formalization to this complexity. Consider the following three statements that may be uttered in a debate on wind energy:

(A) The only pollution caused by windmills is noise. Fine with me.
(B) Coal and gas will disappear. Wind is an inexhaustible source of energy.
(C) While windmills are expensive, they make us less dependent on gas imports.

All three statements clearly communicate a $\oplus$ stance towards wind energy. However, it may be important to capture that all three actually favor different aspects of wind energy: One could argue that Example (A) addresses the aspect of noise pollution by wind energy, Example (B) the aspect of inexhaustibility of wind, and Example (B) the aspect independence of imports. Knowing which aspects are favored or disfavored may be important to many applications such as ranking positive and negative aspects, filtering debates for positions in which we are not interested, or identifying which aspects have not been discussed yet.

Additionally, utterances such as nuclear power is still the only way to go! may exist, which only implicitly express a $\ominus$ stance towards wind energy. However, that a person actually expresses a $\oplus$ stance towards nuclear power could be important for obtaining an adequate overview on the debate.

To capture this more fine grained notion of stance, in this chapter, we will explore multi-target stance approaches that formalize stance towards an overall target and a set of logically linked targets. In Figure 4.1, we visualize an applied annotation scheme that models both a overall stance towards wind energy and more specific stances towards the logically linked targets inexhaustibility of wind, fossil energy sources, wind energy is expensive and threat to ecosystems.

Figure 4.1

Example of an applied annotation scheme which corresponds to the formalization Stance on Multiple Targets

For predefining the sets of specific targets, we rely on prior knowledge on a debate topic. Prior knowledge refers to any external knowledge which cannot be obtained directly from the present dataset. Prior knowledge includes deriving targets from a knowledge source such as Wikipedia or a debate website, but also if the targets are manually selected on the basis of frequent words in the data. We will examine approaches that are independent of external knowledge in the subsequent Chapter 5. We donot distinguish between targets which are single words or entities and targets that are the whole statements. The target reality of climate change – which actually refers to the statement the climate change is real – shows that, in real world scenarios, the distinction is often hardly possible.

We study multi-target stance approaches by annotating the formalization to social media data and conducting a series of quantitative analyses. While there are already several datasets on multi-target stance or related formalisms, in this chapter, we describe how we created two datasets ourselves. We conduct our own annotation studies for three reasons: First, many multi-target approaches (e.g. most of the works in aspect-based sentiment) are annotated on data from the commercial domain (cf. Section 2.3.2). We want to contribute to the understanding of political and social issues with our work. Second, most multi-target stance approaches (e.g. Conrad et al. (2012), Hasan and Ng (2013), or Boltužić and Šnajder (2014)) utilize a binary stance polarity (i.e. $\oplus$ vs. $\ominus$ ). However, as we argued in Chapter 2, the NONE class is necessary for many social media use-cases. Third, by conducting our own data annotation, we have access to and control over the annotation process. This allows us to identify peculiarities in the annotation and to compare different variants of the annotation schemes.

Before reporting the findings of our annotation experiments, we take a look at how to annotate stance, as well as how one can estimate the reliability of annotations, and how to treat instances that implicitly express stance.

Annotating Stance on Multiple Targets

Figure 4.2

Example of annotating multi-target stance. We show that three annotators annotate stance towards the two targets $\textit{A}$ and $\textit{B}$ . A final label is obtained through an additional curation phase in which we follow the majority vote of the annotators.

In the previous chapter, we described that machine learning approaches on stance detection are trained and evaluated with the help of gold labels. As stance is a highly semantic and pragmatic task (cf. Chapter 3), human judges are needed to interpret texts and assign these gold labels. If we want to annotate a stance towards the single target x of a text (cf. Chapter 3), we can ask human subjects to infer (Mohammad et al., 2016):

whether the author of the text is in favor of $x$ ( $\oplus$ ),
whether the author of the text is against $x$ ( $\ominus$ ), or
whether the author of the text is neutral towards $x$ or if the text is off-topic (NONE).

If we want to collect stance towards the target set $Y$ , we can simply repeat the procedure for all elements in $Y$ (Sobhani et al., 2017). We exemplify this process for two targets in Figure 4.3.

The annotation scheme does not restrict how much interpretation and what kind of inferences are allowed for the annotation. However, these degrees of freedom also make the annotation subjective. We will now take a closer look at the possible extent of subjectivity and which challenges arise from it. Subsequently, we will show how to quantify these difficulties by means of reliability measures.

Inferring Stance from Text

The degrees of freedom, one has when interpreting utterances, are particularly problematic for multi-target stance scenarios, since all targets of the target set usually have a logical connection (e.g. part-of-relationships). Suppose we have the utterance Nuclear power poses unpredictable risks! and want to annotate stance towards the targets nuclear power, wind energy, and the more specific targets wind energy is safer than nuclear power, and nuclear power is cheaper than wind energy. An annotator could now make the following inferences:

(A) The author stresses the risk of nuclear power and thus expresses a $\ominus$ stance towards nuclear power.

(B) As the author rejects nuclear power, she should favor alternative energy sources ssuch as wind energy and hence has a $\oplus$ stance on wind energy.

(C) Since the author thinks nuclear power is risky and favors wind energy, she has a $\oplus$ stance towards wind energy is safer than nuclear power.

(D) As a consequence of the rejection of nuclear power and as risks are often associated with costs, the author opposes nuclear power is cheaper than wind energy.

Except for the first inference, without further context all the outlined inferences are highly speculative. For example, the author could be a supporter of coal-fired power plants and reject both nuclear and wind energy. The person could also have the opinion that nuclear power is less expensive than wind, but still be pro-wind for safety considerations. While this is unlikely, a situation might arise in which a person has $\oplus$ stance towards nuclear power despite having strong safety concerns (as indicated by the phrase unpredictable risks!).

While inferring a $\ominus$ stance towards nuclear power seems to be uncontroversial for most people, a $\oplus$ stance towards nuclear power – despite the expressed safety concerns – seems to be rather unlikely. Hence, the inferences differ in how likely we think they are intended in the given context. To understand how people estimate the intentions of a text, we will now shed light on the theories cooperative principle (Grice, 1970) and relevance theory (Sperber and Wilson, 1986, 1995; Wilson and Sperber, 1986, 2002). Both theories aim at explaining inference processes that occur in conscious conversation between two parties. Hence, they both assume a Shannon-Weaver-like*In the Mathematical Theory of Communication Shannon (1948) models information transfer as process in which a transmitter sends encoded information through a channel to a receiver which decodes the signal. The model was originally published 1948 and forms the basis of numerous more fine-grained communication models. process in which there is a sender that encodes content into a message and a receiver that decodes that message. In our set-up, the authors are unaware that annotators classify their posts. Hence, the authors do not encode their intentions in a way that is specific to the annotators. Therefore, we focus on those parts of the two theories that are concerned with decoding processes which correspond to annotating stance.

Cooperative Principle

The cooperative principle (Grice, 1970) states that successful conversation requires cooperation between a speaker and a listener. By the term successful, Grice means that the listener understands the speaker. The speaker’s part of the cooperation is to expresses the message in a way that the listener can understand in the given conversational context. Under the assumption that the speaker communicates cooperatively, the listener can now use the specific way in which the message is expressed to infer the intended meaning.

Grice (1970) postulates four conversational maxims which – if followed – shape a message in a way that allows the listener to effectively decode the intended meaning. These maxims are:

Quantity: The quantity maxim states that utterances should neither contain more or less information than required. For instance, if one wants to express that wind energy is more expensive than nuclear power, one should not provide additional information (e.g. wind energy is more expensive than nuclear power and loud), but also not less (e.g. wind energy is expensive).
Quality: The quality maxim means that anything one wants to communicate should be true. This includes that one should not state something one believes to be false and one should not state anything for which you cannot provide evidence. If someone deliberately provides misinformation, it is difficult to infer her intention, as the misinformation is in conflict with the assumption that the conversation is cooperative.
Relation: The relation maxim means that an utterance should be relevant (Grice himself summarizes the maxim with the simple phrase be relevant). One cannot derive an intention and thus a stance if an utterance is not relevant in the context. For instance, it is difficult to infer stance towards wind energy from the statement Dogs are better than cats.
Manner: The maxim states that the statement should be expressed in a clear and concise manner. This includes avoiding ambiguous, confusing and unnecessarily verbose expressions.

As shown by Deng et al. (2014) and Deng and Wiebe (2014) the cooperative principle can be used to develop rules that infer implicit $\oplus$ or $\ominus$ stance (they call them opinions) from explicitly expressed stance. Hence, by regulating inferences, the considerations of Grice’s assumptions can be used to limit the degrees of freedom inherent in the annotation of stance. Therefore, we first assume that the authors purposely wrote the texts to communicate their intentions to their audience. The audience can be the readers of their (micro-) blog or the group of people with whom the authors are currently in a debate. Next, we can assume that the intention of an utterance is to communicate stance on the topic of conversation in which she or he participates. Of course, this assumption is only valid if the utterance is not bye-catch. From the quantity, quality, relevance, and manner of contained information, we can now infer towards which of our targets the author may have intended to express a stance.

Relevance Theory

While the relevance theory (Sperber and Wilson, 1986, 1995; Wilson and Sperber, 1986, 2002) is grounded in Grice’s relevance maxim (be relevant), it does not assume that the basis of all communication is cooperation, but that communication can also have a deceptive nature. The core of the theory is that evolution has shaped the human cognition to seek the maximum relevance of stimuli such as perceived texts. Thereby, relevance is seen as the proportion of cognitive effects and cognitive effort. Positive cognitive effects positively contribute to an individual’s cognitive representation of the world. Cognitive effort refers to the mental effort that has to be done to obtain the cognitive effects.

Relevance theory claims that communicators exploit that their audience tries to maximize relevance (Wilson and Sperber, 2002). Messages generated by such a process thus contain clues that raise expectations regarding the relevance of the message. According to the relevance theory, these expectations enable a decoding process that allows understanding the intention of a message (Wilson and Sperber, 2002). In this decoding process, readers conduct a sequence of increasingly cognitive complex inference procedures until their expectations of relevance are satisfied. The sequence of inferences includes resolving references or ambiguities, or inferring implicit meanings.

In conclusion, both theories assume that authors provide cues on what intentions they want to communicate in their texts. Hence, if they want to communicate stance they also provide cues on their stance in their messages. By trying to recognize these clues, it may be possible to infer the implied stance.

However, in addition to the degrees of freedom in interpreting stance, there are various factors that make (multi-) stance annotations difficult. Because the texts were written in a different context than they are annotated, we could never infer the intention of a text with absolute certainty. For instance, irony, sarcasm, or communication for the sake of social interaction are factors that are often difficult to assess without knowing the original context.

Another substantial problem is that there is no guarantee that the selected target set matches the targets that are the subject of the discussions in our data the texts. We will examine this problem in more detail in the Sections 4.4 and 4.5. Furthermore, all above described inferences processes are highly subjective (Artstein and Poesio, 2008). This means that the recognition and evaluation of textual cues may be influenced by the annotator’s prior knowledge on the targets, her personal stance towards the targets, or her the motivation for conducting the annotation (e.g. whether the annotation is done out of personal interest or monetary reasons). In order to investigate how subjective the estimations of the annotators are, one can examine the reliability of the resulting annotations. Hence, in the following section, we will discuss how to measure the reliability of annotations.

Reliability of Multi-target Stance Annotations

As we train stance detection systems on labels that have been assigned by human annotators, these systems reflect the interpretation of the annotators. However, as we have discussed above, inferring the stance of texts may be, to some extent, subjective (Artstein and Poesio, 2008). As a result, even if a system achieves 100% accuracy, the resulting labels are unsatisfactory for a second person that interprets the texts in a different way. Hence, we want our data to be reliable. In our context, reliability means we get a consistent result for several annotators. This means that if another person re-annotates the data, we will get the same labels.

One way to measure the reliability of annotations is to have the data annotated by multiple ( $>1$ ) annotators and then measure the agreement between them. The simplest way to measure the agreement between annotators is to calculate the ratio of the number of times the annotators agree on an annotation and the number of all annotations (Artstein and Poesio, 2008). This observed agreement $A_0$ can be calculated by:

A_0(I)=\frac{\sum_{i \in I}agree(i)}{|I|}

with

I

being a set of annotated items and

agree(i)

a function that returns

1

if the annotators agree on

i

and that returns

0

if the annotators disagree on

i

Chance Corrected Agreement Measures

Using the percentage agreement may overestimate the reliability of annotations as it does not correct for both random agreement and the distribution of labels (Artstein and Poesio, 2008). For instance, if we randomly guess a three-way stance we will be correct in $\frac{1}{3}$ of the cases. In addition, consider a case in which one annotator annotated a dataset 90% with a $\oplus$ , 5% with a $\ominus$ , and 5% with a NONE stance. Now, if a second annotator always annotates $\oplus$ without even looking at the texts, we get a high percentage agreement of 90%.

One of the most popular measures, which corrects the observed inter-annotator agreement for chance and skewed class distributions, is Cohen’s $\kappa$ (Cohen, 1960). The basic idea of Cohen’s $\kappa$ is to standardize the observed agreement by the average probability with which a label is used by two annotators (Artstein and Poesio, 2008). The Cohen’s $\kappa$ agreement between two annotators is defined as:

\kappa =\frac{A_0(I)-A_E(I)}{1-A_E(I)}

with

A_0(I)

being the observed percentage agreement and

A_E(I)

being the agreement that one can expect by chance. For calculating

A_E(I)

, we form the product of the percentages of times each of the annotators has annotated each class:

A_{E}(I)=\frac{\sum_{k \in K}c_1(k)}{|I|} \cdot \frac{\sum_{k \in K}c_2(k)}{|I|}

With

K

being the set of classes and

c_i(k)

being the number of times annotator i annotated an instance with the class

k

. Possible values of Cohen’s

\kappa

range plus one to

\frac{-A_E(I)}{1-A_E(I)}

(Artstein and Poesio, 2008). A score of one refers to perfect agreement between the two annotators, a score of zero to a random agreement, and

\frac{-A_E(I)}{1-A_E(I)}

to no agreement at all.

There are a number of similar chance corrected inter-annotator agreement measures such as Bennett’s $S$ (Bennett et al., 1954) and Scott’s $\pi$ (Scott, 1955). They differ in how they calculate the agreement that can be expected by chance. Bennett’s $S$ does not estimate the expected agreement empirically, but assumes a uniform distribution of the labels (Artstein and Poesio, 2008). Hence, the $A_{E,S}(I)$ is simply calculated with:

A_{E,S}(I)=\frac{1}{|K|}

Scott’s

\pi

(Scott, 1955) assumes that both annotators have the same probability of assigning a class, and thus calculates

A_{E,π}(I)

using:

A_{E,\pi}(I)=\sum_{k \in K}{\left( \frac{c(k)}{2|I|}\right)}^2

with

c(k)

referring to the number of times

k ∈ K

has been assigned by all annotators.

Agreement Measures for Multiple Annotators

The above described agreement measures are defined for exactly two annotators. However, to understand the influence of subjectivity in greater detail, we want an item to be annotated by several people. Hence, we need measures that generalize to more than two annotators.

In order to calculate the agreement of more than two annotations we cannot iterate over each item and count if the same class was assigned by all annotators, as there may still be agreement between some annotators (Artstein and Poesio, 2008). For example, if a text has been labeled by the three annotators with a $\oplus$ (annotator A), a second $\oplus$ (annotator B), and a NONE (annotator C) stance, there is agreement between the annotators A and B, but disagreement between A and C, and B and C.

A large number of possibilities of measuring the reliability of multiple annotators have been suggested (Fleiss, 1971; Light, 1971; Hubert, 1977; Conger, 1980; Davies and Fleiss, 1982; Randolph, 2005). Here, we limit ourselves to the agreement measure which is most popular in computational linguistics, namely Fleiss’ $κ$ (Artstein and Poesio, 2008).

Fleiss (1971) proposes to calculate the agreement on an item $A_O(i)$ on the basis of all possible pairs of annotators that have labeled it (A—B, A—C, and B—C in the example). If we have a set of $N$ annotators, the number of all possible pairs of annotators can be calculated using: $|N|\cdot(|N| − 1)$ Hence, for calculating $A_O(i)$ we form the proportion of the agreement between pairs of annotators and the number of all possible pairs:

A_{O}(i)=\frac{\sum_{k \in K}c(k)^2-|N|}{|N|\cdot(|N|-1)}

with

N

being the set of annotators and

c(k)

the number of times the class

k

has been annotated by any annotator. As in the given example the text was annotated two times with

\oplus

and one time with NONE, we would calculate:

A_{O}(i)=\frac{(2^2 + 1^2)-3}{3 \cdot 2 =6}= \frac{1}{3}

Next, the overall agreement

A_{E}(I)

can be calculated by averaging agreement scores for all item over all items

i \in I

A_{O}(I)=\frac{1}{|I|} \sum_{i \in I}A_{E}(i)

For calculating an agreement measure for multiple annotators, we need an expected agreement as well. Similar to Scott (1955), Fleiss (1971) assumes that all annotators have the same probability of annotating a particular class (Artstein and Poesio, 2008). Thus, Fleiss (1971) suggests to calculate the expected agreement by using the sum of the probabilities with which of the classes occur across all instances:

A_{E,Fleiss'\kappa}(I)=\sum_{k \in K}{\left( \frac{c(k)}{|N| \cdot |I|}\right)}^2

Therefore, the estimation of the expected agreement corresponds to a generalization of Scott’s

π

(Artstein and Poesio, 2008). However, Fleiss named his agreement measure

κ

and it is therefore usually referred to as Fleiss’

κ

Weighted Agreement Measures

The measures above are only defined for nominally scaled data.*A nominal or categorial scale refers to characteristic whose values can be distinguished, but which have no natural order or whose distances are not defined. This means that if annotator $A$ classifies a as $\oplus$ and annotator $B$ classifies the text as $\ominus$ , we assume the same disagreement as if annotator $B$ classifies the text as NONE. However, if the goal of annotation is to assign real valued scores to texts (e.g. if we want to annotate the intensity of stance polarity), this equidistance gives us a distorted picture of the agreement. We need measures that acknowledge that the two annotations (a) INTENSITY= $.3$ and (b) INTENSITY= $.301$ are more similar than the two annotations (c) INTENSITY= $.3$ and (d) INTENSITY= $.7$ .

This requirement is met by weighted agreement measures such as weighted $κ$ (Cohen, 1968) or Krippendorf’s $α$ (Krippendorff, 1980, 2004b). Both measures define an abstract distance function $d(a, b)$ , which returns a disagreement score for two inputs $a$ and $b$ . $d(a, b)$ can take any distance functions, including functions that express the distance of data points in a metric space (e.g. $(a − b)^2$ ) (Krippendorff, 2004a, 221-234). In contrast to the previous agreement measures, Krippendorf’s $α$ is not defined on the basis of agreement but on the basis of disagreement:

\alpha =1-{\frac {D_{O}}{D_{E}}}

with

D_O

being the observed disagreement and

D_E

being the disagreement we can expect if annotations are made randomly.

The observed disagreement $D_O$ can be seen as the mean difference of annotations across all items (Krippendorff, 2004a, 223). To simplify the formula, we follow Artstein and Poesio (2008) and assume that duplicate values can be grouped into the set of distinct values $K$ . $D_O$ can then be calculated by iterating over all items $I$ . Specifically, we multiply the difference ( $d(a, b)$ ) between pairs of all distinct values $k ∈ K$ with the number of times these values $k$ are annotated to one item ( $n_{ik}l$ respectively $n_{ik}m$ ). The sum of the resulting values is finally standardized by the number of all pairs of annotations:

D_{O} = \frac{1}{ |N|\cdot |I| (|N|-1)}\sum_{i \in I}\sum_{l=1}^{|K|}\sum_{m=1}^{|K|}n_{ik_l} n_{ik_m} d(k_l,k_m)

The expected disagreement

D_E

can be interpreted as the mean difference of all annotations in the dataset (Krippendorff, 2004a, 223). We again follow Artstein and Poesio (2008) to simplify the formula by grouping distinct values:

D_{E} = \frac{1}{ |N|\cdot |I| (|N|-1)}\sum_{l=1}^{|K|}\sum_{m=1}^{|K|}n_{k_l} n_{k_m} d(k_l,k_m)

The difference to the observed disagreement is that we do not iterate over all items but over all pairs of the distinct values (

k ∈ K

). The possible values for Krippendorf’s

α

range from zero (RANDOM AGREEMENT) to one (PERFECT AGREEMENT) (Krippendorff, 2008).

Creating Multi-Target Stance Datasets

To examine how reliable multi-target stance can be annotated in real-life scenarios, we conduct two annotation studies. The first annotation is done on a fraction of the Twitter dataset which has been explored by SemEval 2016 task 6 Detecting Stance in Tweets (Mohammad et al., 2016). The topic of this dataset is atheism. For the second study, we annotate YouTube comments concerning the death penalty. Subsequently, we will conduct a sequence of quantitative analysis steps to inspect patterns of stance towards different targets and to examine the relationship between multi-target stance and singletarget stance classification. We will now first describe how the two datasets are created.

Multi-target Stance Dataset on Atheism

We first describe how we annotate 733 tweets on atheism (513 from the training set and 220 from the test set) from the SemEval challenge. We select the target atheism, as we found the topic to be less dependent on knowledge of specific political events (unlike other SemEval targets such as Hillary Clinton or Donald Trump). The annotation scheme includes the stance towards the overall target atheism and any number of stances towards more specific targets.

Targets

We want a target set to match the debate and the dataset that we want to annotate. Hence, for creating a target set on atheism, we utilize a semi-automated, datadriven approach. We first automatically select a large number of words that potentially imply targets (i.e. nouns and proper nouns) and then group them manually into a target set.

As we want to ensure that the targets enable us to sufficiently differentiate the authors’ positions, we consider the degree of association between nouns and named entities to the classes $\oplus$ and $\ominus$ of the original SemEval annotation. More specifically, we compute the collocation coefficient Dice (Smadja et al., 1996) for each noun or named entity, and selected the 25 words which are most strongly associated with atheism $\ominus$ and atheism $\oplus$ . In addition, we select the 50 most frequent nouns and named entities.

Next, we manually group nouns and named entities into more coarse-grained targets. For instance, entities such as Bible and Jesus are grouped into the target Christianity. The final set of targets is shown in the appendix in Table A.6.

Table 4.1

overall target	specific target
atheism	Christianity freethinking Islam no evidence life after death supernatural power USA conservatism same-sex marriage religious freedom secularism

overall target

specific target

atheism

Christianity

freethinking

Islam

no evidence

life after death

supernatural power

USA

conservatism

same-sex marriage

religious freedom

secularism

Speciﬁc targets of the Atheism Dataset. The target set has been semiautomatically derived from the data. We also annotate stance towards the overall target atheism. A more detailed description of the targets along with examples can be found in Appendix A.2 (Table A.6).

Annotation Procedure For our multi-target stance annotation we differentiate between the overall stance towards atheism and the more specific stances towards the targets shown in Table A.6. The annotation of stance towards atheism is done using the original questionnaire by Mohammad et al. (2016). We re-annotate the overall stance on atheism, as it is possible that our annotators interpret the tweets differently than the original annotators.

On the basis of cooperative principle and relevance theory, we assume that the tweeters provide specific hints if they want to communicate a more specific stance. Hence, while annotating stances towards the more specific targets, the annotators had the instruction to only annotate a $\ominus$ or $\oplus$ stance if they have perceived textual evidence for such stances. We provide examples for this textual evidence in Table A.6. For instance in the tweet ‘Whosoever abideth not in the doctrine of Christ, hath not God.’ [2 John 9], we can perceive evidence for the author’s stands in favor of Christianity and against other religious beliefs (hath not God). Although it is likely that the author is also against ideas such as secularism, religious freedom or that there is no evidence for religion, we do not see textual evidence for this inference.

We let three annotators perform the multi-target stance annotation. The annotators were three student assistants and were paid accordingly. In order to familiarize the annotators with the annotation scheme, we previously trained them on a small dataset that was comparable in its social media character, but which was concerned with a different target.

Since the data partly contains utterances which cannot be understood without further context, we give annotators the option to mark these instances accordingly. Irony is another phenomenon which influences the interpretability. Therefore, we asked the annotators to annotate the tweets for irony as well.

All annotations were performed with the tool WebAnno (Yimam et al., 2013, 2014).*version 2.3.0 The web-based tool provides the annotators with a consistent interface and allows subsequent curation of the individual annotations.

Multi-target Stance Dataset on the Death Penalty

It remains an open question whether different methods for creating target sets result in equally reliable annotations. Hence, for our second study, we annotated the stance towards the target the death penalty and two sets of more specific targets: (i) one expert set extracted from idebate.org and (ii) one representing the wisdom of the crowd from reddit.com. We use YouTube as a data source, since it is publicly available and rich in opinionated and emotional comments on a huge variety of domains (Severyn et al., 2014). Next, we explain how we (i) retrieve a set of suitable comments, (ii) create the two target sets, and (iii) annotate the comments with stances towards the targets.

Comment retrieval

The underlying assumption of using YouTube to create an NLP-dataset is that videos which are about the death penalty attract a high proportion of comments that express a stance towards the target. Hence, we first identify videos about the death penalty and then extract the associated comments for the annotation.

For identifying videos about the death penalty, we poll the YouTube API*http://developers.google.com/youtube/; v3-rev177-1.22.0 with the term death penalty. Afterwards, we sort the videos by view count and manually remove videos that are not exclusively concerned with the death penalty or are embedded in other content such as a late night show. To ensure a high diversity, we balance the number of videos having a $\oplus$ , $\ominus$ , and a NONE stance by selecting the two most watched videos each.

From the resulting six videos, we download as many comments as allowed by the API restrictions (100 comments plus all their replies per video). With a range between one word and 1,118 words the retrieved comments strongly differ in their length. For the long outliers, we observe that they often weigh several pro and cons, and are quite different from the other comments. Consequently, we remove all comments with a length that is more than one standard deviation ( $71.9$ ) above the mean ( $49.3$ words), i.e. we exclude comments with more than 120 words from our corpus. In addition, we transform all graphical emojis into written expressions such as :smile: to simplify downstream processing.*https://github.com/vdurmont/emoji-java; v3.1.3 We anonymize the users, but give them unique aliases, so that references and replies between users can be analyzed. The final dataset contains 821 comments (313 of them replies) from 614 different users with a total of 30,828 tokens. Table 4.2 gives an overview of the resulting set of comments.

Table 4.2

video title	comments	video stance
Death Penalty: Justice, or Just Too Far? \| Learn Liberty	137	$\ominus$
5 Arguments Against The Death Penalty	148	$\ominus$
Troy Davis Death Penalty Debate Question Time	181	NONE
Should There Be A Death Penalty? - The People Speak	122	NONE
Pro-Death Penalty	118	$\oplus$
Ron White Texas Death Penalty	115	$\oplus$

Overview of the Death Penalty Dataset. We show how many comments have been collected from which YouTube video and which overall stance the videos express towards the death penalty.

Target Sets

Since we want to compare different target sets (and therefore the methods with which they are generated), we create two sets of targets for the annotation of YouTube comments. As approaches of creating target sets usually involve a high degree of manual effort (e.g. analytically deriving targets or grouping reappearing phrases), their reproducibility and reliability may be limited – especially when transferring them to new domains. We try to minimize this problem by extracting the sets from external, collaboratively created resources that are (albeit in varying degrees of quality) available in many domains.

Table 4.3

overall target	set	specific target
Death Penalty	IDebate	Closure deterring eye for an eye financial burden irreversible miscarriages overcrowding preventive state killing
Reddit	Electric Chair gunshot strangulation certainty unachievable heinous crimes immoral to oppose more harsh more quickly psychological impact no humane form replace life-long lethal force abortion euthanasia use bodies

overall target

set

specific target

Death Penalty

IDebate

Closure

deterring

eye for an eye

financial burden

irreversible

miscarriages

overcrowding

preventive

state killing

Reddit

Electric Chair

gunshot

strangulation

certainty unachievable

heinous crimes

immoral to oppose

more harsh

more quickly

psychological impact

no humane form

replace life-long

lethal force

abortion

euthanasia

use bodies

The iDebate set of expert targets and the Reddit target set representing the wisdom of the crowd. We additionally annotate stance towards the set-independent target death penalty_explicit and the overall target death penalty. A more detailed description of the targets along with examples can be found in Appendix A.2 (Table A.7).

For the first set, we utilize expert arguments as targets. We download these targets from the debating website idebate.org.*http://idebate.org/debatabase/debates/capital-punishment/house-supports-death-penalty idebate.org collects debate-worthy issues such as death penalty and popular pro and contra arguments for supporting or opposing these issues. We hypothesize that these pro and contra arguments – which are constructed by domain experts and should have a high quality – make good candidates for targets. This IDebate Set contains nine targets.

For the second set, we relied on the social media platform reddit.com, on which users can exchange content in the form of links or textual posts. reddit.com is organized in socalled Subreddits which allow a thematic classification of posts. For the Reddit-Set, we extract targets from a Subreddit about debating*http://www.reddit.com/r/changemyview where users post controversial standpoints and invite others to challenge it. As the forum is intensively moderated, the quality can be considered somewhat higher than in an average online forum (Wei et al., 2016b). We assume that heavily debated posts represent the major issues and are thus well-fitting candidates for targets. Therefore, we query the changemyview Subreddit for the terms capital punishment, death penalty and death sentence using a wrapper for the Reddit REST-API.*http://github.com/jReddit/; version 1.0.3 We then remove submissions which are not debating about the death penalty (e.g. removing the headphone plug will be a death penalty for the iPhone) or with less than 50 comments. Finally, we manually group posts if they are lexical variations or paraphrases from each other (e.g. we group the posts execution should be done by a bullet in the head vs. execution should be done by a headshot in to one target gunshot). Table A.7 provides an overview of both sets.

Annotation Procedure

In our second annotation study, we again rely on trained student assistants as annotators. We also use the SemEval annotation scheme (labels $\oplus$ , $\ominus$ , and NONE) to annotate each comment with its stance towards death penalty. As in the first annotation study, we let them additionally annotate the stance towards the more specific targets of the two sets. However, there are also differences in the annotation procedure:

In our first study, we allowed annotators a high degree of freedom in assessing whether a text contained relevant clues for a specific stance. This time, for the annotation of the more specific stances, we want to control these inferences by asking the annotators to identify the textual evidence on which they base their decisions. To guide this process, we ask them to annotate stances on spans in the text which communicate the specific stance. In other words, the annotators should identify the area in the text that – if omitted – has the consequence that the annotators would not annotate the stance. For example, in the comment Although I have many problems with the death penalty, there is no better deterrent they should annotate a $\oplus$ stance towards the target deterring to the span there is no better deterrent. Note that we here are not interested in the exact boundaries of the spans. For instance, we would also accept if the span in the example above only includes no better deterrent. The idea is rather that the annotators always identify some evidence for their annotations and thus become more consistent. Figure 4.3 exemplifies this annotation procedure.

We observe that there are comments in our datasets that directly express a stance towards the death penalty. To investigate the extent of utterances which are explicitly expressing a stance towards the death penalty, we define the additional target death penalty_explicit. We again utilize the tool WebAnno (Yimam et al., 2013, 2014) for annoation, as WebAnno comes with functionalities to perform the described span annotations.

Figure 4.3

Example of an annotated comment of the Death Penalty Dataset. Each utterance is annotated with exactly one stance on the death penalty ( $\oplus$ , $\ominus$ , or NONE) and any number of stances towards the targets of both sets (including death penalty_explicit).

To avoid over-interpretation, stance towards a target needs to be annotated according to textual clues (indicated by underlines).

Reliability of Multi-Target Stance Annotation

Since we use three annotators in both annotations, we compute Fleiss’ $κ$ (Fleiss, 1971) to measure annotation reliability. Specifically, we calculate separate agreement scores for the stance towards the topic of the debate (i.e. atheism or death penalty) and for each target from the sets.

For the stance towards the topic of the debate, we calculate the agreement between $\oplus$ , $\ominus$ , and NONE annotations. In both annotation schemes, annotators should annotate stance towards the specific targets only if they perceive specific textual evidence that implies that stance. Hence, if an instance is neither annotated with a $\oplus$ or nor with $\ominus$ stance towards a specific target, we assume a NONE stance. Thus, we also can calculate the Fleiss’ $κ$ agreement between $\oplus$ , $\ominus$ , and NONE annotations for each specific target. In addition, to estimate the reliability of the whole annotation scheme, we calculate the agreement of the joint decision of both the overall and the more specific stance.

Since the annotation schemes are slightly different from each other, we will therefore discuss their reliability individually.

In order to assess the strength of the found agreement, we compare the obtained scores with agreement scores from the literature. Unfortunately, several studies such as the SemEval tasks on aspect-based sentiment detection (Pontiki et al., 2014, 2015, 2016) and the SemEval task on stance detection (Mohammad et al., 2016) do not report chance corrected agreement measures. The few stance annotation efforts that report chance corrected agreements scores include Sobhani et al. (2015) and Taulé et al. (2017).*In the commercial domain, there is a significant number of studies reporting chance-corrected scores. We find that the obtained scores are often substantially higher than in the social or ethical domain. For instance, Ganu et al. (2009), Saeidi et al. (2016) or Wojatzki et al. (2017) report Cohen’s kappa scores of up to 1.00. We here only compare ourselves to approaches from the social or ethical domain. On the dataset on Catalan Independence, Taulé et al. (2017) obtained a Fleiss’ $κ$ of .6 for their stance annotation. Sobhani et al. (2015) annotated comments about breast cancer screenings with stance on a three-level (FOR, OTHER, and AGAINST) and a five-level scale (STRONGLY FOR, FOR, OTHER, AGAINST, and STRONGLY AGAINST). They report a weighted $κ$ between pairs of annotators of .62 for their three-way stance annotations and .54 for their five class scenario.

Atheism Dataset

If tweets are ironically phrased or incomprehensibly written, stance can hardly be inferred. Thus, we exclude tweets that are annotated for irony and understandability issues. However, we find that our annotators rarely agree on these phenomena as we get a $κ$ of only .06 for understandability and a $κ$ of .23 for irony. Therefore, we only exclude 18 tweets in which at least two annotators annotated that the tweet is ironic or incomprehensible. The filtering process resulted in 715 tweets for the final dataset.

As shown in Figure 4.4, we obtain a Fleiss’ $κ$ of .72 for the annotation of the overall stance. While it is difficult to directly compare between reliability scores of different annotations, we conclude that our overall stance annotation is in a comparable range than the reported reference values (Sobhani et al. (2015) reported a weighted $κ$ of .62 and Taulé et al. (2017) a Fleiss’ $κ$ of .6). This shows that both single-target and multi-target stance annotation are tasks that involve a considerable amount of subjectivity. In Figure 4.4 we visualize the reliability scores of the more specific stance annotations.

Overall, we obtain a mixed picture. This means that we obtain high scores for some targets and much lower scores for others. The two targets (Christianity and Islam) yield especially high level of agreement above $.8$ . We attribute this to the fact that they are associated with distinct cues such as the words Jesus or Quran, or references to Bible or Quran passages. Other targets such as secularism and freethinking are rather abstract and hardly involve specific signal words. However, we obtain high agreement scores of over $.7$ . This suggests that our annotators did not just learn to recognize certain keywords, but can also reliably annotate more abstract targets.

The targets USA, religious freedom, same-sex marriage, and life after death yield substantially lower agreement scores between $.4$ and $.6$ . An error analysis shows that the disagreement is often due to the fact that the identified cues are related to a target, but that that they are less specific than the target. For instance, for the target same-sex marriage we find that annotators disagree on whether to annotate a stance if the tweet contains a stance towards gay rights in general but not specifically to gay marriage. A rather low $κ$ of .31 is obtained for the target no evidence. We observe that this disagreement mainly arises from tweets that mention the persons Bill Nye and Richard Dawkins*famous supporters of the position that there is no evidence for religion or their twitter accounts. We suspect that the annotators had different levels of knowledge about these persons and thus interpreted the tweets differently. This indicates that background knowledge can affect the interpretation of stance.

Finally, we obtain a $κ$ of .63 for the joint decision on both the debate and the explicit targets. The obtained inter-annotator agreement shows that our model can be annotated with a varying reliably. Our analysis also shows that the reliability of stance annotation is affected by background knowledge and by how annotators interpret specific clues.

Figure 4.4

Interannotator agreement of the overall stance towards Atheism and or the more specific stances.

Death Penalty Dataset

We will now report the reliability scores for our annotations of YouTube comments on death penalty. As five targets occurred only three times or less, we exclude them from the reliability analysis.*no humane form, same right as euthanasia, same as all lethal forces, immoral to oppose, and negative impact on human psyche

For the Death Penalty Dataset, we obtain Fleiss’ $κ$ of .66 for the overall stance towards atheism – the overall topic of the dataset. This score is slightly lower than the score for the Atheism Dataset, but still within a similar range as the annotations of Sobhani et al. (2015) (weighted κ: of .62) or Taulé et al. (2017) (Fleiss’ $κ$ : .6). We visualize both the agreement scores for stance towards death penalty and the specific targets in Figure 4.5.

Similar to the first annotation study, we obtain a mixed picture for the annotation of the specific stances. With $κ$ scores between .13 and .87, the range is even higher in the Death Penalty Dataset. While more than half of the specific targets result in an annotation with a $κ$ of above .6, several targets, such as certainty unachievable with a $κ$ of $.13$ , are associated with rather low agreement. We do not observe general difference between the two sets, as both contain targets with low and high reliability. We will now discuss the single targets of the sets in more detail.

iDebate Targets

The iDebate-Set includes targets that are similarly reliable or more reliable than the overall stance, and those whose reliability is substantially lower. With $κ$ scores of above .7, the targets overcrowding, preventive, deterring and irreversible reach even a higher agreement than the stance on death penalty. In addition, the targets overcrowding and preventive have $κ$ scores of above .6 and are therefore in a comparable range as the overall stance.

Eye for an eye is less reliable compared to the overall stance. We find that there are differences in the interpretation of this target among the annotators. While some annotators thought that the idea of equalization (i.e. that the punishment must match the crime) is central for the target, others annotate the target, as soon as an author demands the death penalty for murder.

With $κ$ scores of .26 and .29, miscarriages of justice and financial burden result in an even lower level of agreement. Again, there seems to be a disagreement on whether terms that are related to targets are specific enough to signal a stance on the target. More specifically, the annotators disagree on whether high costs are already a burden, or if a burden requires that the costs must cause substantial hardship. Similarly, the annotators disagree on whether systematic misjudgment is principally a miscarriage of justice.

Reddit Targets

Figure 4.5

Inter-annotator agreement of stance annotations for death penalty, death penalty_explicit, iDebate-Set and Reddit-Set. We exclude targets that occurred three times or less.

For the targets of the Reddit-Set we obtain a mixed picture, too. The targets electric chair and gunshot are highly reliable, which we attribute to a strongly associated vocabulary, such as by electric chair and firing squad. In contrast, the targets more quickly and more harsh are not clearly associated with keywords. Nevertheless with $κ$ values of above .7 we obtain rather high reliability scores for these targets.

For the targets use bodies, heinous crimes, and strangulation we find a substantially lower level of agreement. We observe a disagreement among the annotators on how narrow or wide these targets have to be interpreted. For instance, for the target heinous crimes it is often unclear whether all murder is a heinous crime by definition, or the if heinousness must be stressed in the comment. In addition, the annotators disagree on whether hanging always implies a stance on the target strangulation as those sentenced to hanging often die from cervical dislocation.

We notice the lowest level of agreement for the targets abortion, replace life-long and certainty unachievable. As a potential reason for the low level of agreement, we identify that some annotators missed $\ominus$ stances in utterances which express a reversal of these targets. For instance, from the utterance There are cases in which you are sure he is guilty! one may infer a $\ominus$ stance on certainty unachievable. Similarly, from the utterance Let them rot forever one may infer a $\ominus$ stance on replace life-long.

For the joint decision on both the overall and the more specific targets we obtain a $κ$ of .50. The reliability in the Death Penalty Dataset is thus lower than the reliability found in the first study. This is surprising, since the annotation process in the second study is more regulated by the strict requirement to identify spans of specific textual markers.

We suspect several reasons for this lower level of reliability: The higher number of targets in the second study makes our annotation more complex. Annotation complexity is known to be negatively correlated with inter-annotator agreement (Bayerl and Paul, 2011). The targets of both studies differ not only in their number, but also qualitatively. Due to how they were created, the targets in the Atheism Dataset are mostly concepts which can be described with a single word (e.g. Christianity or Islam). The targets for the Death Penalty Dataset were extracted from expert arguments and social media discussions and include complex lines of thought (e.g. the death penalty can result in irreversible miscarriages of justice). As a third reason, we suspect that the different contexts of the YouTube and Twitter data may affect the reliability. We find that there is a rich interaction between the different YouTube comments (i.e. people react on each other’s comments). In the Twitter data, which was created by querying for hashtags, such interactions are rare. In the YouTube data, the posts also share the context that they comment on the same videos. In our annotation, this additional context may have introduced additional possibilities for interpretation and thus caused a lower reliability.

Comparison

Overall, the reliability found in the two studies is in a range which is comparable to similar studies. The considerable amount of agreement shows that humans are in principle capable of reliably interpreting textual clues to infer specific stances. However, the disagreement also highlights that stance is, to a large extent, based on subjective assessments. As a major source of disagreement, we identify inconsistencies with respect to the question of what makes textual clues specific enough to infer a specific stance. In addition, we observe that varying degrees of prior knowledge and differing interpretations of the targets affect the reliability of the annotation. Some of these difficulties could be alleviated by extending the annotation guidelines. For example, a guideline could specify whether stance should be annotated towards gay marriage if the text more generally comments on gay rights. However, such guidelines would be tailored towards the domain, targets, and possibly even the data, and thus not easily transferable to new issues or data.

Distribution of Multi-target Stance Annotations

Our reliability analysis provides insights into how consistently the multi-target stance can be assigned, but provides no insight into how the authors communicate their stances. Hence, we will now analyze the distributions of the collected annotations. Specifically, we examine the distribution of stance which is expressed towards the overall targets (i.e. atheism and death penalty) and towards the more specific targets. Besides providing insights into stance communication, an analysis of the distributions of annotations also sheds light on the quality of the used target sets. A reasonable analysis of communicative behavior is only possible if the targets describe stance-taking in a substantial part of the data. In addition, since the automatic classification is affected by the class distribution (cf. Section 3.3), the skewness of the distribution is a quality criterion from a machine learning perspective.

Before we can inspect the distributions of stances, we have to agree on one annotation for each instance. For the Atheism Dataset, we simply rely on a majority vote to compile a final annotation for each tweet, as we have no way to judge whether the annotations differ in quality. For the Death Penalty Dataset, we also use the majority vote to derive final stance annotation towards the overall target death penalty. However, for the more specific stances, we rely on the annotated spans to perform a curation. During the curation, we compare the spans and corrected annotations if annotators have missed textual clues which express a reversal of these targets (cf. Section 4.3.2). The curation step was done by the author of this thesis.

Overall Stance

Table 4.4

target	$\oplus$	$\ominus$	NONE	$\sum$
atheism	147	348	220	715
death penalty	272	224	325	821
death penalty_explicit	40	47	734	821

Distribution of the stance towards the overall targets in the Death Penalty Dataset and the Atheism Dataset.

In Figure 4.4, we show the distributions of expressed stances towards the overall topics atheism and death penalty. For the Atheism Dataset, we obtain 147 (21%) instances with a $\oplus$ , 348 (49%) with a $\ominus$ , and 220 (31%) with a NONE stance. This distribution is different from the original SemEval annotation (cf. Section 3.3), which consists of 17% $\oplus$ , 63% $\ominus$ , and 20% NONE instances. Our annotators assigned substantially less $\ominus$ stances ( $−$ 14%) and more NONE stances ( $+$ 11%).

A manual inspection shows that, in contrast to the original annotation, our annotators do not interpret any usage of religious or spiritual language (e.g. The greatest act of faith some days is to simply get up and face another day) as an indication of a $\ominus$ stance towards atheism. A possible explanation for these differing interpretations is that the SemEval annotators were all based in the USA, where the attitude to religiosity and atheism may be different. The difference again highlights that stance annotation is indeed affected by subjective interpretations.

In the Death Penalty Dataset, 272 instances (33%) are labeled with a $\oplus$ , 224 instances (27%) are labeled with a $\ominus$ , and 325 (40%) instances are labeled with a NONE stance. This distribution is a bit more even and thus more desirable than the distribution of annotations in the Atheism Dataset. That two-thirds of the data express a stance towards the death penalty confirms our hypothesis that although YouTube is actually a platform for video sharing, it is also a place where people express their opinions and beliefs. Out of the 496 stance bearing instances ( $\oplus+\ominus$ ) only 87 (40 $\oplus$ and 47 $\ominus$ ) are labeled to contain an explicit stance towards death penalty. This finding aligns with our hypothesis that stance is often implicitly expressed in social media communication.

Specific Stances

In Figure 4.6, we show the distribution of specific stances in the Atheism Dataset. We observe that the distribution is highly imbalanced. This means that a few targets are frequent (i.e. supernatural power and Christianity) and that the mass of targets is rare. We suspect that this distribution reflects that the targets are annotated on the basis of textual clues, whose distributions, like most linguistic phenomena, probably follow Zipf’s law. Furthermore, we find substantially more $\oplus$ than $\ominus$ annotations for all targets. The most frequent targets (supernatural power and Christianity) are associated with a religiosity which can be seen as the opposite of atheism. Since there are much more tweets expressing a $\ominus$ stance towards atheism, it is not surprising that there are much more $\oplus$ stance towards the religious targets. There are few interactions between tweets in the dataset. In such a one-sided communication, people may tend to communicate their stance directly rather than to attack the opposing side.

Figure 4.6

Distribution of stance towards the more speciﬁc targets in the dataset concerning the target atheism.

We visualize the distribution of specific stances in the Death Penalty Dataset in Figure 4.7. As in the Atheism Dataset, we find that both distributions yield a strong imbalance. We again attribute this to the annotation procedure, which is based on identifying textual clues that are probably Zipf distributed in the dataset.

The imbalance is more evident for the Reddit-Set than for the iDebate-Set. However, this is mainly due to the fact that the Reddit-Set contains more targets that occur rarely or not at all. This could be a result of how the target sets were created. The Reddit-Set is created on the basis of relevant and frequently commented issues on Reddit, which does not seem to fully correspond to the comments in our Youtube comments. The expert targets of the iDebate-Set were probably also created according to whether they reflect the viewpoints of a sufficiently large number of people.

Similar to the Atheism Dataset, we find that there are substantially more $\oplus$ than $\ominus$ stances towards the specific targets. However, the imbalance between $\oplus$ and $\ominus$ stances is less pronounced in the comments on death penalty. There are even two targets (financial burden and deterring) for which we observe more $\ominus$ than $\oplus$ stances. We attribute this distribution to the more even distribution of the overall stance on the death penalty. The distribution of $\ominus$ and $\oplus$ stances is more even in the iDebate-Set than in the Reddit-Set. This could also be the result of a more objective and balanced selection of targets by the experts of iDebate.

Machine learning systems are sensitive to the frequency and the distribution of the classes we want to predict. On the one hand, they need enough training instances to learn a generalizing mapping function and, on the other hand, the predictions are biased towards the frequent classes (Bishop, 2006, p.17,149). From this perspective, the iDebate-Set has both a slightly more favorable distribution of targets and a slightly more favorable distribution of $\ominus$ and $\oplus$ stances. However, as both sets contain both frequent and infrequent targets none of the two inventories can be clearly preferred over the other.

Figure 4.7

Distribution of stance towards the more specific targets in the dataset concerning the target death penalty.

Coverage

The aim of the formalization multi-target stance is to provide a more detailed picture of the expressed $\ominus$ or $\oplus$ stances. An annotation scheme serves this purpose only if a sufficient number of instances are annotated with a specific stance. The degree to whichinstances are covered by an annotation scheme is called coverage.

Since the NONE class is used to filter out irrelevant by-catch, we are not interested in describing instances labeled with a NONE stance in greater detail. Hence, to examine the coverage of the target sets, we calculate the proportion of instances that are annotated with a $\ominus$ or $\oplus$ stance towards the contained target in the set of instances that are labeled with a $\ominus$ or $\oplus$ stance towards the topic of the debate:

\frac{|I_{t:\oplus,\ominus} \cap I_{\oplus,\ominus}|}{|I_{\oplus,\ominus}|}

with

I_{\oplus,\ominus}

being the set of instances which are labeled with a

\ominus

\oplus

stance towards the topic of the debate and

I_{t:\oplus,\ominus}

the set of instances which are labeled with a

\ominus

\oplus

stance towards at least one target t of the set T . To examine whether the targets better cover the

\ominus

\oplus

instances, we additionally calculate a coverage with respect to all instances with

\ominus

\oplus

stance towards the overall target.

In Table 4.5 we show the results of these calculations. Overall, we find that the coverage of specific stance annotations in the Atheism Dataset is substantially higher than the coverage in the dataset about death penalty. The high coverage of the Atheism Dataset can be explained by the fact that the targets are created on the basis of frequent words. Probably, the annotators use the same words as textual markers for the annotation. With 97% the coverage is especially high among the instances which express a $\ominus$ stance towards atheism. Interestingly, 81% of these instances express a $\oplus$ stance towards the target supernatural power. This means that most authors in the dataset express their stance on atheism by making statements about their religiosity or spirituality.

Both target sets in the Death Penalty Dataset cover slightly less than half of the data. When combining both sets, the coverage almost doubles, from which we may conclude that the two sets are complementary. However, even the combination of the sets misses 22% of the instances. As these instances are also not annotated with an explicit stance towards the death penalty, the sets seem to be not fully-suited to describe how people discuss death penalty-related aspects in our data.

Both target sets cover more $\oplus$ instances than $\ominus$ instances. This imbalance is stronger for the Reddit-Set, which might be due to the community of iDebate that pays more attention to balanced discussions.

Table 4.5

dataset	target set	$I_{\oplus,\ominus}$	$I_{\oplus}$	$I_{\ominus}$
Atheism Dataset	data-driven	90%	73%	97%
Death Penalty Dataset	external (both) iDebate-Set Reddit-Set	78% 48% 44%	56% 26% 31%	43% 22% 14%

dataset

target set

$I_{\oplus,\ominus}$

$I_{\oplus}$

$I_{\ominus}$

Atheism Dataset

data-driven

90%

73%

97%

Death Penalty Dataset

external (both)

iDebate-Set

Reddit-Set

78%

48%

44%

56%

26%

31%

43%

22%

14%

Coverage of stance towards specific targets in the atheism and the death penalty dataset. $I_{\oplus,\ominus}$ , $I_{\oplus}$ and $I_{\ominus}$ refer to sets of instances that are labeled with an $\ominus$ or $\oplus$ , or only with an $\oplus$ resp. $\ominus$ stance towards the overall target.

Predicting Multi-Target Stance

We now turn to investigating whether and how multi-target stance can be assigned automatically. Therefore, we model the task of multi-target stance detection as a number of subtasks which correspond to single-target stance detections. We will now first analyze the performance of the individual components. Subsequently, we will investigate the relationship between the different components.

Predicting Overall and Specific Stances

Both the stance towards the topic of the debate (i.e. death penalty and atheism) and the stances towards the specific targets are annotated in isolation. Thus, we cast the task of multi-target stance detection as a sequence of independent classifications. That means that we train an independent prediction system for the overall stance and for each more specific stance. In isolation, these independent systems correspond to single-target stance detection systems.

Predicting the Overall Stance

For the prediction of the overall stance, we implement single-target stance detection systems (cf. Section 3.3). For both datasets, we preprocess the data using DKPro Core and apply a social media tokenizer (Gimpel et al., 2011).

As the Atheism Dataset is a re-annotation of a subset of the SemEval data, we reimplement the system that is most successful in the SemEval challenge on stance detection. Hence, we implement an SVM classifier with a linear kernel. To represent the instances we use 1–3 word ngram features and 2–5 character ngram features. For evaluating the SVM’s performance, we run a ten-fold cross-validation and report micro-averaged $F_1$ .

For the Death Penalty Dataset, there are no reference approaches, which is why we chose to implement prototypical representatives of the neural and the non-neuronal strand of supervised stance detection systems: As the neural system, we implement a bidirectional LSTM neural network (Schuster and Paliwal, 1997) using the Keras*https://keras.io/ framework with the Theano backend. For vectorization of the data, we use pre-trained word vectors, namely the 840B tokens, common crawl vectors from the GloVe project (Pennington et al., 2014). The input is followed by the bidirectional layer which has 100 units, uses tanh activation and the adam optimizer (Kingma and Ba, 2014). In order to enable regularization, we use a dropout of .2 between the layers. Subsequently, we add another dense and a softmax classification layer. The network is trained with categorical cross-entropy as a loss-function and five training epochs. The architecture is inspired by the system of Zarrella and Marsh (2016), which is the best performing neural system in the SemEval 2016 Task 6 competition. Based on this system, we iteratively optimize the hyper-parameters (i.e. activation function, number of nodes within the layers, different learning rate optimizers) within the cross validation. We terminate this optimization after reaching a (local) optimum.

As a representative of the non-neuronal strand, we implement an SVM with a linear kernel and equip it with word 1–3 gram features. We further experiment with character ngram features, sentiment features derived from the output of the tool by Socher et al. (2013), and dense word vector features. For the latter, we first compute the average over all of the GloVe vectors of a comment and then represent an instance as the resulting vector. An ablation test on the feature level revealed that sentiment features are slightly beneficial, but word vector and character features are not.

We calculate the performance of both approaches on predicting stance towards the death penalty using leave-one-out cross-validation on the video level and report the micro-averaged $F_1$ . This means that we successively train a model on the data of five videos and test the model on the data of the remaining video.

In Table 4.7 we show the performance of these classifications and the performance of a majority class baseline. In both datasets, the performance gain over the majority class baseline is about .15. This means, on the one hand, that the classifiers are indeed making meaningful predictions, but on the other hand – as the gain is only .15 – that the problem is a challenging task.

Although the Death Penalty Dataset is larger than the Atheism Dataset, both the majority class baseline and SVM yield better performance in the Atheism Dataset. The better performance of the majority class baseline can be explained by the class distribution which is more skewed in the Atheism Dataset. We further attribute the worse result of the prediction in Death Penalty Dataset to the stronger influence of the context, which probably also negatively influenced the reliability of the annotation. Our prediction approaches do not model extra-textual knowledge such as references to other users or the videos.

Interestingly, for the Death Penalty Dataset, the LSTM und SVM prediction perform on par. This again provides evidence for our hypothesis that in the task of stance detection neither neural nor non-neural approaches are clearly superior over the other.

Table 4.6

dataset	classifier	micro-averaged $F_1$
Atheism Dataset	majority class baseline ( $\ominus$ ) SVM_ngrams	.49 .66
Death Penalty Dataset	majority class baseline (NONE) SVM_{ngrams+sentiment} LSTM	.40 .55 .54

dataset

classifier

micro-averaged $F_1$

Atheism Dataset

majority class baseline ( $\ominus$ )

SVM_ngrams

.49

.66

Death Penalty Dataset

majority class baseline (NONE)

SVM_{ngrams+sentiment}

LSTM

.40

.55

.54

Performance of the overall stance classiﬁcation in the Atheism and the Death Penalty

Dataset.

Predicting Specific Stances

We now turn to the prediction of the more specific stances. Since the targets are annotated independently, we also predict them independently. Consequently, we cast the prediction of specific stances as a sequence of single-target detections. Thus, we use the above described approaches to predict whether the instances express a $\oplus$ , a $\ominus$ , or a NONE stance towards each target of the sets. However, as we have shown in Section 4.4.2 many targets are very rare. Machine learning approaches for stance detection require that the stance to be learned occurs sufficiently often in both the training and test data. Hence, to be able to carry out meaningful experiments, we limit our experiments to targets that occur more than twenty times. For the Death Penalty corpus we only report the results of the SVM prediction, as we find that the LSTM was not able to cope with the extremely imbalanced class distribution.

In Figure 4.8, we visualize how well each target can be predicted with an SVM and the corresponding performance of the majority class baseline. Overall, we see only a few improvements over the majority class baseline. None of the targets of the Death Penalty Dataset can be classified with a substantially better performance than the majority class baseline. However, due to the class distribution, the majority class baseline already yields a performance of .9 or higher for all targets. Only the two most common targets (i.e. supernatural power and Christianity) of the Atheism Dataset, can be classified with success, if one relates them to the majority class baseline. The associated baselines perform significantly worse than the baselines of the other targets.

Overall, this shows that the prediction of a specific stance is strongly dependent on the frequency and the distribution of the specific targets. While it is difficult to directly compare the two datasets, this implies that a data-driven creation of the targets results in a more favorable distribution of stance annotations.

Figure 4.8

Performance of predicting stance towards the specific targets (ordered according to their frequency). Prediction results are obtained using an SVM and data-specific feature sets. We only show targets that occur twenty times or more.

Relationship Between Specific and Overall Stance

Multi-target stance annotations may yield perfect reliability and coverage, but still may fail to describe details of expressed stances in a meaningful way. For example, if one would annotate whether people like or dislike the platform on which they submit their post (i.e. Twitter or YouTube), it is conceivable that this stance would occur frequently,*We find that comments such as I hate YouTube for the amount of advertisement or I love my Twitter feed are not uncommon on many social media platforms. and that the agreement for this annotation would be high. However, the stance towards the social media platform is hardly suitable for explaining one’s overall stance on atheism or death penalty. Hence, we also need to evaluate how the specific stances relate to the topic of the debate and thus the overall stance.

Therefore, we compare how well the overall stance can be predicted based on the stances that are expressed towards the targets of the sets. If one is able to perfectly predict the overall stance from the targets of a set, one can hypothesize that the set fully models how stance is expressed towards the overall target.

In addition, we examine the relationship between models that are trained to classify the overall stance and stance towards the target sets. Therefore, we compare how well a stance classifier performs on subsets of the data, which are annotated with the targets of a set. The intuition of this experiment is that the better the classifier performs on these subsets, the more it internally relies on features that match the targets of a set. From this, one could conclude that these targets play a large role in how people express stance and thus explain the overall stance to a large degree.

Predicting Overall Stance from Specific Stances

Table 4.7

dataset	target set	micro-averaged $F_1$
Atheism Dataset	full_oracle frequent targets_predicted	.89 .63
Death Penalty Set	full_oracle $+$ Death Penalty_explicit full_oracle iDebate-Set_oracle $+$ Death Penalty_explicit Reddit-Set_oracle $+$ Death Penalty_explicit iDebate-Set_oracle Reddit-Set_oracle	.82 .73 .71 .69 .59 .58

dataset

target set

micro-averaged $F_1$

Atheism Dataset

full_oracle

frequent targets_predicted

.89

.63

Death Penalty Set

full_oracle $+$ Death Penalty_explicit

full_oracle

iDebate-Set_oracle $+$ Death Penalty_explicit

Reddit-Set_oracle $+$ Death Penalty_explicit

iDebate-Set_oracle

Reddit-Set_oracle

.82

.73

.71

.69

.59

.58

Performance of the overall stance prediction which is based on speciﬁc stances. Results are obtained using a logistic regression that is equipped with speciﬁc stances as features.

To measure how well stance can be predicted from specific stances, we carry out a logistic regression that we equip with the specific stances as features. We calculate the classification performance using the same cross-validation set-ups as for predicting the overall stance. This means that we conduct a ten-fold cross-validation for the Atheism Dataset and a leave-one-out cross-validation on the video level for the Death Penalty Dataset. We report micro-averaged $F_1$ as an evaluation metric.

To examine how well the overall stance can be predicted from the specific stances, we perform an oracle experiment in which we assume that the specific stances are given. The results of this experiment are shown in Table 4.7. For both datasets, we obtain scores of over .8, which shows that the overall stance can be reliably predicted from specific stances and thus that the used targets used have a strong relationship to the topic of the two datasets.

In the previous section, we have shown that automatic systems that predict stance towards frequent targets in the Atheism Dataset (cf. 4.5.1) are able to outperform majority class baselines. Thus, we repeat the experiment with features that correspond to the predicted specific stances (target set: frequent targets_predicted). In this experiment we obtain an $F_1$ score of .63. This score is higher than the performance of a majority class baseline ( $F_1$ :.49), but lower than the score of an SVM that directly predicts overall stance direct based on grams ( $F_1$ :.65). Hence, the quality of predicting specific stances is not good enough to improve over the general cross-validation features. However, the oracle condition yields an $F_1$ score of $.89$ . This indicates that improvements over the state of the art are possible, if explicit stances can be more reliably classified.

For the Death Penalty Dataset, we find that the Reddit-Set and the iDebate-Set are equally useful for predicting the overall stance. Both sets can be similarly improved by about .1 if one adds the explicitly expressed stance towards the death penalty. The performance improves substantially when combining both sets, from which we again conclude that the sets are highly complementary. When also adding the explicitly expressed stance towards the death penalty, we obtain a performance of .82 which is fairly high compared to a majority class baseline ( $F_1$ :.40) and the best prediction system ( $F_1$ :.55). As in the Atheism Dataset, this difference suggests that there is a potential for improvements if one is able to automatically extract features which correspond to the specific stance. However, our analysis has shown that – due to the sparse distribution and rare occurrence of targets – this is a highly challenging task.

Influence of Specific Stances in Single-Target Stance Detection

Finally, we examine the influence of specific stances on models which are trained to predict the overall stance from text. Therefore, we examine how well these models (cf. Section 4.5.1) classify instances that are annotated with a certain specific target. If a classifier works well on these instances, we can hypothesize that the associated target – respectively the associated wording – is also important for the learned model. Since the specific targets are almost exclusively annotated to instances which are labeled with an $\oplus$ and $\ominus$ stance, we evaluate the performance using the micro-averaged $F_1$ -score of $\oplus$ and $\ominus$ .

We show the performance on subsets of the Death Penalty Dataset in Figure 4.10 and the performance on subsets of the Atheism Dataset in Figure 4.9. For both datasets, the classification performs considerably better on instances that are annotated with a specific stance than on the whole dataset (+.6% on the Atheism Dataset and +.8% on the Death Penalty Dataset). We also observe a large drop in classification performance if we exclude comments that are annotated to express a specific stance in both datasets. From this, we conclude that the classifiers that are trained to predict an overall stance are largely learning to classify specific stances.

Interestingly, we notice large differences between the targets of the Atheism Dataset. With $F_1$ -scores of above .8, instances that are annotated with a stance towards the two most frequent targets (supernatural power and Christianity) are classified particularly well. Among the instances annotated with the less frequent targets, the classification performs poorly. For instance, not a single tweet annotated with a stance towards same-sex marriage is classified correctly.

Figure 4.9

Classification performance (SVM) on subsets of the Atheism Dataset. The subsets correspond to instances which are labeled with a $\oplus$ or $\ominus$ stance towards the specific targets. The dashed line marks the performance of the SVM in a cross-validation across the entire data set.

For the Death Penalty Dataset, we do not observe major differences between the two target sets regarding classification performance. However, while the LSTM and SVM perform on par when considering all targets, they differ substantially with respect to certain targets. Some of the targets that cause problems for the LSTM achieve relatively high scores using the SVM. For example, the performance on instances annotated with overcrowding is .38 for the LSTM, and .74 for the SVM. However, we also observe targets in which the situation is inverted, e.g. for use bodies we obtain a much higher value (.77) for the LSTM than for the SVM (.36).

The performance of the LSTM shows a large variance for the iDebate-Set. There are even three sub-debates (irreversible, miscarriages and deterring) performing worse than the whole dataset. The targets preventive, eye for an eye and financial burden gain $F_1$ -scores of above .6. overcrowding results in a $F_1$ -scores of even $.83$ . In contrast to the LSTM, the SVM shows a lower variance and no targets which perform worse than on all data.

On the Reddit-Set, both the SVM and LSTM show a lower variance, but the LSTM is often slightly better than the SVM. In this set, there are more targets (abortion, electric chair, more harsh and replace life-long) in which the performance of both classifiers is comparable. Nevertheless, use bodies, strangulation and heinous crimes are significantly better predicted by the LSTM.

On comments that express an explicit stance on the death penalty, we find that the $F_1$ -score is in the same range (LSTM: $.52$ , SVM: $.55$ ) as for the classification of targets. This further supports our decision to treat this explicit stance as a special case of a target.

Figure 4.10

Classification performance (SVM and LSTM) on subsets of the Death Penalty Dataset. The subsets correspond to instances which are labeled with $\oplus$ a $\ominus$ or stance towards the specific targets. The dashed line marks the performance of SVM and LSTM in a cross-validation across the entire data set.

Chapter Summary

In this chapter, we examined multi-target stance – a formalization that provides a more complex picture of stances than single-target stance. We studied the formalization by creating and analyzing two datasets: (i) the Atheism Dataset which consists of tweets on atheism from the SemEval task on Stance detection (Mohammad et al., 2016) and (i) Death Penalty Dataset which consists YouTube comments on death penalty.

We annotate all instances in both datasets with an overall stance towards the topic of the debate (i.e. death penalty or atheism) and any number of more specific stances towards the targets of predefined sets. To objectify the annotation of the specific stances, we adapt ideas of the cooperative principle (Grice, 1970) and the relevance theory (Sperber and Wilson, 1986, 1995) and require that annotations are based on relevant textual clues. For the Atheism Dataset, we created the target set using a semi-automatic, data-driven procedure. For the Penalty Dataset, we created two sets of targets from the collaboratively created resources idebate.org and reddit.com.

Next, we analyzed the reliability and distribution of annotations. In both datasets, the reliability is in the same range as comparable efforts. However, we also find substantial disagreement which we attribute to inconsistent interpretation of certain textual clues.

When comparing the distributions of the different annotations, we find for both datasets and all target sets that the distribution of annotations are strongly imbalanced, with a few targets occurring frequently and most targets being sparse. We attribute this to the annotation procedure that focusses on textual elements that are probably Zipf distributed. We cannot identify an objectively superior method for creating the targets. While the data-driven targets yield a higher coverage and are more frequent, we find that some of the externally created targets show a preferable class distribution of $\oplus$ and $\ominus$ stances.

Finally, we examined how well we can automatically predict the different components of multi-target stance annotations and how these components relate to each other. The automatic classification of specific stances is only successful in a few cases. We can explain this with the skewed class distribution and the infrequent occurrence of the targets. Probably, the performance is also negatively influenced by the reliability of annotations, as low reliability weakens the signal in the training data. In addition, we show that overall stance classifiers already model stance on explicit targets to a large extent. Furthermore, we find that the overall stance can be quite reliably predicted if specific targets are given.

Future attempts on stance detection could facilitate these findings by using external knowledge specific to the targets or by reusing models, which have been built to classify them. For future attempts on creating aspect-based sentiment or stance datasets, the results highlight that the composition of target sets is a crucial subtask of stance detection and aspect-based sentiment analysis that is worth investing efforts.

In the next chapter, we will examine the most complex formalization of stance – stance that is expressed towards nuanced statements.

Nuanced Assertions

In the previous chapter, we have shown that externally derived targets might fail at sufficiently covering stance-taking in a dataset. In addition, even if a target set would perfectly cover a dataset, the used targets may still be too coarse to describe the complexity of the expressed stances. For example, consider the three following statements that address the high costs of maintaining windmills:

(A) If you think the maintenance of windmills is too expensive, think of the price of storing nuclear waste.

(B) The high cost of maintaining windmills is nothing compared to the price of global warming.
(C) While in certain locations windmills yield too high maintenance costs, they operate cost-effectively in other places.

The three utterances all express a $\oplus$ stance towards wind energy and also a $\oplus$ stance towards the target the costs of maintaining windmills is high. However, the utterances all emphasize different nuances of the target the costs of maintaining windmills is high. If our formalization should be able to differentiate between these nuances, we would need corresponding targets (e.g. (A): wind energy is cheaper than nuclear power, (B): wind energy is cheaper than costs caused by global warming, and (C): Not all windmills have high maintenance costs). However, for each of the examples, further nuances are easy to imagine. For instance, it is conceivable that other utterances compare wind energy to other sources of energy or differentiate between the high costs of different kinds of windmills. If necessary, we would now also have to model these nuances by nuanced targets. As we could recursively find even more nuances for the more nuanced utterances, theoretically there are no limits to the complexity of target sets.

In certain use cases, it may be necessary to obtain a comprehensive understanding of all nuances of stance towards a topic. For example, for a local government that plans to install a wind power plant, obtaining the percentage of people that support or oppose wind energy may not be sufficient, as people may disagree on where to install the windmills, what type and number of windmills to install, or any other relevant details. Hence, to make an informed decision, the local government needs knowledge about what nuances of the plan are discussed by their citizens, which of them are considered more important, which of them are most divisive, or on which nuances opposing groups agree and disagree.

Given that it is difficult to estimate all possible nuances in advance, using a target set which is based on external knowledge seems to not be a viable approach. A common approach for generating suitable targets is to conduct surveys to identify important aspects of an issue. However, besides being time-intensive and expensive, surveys may also not cover all relevant aspects, as the survey creators inadvertently may bring in biases and because experts may err as well. Hence, in this chapter, we derive target sets from the data that we want to analyze. Specifically, we ensure that all nuances of a dataset are covered by using all utterances as targets. To the best of our knowledge, modelling stance in this way has not been scientifically studied yet.

There are also practical reasons that make it difficult to annotate stance towards a large set of nuanced targets. If we would annotate $n$ instance using the annotation procedure we described in Section 4.2, each annotator would have to perform $n − 1$ stance annotations for each instance. This amount of annotations is hardly practical for sufficiently large datasets. Thus, for examining nuanced stance taking, we do not study how people express stance through text, but how people react to text. To collect such data, we make use of mechanisms that allow people to express a $\oplus$ stance (i.e. to agree) or a $\ominus$ stance (i.e. to disagree) towards text. This mechanism is inspired by functionalities that can be found in almost all social networking sites, where users can express their stance towards a post using a thumb-up, a like, an up-vote, a ♥, or a thumb-down, a dislike, or a down-vote.

To simplify the annotation, we will restrict ourselves to textual utterances that are explicit, relevant, and that do not contain multiple positions. We will refer to such utterances as assertions. In Figure 5.1, we show an example of the intended data structure for the issue wind energy.

Figure 5.1

Example of the data structure that corresponds to the formalization Stance on Nuanced Targets.

We will now first describe how we collected a dataset that consist of over 2,000 assertions on sixteen different controversial issues, which are annotated based on whether people agree or disagree with them, and with judgments indicating how strongly people support or oppose the assertions. Subsequently, in Section 5.2, we will show how the data can be used to obtain an understanding of nuanced stance towards the issues. Finally, in Section 5.3, we will formalize and explore two novel prediction tasks for which the collected data serves as gold standard.

Quantifying Qualitative Data

In order to examine the reactions towards texts, we need data that – like in a social media setting – consists of a set of texts that is judged by a large number of people. However, we do not use social media data directly to avoid several experimental problems: First, people’s stance on social media posts may constitute sensitive information. Hence, without informed consent, scraping such data may yield legal and ethical issues. Furthermore, it is difficult to control for variables that moderate peoples’ judgment behavior in real-life social media settings. These moderator variables include the influence of previous posts, the question of whether someone is not judging an assertion because she does not want to judge it or because she did not perceive it. In addition, people may refuse to make judgments because of privacy concerns.

Hence, we collect data that mimics the situation in social media (i.e. a large number of posts that is judged by a large number of people), but which isolates the judgments from external influences and thus does not have the above mentioned ethical and experimental problems. To collect such data, we suggest a novel collection process which we call quantifying qualitative data. The proposed process consists of two phases: (i) a qualitative phase in which we collect a large number of nuanced assertions relevant to controversial issues and (ii) a quantitative phase in which we let a large number of people judge the collected assertions. We visualize the process of quantifying qualitative data in Figure 5.2.

Figure 5.2

Data creation by quantifying qualitative assertions. The agreement matrix has value $1$ for agreement and $-1$ for disagreement. The support/oppose matrix contains integer values indicating how many times the participants have selected an assertion as the one they most support (respectivly least oppose) and most oppose (respectivly least support).

Crowdsourcing

We conduct the data collection using crowdsourcing (Howe, 2006, 2008) on the platform crowdflower.com. The term crowdsourcing is a compound consisting of the words crowd (intelligence) and (out-)sourcing (Howe, 2006). Hence, the idea of crowdsourcing is to outsource tasks to a crowd of non-expert, usually web-based workers which are often paid for their efforts (Fort et al., 2011).

In a crowdsourcing setup, complex or voluminous tasks (such as the collection of a dataset) are usually broken down into a series of micro-tasks (e.g. annotating a single post with stance towards one target). Subsequently, this task is offered to the workers via dedicated marketplaces such as Amazon’s Mechanical Turk (mturk.com) or crowdflower.com.*now figureeight.com As a provider of a task, one offers a certain payment and the workers then decide for themselves how many micro-tasks they want to perform. This simple access, the high availability of workers and the comparatively low cost made crowdsourcing one of the major tools for creating NLP datasets (Fort et al., 2011).

There is also a number of concerns about using crowdsourcing for collecting datasets: Since the workers are motivated by monetary interests, the quality of the data may be insufficient. Particularly problematic are malicious workers who are only interested in the rapid processing of tasks and therefore provide nonsensical answers (Eickhoff and de Vries, 2011). However, it has been shown that the quality of crowdsourced annotations can be comparable to the quality of expert annotations, if appropriate quality assurance measures are taken (Snow et al., 2008; Callison-Burch, 2009; Zaidan and Callison-Burch, 2011). Examples of such quality assurance measures include filtering of unreliable workers, or that one restricts access to the task to workers that have passed a test which examines if workers have understood the task. We will discuss the quality assurance measures that we have taken when describing the two collection phases in detail.

There are also ethical concerns about using crowdsourcing. The crowdsourcing marketplaces resemble unregulated labour markets (Fort et al., 2011). Unlike in a regular labor market, there are no workers’ rights and workers are not protected in any way. That means that the workers have no job security, no guaranteed income, no right to form unions, and no possibilities to take actions against unjust treatment (Fort et al., 2011). This lack of basic workers’ rights is particularly alarming, as there are workers for whom crowdsourcing is the main source of income. Hence, they may be forced to accept low payment and mistreatment in order to earn any income.

Our data collection was approved by the National Research Council Canada’s Institutional Review Board, which reviewed the proposed methods to ensure that they were ethical. Now, we will describe the qualitative and quantitative phase in more detail. For both phases, we show the exact questionnaires in Appendix A.3.

Qualitative Data

In the qualitative phase (cf. qualitative data in Figure 5.2), we asked crowd-workers to generate a large number of assertions relevant to sixteen different controversial issues. For generating the assertions, we provided the participants with the name and a brief description of an issue. The sixteen controversial issues that we explore are shown in Table 5.1. These pre-requisitions are visualized under the item setup in Figure 5.2.

Specifically, we asked each participant to come up with five assertions relevant to the issue. To guide the process of creating assertions, the participants were asked to follow a set of directions: First, participants had to formulate assertions in a way that a third person can agree or disagree with it. Second, the assertions had to be self-contained and understandable without additional context. To that end, we did not allow the use of co-reference or personal references (e.g. my wife thinks that...). In addition, we do not permit the use of hedged or vague statements that included words such as maybe, perhaps, or possibly.

As a quality control measure, participants were required to answer a set of questions which tested whether the task and the set of directions were understood. For instance, for the issue US immigration, we showed the question This issue is only about illegal immigration and they had to respond with true (which is the correct answer) or false (which is incorrect as we specified that the issue is about both legal and illegal forms of immigration). Next, we showed them a set of example assertions and they had to mark all assertions that are in accordance with our directions. For example, for the issue US Immigration, we showed the assertions It is desirable (invalid response as it contains a co-reference) and multiculturalism is desirable (valid response).

We discarded all responses from participants who incorrectly answered more than 10% of these questions. In addition, we removed instances that were duplicates and instances which did not adhere to our directions. After cleaning the data, 2,243 assertions from 69 participants remained in the dataset. Table 5.1 lists the number of remaining valid assertions for each issue.

Quantitative Data

Once the list of assertions was compiled, we again use crowdsourcing to obtain a large number of judgments on (i) whether people agree with the assertion, and (ii) how strongly they support or oppose the assertion (cf. quantitative data in Figure 5.2).

As in real-life settings, participants may not be inclined to judge all assertions. However, if a large enough number of judgments are obtained from many participants, then meaningful inferences can be drawn. Thus, individual participants were free to judge as many assertions as they wished.

Agreement Judgments o obtain agreement judgments on the assertions, we simply asked the subjects to indicate to us whether they personally agree or disagree with each collected assertion. We decided against a third neutral option, as we wanted the participants to take a clear stance on every assertion and the collection process was designed to make sure that the assertions are all relevant to the issues.

Since we ask people for their personal assessment, it is difficult to assess the quality of the annotations, as we have no way of verifying the veracity of annotations. To nevertheless identify people who perform the task maliciously, we duplicate a substantial amount of assertions and exclude participants who do not respond consistently to them. Specifically, 25% of the set is redundant and we discard participants who deviate in more than 5% of their judgments towards the redundant assertions. The average number of resulting agreement judgments per assertion was 45. In Table 5.1, we show how many judgments we collected for each of the sixteen issues.

As shown in Figure 5.2, we created the agreement matrix $AM$ from the collected agreement judgments. The agreement matrix contains one column per assertion, and one row per participant. Hence, each cell $ad_{p,a}$ in this matrix corresponds to the judgment provided by participant $p$ for assertion $a$ . Consequently, $\vec{ad_a}$ is the vector of all judgments provided for assertion $a$ , and $\vec{ad_p}$ is the vector of all judgments provided by participant $p$ . We encode agreement with $1$ , disagreement with $−1$ , and missing values with $0$ .

Inferring Support or Opposition From Comparative Annotations

Whether participants agree or disagree with an assertion does not necessarily mean that the assertion is important for their overall stance on the issue. Thus, besides having the participants to indicate agreement, we asked them how strong they support or oppose the assertions. An obvious way to capture the strength of support and opposition would be to ask people to indicate this strength on a rating scale (e.g. on a scale from -5 to +5 where -5 indicates strongest opposition and where +5 indicates strongest support). However, there are several limitations inherent to rating scales (Baumgartner and Steenkamp, 2001; Kiritchenko and Mohammad, 2017): First, different persons might have inconsistent internal scales. For instance, two persons who equally strongly oppose an assertion may assign different scores. In addition, annotators may also inconsistently assign scores to the same assertion if asked at different occasions. Furthermore, annotators may have a tendency towards certain regions of the scale (e.g. they may have the tendency to assign scores in the middle of the possible range).

One way to obtain more consistent and comparable scores indicating the degree of support or opposition is to ask participants to perform paired comparisons (Thurstone, 1927; David, 1963). To perform paired comparisons, we would have to present the participants with pairs of assertions and then ask them which of the two assertions they support or oppose more. By concatenating the comparisons, we can obtain a ranking of assertions. For example, if we have the assertions $A_1$ , $A_2$ , and $A_3$ and the comparisons $A_1$ $>$ $A_2$ and $A_2$ $>$ $A_3$ , we can derive the ranking $A_1$ $>$ $A_2$ $>$ $A_3$ . Based on the Law of Comparative Judgment (Thurstone, 1927) the ranking can be transformed into an interval scale with real-valued scores indicating the degree to which each assertion is supported or opposed.

The Law of Comparative Judgment states that one can derive a scale from a ranking by quantifying the distance between all adjacent items in the ranking (i.e. distance between $A_1$ and $A_2$ , and $A_2$ and $A_3$ in the example). Therefore, for calculating the distance between the items $A_1$ and $A_2$ , we empirically estimate the distance between $A_1$ and $A_2$ from the ratio between the number of times $A_1$ was preferred over $A_2$ compared to the overall number of comparisons between $A_1$ and $A_2$ :

P(A_1 > A_2)=\frac{c(A_1>A_2)}{c(A_1>A_2)+c(A_2>A_1)}

with

c(A_1>A_2)

being the number of times

A_1

was preferred over

A_2

and

c(A_2>A_1)

the number of times

A_2

was preferred over

A_1

. Next, the Law of Comparative Judgment assumes that frequencies of preferences for items follow a normal distribution (Thurstone, 1927). Hence, according to Tsukida and Gupta (2011) the proportion of times

A_1

was preferred over

A_2

, can be scaled using:

dist(A_1,A_2)=P(A_1 > A_2)\cdot \sigma_{A_1,A_2} \cdot \Phi^{-1}

where

\sigma_{A_1,A_2}

refers to the standard deviation of

P(A_1 > A_2)

and

\Phi(x)

refers to the cumulative distribution function of the standard normal distribution:

\Phi(x)= \frac{1}{\sqrt{2 \pi}} \int_{-\infty}^{x} e^{-t^2 / 2} dt

However, comparative procedures also have the disadvantage of requiring a large number of comparisons. Specifically, if we want to obtain a complete ranking for n assertions, we would need to collect n² comparisons. This number is even higher, if every pair should be annotated by several annotators. Hence, to obtain fine-grained support or oppose scores from multiple participants, we used a technique which is called Best–Worst-Scaling (BWS) (Louviere, 1991; Louviere et al., 2015). BWS has been successfully applied to collect preferences in several domains, including health and social care (Flynn et al., 2007; Potoglou et al., 2011), or marketing (Cohen, 2009; Louviere et al., 2013). BWS has also been applied to NLP problems. For instance, it has been used to weight word senses (Jurgens, 2013) or to collect sentiment annotations (Kiritchenko and Mohammad, 2016, 2017).

Like paired comparisons, BWS is a comparative annotation scheme and thus addresses the limitations of rating scales, such as inter- and intra-annotator inconsistencies. However, in contrast to paired comparisons, BWS is more efficient and requires only about $2n$ annotation to derive a ranking of all items. In BWS, annotators are presented with $n$ items (usually between four and six) and are then asked to indicate which item is the best (highest in terms of the property of interest) and which is the worst (lowest in terms of the property of interest). Hence, to collect support-oppose scores, we presented four assertions at a time and asked the participants to indicate:

which of the assertions they support the most (or oppose the least)
which of the assertions they oppose the most (or support the least)

If we use 4-tuples, best–worst annotations are particularly efficient as each best and worst annotation reveals the order of five of the six item pairs. For example, if we have best–worst annotations for the assertions $A_1$ , $A_2$ , $A_3$ , and $A_4$ , and if $A_1$ is selected as the best, and A₄ is selected as the worst, we can derive $A_1 > A_2$ , $A_1 > A_3$ , $A_1 > A_4$ , $A_2 > A_4$ , and $A_3 > A_4$ . A number of methods have been proposed to transform best–worst annotations to real-valued scores (Louviere et al., 2015). The simplest is the counting procedure proposed by Orme (2009), in which one subtracts the number of times an item was selected as best from the number of times it was selected as worst. Louviere et al. (2015) show that this comparably simple procedure approximates other, more complicated procedures well.

To collect best–worst annotations, we generated 4,486 4-tuples of assertions from our list of 2,243 assertions using the code provided by Kiritchenko and Mohammad (2016). We obtained judgments from fifteen people for every 4-tuple. From the comparative judgments, we created the support–oppose matrix $SM$ (cf. Figure 5.1), that consists of one row per participant, and one column per assertion. In each cell, we store a tuple that indicates how often participant $p$ has selected assertion $a$ as the one they support the most (or oppose the least), and how often participant $p$ has selected assertion $a$ as the one they oppose the most (or support the least).

Analysis

We now turn to analyzing the data. Specifically, we propose various metrics that can be used to gain insights into various aspects of the data. We visualize this process of consolidating the raw scores into interpretable scores under the item annotated data in Figure 5.2.

Ranking Assertions

For decision-makers, it may be important to know which aspects of a controversial issue are particularly controversial, or which aspects are consistently judged by a large number of people. Thus, ranking the assertions by amount of agreement and strength of support or opposition is particularly useful, as it allows us to quickly inspect the assertions with which (i) most participants agree or disagree, or (ii) which most participants support or oppose.

Agreement Score

To be able to rank the assertions according to agreement, we calculate an agreement score ( $ags$ ) of an assertion $a$ by simply subtracting the percentage of times the assertion was disagreed with from the percentage of times the assertion was agreed with. That means the $ags$ for an assertion $a$ is calculated by:

ags(a)= \%\textrm{ agree}(a)-\%\textrm{ disagree}(a)

The agreement score ranges from LEAST AGREEMENT (

−1

) to MOST AGREEMENT (

1

). A score of 0 indicates that an equal number of participants agree and disagree with the assertion, which means that the assertion is highly controversial. We can identify the most controversial assertions by sorting assertions by the absolute value of the agreement scores and selecting those assertions which have the lowest absolute scores.

By inspecting the ranking of all assertions of an issue, we can gain an understanding of the corresponding debate. For instance, for the issue legalization of same-sex marriage, the assertion Love is a right for everyone has the highests agreement score, Saying that gay people should get married is like saying that a brother can marry his sister both are at higher risk of disease has the lowest agreement score, and Allowing same-sex marriage will create a slippery slope where people will begin to fight for other alternative marriages such as polygamy is the most controversial assertion.

In Figure 5.3(a) we show a histogram of the calculated agreement scores across all issues. We notice that the mass of the distribution is concentrated in the positive range of possible values. This indicates that the participants tend to agree with the assertions more often than they disagree. We observe similar distributions when looking at the distributions of single issues.

One possible explanation for the distribution is that both the qualitative and the quantitative phase were conducted using crowdflower.com and that therefore both groups of participants are similar. It is conceivable that people are more likely to generate assertions to which they or their peers agree. Another possible explanation is that there is already a certain level of consensus on many assertions in the general population. It is also conceivable that the reason for the distribution is that people often do not disagree with the individual assertions that their opponents make, however, they might disagree on the relative importance of that assertion among the various other assertions.

Support–Oppose Score

We transform the comparative BWS judgments for support and opposition into real-valued scores by using the counting procedure proposed by Orme (2009). Thus, for an assertion $a$ we calculate a support–oppose score ( $sos$ ) by subtracting the percentage of times an assertion was chosen as the most opposed from the percentage of times the assertion was chosen as the most supported:

sos(a)= \%\textrm{ most support}(a)-\%\textrm{ most opposed}(a)

This support–oppose score ranges from MOST STRONGLY SUPPORTED (1) to MOST STRONGLY OPPOSED (−1). We show a histogram of the support–oppose scores across all issues in Figure 5.3(b). We observe that the support-oppose scores are normally distributed. This finding is consistent with the Law of Comparative Judgments (Thurstone, 1927), which states that preferences within choice experiments follow a normal distribution.

As we ask participants to indicate which assertion they oppose the most or support the least, there are two possible interpretations for a selection. However, as we also ask the same participants whether they agree or disagree with the assertions, we can infer which of the two interpretations applies. That means that we infer that the response means MOST OPPOSE if the participant disagrees with the assertion and that the response means LEAST SUPPORT if the participant agrees with the assertion.

Thus, in addition to the support–oppose score, we compute a support score which ranges from LEAST SUPPORTED (0) to MOST SUPPORTED (1) and an oppose score that ranges from LEAST OPPOSED (0) to MOST OPPOSED (1). To calculate these scores, we reuse the formula shown in equation 5.5, but calculate the percentages only on the set of people that agreed to the assertion (support score) or disagreed with the assertion (oppose score).

These scores can be used to differentiate between assertions where a support–oppose score of about zero indicates that an assertion is both strongly supported and strongly opposed. For instance, for the assertion Freedom of the press prevents the government or other third parties from controlling the media, we find a support–oppose score of $.0$ , but both a high support score ( $.71$ ) and a high oppose score ( $1.0$ ).

The support score and the oppose score also help to identify assertions that have a low support–oppose score and that are rarely strongly supported or opposed. For example, the assertion Women’s rights have well found legal basis. has a low support–oppose score of $−.07$ and both a low oppose score ( $0$ ) and support score ( $.39$ ).

Figure 5.3

Distribution of agreement and support–oppose scores across all issues. We group the agreement and support–oppose scores into bins of size .05. For both the agreement and support–oppose scores, the colors encode how positive (green) or negative (red) the scores are.

Ranking Controversial Issues

There are situations where it may be useful to asses and compare several different issues. For instance, governments and policy makers may need to prioritize issues that are particularly polaririzing.

As an indicator of how polarizing an issue is, we use the extent to which an issues’s assertions evoke consistent or inconsistent responses. If all the assertions for an issue have an agreement score of zero, the same amount of participants have a $\oplus$ and a $\ominus$ stance towards the assertions. If participants consistently have a $\oplus$ or $\ominus$ stance on all assertions, we assume that an issue is not polarizing at all.

Based on this assumption, we calculate the polarization score ( $ps$ ) of an issue $I$ by first computing the average of the absolute value of the agreement score for each of the assertions, and then subtracting this value from one:

ps(I)=1 - \frac{1}{\vert I \vert}\sum_{a \in I }^{}\vert ags(a)\vert

Consequently, this polarization score can be used to rank issues from NOT POLARIZING AT ALL (

ps = 0

) to MOST POLARIZING (

ps = 1

). A polarization score of

.5

describes an issue in which there is an equal amount of more and less polarizing assertions.

We show the polarization scores of the sixteen issues in Figure 5.4. We observe that many of the polarization scores are around .5. For the issues climate change, gender equality, media bias, mandatory vaccination, and Obamacare the scores are even below .5, which means that on average there is more consensus than dissent in judging the assertions on these issues. However, there are also the issues same-sex marriage (.66), Marijuana (.69) and vegetarianism and veganism (.73) that have a score of above .5 and thus are more polarizing.

Figure 5.4

Issues ranked according to their polarization score.

Similarity of Participants

The collected data can be used to determine if there are groups of people who have similar stances towards the total of assertions. Identifying groups of people with similar stances allows a decision-maker, for instance, to involve this group in the process of finding a solution or to compromise with large groups.

If we cluster the participants by how similarly they judge the assertions, several scenarios are conceivable: First, clustering could result in a binary split into persons that have a $\oplus$ stance and persons that have $\ominus$ stance towards the overall issue. For instance, for the issue Marijuana there could be one cluster that contains people who favor the legalization of Marijuana and a second cluster that includes people that are against legalization. Second, we could find several clusters that correspond to persons with more specific stances. For instance, there could be two clusters of people that favor or oppose legalizing Marijuana for medical purposes and two additional clusters that correspond to persons that agree or disagree with the idea that Marijuana is a gateway drug. A third possibility is that there are no clear clusters at all, as all people agree with each other to a similar extent. The third possibility applies for issues for which there is a mainstream position with which most people agree, or because the stances of people are so diverse that there is little agreement between several people.

To find clusters of the similar judgment behaviors, we need to quantify the similarity between the judgments of persons. We determine this similarity by calculating the cosine similarity of the vectors that represent the rows in the agreement matrix $AM$ (cf. Figure 5.2):

cos(p1,p2) =\frac{\vec{ad_{p1}} \cdot \vec{ad_{p2}}}{\vert \vec{ad_{p1}} \vert \cdot \vert \vec{ad_{p2}} \vert}

We calculate this person–person similarity between all pairs of persons of each issue. To examine the distribution of similar participants, we create an undirected graph in which the nodes represent participants and edges are drawn if the similarity exceeds a certain threshold. Clusters of similar individuals would form densely connected cliques in such a graph. The aim of this experiment is not to actually determine the clusters that best fit the data, but rather to visually explore whether there are meaningful patterns in the dataset.

Whether one can observe clusters in a dataset depends on the relative differences between the similarity scores. Depending on how we set the threshold, we may find a large number of clusters that each include a small number of highly similar persons, or a lower number of clusters that include a large number of less similar persons. Thus, we compare graphs with different thresholds. For all issues, we observe that at a low threshold almost all participants are connected. That means that all subjects have a certain similarity to each other.

If we increase the threshold, we do not observe the formation of several clusters, but of a single cluster and an increasing amount of single disconnected outliers. This indicates that the majority of persons are similar to each other. We conclude that most of our participants share a mainstream position on the issues.

We visualize this experiment for the issue black lives matter in Figure 5.5. We manually inspect the judgments of disconnected persons and observe that these indeed tend to have rather radical positions. For example, for the issue black lives matter some of the outliers disagree with the assertion everyone is equal.

Figure 5.5

Similarity of participants visualized in an undirected graph for the issue black lives matter. In the sub figures, we draw edges between two persons if their voting similarity is above a certain threshold.

Judgment Similarity

The collected data also allows us to examine how similarly the assertions are judged. We refer to the degree to which two assertions are judged similarly by a large number of people as judgment similarity. An analysis of the judgement similarity can help to identify inter-related assertions. We determine the judgment similarity by computing the cosine of the column vectors of the agreement matrix $AM$ (see Figure 5.2):

cos(a1,a2) =\frac{\vec{ad_{a1}} \cdot \vec{ad_{a2}}}{\vert \vec{ad_{a1}} \vert \cdot \vert \vec{ad_{a2}} \vert}

Hence, the judgment similarity of two assertions is

1

if two assertions are consistently judged the same. A judgment similarity of

−1

indicates that the assertions are consistently rated the opposite way (e.g. all subjects agree with one assertion and disagree with the other). The judgment similarity becomes zero if the corresponding vectors are orthogonal (e.g. if the amount of subjects who consistently agree to both assertions is equal to the amount of subjects that agrees to the first assertion and disagrees with the second).

To obtain a comprehensive picture, we calculate the judgment similarity between all unique pairs of assertions of each issue. That means we do not consider both $a_1$ with $a_2$ and $a_2$ with $a_1$ , and also do not consider self-pairing. Hence, for an issue with $n$ assertions, we calculate the judgment similarity of $\frac{n \cdot (n-1)}{2}$ pairs of assertions.

We hypothesize that there is a relationship between an issue’s polarization score and the distribution of the judgment similarity scores. Specifically, we hypothesize that we should observe more assertions with high judgment similarity for a less polarizing issue than for a more polarizing one. Hence, in Figure 5.6, we visualize the distribution of similarity scores for the issue with the lowest (climate change) and the highest (vegetarianism and veganism) polarization score, as well as black lives matter from the middle value range.

All three distributions resemble normal distributions. However, we observed that the less polarizing one issue is, the more the normal distribution is shifted to the positive range of possible values. Hence the polarity score indeed has an influence on the average judgment similarity of an issue’s assertions.

For all three distributions, we observe that there is an area around zero in which there are no values. This behavior results from the fact that the cosine similarity is defined discontinuously if we work on vectors that only have integer entries. The cosine similarity of two vectors $\vec{v_1}$ and $\vec{v_2}$ is formed by dividing their dot product ( $\vec{v_1} \cdot \vec{v_2}$ ) by the length of the vectors ( $|\vec{v_1}|\cdot|\vec{v_2}|$ ).

If $\vec{v_1}$ and $\vec{v_2}$ share no dimensions in which they both have no non-zero entires, their dot product becomes zero, which results in a cosine similarity of zero. If the vectors overlap in exactly one dimension (and as in our case the entries are either +1 or -1) we obtain a similarity of $\frac{1}{|\vec{v_1}|\cdot|\vec{v_2}|}$ or $\frac{-1}{|\vec{v_1}|\cdot|\vec{v_2}|}$ . Lower similarity scores are not possible, as for vectors only containing $1,0$ and $−1$ entries, there is no dot product, which is greater than 0 and smaller than 1.

For each additional dimension in which both vectors have non-zero entries, the number of possible values for the dot product increases (e.g. if the two dimensions share four dimensions with non-zero entries, there are five possible values). For an even number of overlaps, there is the possibility that the vectors are orthogonal and thus have a point product of zero. Hence, the number of zero scores in the distribution corresponds to both the number of pairs that have zero overlap and to the number of pairs that have orthogonal vectors.

Figure 5.6

Distribution of the judgment similarity scores for the issues vegetarism and veganism, black lives matter, and climate change. We group the similarity scores into bins of size .01.

Next, we manually inspect pairs of assertions with a particularly high or low judgment similarity. We observe that there are a variety of reasons that can lead to high judgment similarity of two assertions: If two assertions are close paraphrases, they tend to have a high similarity score (e.g. owning a gun can deter criminals and gun ownership deters crime). In addition to assertions whose surface forms are similar, we observe high similarity scores for assertions that convey similar semantics (e.g. in Marijuana alleviates the suffering of chronically ill patients and Marijuana helps chronically ill persons). A high judgment similarity of two assertions can also be due to semantically entailed relationships*A text entails another text, if the meaning of one text can be inferred from the other text (Dagan and Glickman, 2004; Dagan et al., 2009). between assertions. For instance, if people agree with Marijuana is a gateway drug for teenagers and damages growing brains, most of them also agree to that Marijuana is dangerous for minors, despite the texts being different in content. Furthermore, a high judgment similarity can be attributed to various underlying sociocultural, political, or personal factors. For instance, the assertions Consuming Marijuana has no impact on your success at work and Marijuana is not addictive describe different arguments for legalizing Marijuana, but judgments made on these assertions are often correlated.

We observe that if two assertions are contradictory or contrasting (e.g. it is safe to use vaccines and vaccines cause autism), they tend to have a low judgment similarity. We also notice that underlying socio-cultural and political factors may cause people to vote dissimilarly on two (sometimes seemingly unrelated) assertions (e.g. congress should immediately fund Trump’s wall and all immigrants should have the right to vote in the American elections).

Judgment Similarity and Text Similarity

To further explore the relation between judgment similarity and textual similarity, we inspect pairs of assertions that have a particularly high semantic text similarity. Semantic text similarity refers to the degree of semantic equivalence between two texts (Agirre et al., 2012). A number of automatic methods for measuring semantic text similarity have been proposed (cf. Br et al. (2013)). These methods range from methods which rely on the textual overlap of two texts (Wise, 1996; Gusfield, 1997; Lyon et al., 2001) to methods involving the distributional similarity of words, sentences, or paragraphs (Landauer et al., 1998; Mikolov et al., 2013b).

Since the results of distributional methods often depend on several hyper-parameters (e.g. the corpus on which we compute the distributional similarity), we here rely on the Jaccard index (Lyon et al., 2001) that measures text similarity based on textual overlap. To compute the Jaccard index of the assertions $a_1$ and $a_2$ , we translate them into the sets $A_1$ , which contains all words of $a_1$ , and $A_2$ , which contains all words of $a_2$ . Next, we calculate the cardinality of the intersection of $A_1$ and $A_2$ ( $| A_1 \cap A_2 |$ ). Finally, we normalize the cardinality of the intersection by dividing it by the cardinality of the union of $A_1$ and $A_2$ :

Jaccard(A_1,A_2)= \frac{| A_1 \cap A_2 |}{| A_1 \cup A_2 |}

The Jaccard index ranges from zero (no words in both

A_1

and

A_2

) to one (

A_1

and

A_2

contain the same words).

We compute the Jaccard index for all unique pairs of assertions and compare the Jaccard index with the determined judgment similarity score. We observe that those pairs of assertions that have the highest judgment similarity scores rarely have a high Jaccard index. For instance, the two pairs with the highest similarity score are

Too much lives are wasted because of war on terrorism.
Terrorism should be stopped. (judgment similarity = .75; Jaccard = .08)

and

A journalist’s job is to report the truth to the public and should not accept gifts from any third party.
Journalist should not accept bribes. (judgment similarity = .66; Jaccard = .13)

Nevertheless, we find that assertions with a particularly high level of similarity often have a comparable high judgment similarity. However, in these pairs, slight differences in wording seem to have a strong influence on how the assertions are judged. Examples include:

The foreign aid budget should be made more effective.
Foreign aid budget should be more effective. (judgment similarity = .41; Jaccard = .88)

as well as

Climate change is costing lives.
Climate change is already costing lives. (judgment similarity = .27; Jaccard = .83)

From our manual analysis, we conclude that judgment similarity is only distantly related to textual similarity. We hypothesize that judgment similarity is more strongly affected by the semantics and pragmatics defined by the issue. In the following section, we will show how judgment similarity can be utilized for predicting stance towards assertions.

Predicting Stance on Nuanced Assertions

We now turn to automatic approaches for predicting nuanced stance. However, in contrast to single and multi-target stance, we here try not to predict what stance people express through a text, but what stance people have towards a text.

To be able to predict someone’s stance towards an assertion, we require knowledge about that person. Ideally, we would know what stance the person has towards a set of other assertions. If we now assume that this person has a consistent world view, we could infer the persons stance on related assertions. For instance, if we know that someone agrees with the assertions most vegetarians are healthier than those who eat meat and a vegetarian diet provides all the essential nutrients, then most people would infer that the person probably agrees with the assertion the vegetarian diet has many health benefits. Similarly, most people would conclude that if 90% of a group of persons have a $\oplus$ stance towards discrimination of women has to end now, a similar portion should have a $\oplus$ stance towards we should fight to end any gender discrimination. We hypothesize that the stance of an individual or a group on a new assertion correspond to the stance towards highly similar assertions.

To test if we can automate this intuition, we formulate two prediction tasks which are illustrated Figure 5.7: In the first task, we want to predict stance of individuals on assertions based on the person’s stance towards other assertions. Thus, the first task is formulated as follows: given a set of assertions $a_1,...,a_n$ relevant to an issue and the stance of a person $p_i$ on $a_1, ..., a_{n−1}$ an automatic system has to predict $p_i$ ’s stance on the assertion $a_n$ . The stance of a person towards an assertion corresponds to whether the person agrees with the assertion ( $\oplus$ stance) or whether the person disagrees with the assertion (8 stance).

In the second task, we want to predict stance of groups on assertions based on the group’s stance towards other assertions. Hence, this task can be formalized as follows:

given a set of stances of a group of persons $p_1, ..., p_k$ on the assertions $a_1, ..., a_{n−1}$ , an automatic systems must predict the stance towards the assertion $a_n$ for the same group of persons. The stance of a group towards an assertion can be expressed by the proportion of persons that agree with the assertion ( $\oplus$ stance) and persons that disagree with the assertion (8 stance).

Figure 5.7

Predicting stance towards nuanced assertions. We differentiate between the two tasks of (i) predicting stance of individuals and (ii) predicting stance of groups. These prediction tasks correspond to the automatic estimation of the agreement matrix $AM$ and the aggregated agreement scores (cf. Figure 5.2).

We argue that the outlined use-cases correspond to the situation in social media, which is rife with people expressing their stance towards text (e.g. likes on Facebook, or up- and down-votes on Reddit posts) – especially on controversial issues. Hence, being able to predict people’s stance towards a text has several applications: People at large could automatically anticipate the stance of politicians, companies or other decision makers towards a new perspective on a problem or a new possible solution. The method can also be used to more accurately analyze the homogeneity of opinions or to detect filter bubbles in social media. Decision makers themselves would be able to evaluate in advance how citizens, customers, or employees react to a press announcement, a new regulation, or tweet. Social media users could be enabled to search, sort or filter posts based on whether they are in accordance with or contrary to their personal world view. Such predictions could also be used to augment chat applications by indicating to a user if her recipients will have a $\oplus$ or $\ominus$ stance towards a message to be sent, enabling them to choose a more or less confrontational discussion style.

For solving the outlined tasks, we propose to rely on judgment similarity. That means that we estimate the similarity between each of the assertions previously judged by the person or the group, and the assertion for which we want to make the prediction. Then we simply transfer the judgments on the most similar assertions to the assertion of interest. However, in real-life settings, large volumes of judgments of other persons – which are necessary to compute the judgment similarity – are not always easily available. For instance, we may want to predict a stance towards a completely new assertion which has not been noticed by a large audience. To overcome this limitation, we propose to use methods that only consider the texts of the assertions to mimic judgment similarity and have thus the ability to generalize from existing data collections.

We will now first describe how we create and evaluate models that are able to predict a judgment similarity score for pairs of assertions. Subsequently, in Section 5.3.2, we will evaluate how these judgment similarity measures can be utilized for predicting stance of individuals and groups on nuanced assertions.

Predicting Judgment Similarity

Figure 5.8

SVM (a) and SNN (b) architectures for predicting judgment similarity of assertions.

As baselines for predicting judgment similarity from text, we utilize well-established semantic text similarity measures. Specifically, we use (i) the above-described Jaccard coefficient (Lyon et al., 2001), (ii) the Greedy String Tiling algorithm (Wise, 1996), and

(i) the longest common substring measure (Gusfield, 1997). Greedy String Tiling refers to an algorithm that has been proposed for detecting suspected (text) plagiarism – even if parts have been reordered. We use the implementations of these algorithms provided by the framework DKPro Similarity (Bär et al., 2013).*version 2.2.0 In addition, we use the cosine between the averaged FastText word vectors (Bojanowski et al., 2017) of both texts as a similarity measure.

Beyond the baselines, we again compare a more traditional, feature engineering-based approach with a neural network approach. Hence, we implement an SVM-based regressor and a Siamese Neural Network that both rely on two texts as input to predict a judgment similarity score. We visualize both machine-learning architectures in Figure 5.8.

We implement the SVM regressor using LibSVM (Chang and Lin, 2011) as provided by DKProTC (Daxenberger et al., 2014).*version 1.0 To represent the instances, we use word ngram features, sentiment features, word vector features and negation features (cf. Figure 5.8(a)). For deriving sentiment features, we assign polarity scores to the assertion using the tool of Kiritchenko et al. (2014) and use the scores of both assertions as features. As word vector features, we use each dimension of the averaged FastText vectors (Bojanowski et al., 2017)) of both assertions.

For the neural approach, we adapt Siamese Neural Networks (SNN), which consist of two identical branches that are trained to extract useful representations of the assertions and a final layer that merges these branches. SNNs have been successfully used in a number of tasks in which one tries to predict the strength of the relationship between two texts. For instance, SNNs have been trained to predict text similarity (Mueller and Thyagarajan, 2016; Neculoiu et al., 2016) or to match pairs of sentences (e.g. a tweet to reply) (Hu et al., 2014).

As shown in Figure 5.8(b), in our SNN, a branch consists of a layer that translates the assertions into sequences of word embeddings, which is followed by a convolution layer with a filter size of two, a max pooling over time layer, and a dense layer (100 nodes with tanh activation). To merge the branches, we calculate the cosine similarity of the extracted vector representations. For training the network, we finally calculate the squared error between the resulting cosine similarity and the judgment similarity. We implemented the SNN using the deep learning framework deepTC (Horsmann and Zesch, 2018) in conjunction with Keras and Tensorflow (Abadi et al., 2016).

We evaluate both the text similarity and the machine learning-based approaches using a ten-fold cross validation. As an evaluation metric, we use the Pearson correlation between predicted and gold similarity.

Results

We show correlation of all approaches averaged over all sixteen issues in Table 5.2. As Pearson’s $r$ is defined in a probabilistic space, it cannot be averaged directly over different issues. Thus, following Corey et al. (1998), we first transform the scores using Fisher’s Z transformation, average them and then transform them back into the original range of values. Fisher’s Z transformation is defined as the inverse hyperbolic function arctanh* $arctanh(x)=\frac{1}{2}ln(\frac{1+x}{1-x})$ of $r$ and – consequently – the inverse transformation can be defined as the hyperbolic function tanh of $z$ .

Across all issues, we observe that the semantic text similarity measures result in very low correlation coefficients between $.02$ and $.07$ . In contrast, both the SVM and the SNN obtain coefficients around $.6$ . This shows that the systems can learn useful representations that capture judgment similarity and that this representation is indeed different from semantic similarity. Since both models mostly rely on lexical information and still yield reliable performance, we suspect that the relationship between a pair of assertions and their judgment similarity also has a lexical nature.

Table 5.2

Method	$r$
SNN	.61
SVM	.58
word vectors distance	.07
Jaccard	.07
greedy string tiling	.06
longest common sub string	.05

Performance (Pearson correlation between prediction and gold averaged over all issues) of text-based approaches for approximating judgment similarity.

While the semantic text similarity measures obtain consistently low results, we find large differences for the individual issues for both the SVM and SNN. With coefficients ranging from $.37$ to $.7$ , the SVM’s range is a bit smaller than the range of coefficients obtained by the SNN ( $.32$ to $.72$ ). Overall, with a higher mean coefficient and more issues in which the SNN obtains better results, the SNN seems to be better suited for the task of predicting judgment similarity.

We show detailed results for each issue in Table 5.3. We find that when comparing the coefficients of SNN and SVM for the individual issues, the deviations, unlike when comparing the performance between issues, are rather small (the largest deviation is .13 for the issue gun rights). This suggests that there are issue-inherent reasons that make it easy or hard to learn judgment similarity. We observe that there is a negative correlation between the coefficients, and the polarization scores of an issue (cf. Section 5.2.2). That means, that the judgment similarity of assertions from less polarizing issues tends to be predicted more reliably. In Section 5.2.4 we have demonstrated that the less polarizing an issue is, the more the distribution of judgment similarity scores shifted to the positive range of values. Thus, we suspect that both the SVM and SNN are better suited to learn stronger judgment similarities.

Table 5.3

Issue	SVM	SNN
climate change	.70	.72
gender equality	.67	.73
mandatory vaccination	.68	.74
Obamacare	.66	.70
black lives matter	.66	.74
media bias	.63	.63
US electoral system	.63	.59
same-sex marriage	.59	.61
war on terrorism	.56	.59
foreign aid	.54	.46
US in the Middle East	.52	.55
US immigration	.52	.57
gun rights	.51	.64
creationism in school	.48	.51
vegetarian and vegan lifestyle	.43	.40
legalization of Marijuana	.37	.32

Performance (Pearson correlation between prediction and gold) of the judgment similarity prediction by the SVM and the SNN, obtained in 10 fold cross-validation.

Outlier Analysis In order to gain deeper insights into the classification performance, we investigate cases that deviate strongly from an ideal correlation. Therefore, we examine the scatter-plots that visualize assignment of gold to prediction (x–Axis: gold, y–Axis: prediction). Figure 5.9 shows the scatterplot for the issue Climate Change for both architectures.

Figure 5.9

Comparison of gold judgment similarity and judgment similarity as predicted by the SVM and the SNN for the issue Climate Change.

For the SVM, we observe that there is a group of pairs that is predicted inversely proportional. This means that the (gold) judgment similarity of these pairs is positive, but the SVM regression assigns a clearly negative value. We observe that these instances mainly correspond to pairs in which both assertions express a $\ominus$ stance towards climate change. For instance the pair, there is not a real contribution of human activities in Climate change and climate change was made up by the government to keep people in fear, has a comparable high judgment similarity of $.20$ and both assertions express the position of climate change deniers. The SVM, however, assigns them a similarity score of $−.38$ . We suspect that this effect results from the distribution of similarity scores that is skewed to the positive range of possible scores. Therefore, the SVM probably assigns too much weight to ngrams that signal a negative score.

For the neural approach, we find pairs whose (gold) judgment similarity is negative, but which are assigned a positive value. We observe that many of these pairs contain one assertion which uses a negation (e.g. not, unsure, or unlikely). An example for this is the pair, there has been an increase in tropical storms of greater intensity which can be attributed to climate change and different changes in weather does not mean global warming, that have a low similarity in the gold data ( $−.19$ ), but are assigned a comparably high similarity score ( $.20$ ). We suspect that the SVM can handle such pairs better, as we have equipped it with an explicit negation feature.

Predicting Stance of Individuals

Now that we have the means for estimating the judgment similarity of assertions, we will explore whether judgment similarity can be used to predict the stance of individuals on assertions. We hypothesize that a person’s stance towards a new assertion is the same as to highly similar assertions. Consequently, we calculate the similarity between each of the assertions previously judged by a person ( $a_{1},...,a_{n-1}$ ) and the assertion for which we want to make the prediction ( $a_n$ ). Then we simply transfer the stance of the most similar assertion to the assertion $a_n$ . Note that the subjects have all rated different numbers of assertions. Thus, for the sake of comparability, we restrict ourselves to the most similar assertion (as opposed to averaging a judgment over the $n$ most similar assertions).

We calculate the similarity between assertions using the semantic textual similarity measures as well as with the previously-trained prediction models (SVM and SNN). For predicting judgment similarity with the SNN and the SVM, we do not use models that have been trained on all data, as they would contain the pair for which we want to make the prediction. Instead, we use the models that have been trained in the cross-validation and which do not include the pair in the training set. As the matrix is missing one entry for each prediction (i.e. the stance on the assertion for which we want to make the prediction), – to exclude any possibility of data leakage – one could theoretically form a new matrix for each prediction and then re-calculate all cosines. However, we find that the judgment similarity between assertions does not change significantly when a single entry in the vectors of the assertions is removed or added. Hence, to reduce computational complexity, the gold similarity was calculated over the entire matrix of judgments.

There are several assertions that do not have textual overlap, which is why the semantic textual similarity measures often return a zero similarity. In such a case, we fall back to predicting a $\oplus$ stance, as the data contains more $\oplus$ judgments than $\ominus$ judgments. We refer to methods that are based on judgment similarity as most similar assertion (measure), where the term measure indicates how the similarity is computed.

We evaluate all prediction methods for each subject using a leave-one-out cross validation. In this way, we only consider those assertions for which a subject has provided judgments. This means that if a subject provided judgment on the assertion $a_1,...,a_4$ , we make a single prediction for each assertion $a_1,...,a_4$ based on all other judgments. For instance, if we want to predict the stance towards the assertion $a_4$ and we will rely on judgments on the assertions $a_1,...,a_3$ .

We compare the judgment similarity methods against several baselines and approaches that directly use the judgments of other persons to calculate a similarity between person and assertion. The latter approaches are inspired by collaborative filtering methods which are, for example, used in modern recommender systems (Adomavicius and Tuzhilin, 2005; Schafer et al., 2007; Su and Khoshgoftaar, 2009). For all methods, we report accuracy as an evaluation metric. Next, we will describe the baseline and the collaborative filtering methods in more detail.

Baselines (BL)

As the simplest baseline, we randomly predict a $\oplus$ or $\ominus$ stance towards each assertion. We refer to this baseline as the random baseline.

In addition, we define the all agree baseline which assumes that a person has a $\oplus$ stance towards all assertions (i.e. the person agrees with all assertions). As the data contains substantially more $\oplus$ judgments than $\ominus$ judgments (cf. Figure 5.3), the all agree baseline represents a strong baseline.

As a third baseline, we make use of potential tendencies of persons to rather agree or rather not to agree with assertions. Therefore, we consider all known judgments of a person and predict $\oplus$ if the persons agrees with the majority of assertions and predict $\ominus$ otherwise. We refer to this baseline as tendency.

Collaborative Filtering (CF)

The idea of collaborative filtering is to exploit patterns in the behaviour of groups (e.g. what articles they buy, or whether they like or dislike a post) to infer preferences of individuals (Adomavicius and Tuzhilin, 2005; Schafer et al., 2007; Su and Khoshgoftaar, 2009). Collaborative filtering has been successfully applied to recommend products (e.g. on amazon.com (Linden et al., 2003)), movies (e.g. on netflix.com (Gomez-Uribe and Hunt, 2015), videos (e.g. on youtube.com (Davidson et al., 2010)), or news articles (e.g. on google.com/news (Das et al., 2007)).

Hence, we here compare the judgment similarity methods with collaborative filtering methods that use previously-made judgments and judgments made by others to predict future judgments. However, collaborative filtering requires knowledge of how others judged the assertion a_n for which we want to predict a stance. Therefore, the practical use of these methods is limited – at least in the present use case. For instance, collaborative filtering methods would not be applicable if we want to predict the stance towards an assertion which is completely new, and that thus has not been judged by a large number of people. Nevertheless, collaborative filtering represents an upper-bound for our textbased predictions.

For our first collaborative filtering method, we rely on how the majority of other persons judged the assertion $a_n$ . Consequently, we consider the judgments of all other subjects on $a_n$ and predict a $\oplus$ stance if the majority agrees with $a_n$ , and a $\ominus$ stance if the majority disagrees with $a_n$ . We will call this method mean other.

In addition, we compute the similarity between pairs of people by calculating the cosine similarity between the vectors that corresponds to all judgments a person has made. We use this person–person similarity to determine the most similar person and then transfer the judgment on $a_n$ of the user which is most similar to $p_i$ . We refer to this method as most similar user.

Furthermore, we use the (gold) judgment similarity between assertions to predict a stance based on how the assertion that is most similar to $a_n$ has been judged. Thus, this method represents the upper-bound for our most similar assertion (measure) methods. Consequently, this method will be referred to as most similar assertion (gold).

Results

Table 5.4

Straregy	Type	Accuracy
most similar use	collaborative filtering	.85
most similar assertion (gold)	judgment similarity	.76
tendency	baseline	.75
mean other	collaborative filtering	.74
most similar assertion (SNN)	judgment similarity	.73
most similar assertion (SVM)	judgment similarity	.72
all agree	baseline	.71
most similar assertion (jaccard)	text similarity	.70
most similar assertion (word vectors)	text similarity	.68
most similar assertion (gst)	text similarity	.69
most similar assertion (lcss)	text similarity	.67
random	baseline	.50

Accuracy of diﬀerent approaches for predicting stance of individuals.

Table 5.4 shows the performance of the implemented methods across all issues. Overall, we find that all strategies are substantially better than the random baseline. On average, the all agree baseline obtains an accuracy of .71, which is more than 20 percentage points above the performance of the random baseline. As expected from the skewed class distribution (cf. Figure 5.3), always simply assuming a $\oplus$ indeed represents a highly competitive baseline.

The tendency baseline – which is a refinement of the all agree baseline – achieves an accuracy which is 4% higher than the all agree baseline. Only the collaborative filtering methods most similar assertion (gold) and most similar user are able to outperform this baseline. This shows that predicting the stance of individuals is a highly challenging task. With an accuracy of about 85%, we find that the most similar user method performs best. That means that knowing which stance others have towards the assertion for which we want to predict a stance is an effective means in the present task. The methods that are based on the trained judgment similarity measures (i.e. SVM and SNN) outperform the all agree baseline, but fall behind the tendency baseline. The fact that methods based on trained judgment similarity are already close to their upper-bound (most similar assertion (gold)) shows that their potential is limited, even if measuring judgment similarity can be significantly improved.

One possible explanation for the comparably low performance of the most similar assertion method is that some subjects have only judged a small number of assertions (e.g. some subjects only judged a total of four assertions). This amount of judgments may simply be insufficient to make a meaningful prediction. For instance, if only a few assertions have been judged in the past and none of them are similar to a new assertion, then a prediction becomes guess-work. As expected from their poor performance of approximating judgment similarity, the methods relying on semantic text similarity measures perform worse than the all agree baseline.

Issue-wise analysis

Table 5.5

Accuracy of different approaches on predicting stance of individuals. Results above the all agree baseline are boldfaced. We indicate methods that rely on judgment similarity by the abbreviation JS(measure) where measure indicates how the similarity is computed. The methods are sorted according to their performance averaged across all issues.

In Table 5.5 we show how the prediction methods perform on the individual issues. Overall, the individual methods perform consistently across most issues. For instance, the method most similar user consistently outperforms the other methods and reaches an accuracy between $.82$ (US electoral system) and $.87$ (gender equality). Hence, the influence of issue-specific peculiarities is not very strong.

However, there are also some notable differences between the issues. The methods based on the trained judgment similarity only outperform the all agree baseline in about half of the cases. We observe that these cases correspond to issues where either the performance of the judgment similarity prediction is strong (e.g. climate change or mandatory vaccination), or in which the all agree baseline yields a particularly low performance. For the issues same-sex marriage and vegetarianism & veganism, we find that even the methods based on semantic textual similarity outperform the all agree baseline. We notice that for these highly polarizing issues the difference between the all agree baseline and the method most similar assertion (gold) – which represents the upper bound for all methods based on judgment similarity – is above 10% (16% for same-sex marriage and (12% for vegetarism & veganism) and thus comparably high (as the mean difference is 5%). Thus, even unreliable estimations of judgment similarity (cf. Table 5.2) are helpful for predicting stance for these issues.

For the issues climate change, mandatory vaccination, and Obamacare there is only a small difference between the all agree baseline and the method most similar assertion (gold). However, as the automatic estimation of the judgment similarity performs quite reliably, we still observe improvements – albeit small ones – over the all agree baseline. In summary, these findings indicate that the usefulness of judgment similarity for predicting the stance of individuals depends on a complex interaction of the distribution of agreement scores and judgment similarity scores.

Predicting Stance of Groups

We now turn to predicting the stance of groups towards assertions. This means that we want to explore if we can estimate what percentage of a group has a $\oplus$ or $\ominus$ stance towards an assertion, based on the stance they have towards other assertions. We illustrate the prediction task with the following example: From the assertion Marijuana is almost never addictive with an agreement score of $.9$ we want to predict a comparatively lower value for the assertion Marijuana is sometimes addictive.

Analogously to the prediction of stance of individuals, we first calculate the judgment similarity of two assertions using the described SVM and SNN architectures. Next, we take the $n$ most similar assertions and return the average of the resulting scores. As an upper-bound, we also compute the judgment similarity that results from the gold data. Note, that this upper bound again assumes knowledge about judgments on that assertion for which we actually want to make a prediction.

As a reference approach, we train different regression models that predict the agreement score directly from the text of the assertion. Thereby, we again compare more traditional models based on feature engineering and neural models.

For the feature engineering approach, we implement an SVM regressor (again LibSVM (Chang and Lin, 2011) from DKProTC (Daxenberger et al., 2014)) and experiment with the following feature sets: First, we use a length feature which consists of the number of words per assertion. In order to capture stylistic variations, we compute a feature vector consisting of the number of exclamation and question marks, the number of modal verbs, the average word length in an assertion, POS type ratio, and type token ratio. In addition, we capture the wording of an assertion by uni-,bi-, and trigram word features. For capturing the semantics of words, we again derive features from the pre-trained fastText word vectors Bojanowski et al. (2017). To capture the emotional tone of an assertion, we use the sentiment score which is assigned by the tool of Kiritchenko et al. (2014) as a feature.

Since convolution has shown to be particularly useful in estimating judgment similarity, we implement a convolutional neural network (CNN) as the neural approach on directly predicting aggregated judgments. The CNN was implemented in the framework deepTC (Horsmann and Zesch, 2018). Through iterative experiments we found that it is advantageous to add two additional dense layers (200 and 100 relu nodes). Since we are trying to solve a regression problem, the final layer of the network is a single node equipped with a linear activation function. The network was trained using the squared error between the gold and predicted agreement scores.

We train all the regression models using a ten-fold cross-validation over all the issues. As an evaluation metric we report Pearson’s r (between the predicted and gold agreement scores) for both the regression models and the methods which are based on judgment similarity.

Result

Table 5.6

Model	Type	$r$
gold ( $n=7$ )	judgment similarity	.90
gold ( $n=1$ )	judgment similarity	.84
SNN ( $n=34$ )	judgment similarity	.74
SNN ( $n=1$ )	judgment similarity	.45
SVM ( $n=18$ )	judgment similarity	.42
CNN	direct prediction	.40
sentiment + trigrams trigrams unigrams + word vectors unigrams SVM ( $n=1$ ) sentiment+ trigrams + style sentiment style length	direct prediction direct prediction direct prediction direct prediction judgment similarity direct prediction direct prediction direct prediction direct prediction	.36 .35 .32 .32 .32 .27 .13 .10 .00

Performance (Pearson correlation between prediction and gold) of approaches on predicting the stance of groups on assertions. We highlight the best model of the group of approaches that a) rely on automatically measured judgment similarity, b) rely on empirically estimated (gold) judgment similarity, and c) are trained to predict the scores directly from text.

Table 5.6 shows the performance of the different approaches for predicting the stance of groups on assertions. For the judgment similarity approaches, we only show the results for $n = 1$ , and for that $n$ which leads to the best performance.

Overall, we notice that prediction based on judgment similarity outperforms models that are trained to predict the scores directly from text. For instance, for the best judgment similarity model (SNN with $n = 34$ ), we obtain a coefficient of $r = .74$ , which is substantially better than the best direct prediction model (CNN, $r = .40$ ). The best prediction based on the empirically estimated (gold) judgment similarity reaches a performance of .9. The strength of this correlation and the fact that even our best estimate is still 15 points below shows the potential of judgment similarity for predicting the judgments of groups.

Among the approaches that are based on judgment similarity, we observe large differences between the approaches that rely on the SVM and approaches that rely on the SNN. This is especially interesting because the performance of the similarity prediction is comparable. We attribute this to the systematic error made by the SVM when trying to predict the similarity of assertions that have a negative agreement score. While the SVM only outperforms the plain regressions if the prediction is based on several assertions, we observe a substantially better performance for the judgment similarity based on the SNN.

For the plain regression, we observe that the CNN outperforms all models based on feature engineering. Among the feature engineering models, the models which use ngram features yield the best performance. While the sentiment feature alone has low performance, the model that combines sentiment and ngrams shows slight improvement over the trigrams alone. From this, we conclude that the emotional tone of an assertion has an – although small – influence on how the assertion is judged.

The length feature and the style features alone have a comparably low performance. In addition, models which combine style and length feature with ngram features yield a lower performance than the lexical models alone. Thus, the influence of style on whether people agree or disagree with assertions is not substantial.

Issue-wise analysis

In Table 5.6, we show that for the judgment similarity approaches, it is beneficial to not just transfer the score of the most similar assertion, but to use a larger $n$ . To examine the relationship between $n$ and the performance of the prediction, we compare the performance of both approaches for all $n< 50$ . We visualize this comparison both for individual issues and averaged across all issues in Figure 5.10.

For the averaged scores, we observe that all curves follow a similar pattern that resembles a learning curve. That means the performance increases rapidly with increasing $n$ , but then plateaus from a certain number of assertions. This behaviour is not surprising considering that we average the scores of the $n$ most similar assertion. As we average the scores, the influence of an individual score decreases with each additional assertion.

If we further increase $n$ , we observe that the correlations start to decrease after the plateau. We suspect that this is because we add the scores of less similar assertion as $n$ increases. At some point, the assertions seem to be so dissimilar that they do not provide meaningful hints on the assertion for which we want to make a prediction.

For the SNN, the predictions follow a similar pattern across all issues. However, the number of assertions for which we observe a plateau varies significantly. There is also a large variance in terms of absolute performance. For example, for $n = 1$ , we find a strong correlation of $r = .75$ for the issue gun rights, but an almost zero correlation of $r = .01$ for the issue foreign aid. While the performance of the judgment similarity for the issue gun rights ( $r = .64$ ) was substantially better than the performance for the issue foreign aid ( $r = .46$ ), this difference seems not large enough to account for the large variance of the prediction quality alone. We find that for $n = 1$ there is a comparable, relative performance difference if we inspect the prediction which is the gold similarity (Foreign Aid : $r = .71$ ; gun rights: $r = .88$ ). Hence, we suspect that the reason for the large differences is a complex interplay between the quality of the judgment similarity estimation, and the upper-bound of a prediction that uses this measure.

For the SVM we also observe a pattern that resembles a learning curve for most of the approaches. However, the plateau is often reached much later and the absolute performance is substantially lower. The similarity prediction of the SNN seems to be much better suited for selecting assertions which have a comparable agreement score. In Section 5.3.1, we showed that SVM’s judgment similarity estimation is systematically affected by the skewed distribution of the similarity scores. A consequence of this error is that – for assertions that have a negative agreement score – the SVM often predicts assertions that have a high positive score. Hence, the SVM systematically makes mispredictions that are particularly harmful for the prediction of stance.

There are even two issues (US engagement in the Middle East and US immigration) where we do not observe an increase in performance with increasing $n$ . The low performance for these issues cannot be explained by an unreliable judgment similarity prediction and a low upper-bound of our transfer approach, as both values are in the middle range of the respective ranges (judgment similarity prediction: $r = .52$ for both issues; upper-bound $r = .8$ for US engagement in the Middle East and $r = .82$ for US immigration). Thus, we suspect that the systematic error of the SVM-based judgment similarity estimation results in particularly disadvantageous predictions for these two issues.

Figure 5.10

Prediction performance (Pearson correlation between prediction and gold) based on the transfer of the $n$ -most similar assertions (expressed by the strength of correlation with the gold values). Sub-figure a) shows the scores averaged across all issues. We show the variance obtained on individual issues by the SVM in Sub-Figure b) and by the SNN in Sub-Figure c).

Chapter Summary

In this chapter, we examined our most complex formalization of stance – stance towards nuanced assertions. To the best of our knowledge, this formalization has not yet been examined from an NLP perspective. For examining stance towards nuanced assertions, we created a novel dataset that covers stance towards assertions relevant to sixteen different controversial issues. For creating the data, we proposed a method which quantifies qualitative data by engaging people directly. This means that, besides a definition of an issue, our method does not require any external knowledge and can be fully decomposed into crowdsourcable micro-tasks. The collected datasets contains over 2,000 assertions, over 100,000 judgments of whether people agree or disagree with the assertions, and about 70,000 judgments indicating how strongly people support or oppose the assertions.

We propose several metrics that can be calculated from the data and used for grouping, ranking, and clustering issues, assertions and persons. Across the sixteen issues, we find that when people judge a comprehensive set of assertions, there is a high level of consensus, and that only a comparably small proportion of assertions leads to a significant level dissent. We have shown how assertions that prompt dissent or consensus can be identified by ranking the whole set of assertions based on the collected judgments. Besides analyzing subjects that responded similarly, we analyzed the judgment similarity of assertions – the degree to which two assertions are judged similarly by a large number of people. We investigate whether this judgment similarity can be estimated automatically and find that machine learning architectures are able to approximate judgment similarity quite reliably.

Subsequently, we suggested two NLP-tasks: predicting stance of (i) individuals and (ii) groups based on the stance towards a set of other assertions. We proposed to solve these two tasks by automatically measuring judgment similarity between assertions and using the stance towards the most similar assertions for our prediction. We compared this prediction against several reference approaches. We found that predicting stance of individuals is a hard task, with our best results only slightly exceeding a majority ( $\oplus$ stance) baseline. However, we also found that the aggregated stance of groups can be quite reliably predicted. In this context, the judgment similarity, which is approximated by the SNN, has shown to be particularly useful.

However, the effectiveness of judgment similarity-based approaches also raises several questions. While we find that the wording is of decisive importance and that semantic textual relatedness is not sufficient to describe judgement similarity, the exact relationship between pairs of assertions and their judgment similarity remains unclear. To better understand this relationship is seems worthwhile to annotate and experimentally control typed relationships (e.g. paraphrases or entailment) of pairs of assertions.

We argue that the collected dataset corresponds to a dataset that could be obtained from social media, for example, by collecting the upvotes or downvotes of people on posts. However, in a real social media dataset there are a number of reasons why people do not upvote a post even though they agree with it (e.g. they may have privacy concerns to share their opinion, not perceive a post, or simply be unmotivated to upvote for something), or why they upvote a post that they disagree with (e.g. peer pressure can motivate people to upvote a post, or an upvote can be meant ironically). Future research should therefore investigate whether our findings can be replicated on real social media datasets.

In the present chapter, we relied on simple heuristics (i.e. averaging) to predict the stance towards one assertion based on the stance towards the $n$ most similar assertions. Future attempts should explore more advanced ways of weighting the most similar assertions to derive a prediction. For example, one could explore weighting functions that decrease the influence of assertions based on the similarity, or try to train weighted functions using machine learning. Finally, being able to predict judgment similarity might also be helpful for predicting the stance towards single or multiple targets.

Next, as an application domain of stance detection, we will study the detection of hateful stances. Specifically, we will examine formalizations, datasets, as well as the automatic detection of this unpleasant form of expressing one’s stance.

Hateful Stance

Utterances that express stance differ not only in the position they convey, but also with regard to how offensive or hateful they are. Hence, in this chapter we will discuss ways to formalize this unpleasant form of stance-taking, ways of creating and analyzing suitable datasets, and methods that are able to automatically detect such stances. To elucidate the kinds of unpleasant communication that this refers to, consider the following expressions that could be uttered by a person who strongly opposes the construction of a windmill in her town:

(A) It is unpleasant that they want to construct the windmill here.
(B) I curse all who are pro constructing this hideous windmill.
(C) I will tear this disgusting windmill down if they construct it.

While all utterances clearly express a $\ominus$ stance towards the construction of these windmills, they differ in how objective, or how offensive and hateful they are. For most people, Example (A) should represent an acceptable discussion contribution, while Example (C) is probably considered to be more offensive. If you do not consider Example (C) to be offensive, you may replace the target windmill with – for instance – refugee center, women’s shelter, or Synagogue.

The examples above also illustrate that offensive stance-taking cannot always easily be distinguished from taking an extreme, but still acceptable position. Such a distinction may vary with respect to several contextual factors such as the occasion on which the post is made or which audience is addressed. For instance, the hideous women’s shelter (Example (B)) is probably considered to be offensive by a general audience, but may be acceptable when uttered by one inhabitant of a women’s shelter to another. In addition, some forms of offensive expressions may represent criminal offences under the criminal codes of some countries (e.g. denying the Holocaust in Germany)*under §130 in the German criminal code (Strafgesetzbuch): incitement to hatred (Volksverhetzung) or be inappropriate for children below a certain age (e.g. extreme forms of obscene or sexualizedlanguage).*in extreme forms, sexualized language towards children may be considered as sexual abuse of children (§176 in the German criminal code)

Besides stance-taking that is in conflict with certain laws, offensive utterances may have several other negative real-world consequences: The presence of hateful utterances may amplify group polarization (Sunstein, 2002), result a hostile climate for online discussions (Mantilla, 2013), the propagation of hateful (e.g. racist, sexist) stereotypes (Chaudhry, 2015; Bartlett et al., 2014), or negative psychological effects for those who are targeted by the expressions. For instance, being targeted by online attacks has been linked to developing a depression or even suicidal ideation (Hamm et al., 2015; Wright, 2018). In addition, Martin et al. (2013) show that being exposed to hateful expressions may turn one’s otherwise neutral stance towards a targeted group or person into a negative one. Furthermore, hateful expressions may precede hate crimes in the offline world (Mantilla, 2013).

This series of negative consequences highlights that hateful stances represent a serious problem for several forms of online communication and has the potential to nullify the positive effects brought by these communication means. One frequently taken measure for limiting the amount and impact of such utterances is to implement a code of conduct which can then be enforced by moderators. However, the sheer mass of online postings, makes such an approach time- and cost-intensive which may lead to hate comments not being identified at all, or only after a long time.

Hence, the automatic (and thus fast and comparably cheap) detection of hateful stances has great potential in tackling this unpleasant phenomenon. In addition, automated processes may help to increase fairness of moderation as machines are not biased by relationships between moderators and posters, or stereotypes of the moderators.

Besides the potentials inherent to the automatic detection of hateful stances, we here also want to stress the risks of such approaches. Any censorship may be in conflict with the right to freedom of expression as, for instance, granted by the European Convention on Human Rights or the First Amendment to the United States Constitution.*Article 10: right to freedom of expression Hence, the identification of hateful stance requires a balancing of conflicting rights, which also includes the broader context of the utterance. Because of this balancing of conflicting rights, it seems problematic to rely exclusively on automatic decisions.

Hence, we envision use-cases which combine both automatic and human decisions: On the one hand, automatic identification of potentially hateful messages could be used as a pre-filtering step before manual identification, which would reduce the workload of a moderator. On the other hand, such automatic identification could be used to augment the interfaces of social media users. Such a system could indicate to a user if the post to be made might be considered to be hateful, which could motivate the user to rephrase a comment in a less hateful way.

Formalizing Hateful Stance

The examples (A), (B), (C) about windmills illustrate that there are several reasons why an expression may be perceived as being hateful. These reasons include: (i) the usage of potentially inappropriate language (e.g. Example (B): hideous and (C): disgusting), (ii) threatening behaviour (Example (C): I will tear [...] down [...]), or (iii) derogations and insults (Example (B): I curse all who [...]).

As the overall aim of this thesis is to provide a means for understanding stances, we here focus on hate that is being expressed towards a target and not on what sort of language may be inappropriate (e.g. too rude, to obscene) for a certain audience. Hence, we formalize hateful stance as a tuple which consists of:

a hatefulness polarity, which describes how hateful an expression is evaluated in the given context. Examples for labels include binary distinctions such as hateful vs. not hateful or more fine-grained distinctions (e.g. a hatefulness score from NOT HATEFUL to EXTREMELY HATEFUL).
a target to which the hateful stance is expressed and which may refer to a group (e.g. refugees or women) or a single person (e.g. Hillary Clinton). In contrast to stance, we do not define hateful stance towards statements, as the above outlined negative consequences (e.g. developing a depression) can only occur if there are real victims in the offline world.

In Figure 6.1, we provide an example of an annotation scheme that corresponds to this definition of hateful stance. In the following section, we will discuss how this definition relates to the definition of hate speech and similar phenomena, what formalism have been created to annotate hateful stances, and which methods have been applied to automatically detect this unpleasant form of stance-taking.

Figure 6.1

Example of an applied annotation scheme which corresponds to the formalization Hateful Stance. If you do not consider the examples of the right column to be offensive, you may replace the target windmill with refugee center, women’s shelter, or Synagogue.

Hateful Stance, Hate Speech and Related Concepts

There are several factors (e.g. threatening language or inappropriate wording) that may contribute to whether a stance is perceived as hateful. As a consequence of this variety, researchers have studied this or related phenomena from several viewpoints and by labelling their research using different terms. These terms include abusive language (Waseem et al., 2017b), Ad hominem argumentation (Habernal et al., 2018), aggression (Kumar et al., 2018a), cyberbullying (Xu et al., 2012; Macbeth et al., 2013), offensive language (Razavi et al., 2010; Davidson et al., 2017), hate speech (Warner and Hirschberg, 2012; Del Vigna et al., 2017; Davidson et al., 2017), toxic language (Wulczyn et al., 2017), or threatening (Oostdijk and van Halteren, 2013).

We will now discuss how these approaches relate to our definition of hateful stance. As in our discussions of formalisms stance, we will here focus on approaches which apply an annotation schema to data and discuss the annotation schemes according to the two building blocks hatefulness polarity and target. We will again consider these approaches in terms of their annotation complexity.

Hatefulness Polarity

Binary Polarity

The least complex form of polarity distinctions is a binary one. Representatives of binary distinctions are approaches on collecting lists of swearwords, or profane and hateful words: Spertus (1997) manually compile a list of bad-words which contains verbs (e.g. stink or suck), adjectives (e.g. bad or lousy) and nouns (e.g. loser or idiot). Similarly, Gitari et al. (2015) defined discriminative verbs and then expanded the list using the lexical-semantic knowledge base WordNet. There are also approaches on inducing such lists by predicting whether words are ABUSIVE or NOT ABUSIVE (Wiegand et al., 2018a). While these lexicon may contain some target-specific swearwords,*For instance, Wiegand et al. (2018a) show that the word bitch is one of the most frequent abusive words in their Twitter dataset. This swearword has a clear female connotation and may be rarely used to offend religious groups. these approaches differ substantially from our definition of hateful stance as they only annotate words (i.e. a word is in the list or not in the list) and thus do not model the relation to a target.

There are also approaches that annotate complete posts with a binary annotation scheme: For instance, Nobata et al. (2016) let trained annotators label Yahoo! comments for being ABUSIVE or CLEAN. Similarly, Sood et al. (2012) annotate comments on news articles with the labels INSULT or NO INSULT. In subtask one of their shared task on offensive language detection, Wiegand et al. (2018b) label tweets based on whether they are OFFENSIVE or NOT. These formalizations subsume a number of phenomena (e.g. insults or profanity) under the notion of ABUSIVE or OFFENSIVE. As this includes the usage of potentially inappropriate language, these formalizations are broader than our definition of hateful stance. There are also more specific, binary formalizations such as PROFANE versus NOT PROFANE (Su et al., 2017), THREAT versus NON-THREAT (Oostdijk and van Halteren, 2013), or RACIST versus NON-RACIST (Kwok and Wang, 2013).

Other binary approaches annotate whether comments in forums represent ATTACKS or NOT (Wulczyn et al., 2017; Pavlopoulos et al., 2017). Similarly, Habernal et al. (2018) model whether a Reddit post represents a so-called ad hominem argument – a violation of the SubReddit’s code of conduct (e.g. by being too rude). The aim of these approaches is to analyze whether the author’s social media posts attack each other during their interactions. Hence, these approaches formalize offensive speech towards a target – the user author(s) in a specific discussion threads. Hence, these formalizations are related to our definition of stance. However, by comparing attacking in several threads, approaches that model attacks focus on determining what properties of the used language result in the perception that a comment is an attack. Thus, these approaches also differ from our formalization as our focus is to study hateful stances towards specific targets.

Fine-grained Polarity

Besides binary distinctions there are more complex formalizations that use more fine-grained gradations of hatefulness. For instance, Del Vigna et al. (2017) use a three-level scale (labels: NO HATE, WEAK HATE, and STRONG HATE) to annotate hatefulness in Italian Facebook posts. They subsequently group the hateful posts based on a predefined set of categories.*The categories are religion, handicapped, socio-economical status, politics, race, gender, and other As the categories refer to certain groups of targets (e.g. the category religion is assigned if the posts targets people that have a certain religion), the scheme is in-line with our definition of hateful stance. Burnap and Williams (2014) annotate whether tweets are offensive with respect to race, ethnicity or religion. In addition to a binary decision (YES versus NO), they also model an UNDECIDED option.

A three-level scale is also used by Kumar et al. (2018a) who label English and Hindi Facebook posts with three levels of aggression (labels: NON-AGGRESIVE, COVERTLY AGGRESIVE and OVERTLY AGGRESIVE). Like the binary annotations of Wulczyn et al. (2017) or Pavlopoulos et al. (2017), the three-level aggression scale rather models the intensity of aggression as a general language property and thus is different from our definition of hateful stance.

There are also approaches that use even more fine-grained levels of hate polarity. For instance, Razavi et al. (2010) use a manual, iterative procedure to label potentially hate-bearing words, expression and phrases on a scale from one (SUBJECTIVE – but not insulting) to five (ABUSING/INSULTING). Like the previously described swearword-lists, the resulting Insulting or Abusive Language Dictionary contains little target-specific vocabulary and thus has a different purpose than our formalization of hateful stance.

The binary ATTACK annotations of Wulczyn et al. (2017) are created by a majority vote over the labels provided by ten annotators. The authors suspect that different annotators may differ in their judgment on what attacks are – a suspicion that is supported by the low reliability of the annotation (Krippendorf’s $α$ : .45). To account for this uncertainty, the authors introduce fuzzy labels that indicate what percentage of annotators labeled a comment as an attack. For instance, a score of 70% means that seven out of ten annotators flagged a comment as an attack.

Nuanced Labels

Finally, we take a look on approaches that simultaneously model different qualities of hateful or inappropriate expressions. First, there are approaches that differ between offensiveness and hatefulness. Examples for these approaches include Davidson et al. (2017) who use the labels HATE, OFFENSIVE, and NEITHER, and Mathur et al. (2018) that formalize whether Hindi-English tweets are NON-OFFENSIVE, ABUSIVE, or HATE-INDUCING. In addition, Mubarak et al. (2017) annotate whether Arabic social media posts are VULGAR (e.g. rude sexual references), PORNOGRAPHIC, or HATEFUL.

The annotation guidelines of these three approaches state that the labels HATE, HATE-INDUCING or HATEFUL include offensive stances towards peoples race, religion, or country. In subtask two of the shared task by Wiegand et al. (2018b), tweets are annotated with the labels PROFANITY, INSULT, ABUSE, and OTHER. While the annotation scheme is not designed to exclusivly capture hateful stances towards target groups, the label ABUSE includes whether someone is degraded based on whether she is member of a specific group (e.g. being a migrant, being homosexual). Hence, this label corresponds to our definition of hateful stance.

Different qualities of hateful or inappropriate expressions are also modelled by approaches that formalize hateful behavior between different social media users. For example, the dataset of Kumar et al. (2018b) contains a distinction between different kinds of aggression.*PHYSICAL THREATS, SEXUAL THREATS/AGGRESSION, IDENTITY THREATS/AGGRESSION, GENDERED AGGRESSION, GEOGRAPHICAL AGGRESSION, POLITICAL AGGRESSIONS, AGGRESSION BASED ON SOMEONE'S CAST, COMMUNAL AGGRESSIONS, RACIAL AGGRESSION and NON-THREATENING AGGRESSIONS Van Hee et al. (2015) annotate ask.fm comments for the presence and intensity of cyberbullying. Specifically, they annotate whether the post’s author is a HARASSER, VICTIM or BYSTANDER and fine-grained cyberbullying categories such as THREAT/BLACKMAIL, INSULT, CURSE/EXCLUSION, DEFAMATION or SEXUAL TALK.

Targets

As described above, we distinguish between targeted formalisms that assume predefined targets (e.g. immigrants, women), and non-targeted formalisms that model hatefulness or offensiveness as a more general property of language (e.g. whether words are considered to be swearwords). However, if one applies the annotation scheme to real-word datasets, there is often a strong overlap between targeted and non-targeted schemes. For instance, approaches that aim at recognizing profane language can be classified as non-target-specific formalisms. If we apply such approaches to a dataset that deals exclusively with a specific topic, the extracted profane language will likely be specific to the topic of the dataset. For example, if we try to identify swearwords in a dataset that contains texts about women, we will probably capture swearword that target women.

Approaches that model whether authors attack each other have a predefined target – the specific author that is attacked by an utterance. As described above, the aim of these approaches is to recognize attacks against generic users. Hence, our definition of a target is related, but also different, as we focus on specific targets such as minorities (e.g. immigrants).

Moreover, there are approaches that use target-specific hatefulness labels and are thus in accordance with our definition of targets. For instance, Kwok and Wang (2013) distinguish between RACIST or NON-RACIST tweets in a dataset about blacks in the US. Hence, their formalization can be interpreted as hateful stance towards blacks in the US.

In subtask A of their shared task on automatic misogyny identification Fersini et al. (2018) annotate whether tweets are MISOGYNOUS or NOT MISOGYNOUS, which is similar to annotating hateful stance towards woman.*In subtask B, Fersini et al. (2018) use the more fine-grained hateful polarities stereotype & objectification, dominance, derailing (justify woman abuse), Sexual Harassment & Threats of Violence, and discredit (slurring over women with no other larger intention. They also specify whether a tweet targets a specific individual or women in general.A less precise group of targets is studied by Burnap and Williams (2014) who asks in a crowdsourcing setup if tweets are offensive in terms of race, ethnicity or religion.

In addition to these single-target approaches, there are also approaches that formalize hateful stance towards several distinguishable targets. For instance, Waseem (2016) and Waseem and Hovy (2016) apply a series of tests to determine whether tweets represent RACISM (targets a minority which differs from the main population in race), SEXISM (targets females) or NEITHER. Similarly, Dinakar et al. (2012) formalize attacks on sexual minorities or women (label: SEXUALITY), racial or cultural minorities (label: RACE AND CULTURE), and on the intelligence of individuals (label: INTELLIGENCE). The first two labels correspond directly to our definition of hateful stance. Saleem et al. (2016) manually identify hateful SubReddits (e.g. r/FatPeopleHate) to collect posts that target African-Americans, plus-sized people, and women. Warner and Hirschberg (2012) annotate passages with the labels ANTI-SEMITIC, ANTI-BLACK, ANTI-ASIAN, ANTI-WOMAN, ANTI-MUSLIM, ANTI-IMMIGRANT, or OTHER-HATE. Fišer et al. (2017) formalize whether INAPPROPRIATE SPEECH, INADMISSIBLE SPEECH, or HATE SPEECH targets a certain ethnicity, race, sexual orientation, political affiliation, or religion.

In summary, we observe that there is already a large body of related works on formalizing and annotating unpleasant or hate-bearing social media comments. However, whether certain groups of people are the target of hate is only considered in a few works and the phenomenon is rarely considered in the full spectrum of all possible stances towards a target. In this chapter, we will show how hate towards a target can be formalized as a target-polarity tuple which corresponds to how we formalize regular stance. We will also show how hateful stance can be modelled as a part of the full spectrum of nuanced stances towards a target.

Automatic Prediction

In the following, we will discuss the current state of the art of automatically detecting hateful stance. While there are some rule-based approaches in which a function that assigns hatefulness labels to text is manually crafted,*e.g. Spertus (1997) manually built a decision tree that makes use of a hate word lexicon the majority of today’s systems uses supervised machine learning to automatically estimate such a function. As for the discussion of the state of the art of automatic stance detection, we distinguish between classical approaches which are based on feature engineering and neural network approaches. Next, we will first discuss which neural and non-neural approaches have been proposed. Subsequently, to identify the current state of the art, we will take a look at comparative evaluation initiatives on the subject.

Feature Engineering

Similar to stance detection, among the feature engineering approaches SVMs are frequently used to determine the function between instance representations and hatefulness labels. For instance, Sood et al. (2012) train an SVM equipped with weighted ngrams to predict insults. Similarly, Chen et al. (2012) compare the performance of an SVM and a Naïve Bayes classifier in the task of predicting offensive language. In their experiment, they represent the input instances with lexical features – including ngrams and based on an offensiveness dictionary – and syntactic features which are derived from the grammatical structure of sentences. They find that the SVM consistently outperforms the Naїve Bayes classifier. In addition to ngrams and syntax features (based on POS-tags), Warner and Hirschberg (2012) also add semantic features derived from brown clusters.

In addition to SVMs, there are also approaches that rely on logistic regression as a classification algorithm: For instance, Waseem and Hovy (2016) train a logistic regression to predict racism and sexism in tweets. In their experiments they represent the tweets through word, character ngrams, and length features. The authors experiment with adding user-specific features such as the author’s gender and location, but find that gender information only slightly improves performance and that location information even decreases performance.

On the same dataset, Waseem (2016) augment the tweet representation by using skipgram and POS features, and features derived from Brown clusters. Their experiment on expert and laymen annotations shows that Brown clusters increase the performance, and that skip-gram and POS features are beneficial in certain conditions of their experiment. Similarly, Nobata et al. (2016) use lexical, syntactic, and length features and features that reflect the distributional similarity of words. However, instead of Brown clustering, Nobata et al. (2016) derive their features from pre-trained dense word vectors.

An even richer feature space is examined by Davidson et al. (2017). Specifically, they use tf-idf weighted ngrams, POS tags, readability measures,*They utilize the Flesch-Kincaid index (Flesch, 1948), which is a formula that tries to determine the readability of a text on the basis of sentence length and syllables. length, sentiment scores, and count indicators for hashtags, @-mentions, retweets, and URLs. On the basis of this feature space Davidson et al. (2017) compare several classification algorithms, and find that logistic regression and SVM perform best on their dataset. Burnap and Williams (2014) also compare several classification algorithms on the basis of a rich feature space (various ngram features, lexicons, and syntax features). They find that an ensemble classifier that combines the predictions of several base classifiers yields superior performance over the performance of the individual base classifiers.

From the popularity of SVMs equipped with ngrams, word vectors and specialized dictionaries, we conclude that this approach yields a comparably robust performance across different datasets. In that manner, the feature engineering approaches on detecting hateful stance or related phenomena resemble the methods which are used in stance detection.

Neural Networks

Among the neural approaches, the variety of approaches ranges from simple multilayer perceptrons over CNNs to LSTMs. For instance, Djuric et al. (2015) use paragraph2vec (Le and Mikolov, 2014) to translate Yahoo! comments into vectors. Subsequently, they use a softmax layer to classify the comments based on whether they contain hate speech or whether they are clean. They report that this approach is superior over representing the comments using a representation based on sparse word vectors. On the same dataset, Mehdad and Tetreault (2016) show that RNNs outperform both the paragraph2vec approach of Djuric et al. (2015), and the logistic regression of Nobata et al. (2016). However, they also show that their RNN is inferior to a combination of Naïve Bayes and SVM. Furthermore, they achieve an even higher performance by combining several classifiers (including RNN) into a ensemble classifier.

Pavlopoulos et al. (2017) and Wulczyn et al. (2017) apply neural approaches on a dataset containing Wikipedia comments which is annotated with levels of aggression. In detail, Wulczyn et al. (2017) propose to use a multi-layer perceptron. Pavlopoulos et al. (2017) show that their RNN outperforms both the perceptron of Wulczyn et al. (2017) and a CNN on an extended version of this data. Mishra et al. (2018) use a biLSTM which is enriched with character information on the same data and demonstrate that this enrichment is beneficial in terms of performance.

Neural approaches have also been applied to predict racism and sexism in the dataset of Waseem (2016). For instance, Gambäck and Sikdar (2017) translate the tweets into vectors by using word2vec (Mikolov et al., 2013a) and then train a neural network classifier with a convolutional layer in its core. They show that this approach outperforms the logistic regression by Waseem (2016). Badjatiya et al. (2017) conduct a comparative evaluation in which they compare the performance of CNN, LSTMs with SVMs. They find that both neural approaches outperform the SVM and that the LSTM is superior over the CNN. Mishra et al. (2018) apply their architecture, which uses a biLSTM layer and character-level information, to the dataset. As in the dataset of Wulczyn et al. (2017), they find that using character-level information increases the performance.

In summary, we do not observe a clear tendency for whether a certain neural network architecture is best suited for predicting hateful stance or related phenomena. However, we observe that – at least under the controlled conditions of the presented studies – neural networks are often superior over more traditional, feature engineering approaches.

State of the art of Automatic Hatefulness Detection

The discussion above shows that there is a large variance in approaches towards automatically detecting hateful stance, or offensive and attacking language. In order to examine how these approaches perform in direct comparison, we now discuss a series of shared tasks that have been executed to determine the state of the art in this area. During this discussion, we will identify those strategies that potentially make some systems more successful than others.

At first, we discuss the systems which perform best in the Kaggle challenge on classifying toxic comments.*Kaggle is an online platform that offers machine learning competitions; the challenge on toxic language can be found under www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge; last accessed November 12th 2018 The challenge’s organizers provide Wikipedia talk page edits, which are labeled with different types of of toxicity such as threats, obscenity, or insults. We now discuss the strategies which are adapted by the three top-scoring systems. We observe that all of the three top-scoring systems use ensemble classifiers. In addition, we notice that two of the three systems make use of data augmentation.

The team that is ranked first uses an ensemble of gradient-boosted trees and several ways to augment the training data. For instance, they first translated the texts to German and then translated the texts back to English. This creates an additional variance in the texts, which is expected to help to train a better model.*www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/discussion/52557 The team on the second place uses an ensemble of RNNs, CNNs and gradient boosted machines, and also translation methods to augment the data.*www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/discussion/52612 The team on the third place also uses an ensemble of several neural networks (mostly with CNN, LSTM layers in different parameterizations) and gradient-boosted trees.*www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/discussion/52762

Next, we shed light on the systems that participated in the shared task on aggression identification by Kumar et al. (2018a) which provided aggression annotated Twitter- and Facebook-datasets both in English and Hindi. We observe that here is a huge variance between the systems that performe best in the four datasets:

For the Hindi Facebook data, the top-scoring system is the logistic regressor by Samghabadi et al. (2018) which is equipped with tf-idf-weighted word unigrams, character ngrams and word vector features. They also report that they make use of extensive preprocessing in which they applied spelling correction and replaced URLs, numbers, email addresses with a special token. In contrast to the Hindi Facebook posts, neural approaches (e.g. the combination of fastText word vectors and CNNs by Majumder et al. (2018)) are superior to traditional approaches on the Twitter data. On the English Facebook data, the neural network with LSTM-units by Aroyehun and Gelbukh (2018) achieve the top rank. The team also augment the training data by translating it to and from French, Spanish, German, and Hindi. On the English Twitter dataset a comparably simple dense neural network with social media preprocessing (e.g. replacing informal abbreviations with the words they refer to) by Raiyani et al. (2018) outperforms all other competitors.

In shared task on the identification of offensive language by Wiegand et al. (2018b) there are two subtasks: In subtask one, tweets have to be classified as OFFENSIVE or NOT, and in subtask two, the participants have to distinguish between finer levels of offensiveness (e.g. profanity or abuse). In both subtasks, we notice that ensemble classifiers are popular among the top performing systems. Specifically, Montani and Schüller (2018) train an ensemble classifier based on traditional classifiers (e.g. maximum entropy, or random forest) which are equipped with tf-idf weighted token and character ngrams and the averaged word vector features. Similarly, von Grünigen et al. (2018) combine the predictions of several neural network classifiers by using an ensemble classier.

In both subtasks, pre-training a neural network on similar data and then fine-tuning it on the tasks’ train data seems to be a successful strategy. For instance, this transfer learning strategy is used by the top-ranked submissions of von Grünigen et al. (2018) and Wiedemann et al. (2018). Both teams train networks that have both recurrent and convolutional layers. In subtask two, an interesting approach in the top three is Corazza et al. (2018) who combine neural network classification with manual feature engineering. More specifically, they implement a RNN to extract text representations and concatenate these representations with a vector that represents tweet-level features such as the number of emojis within a tweet.

Finally, we take a look on the systems that reach a top performance on the shared task on automatic misogyny identification by Fersini et al. (2018). We find that all topscoring systems rely on traditional classifiers and rich feature sets. We notice that the neural network approaches achieve comparably low performance.

The best-performing system is the SVM submitted by Pamungkas et al. (2018), which is equipped with stylistic and lexical features. Precisely, they use ngrams, twitter-specific features (e.g. occurrences of hashtags) and several lexicons to capture the presence of swear words, sexist slurs, and women-related words. Frenda et al. (2018) – the second-best team – use character ngrams and handcrafted lexicons to capture words that are indicative of certain topics such as sexuality or stereotypes. Subsequently, they feed the extracted feature vectors to an ensemble classifier that combines SVM, random forest, and gradient boosted trees. The third place is reached by the SVM classifier of Canós (2018) that only uses tf-idf weighted unigrams. The system uses extensive preprocessing to – for instance – replace URLs or misogynistic hashtags (as specified in a predefined list) with a special token.

Across the four evaluation initiatives, we observe two techniques that are commonly used by the top-performing systems: First, many of the top scoring systems use ensemble classifiers to compensate for the potential weaknesses of several different base classifiers. Second, several systems adapt strategies to increase the amount of training examples. These include augmentation trough translation (and back translation) or transfer learning setups in which one pre-trains models on a similar task and then fine-tunes the model on the task at hand. If we compare the usage of neural and traditional classifiers, we do not observe that one strand is consistently superior over the other. This observation corresponds to our observation regarding the state of the art in stance detection (cf. Chapter 3), where neural and non-neural methods are also often on par. Not only this tendency, but also the methods used in general are similar to the methods used in stance detection (e.g. ngrams and the usage of specific lexica). This emphasises that the two tasks have a clear relationship.

The discussion above shows that today there is an active research community which examines the automatic detection of hateful stances or related phenomena. However, when we started to devote ourselves to this topic in 2015, there were only a few related works such as Warner and Hirschberg (2012), Dinakar et al. (2012), or Spertus (1997). As a consequence, there was a lack of publicly available datasets*For instance, the first ACL Workshop on Abusive Language (Waseem et al., 2017a) featured only three datasets – one of which is the dataset that was created by us and is described below. and thus there were few possibilities to comparatively evaluate detection approaches. This lack of available datasets also means that there was little variance in the formalizations of hateful stance which have been examined in the form of concrete annotation experiments.

In the rest of this chapter, we will deal with the annotation, analysis, and automatic predictions of two formalizations – hateful stance towards a single target and (ii) a binary polarity (HATEFUL vs. NOT HATEFUL), and (ii) fine-grained polarity ranging from MOST HATEFUL to LEAST HATEFUL.

Single Target Hateful Stance Towards Refugees

Now, we describe how we create a dataset which contains German tweets that are annotated with a hateful single-target stance towards refugees. We select the target refugees, because the time of the data creation coincided with a high point of the so-called refugee crisis (Holmes and Castañeda, 2016). In that time period, the debate on the topic often dominated the media landscape and also social media. Hence, we suspect that there should also be a high number of hateful stances in social media sites.

To compile the dataset, we first collected a number of potentially hate-bearing tweets and then conducted two annotation experiments – one by relying expert annotators and a second one by relying on the annotations provided by laymen. We now describe the collection of tweets in more detail.

Collecting Potentially Hate-bearing Tweets

In order to create a suitable dataset, we adapt the approach that was used to create the datasets that have been used to conduct the shared tasks on single-target stance (cf. Chapter 3) and that is also frequently used in similar initiatives. Hence, we first defined ten hashtags that are potentially used in German tweets that express a hateful stance towards refugees. These hashtags are: #Pack (vermin), #Aslyanten (asylum seekers – negative connotation), #WehrDich (defend yourself or fight back), #Krimmigranten (portmanteau of criminal and immigrants), #Rapefugees (portmanteau of rape and refugee), #Islamfaschisten (Islam fascists), #RefugeesNotWelcome, #Islamisierung (islamization), #AsylantenInvasion (invasion of the asylum seekers)), and #Scharia (Sharia – the islamic religious law). The collection is created through iterative group discussions, whereby we checked that the brainstormed hashtags are frequently used. Next, we polled the Twitter API to collect tweets that contain these hashtags, which resulted in 13 766 tweets.*roughly in the period between February and March 2016

As the collected tweets frequently contain non-textual content, we remove tweets that solely contain links or images. To ensure that tweets are understandable without context, we also remove tweets that reference others (i.e. retweets and replies). In addition, we observe that several tweets are duplicates or near-duplicates. Hence, we remove tweets that have a normalised Levenshtein distance smaller than .85 to another tweet.

A manual inspection of the collection also shows that some of the hashtags are not indicative for hateful stance towards refugees. Precisely, the hashtag #Pack attracted a large amount of tweets which are aimed at other targets. Consequently, we removed tweets which only contain the hashtag #Pack. Finally, we inspect the collection and manually removed tweets which are incomprehensible or too difficult to understand without their original context. The resulting dataset contains 541 tweets, which is – compared to other datasets such as Waseem and Hovy (2016) – small in size. However, our elaborate filtering process should ensure that the quality of the data is comparatively high.

Expert Annotations

As a first step, a group of experts annotated the collected tweets for being HATEFUL or NON-HATEFUL towards refugees. As expert annotators, we rely a group of phd-students from the University of Duisburg-Essen, who joined forces in a working group on the subject of hate speech. Within the working group they discussed literature on the subject and thus gathered theoretical expertise on the topic.

For collecting the expert annotations, we randomly split the 541 tweets into six parts that roughly contain the same number of tweets. Next, each part is labeled by two of the six annotators. In detail, they decide for each tweet whether a it contains hatefulness towards refugees. This decision corresponds to our definition of hateful stance towards a single target. Before the annotation, the annotators were instructed to base this decision on their personal intuition.

Table 6.1

part	annotator pair				$\Delta$
1	Expert₁:	23%	Expert₂:	42%	19%
2	Expert₃:	8%	Expert₁:	22%	14%
3	Expert₄:	11%	Expert₅:	19%	7%
4	Expert₆:	29%	Expert₃:	6%	23%
5	Expert₅:	13%	Expert₆:	3%	10%
6	Expert₂:	85%	Expert₄:	40%	45%

Proportions of the dataset that are labeled as being hateful by the expert annotators.

Table 6.1 shows the percentages of tweets that are labeled as being hateful by each annotator. The table also shows that there are substantial differences between the annotator judgments. In the extreme case, the annotators even disagree in every second tweet (i.e. Expert₂ and Expert₄ on part 6). This amount of disagreement is also reflected in a low inter-rater reliability, as measured with Krippendorff’s $α$ of $.38$ .

Furthermore, the table shows that there is a large variance in the differences for the individual annotator pairs. Specifically, this variance ranges from a difference in 7% of the tweets (Expert₄ and Expert₅ on part 3) to a difference in 45% of the tweets (Expert₂ and Expert₄ on part 6). If we compare the annotations of individual experts across different parts of the data, then we observe a slight tendency of annotators to label comparatively small or large amounts of tweets as hateful. For instance, while the annotations of Expert₂ result in comparably low percentages (8% and 6%), the annotations of Expert₃ result in comparably high percentages (42% and 85%). However, as demonstrated by the two scores of Expert₃, there is also substantial variance between the individual tendencies. This – as the partition of the dataset was made randomly – suggests that the intra-rater consistency of the annotations is also low.

Laymen Annotations

The low reliability makes us hypothesize that a definition could help to create more consistent annotation. For this experiment, we do not rely on experts, but on lay people. We rely on laymen, as we try to mimic a situation that is common in social media, where users can mark the contributions of other users as hateful. If laymen can annotate the data with a satisfactory quality, it would be possible to easily collect large amounts of training data from social media.

Hence, to test our hypothesis, we conduct an experiment in which we ask two groups of laymen to provide us hatefulness annotations. Before collecting the annotations, we show Twitters’s definition of hateful conduct*[...]You may not promote violence against or directly attack or threaten other people on the basis of race, ethnicity, national origin, sexual orientation, gender, gender identity, religious affiliation, age, disability, or serious disease.[...] from https://help.twitter.com/en/rules-and-policies/ hateful-conduct-policy to one of the groups (N=25). The other group is not provided any definition (N=31).

To reduce the effort and cognitive load for the annotators, we randomly select twenty tweets from the set of tweets that have been labeled as being hateful by at least one expert. Next, we used an online survey to show the selected tweets to the two groups and let them make the hatefulness annotations. However, instead of solely asking for binary annotations, we experiment with three different ways of asking for hatefulness. Specifically, we ask the annotators to indicate to us (i) whether the tweet is hateful on a binary scale (yes/no), (ii) whether the tweet should be removed from Twitter, and (iii) how offensive the tweet is on a six-point scale from one (NOT OFFENSIVE AT ALL) to six (VERY OFFENSIVE). The experimental design was approved by the ethics committee of the University of Duisburg-Essen.

Table 6.2

	with definition (N=25)	without definition (N=31)
tweet is hateful (% yes)	33%	40%
tweet should be removed (% yes)	33%	18%
offensiveness (mean)	3.49	3.42

Average responses to the three hatefulness questions of the surveyed laymen group(i) with and (ii) without definition.

In Table 6.2, we show the average responses of our subjects to the three questions. Overall, we notice that there little differences between the two groups. For the binary ratings of whether a tweet is hateful or not, the group responded slightly more frequently with YES. However, as we show only twenty tweets to the subjects, this difference corresponds to annotating only one tweet. In addition, we find that this difference is not statistically significant (as measured with a Wilcoxon-Mann-Whitney test (Wilcoxon, 1945): $p$ -value= .26).

We observe a clearer – and statistically significant (Wilcoxon-Mann-Whitney test: $p$ -value= .01) – difference for the question of whether a tweet should be removed or not. However, again the size of the effect is quite small (15% or three out of twenty tweets). For the group that is shown the definition, there is almost no difference in the annotations for the question of whether a tweet is hateful and whether it should be removed. Hence, the presence of a definition seems to blur the difference between these questions. These findings indicate that definition has a comparatively small influence on the annotations of hateful stance.

Figure 6.2

Reliability (Krippendorff’s $α$ ) for the different groups (with and without hatefulness definition) and HATEFULNESS questions.

In addition, we calculate Krippendorff’s $α$ to measure the inter-annotator reliability of all three questions under both conditions. In Figure 6.2, we visualize the calculated $α$ scores. With a range from $alpha = .18$ to $.29$ , we find that the inter-annotator agreement is low for all three questions and for both the group with and without definition. Contrary to our hypothesis, definitions do not inherently lead more reliable data.

The low reliability scores are consistent with a number of other studies that annotated whether comments represent ATTACKS or NOT (Wulczyn et al., 2017; Pavlopoulos et al., 2017) or cyberbullying (Dinakar et al., 2012). More specifically, Wulczyn et al. (2017) report a Krippendorff’s $α$ of $.45$ , Pavlopoulos et al. (2017) report a Krippendorff’s $α$ of $.48$ and Dinakar et al. (2012) a Cohen’s $κ$ of $.4$ . However, our findings are in contradiction with the results of Waseem and Hovy (2016), who report a fairly high $κ$ of $.84$ .

We suspect two reasons behind the higher agreement: First, Waseem and Hovy (2016) state that their data is annotated by an feminist and a women studying gender-studies. It is possible that these two persons have similar stances on the topic of their dataset (i.e. misogyny and racism) and that therefore they tend to annotate more consistently agreement. We will examine this hypothesis in Section 6.3 in more detail.

Second, the tweets which form the basis of the annotations of Waseem and Hovy (2016) are collected by using quite drastic hashtags such as #feminazi, #arabTerror, or #n**ger and also from previously identified hateful tweeters. Hence, it is possible that the collected dataset contains tweets that are more obviously hateful. Thus, in the following section we will examine implicitness as a factor that could cause tweets to be perceived as being more or less hateful.

Implicitness

As described above, in this section, we turn to examining textual properties of the tweets that affect how hateful they are perceived. During a manual analysis of the annotation, we observe that tweets which contain strong graphical language tend to be annotated consistently as being hateful. However, tweets that contain vague and indirect expressions are often associated with significant variance in the annotations. Hence, we hypothesize that the implicitness of the expressed stances contributes to how hatefulness is perceived.

We illustrate the assumed mechanism with a pair tweets. The first tweet is a rather implicit example from the previously described collection and concerns a train accident that happened in Bad Aibling, Germany, in 2016. The second tweet is a version of this tweet, which has been phrased more explicitly:

Original implicit version: Everything was quite ominous with the train accident. Would like to know whether the train drivers were called Hassan, Ali or Mohammed #RefugeeCrisis #Islamization*The tweet is translated to English. Original tweet: Alles recht ominös mit dem Zugunglück. Wüsste gerne, ob die Lokführer Hassan, Ali oder Mohammed hießen #Fluechtlingskrise #Islamisierung

Explicit version: Everything was quite ominous with the train accident. The train drivers were Muslims #RefugeeCrisis #Islamization

We suspect that implicitness could influence the perception of hateful stances in two different ways: On the one hand, the explicitly expressed version may be perceived as more hateful. In the example above, the explicit version directly accuses Muslims of being involved in a train accident. On the other hand, the implicit version could be perceived as more hateful. For instance, as the tweet uses allegedly prototypical Muslim first names, one could argue that the tweet evokes racist stereotypes. Furthermore, the rather factual tone of the explicit version may be perceived as less emotional and thus less hateful.

In order to examine the influence of implicitness on the perception of hatefulness, we need a dataset in which we can experimentally control for implicitness. We created such a dataset by first selecting tweets that implicitly express a hateful stance from the above described collection. Next, we used a set of linguistic rules to manually create a set of explicit counterparts. Then each set (i.e. the original version and their explicit counterparts) was presented to a different group of subjects within an online survey. In the online survey, we asked the subjects to provide us with binary hatefulness annotations (HATEFUL or NOT towards refugees) and more fine-grained hatefulness annotations (hatefulness intensity on a scale from one (NOT HATEFUL AT ALL) to six (EXTREMELY HAETFUL)) for each tweet.

We will now explain in more detail how we created the dataset and what findings result from its analysis. Subsequently, we will look on how implicitness affects the automatic detection of hateful stance.

Manufacturing Controllable Explicitness

The basic idea for our dataset is to first identify implicit, but hateful tweets and then to create explicit counterparts of them. We identified the implicit tweets by using two steps: First, we restrict to tweets which have been annotated as being hateful by at least one of the experts (cf. Section 6.2.2).

Next, we identified all tweets that have specific surface forms which are clear indicators of hatefulness. The intuition of this heuristic is that these surface forms – such as swear words – are an explicit form of hatefulness. We identified such indicators by retrieving words that are strongly associated with hatefulness. Therefore, we compute the $Dice$ (Smadja et al., 1996) collocation coefficient for each token in the dataset and inspect those words which are most strongly associated with a tweet being hateful. We find the strongest association for the token #rapefugee and for cognates of rape such as rapist and rapes. To estimate whether these tokens may serve as indicators of hatefulness, we calculate the probability of their occurrence predicting whether a tweet is hateful or not. We find that 65.8% of the tweets containing #rapefugee and 87.5% of the tweets containing cognates of rape are labeled as hateful by at least one annotator. From this, we conclude that the identified markers are strong indicators for a tweet’s hatefulness. Hence, we remove all tweets containing the token #rapefugee and for cognates of rape. This filtering procedure results in 36 tweets.

After collecting a set of implicit tweets, we define a set of paraphrasing rules according to which the explicit counterparts are created. Paraphrasing refers to the relation between two sentences that the have approximately the same meaning (Madnani and Dorr, 2010; Bhagat and Hovy, 2013). One can differentiate between different types of paraphrasing, such as the type Ellipsis (adding or omitting information from the original sentence) or being more or less specific than the original sentence (Rus et al., 2014; Bhagat and Hovy, 2013; Vila et al., 2014). Our paraphrasing rules correspond to these types by, for example, specifying how information needs to be added (e.g. performing coreference resolution) or how quantifiers need to be changed (e.g. changing some refugees to all refugees) to make tweets more explicit. The full set of rules along with directions and examples is shown in the Appendix A.4.

For creating the explicit versions of the implicit tweets, we manually check all phrases in the tweets for whether they explicitly reference our target refugees. If a phrase does not contain such explicit references, we apply as many rules as possible to make the tweet as explicit as possible. The paraphrasing process is performed independently by two experts, who chose the same instances of implicit stance, but produced slightly differing paraphrases. The process is carried out by the two experts independently. Afterwards, the paraphrased versions of the two are merged by agreeing to one of the two alternatives in a discussion.

Study

To examine how implicitness affects how hateful tweets are perceived, we conduct an online survey in which we show the implicit and explicit tweets to two different groups. This means that we choose an experimental between-group design with explicitness as the experimental condition.

For our study, we recruit a total of 101 native German speakers as test subjects (53.4% female, 41.6% male and 1% other). Participants are randomly assigned to one of the conditions. This results in 55 subjects to provide hatefulness annotations in the implicit condition and 46 in the explicit condition. In order to familiarize the participants with the topic, we previously showed them a definition of hatefulness. The definition is inspired by the definition of hate speech of the European ministerial committee (McGonagle, 2013).*http://www.egmr.org/minkom/ch/rec1997-20.pdf

After the survey was conducted, we first analyze the binary hatefulness annotations. On average, we find that 11.3 out of the 36 (31%) of the tweets are rated as being hateful in the explicit condition. Similarly, in the implicit condition 14.4 out of the 36 (40%) tweets are rated as being hateful. However, for both conditions, we notice high standard deviations (standard deviation $_{explicit}=11.3$ and standard deviation $_{implicit}=14.6$ ), which again emphasizes that hatefulness annotations annotations are highly subjective. To test whether the measured mean scores differ significantly between the two groups, we carry out a $\chi^2$ test. The test shows that the binary annotations are not significantly differently distributed ( $\chi^2_{(22, N = 57)}= 4.53, p > .05$ ).

Next, we analyze the more fine-grained hatefulness annotations. Similar to the binary annotations, we find similar mean scores for both conditions (mean $_{explicit} = 3.9$ and mean $_{implicit}=4.1$ ). In addition, we again find standard deviations for both conditions which are rather high – if one considers that we use a six-point scale (standard deviation $_{explicit}=.98$ and standard deviation $_{implicit}=1.01$ ). To examine, if the difference between the groups is statistically significant, we conduct a T-test. However, the calculated T-test does not indicate statistical significance ( $t(97.4) = 1.1 , p > .05$ ).

Figure 6.3

Change in perceived hatefulness intensity between implicit and explicit versions of the selected tweets.

To analyze the differences between the conditions in more detail, we inspect the mean differences in the perceived hatefulness intensity of each tweet. We visualize these individual differences in Figure 6.3. Overall, we observe that there are clear differences between the individual tweets. These differences range from a mean increase of $+.81$ (tweet $24$ ) to a mean decrease of $+.97$ (tweet $10$ ). This means that while there are no significant differences at the level of the set of, there are substantial differences on the level of individual tweets. However, as the changes differ in whether they represent an increase or a decrease, differences disappear when one aggregates several tweets.

We manually inspect those instances that show a particularly high change in hatefulness intensity. For the cases with a particularly high decrease of intensity, we note that the implicit version is more global and often targets larger groups (e.g. implicit: asylum seekers versus explicit: Muslim asylum seekers). Thus, we suspect that targeting larger groups could be the reason for the higher hatefulness scores. In addition, two of those instances which show particular high decrease (i.e. tweet IDs 6 and 10) contain rhetorical questions. This could mean that rhetorical questions are perceived as more intrusive than their explicit counterpart.

For the one case in which the explicit version is substantially more hateful than the implicit version (tweet ID 24), we notice that the tweet contains a threat of violence (implicit:[...] If they steal: cut their hands off *?* [...] versus [...]If they steal: cut their hands off *!*[...]). Thus, we suspect that threats are perceived as more hateful when expressed explicitly.

The Effect of Implicitness on Automatically Detecting Hateful Stance

Our experiment on the influence of implicitness on the human perception of hatefulness suggests that there is indeed an influence, but that this influence is moderated by several factors (e.g. whether a tweet is a threat or not). We now turn to the question of whether a similar influence can be found for automatic systems that are trained to detect hateful stance.

In Section 6.1.3, we showed that SVMs yield robust and – on some datasets – state-of-the-art results for the prediction of hateful stance. Hence, we adapt the approaches of Waseem and Hovy (2016) and Warner and Hirschberg (2012) to German data and equip an SVM with character, token and POS uni-, bi-, and trigrams features. In addition, we add the type-token-ratio and emoticon ratio of each tweet as features. We implementthis system using DKProTC (Daxenberger et al., 2014)*version 0.9.0 and the integrated weka SVM algorithm. For preprocessing, we rely on the twitter-specific tokenizer by Gimpel et al. (2011) and the POS-tagger by Toutanova et al. (2003).

While we are aware that the annotation does not have a sufficient level of reliability, we assume that there is still a signal in the annotation of the experts. In the following experiment, we consider the task as a classification task which tries to determine whether a tweet is POSSIBLY HATEFUL or NOT. Thus, we consider a tweet as having a hateful stance, if at least one annotator flagged it as such. This results in a class distribution of 33% POSSIBLY HATEFUL and 67% NOT HATEFUL tweets. Such a high-recall classifier corresponds to a use-case in which an automatic system pre-filters posts and a human can then make the final decision (e.g. to delete or not).

In order to examine the influence of implicitness, we implement a train-test setup in which the selected, implicit tweets are the test set and the remaining tweets are used for training. To have a reference point for the resulting performance, we additionally calculate the systems’s performance on all data within a tenfold cross-validation as well as the performance of a majority class baseline. We report the resulting performance using macro- $F_1$ . The results of this experiment are shown in Table 6.3.

Table 6.3

experimental setup	condition	model	macro- $F_1$
cross-validation	implicit	majority class baseline	.40
cross-validation	implicit	SVM	.65
train-test (selected tweets)	implicit	SVM	.10
train-test (selected tweets)	explicit	SVM	.10

Performance of the hatefulness classification on the selected implicit tweets as well as on their explicit counterparts.

If one compares the performance of the majority class baseline (macro- $F_1 = .4$ ) and the performance of the SVM in the cross-validation (macro- $F_1 = .65$ ), we can conclude that our system actually learns something meaningful. Overall, the far-from-perfect performance is in a similar range as the state of the art on other datasets (cf. Fersini et al. (2018), Kumar et al. (2018a), or Wiegand et al. (2018b)).

In the set-up in which we test only on the selected implied tweets, we obtain a performance of only .1. We get the same performance, regardless of whether we test on the original, implicit version or its explicit counterpart. While the obtained macro-F₁ scores outperform the majority class baseline on these tweets (as all tweets are member of the POSSIBLY HATEFULNESS class the resulting score is 0), the drop is substantial compared to the performance obtained in the cross-validation.

These results indicate that implicitness represent a major difficulty for our detection system. This difficulty can be explained by the fact that the implicit tweets were selected based to whether they contain a token which is strongly associated with the hatefulness class. This finding poses a serious problem for automatic approaches on detecting hateful stances. For instance, if automatic systems were used in real-world applications, they could easily be deceived by the use of implicit expressions.

Nuanced Stance and Hateful Stance Towards Women

Figure 6.4

Overview of the data structure of the FEMHATE Dataset. We simultaneously model hateful stance (i.e. a hatefulness score that indicates how hateful an assertions is towards women) and stance on nuanced assertions (i.e. an agreement scores that indicates to what extent people agree or disagree with the assertions).

For the final examination in this section, we turn to the question of whether the annotator’s personal stances influence how hateful they perceive statements to be. The underlying hypothesis is that people who agree with a statement may be less inclined to label it as hateful. For instance, most people probably disagree with the statement women are more stupid than men and also consider it as (at least somewhat) hateful. However, if anyone should agree to this assertion, then this person will probably attribute less hate to the statement.

To be able to examine this hypothesis, we simultaneously model stance and hateful stance. Specifically, as we are interested in the relationship between agreeing or disagreeing with statements and their hatefulness, we formalize stance towards nuanced targets as described in Chapter 5. This means that we ask a large number of people to annotate whether they personally agree or disagree with a large number of assertions and use these judgments to calculate an agreement score.

In Section 6.2, we showed that annotating hatefulness is associated with inter-rater inconsistencies and thus low reliability. Hence, in this examination, we rely on BWS (cf. Section 5.1.2) to obtain more consistent hatefulness annotations. The motivation for using BWS to collect hatefulness annotations is that annotators may not share a common absolute scale for hatefulness, but they still might agree when picking the most and least hateful assertion from a set of assertions. In Figure 6.4, we visualize the data structure of this dataset, which we call the FEMHATE Dataset.

We select the subject women, women’s rights, and discrimination of women as the overall target of the following examination. We select this target, as we suspect that stances towards this target are less influenced by current events than, for instance, the stances towards refugees. In addition, since women account for about half of the population, a particularly large group could profit from automatic approaches on detecting hateful expressions.

Moreover, we hypothesize that the perception of hatefulness is influenced by whether one belongs to the targeted group. For example, we suspect that females may perceive the statement women are more stupid than men as more hateful than men.

In following we will describe how we created the FEMHATE Dataset which contains 400 German assertions about women, and which are labeled with both nuanced stance and hateful stance. Next, we will describe a sequence of quantitative analyses that test the outlined hypotheses. Finally, we will demonstrate how judgment similarity (cf. Section 5.2.4) can be used to predict the hatefulness annotations of the collected assertions.

The FEMHATE Dataset

To collect a dataset of assertions which are labeled with both nuanced and hateful stance, we adapt our approach of quantifying qualitative data (cf. Section 5.1). Consequently, we conduct (i) a qualitative phase in which we crowdsource a large number of assertions about women, and (ii) a quantitative phase in which we collect annotations indicating agreement and disagreement, and hatefulness towards women.

We are aware of that the way we generate assertions may introduce artifacts to our dataset that makes it different from data which is collected from social network sites directly. However, due to the free generation of the utterances, we are less prone to biases that are inevitably introduced by a keyword-based process of data collection. For example, if one collects tweets by searching for the term bitch, it is not surprising that hatefulness annotations in this dataset are strongly associated with this term.

In addition, there may be ethical concerns about annotating people’s posts with sensitive judgments such as hatefulness – especially if these people have not given their informed consent and one plans to make that data available to other researchers. The experimental design, which is described in the following, was reviewed and approved by the ethics committee of the University of Duisburg-Essen. We will now describe the two phases in more detail.

Crowdsourcing Assertions About Women

To generate a variety of assertions, we constructed an online survey in which we ask participants to freely formulate assertions relevant to our topic. Within the survey, we presented the participants with a definition of the topic and a number of sub-topics. These sub-topics include: gendered language (e.g. waitresses vs. wait staff), legal differences between men and women (laws for divorce and custody), professional life (e.g. differences in salary, leadership positions, women in the army), social roles (e.g. ‘typical women’s interests’, women and family, ‘typical women’s jobs’), biological differences, and gender identity. The aim of this step is to narrow down the topic for the subjects. In the directions of the survey, we explicitly reminded the subjects that these sub-topics are meant as a source of inspiration, but that they are not limited to them when coming up with the assertions.

Since we want a large variance of assertions that ranges from highly controversial to uncontroversial assertions, we used three different ways of asking for assertions. First, we asked the subjects to provide us at least three assertions with which they personally agree. Second, we asked them to formulate at least three assertions with which they personally disagree. Third, we requested at least three assertions with which they personally agree, but which they would not express in public. However, as the subjects may feel uncomfortable about answering the third question, the question was optional. To clarify the three questions, we provided example answers. Specifically, for each question we showed one example which takes a $\oplus$ stance towards women and one example which takes a $\ominus$ stance towards women.

As we are interested in self-contained assertions, the participants were instructed to avoid expressions that indicate subjectivity (e.g. I tend to think), co-reference or references to other assertions, and hedging (e.g. indicated by maybe, perhaps, or possibly). In addition, we asked the participants to formulate the assertions in a way that a third person can agree or disagree with them. We manually removed assertions which do not adhere to the provided directions or which are incomprehensible without further context.

To capture assertions that cover a wide range of nuanced stances, we posted the link to our online-survey in various online forums that have a thematic connection to the target women. These forums include communities for people that probably mostly have a $\oplus$ stance towards the topic (e.g. the German SubReddit from women for women r/ Weibsvolk/) and communities that are expected to rather have a $\ominus$ stance towards the topic (e.g. the Facebook group gender mich nicht voll (don’t gender me)). In addition, to capture less extreme opinions, we also posted the link to topically unrelated communities (e.g. the public Facebook group of the University of Duisburg-Essen).

Initially, this process resulted in 810 assertions from 81 participants. That means that – on average – each subject generated ten assertions, although the survey required a minimum of six assertions. From this, we conclude that the voluntary participants were highly motivated to share their opinions on the topic.

After removing assertions that did not meet our guidelines, 627 assertions remained. From this set of assertions, we randomly select 400 assertions for the subsequent qualitative phase.

Collecting Stance and Hatefulness Scores

To ensure high quality data, we conducted the second phase of the data collection in a laboratory study. Hence, we invited voluntary participants (i) to annotate whether they agree or disagree with the collected assertions and (ii) to provide best-worst annotations which indicate the hatefulness of the assertions. As we hypothesize that being a member of the targeted group (i.e. being a woman) is potentially an important factor in the perception of hatefulness, we systematically controlled for this variable in our experiment. Hence, we asked 40 female and 40 male subjects to participate in our study.

The subjects were compensated with either 15e or subject hour certificates*as needed by their study program for their participation.

To minimize the effect of other group variables such as age or educational background, we recruited a rather homogeneous group of subjects – students of the University of Duisburg-Essen. To describe the sample in more detail, we will now report some demographics of the collected group: The subjects report a mean age of $23.4$ years (standard deviation: $4.3$ ). 78% of the subjects report that they are undergraduate students, 21% graduate students and 1% had a different educational level or did not provide this information. From these descriptive statistics, we conclude that our sample is indeed a fairly homogeneous group.

For collecting the nuanced stance annotations, we let each of the participants indicate whether they personally agree or disagree with each of the 400 assertions. To enable an efficient decision process, we let the subjects judge the assertions by using arrow keys (left arrow: DISAGREE, right arrow: AGREE). This principle resembles modern applications that are considered with the evaluation of people, goods or other things (e.g. Tinder, Stylect, Jobr, or Blynk) and is therefore familiar to the participants.

To prevent fatigue effects, we let the subjects provide the annotations in five sessions (containing 80 assertions each) with a 60 seconds break between the sessions. In order to avoid any effects that may result from the order in which we present the assertions, we used a random order for each participant.

As for our dataset on nuanced stance (cf. Section 5.2.1), we calculate an agreement score for each assertion. However, since, in the FEMHATE Dataset, all subjects judged all assertions, we simply use the percentage of times subjects agree with an assertion as the agreement score. That means that the resulting score ranges from zero (everyone disagrees with the assertion) to one (everyone agrees with the assertion). Consequently, assertions that have an agreement score of $.5$ are the most controversial assertions.

As described above, we use BWS to measure the HATEFULNESS of each assertion. Therefore, we first use the tool of Kiritchenko and Mohammad (2017) to generate 600 $4$ -tuples from the 400 assertions. This script ensures that (i) each $4$ -tuple occurs only once in the resulting set, (ii) each assertion occurs only once within a tuple and (iii) all assertions appear approximately in the same number of tuples.

Next, we show these 4-tuples to the subjects and ask them to indicate to us (i) which assertion is most hateful towards women and (ii) which assertion is least hateful towards women. As in our study on implicitness, we provide the annotators with the definition of hate speech by the Council of Europe (McGonagle, 2013) to familiarize them with the task. To limit the overall workload of the annotators, we show each of the 600 tuples to $−1$ four female and four male subjects.

Given the collected BWS annotations, we calculate a hatefulness score (hs) for each assertion $a$ . As in our investigation on strength of support or opposition (cf. Section 5.1.2), we therefore rely on the formula by Orme (2009):

hs(a)=\%most(a) - \%least(a)

Hence, the resulting score ranges from (LEAST HATEFUL) to

1

(MOST HATEFUL).

Table 6.4

	Number
assertions	4000
Agreement Judgments	32,000
BWS Judgments	4,800

Overview of number of assertions and collected annotations in the FEMHATE Dataset.

In Table 6.4 we provide an overview on the number of assertions and annotations in the collected dataset. Next, we will describe how the dataset can be analyzed in a way that allows us to accept or reject our hypotheses.

Analysis

To verify our hypotheses, we first inspect the obtained hatefulness and agreement scores in isolation. Thereby, we examine the reliability of our data collection and compare the scores resulting from the judgments of the male subjects with the scores that result from the judgments of the female subjects. Subsequently, we anaylze whether the stance’s of our subjects correlate with their perception of hatefulness by investigating the relationship between hatefulness and agreement scores.

Hatefulness Scores

Figure 6.5

Distribution of hatefulness scores obtained using best-worst–scaling. Scores are discretized into ten bins. Positive scores (gold) indicate a high degree of hatefulness and negative scores (violet) indicate a low degree of hatefulness.

We start our analysis with an examination of the distribution of the calculated hatefulness scores. Therefore, we inspect a histogram of the scores and notice that the distribution of the scores resembles a gaussian distribution. We show that histogram in Figure 6.5. To better understand the distribution, we manually inspect the assertions with the highest and lowest hatefulness score score.

The three assertions ranked as most hateful are:

There are women who enjoy being raped. This kind of women form the non-lesbian portion of the goody-goody parties such as the Greens. (hatefulness score: 1.0)
Women belong in the kitchen. (hatefulness score: .98)
Women are incapable and too stupid. (hatefulness score: .98)

We argue that these assertions are all quite drastic examples of belittling women or their abilities. That these assertions are ranked as most hateful, is a qualitative indication of the validity of our annotation. When comparing the judgments of male and female participants, we find that these three assertions are amongst the ten most hateful assertions for both subgroups.

The following three assertions are ranked as least hateful:

Women aren’t weak and shouldn’t feel that way. (hatefulness score: -.84)
Women and men are equally intelligent. (hatefulness score: -.84)
Women are strong and to have them in a team makes every economic as well as social company stronger. (hatefulness score: -.89)

We argue that these annotations coincide with the opinion of people at large. We again find that the three assertions among the ten least hateful assertions for both gender subgroups.

This qualitative analysis of the dataset indicates that the annotators largely agree on the hatefulness annotations. To quantitatively analyze the reliability of annotations, we additionally evaluate the reliability of our annotations using split-half reliability (Louviere, 1993; Kiritchenko and Mohammad, 2016). The underlying assumption of split-half reliability is that an annotation procedure is only consistent if one can divide the set of annotators in two halves and the annotations of the two halves correlate strongly. Hence, we randomly split the participants in two halves, compute the scores for each half and calculate the Pearson correlation of the two halves’ scores.

To avoid random effects, we compute split-half reliability 100 times and calculate the average correlation across the 100 repetitions.*Note that Pearson’s $r$ is defined in a probabilistic space and therefore cannot be averaged directly. Hence, we first z-transform the scores, average them and then transform them back into the original range of values. If we calculate split-half reliability for the whole set of participants (i.e. both females and males), we obtain a rather strong correlation coefficient of $r = .90$ . From this, we conclude that our BWS annotations result in fairly reliable hatefulness scores.

In addition, we compute split-half reliability for the female and male subjects. We find that the correlation coefficients of the female participants ( $r = .82$ ) and male participants ( $r = .81$ ) are significantly lower, although nonetheless substantial. Interestingly, both females and males yield almost the same correlations.

To investigate whether female and male participants disagree in hatefulness annotations, we also compute the split-half reliability with one half being the group of males and one half being the group of females. This comparison results in a correlation coefficient of $r = .93$ . This indicates that male and female subjects largely agree on hatefulness annotations.

While there is a high level of agreement amongst the gender groups, there are also assertions that are associated with substantial disagreement. In the following we discuss this assertion. We find that assertions with a large score difference between the genders often attack female activists or gender equality, but do not target women in general. Example of such assertions include:

Gender equality actually just means “favoring women”. (hatefulness score_female: .67; hatefulness score_male: .25)
Feminists are man-hating women, who found no happiness in life. (hatefulness score_female: .83; hatefulness score_male: .42)

Next, we take a closer look at the agreement scores.

Agreement Scores

Figure 6.6

Distribution of agreement scores in the FEMHATE Dataset. Scores are discretisized into ten bins. We use a color scheme to encode how positive (green) or negative (red) the scores are.

As for the hatefulness scores, we start our analysis by inspecting histograms of the calculated agreement scores. In Figure 6.6 we show the histograms of the agreement scores calculated from the full set of subjects (Subfigure a), from the set of female participants (Subfigure b), and from the set of female participants (Subfigure c). For the whole set of subjects, we observe that the scores are rather evenly distributed across the range of possible agreement scores. The mass of the distribution is slightly bigger in the negative range of scores. This means that there are more assertions towards which the majority of subjects have a $\ominus$ stance than assertions towards which the majority agrees have a $\oplus$ stance. Accordingly, the average agreement score over all assertions is $.42$ . For the male subjects the scores are distributed even more equally, which results in a slightly higher mean agreement score of $.44$ . As this score is closer to $.5$ , we conclude that the assertions are more controversial for male subjects.

In contrast to the male subjects, we observe a higher frequency of scores around $1.0$ and $.0$ for the female subjects. It is possible that women, being the target of the hate, are more affected and thus more extreme in their judgment. In addition, we notice a higher frequency of assertions with a $.0$ score than assertions with a $1.0$ score. This is also reflected in a slightly lower mean agreement score ( $.41$ ).

Analogous to the BWS annotations, we calculate split-half reliability for the agreement scores to estimate how reliable our data collection is. We gain compute split-half reliability for the whole group, for the group of female subjects, and for the group of male subjects.

For the whole group, we obtain a correlation coefficient of $r = .96$ , from which we conclude that the agreement scores can be regarded as being robust. The correlation coefficients of the male ( $r = .92$ ) and female ( $r = .95$ ) group are only slightly lower and thus also quite robust. We argue that the strong correlations indicate that our group of subjects is indeed very homogenous.

If we compute the correlation between the scores of the female and male participants, we obtain a coefficient of $r = .83$ . This means that although there is a correlation of the resulting scores, there are also substantial differences between the genders. We manually inspect assertions which are associated with a particularly high level of disagreement and find that most of them are concerned with female quotas. For instance, the largest difference in the agreement score can be found in the following two assertions:

Female quotas are nonsense. (agreement score_female: .10; agreement score_male: .53)
Female quotas are cosmetic restraints and constitute a form of discrimination. (agreement score_female: .22; agreement score_male: .65)

Relationship between Agreement and Hatefulness

Figure 6.7

Comparison of agreement and hatefulness scores by the gender groups female, male, and all subjects.

To examine whether the stances of our subjects correlate with their perception of hatefulness, we analyze the relationship between agreement and hatefulness scores. Specifically, we compare the agreement and hatefulness score of each assertion for the whole sample, the female and the male subjects. We visualize this comparison in Figure 6.7.

For the whole group, we find that there is a substantial negative correlation of $r =−.76$ between agreement and hatefulness scores. This indicates that there is indeed a relationship between disagreeing to an assertion (8 stance) and perceiving it as hateful – respectively – between agreeing ( $\oplus$ stance) to an assertion and not perceiving it as hateful. We observe an even stronger correlation for the scores of the female subjects ( $r = −.79$ ) and a slightly lower coefficient ( $r = −.67$ ) for male subjects. We therefore suspect that the relationship between stance and hateful stance may be more pronounced for annotators that are members of the targeted group.

Interestingly, there are no cases with both a high agreement score and a high hatefulness speech. This means that people never agree with assertions they perceive as hateful – respectively – that people do not perceive assertions they agree with as hateful. In contrast, there are cases with both low agreement and low hatefulness scores. We manually inspect these assertions and find that they often express male clichés such as Men have to like football.

In summary, the analysis of our datasets about women indicates that (i) there are little differences between males and females in the perception of hatefulness, but (ii) that there are differences in what stances males and females have towards the assertions. In addition, we find a strong relationship between nuanced stance and the perception of hatefulness. In the following we show how this relationship can be exploited by automatic systems for predicting hateful stances.

Predicting Hateful Stance Using Judgment Similarity

In Chapter 5 we show that the degree to which assertions are judged similarly by a large number of people (i.e. judgment similarity) can be used to predict stance on nuanced assertions. As nuanced stance and hateful stance correlate strongly, we hypothesize that judgment similarity should be a useful mean for predicting hateful stance.

Hence, for predicting the hatefulness score of an assertion, we first calculate judgment similarity for the assertions in our dataset and base our prediction on the most similar assertions. If judgment similarity can indeed be used to approximate hatefulness scores, the judgements found in social media (e.g. thumb-up and thumb-down, or likes and dislikes on social media posts) could serve as an inexpensive source of new judgements for an efficient identification of hateful stances.

We will now first describe the estimation of judgment similarity in more detail. Subsequently, we will describe an SVM-based regression that uses the scores of the most similar assertions as features to represent input instances.

Automatically Estimating Judgment Similarity

As in Section 5.2.4, we rely on the collected agreement matrix in which rows represent the subjects and columns represent the assertions to compute judgment similarity. Hence, we again define judgement similarity as:

JS(a1,a2) =\frac{\vec{a_{1}} \cdot \vec{a_{2}}}{\vert \vec{a_{1}} \vert \cdot \vert \vec{a_{2}} \vert}

with

\vec{a_1}

being the vector representing all judgments on

a_1

and

\vec{a_2}

representing all judgments on

a_2

In contrast to Section 5.2.4, we observe that the calculated similarities cover the whole range of values from $−1$ (the same number of people agrees and disagrees to the assertions of the pair) to $+1$ (the participants consistently agree or disagree with both assertions).

We suspect two reasons behind this broader range: First, in the FEMHATE Dataset, all persons judged all assertions. Consequently, in contrast to the dataset described in Chapter 5, there are no default zero scores that make all assertions somewhat similar. Second, we suspect that the subjects that were recruited to annotate the FEMHATE Dataset are a more homogenous group. A homogeneous group can be expected to respond homogenously to certain assertions.

To be able to use approaches that are based on judgment similarity in real scenarios, we need to estimate the judgement similarity of two assertions from their texts. Otherwise, our approach would not be applicable, for example, to assertions that are uttered recently and thus are not judged by a large number of people.

Hence, we re-use the Siamese Neural Network approach which is described in Section 5.3.1 and for which we showed that it is well suited for the task at hand. Consequently, our SNNs consist of two identical subnetworks and a final merge-layer that calculates the cosine between the representations resulting from the subnetworks. Again, each subnetwork consists of an embedding layer (German FastText vectors by Bojanowski et al. (2017)), a convolution layer (filter size of two), a max-pooling over time layer, and a final dense layer with 100 nodes.

We evaluate the prediction approach using a ten-fold cross validation and obtain a Pearson correlation of of $r = .72$ between the predicted and the gold similarity. From this substantial correlation, we conclude that our architecture is able to successfully estimate judgment similarity from text. However, since the correlation is still far from being perfect, we conduct a manual error analysis by inspecting the assignment of gold to prediction within a scatterplot.

Figure 6.8

Correlation of gold judgment similarity and judgment similarity as predicted by the SNN for the FEMHATE Dataset.

We show this scatterplot (gold: x-axis; prediction: y-axis) in Figure 6.8. The scatterplot shows a clear correlation between the predicted and gold similarities, which corresponds to the measured Pearson correlation.

However, the scatterplot also shows a systematic error of cases for which the SNN predicts a negative similarity of below $.0$ , but which actually have a high similarity of $> .5$ . We manually inspect these problematic cases and find that they often cover pairs in which the two assertions have low textual overlap or low semantic relatedness (e.g. the pair: All girls love horses. and Women out of the Bundestag. or The women’s quota is nonsense. and Gender equality is an over-representation of women.). We suspect that the neural network is not able to estimate a relationship between the two assertions as there is little semantic connection between the contained tokens. One possible way to overcome this limitation could be to use word representations that account for the specific lexical semantics of the dataset (as suggested in Section 3.3.3). For example, in a general context, the words quota and equality are less related than in the context of women’s rights. If we could model this semantic shift by suitable vectors, the performance of our SNN could probably be further increased.

Hatefulness Prediction

To examine whether judgment similarity can be utilized for predicting hatefulness scores of assertions, we implement an SVM-based regression. For training the SVM, we represent the input instances by their most similar assertions. As a reference approach, we implement an SVM-regression equipped with 1-3 gram features – a system that yields highly competitive performance for similar tasks (cf. Section 6.1.3). We evaluate the systems based on gold and SNN judgement similarity within a ten-fold cross-validation with Pearson’s $r$ as a performance metric. To enable a fair comparison, the approach, which is based on judgment similarity, only uses the score of the $n$ most similar assertions in the respective training set. For our system based on judgment similarity, we experiment with different amounts of $n$ most similar assertions. Furthermore, we experiment with representing the input instances with both the hatefulness score and the agreement score of the most similar assertions.

Table 6.5

feature set	representation	$n$ most similar	Pearson’s $r$
ngrams	uni-, bi-, and trigrams	-	.35
judgment similarity SNN	agreement score	1 25 50 75	.25 .39 .48 .49
hatefulness score	1 25 50 75	.25 .52 .54 .53
judgment similarity gold	agreement score	1 25 50 75	.70 .70 .63 .61
hatefulness score	1 25 50 75	.75 .67 .65 .65

feature set

representation

$n$ most similar

Pearson’s $r$

ngrams

uni-, bi-, and trigrams

.35

judgment similarity

SNN

agreement

score

.25

.39

.48

.49

hatefulness

score

.25

.52

.54

.53

judgment similarity

gold

agreement

score

.70

.63

.61

hatefulness

score

.75

.67

.65

Performance (Pearson correlation between gold and prediction) of the SVM-based regression for predicting fine-grained, hateful stance. We represent the input instances by (i) ngrams, (ii) the agreement and hatefulness score of their most similar assertions (gold), and (iii) the scores of their most similar assertions as predicted by the SNN.

We show the results of this experiment in Table 6.5. Since our dataset is – with 400 assertions – comparatively small, it is not surprising that a direct prediction based on ngrams yields a low performance of $r = .35$ .

All approaches based on gold judgment similarity score significantly better. With a performance of $r = .75$ , our best configuration (i.e. using the hatefulness score of the most similar assertion to represent an instance) outperforms the SVM equipped with ngrams by a wide margin. This result underlines that the scores of similarly judged assertions indeed provide a strong signal for classifying hatefulness. We also find that the using $n > 1$ does not improve the performance. We suspect that this is because – by increasing $n$ – we add less similar assertions, whose scores are increasingly less expressive for the score we want to predict.

For the approaches based on the estimated judgment similarity, we observe a large variance depending on which $n$ we use. For both scores (i.e. agreement and hatefulness score), we observe substantial improvements between the $n = 1$ and $n = 25$ most similar assertions. However, we only see moderate changes for even larger $n$ . As expected, the predictions based on the HATEFULNESS score outperform the prediction based on the agreement score in both the SNN and gold setup.

Overall, the results imply that considering assertions that are judged similarly by a large number of people provides a strong signal for the automatic detection of hateful stance. As such judgments could be extracted from social networking sites (e.g. the likes and dislikes on posts) our approach could also be useful in practical applications. Our evaluation also demonstrated that judgement similarity can be approximated by the mere text of the assertions. Hence, it may be possible to train such models on rather artificial data and to apply them within real-world applications.

Chapter Summary

In this chapter, we examined the annotation and automatic detection of a special, unpleasant form of stance – hateful stance. We formalized hateful stance as a tuple consisting of (i) a target (e.g. refugees or women) and (ii) a polarity (e.g. being HATEFUL or NOT) that is expressed towards that target.

We examined this formalization in two newly created datasets, one targeting refugees and the other targeting women. We find that reliably annotating hateful stance is a hard task for both expert and laymen annotators. While this finding is in line with similar annotation efforts, this raises the question of how meaningful the predictions of systems are that are trained on such data. We therefore argue that such systems – at least until the reliability is substantially improved – should only serve as recommendations for human decision-makers that decide whether to delete problematic posts. Otherwise, such an error-prone automatic censorship, could have unforeseen consequences on processes of forming public opinion.

Furthermore, our experiments suggest that the perception of whether a text is hateful or not does not become more consistent when we provide definitions to the annotators, and is also influenced by whether the text is expressed implicitly. This underlines that those who train hate detection systems should consider hatefulness annotations with a certain skepticism.

On the dataset concerning women’s rights, we annotated fine grained hatefulness using a comparative procedure and find that this method results in robust scores. In this dataset, we also find that there is a clear link between nuanced stance (i.e. whether one agrees with an assertion) and hateful stance. We demonstrate that this connection can be utilized for predicting hatefulness if we base our prediction on assertions which are judged similarly by large number of people. We envision that this finding could be transferred to detecting hateful stances in social network sites by exploiting the judgment mechanisms which are present in almost all of those sites (e.g. up- or down voting posts).

Conclusion

In this thesis, we examined how computers can be used to detect and analyze the abundance of stances that people express through social media. We cast stance detection as an NLP task in which we try to automatically label a text according to a certain stance formalization. We investigated three different formalizations: stance towards (i) single targets, (ii) multiple targets, and (iii) nuanced targets, as well as (iv) hateful stance as an especially unpleasant form of stance-taking. To train and evaluate systems for solving these tasks, we need data which is manually labeled according to these formalizations. Hence, the contributions of this work fall into three areas: formalizations, data creation, and detection systems. We will now summarize our main findings in these areas (Sections 7.1, 7.2, and 7.3) and subsequently discuss in Section 7.4 the implications that our research might have on further research or practitioners concerned with automatically analyzing stance.

Formalization

With the term formalization, we refer to an abstract model of stance, which is quantifiable and thus digestible by a computer. In Chapter 2, we provided a overview of state-of-the-art stance formalizations. Furthermore, we extended the variety of stance formalisms with a new formalization, which we refer to as stance on nuanced targets.

In all of our stance, formalizations we define stance as a tuple consisting of (i) a target (e.g. death penalty) and (ii) a polarity (e.g. BEING IN FAVOR OF or BEING AGAINST the target). We show that stance is highly related to the NLP formalisms target-dependent, topic-dependent and aspect-based sentiment, as well as claim as defined in argument mining. Through these examinations, we lay the foundations for bundling efforts in these research fields, which are often considered to be independent.

In this thesis, we studied three different formalizations which we distinguish based on complexity of their targets and – as a consequence – which use-cases they support. In Chapter 3, we studied stance on single targets, which formalizes stance towards a single target and therefore only provides a broad overview of expressed stances. An example application of this formalization is to predict for large numbers of tweets what percentage supports Catalonia’s independence from Spain, what percentage opposes independence and what percentage is neutral to the subject. In Chapter 4, we examined the formalization stance on multiple targets, which simultaneously models stance towards an overall target and a set of predefined, logically linked targets. An example for this formalization is to simultaneously model stance towards the death penalty and subordinated issues such as whether there are humane forms of the death penalty or whether the death penalty is a financial burden to the state. We argue that this formalization is suited to model stance in debates that involve more complex positions and thus supports use-cases in which we want to quantify stance on this more fine-grained level.

However, predefined targets may be still too coarse-grained if we want to obtain a comprehensive understanding of all nuances of a debate, or if we have little knowledge of what aspects are actually important in a debate. Thus, to capture an even more fine-grained notion of stance, in Chapter 5, we propose the formalization stance on nuanced targets, which models a polarity towards all utterances in a present debate. Consequently, one’s stance towards an issue (e.g. climate change) is whether one agrees or disagrees with every single contribution in an debate on the issue.

Finally, in Chapter 6, we adapted our formalization of stance to capture hateful stance – a negative consequence of the fact that everyone can express their stance within social media. Hence, we propose to formalize hateful stance as a tuple consisting of a target (e.g. refugees) and a hatefulness polarity (e.g. BEING HATEFUL or NOT). We assume that this formalization has the potential to contain this unpleasant phenomenon of today’s computer-mediated communication and its consequences (e.g. amplified group polarization, the spread of hateful stereotypes or negative psychological effects for the targeted group).

Data Creation

The second area, in which we have contributed to the state of research, are methods of creating and analyzing datasets, which are annotated with the described formalizations. Our findings result from controlled annotation studies we performed either in laboratory settings or by using crowdsourcing. In detail, our main contributions are:

Reliability of Annotations

We investigated the reliability of stance annotations in various experiments. Data that does not have sufficient reliability poses a serious problem for automated systems that are trained and evaluated on this data, since the resulting models will inevitable be unreliable, too. In two experiments, we find mixed levels of reliability for the annotation of stance on multiple targets, and thus – as multiple targets subsume several single targets – for the annotation of stance on single targets. We identified certain phrases whose varying interpretations are likely to be the cause of low reliability scores. In turn, we argue that this problem could be solved by an extensive training of the annotators.

For the annotation of hateful stance, we find a low level of reliability for both expert and laymen annotators. This indicates, that – even for persons who have extensively studied the phenomenon – hatefulness remains subjective to a large extent. We conclude that classical annotation methods (i.e. a few expert annotators label a large number of documents) reach their limits in the annotation of hateful stance. In our investigations, the use of comparative annotation techniques and the use of larger groups of annotators, whose judgments can be aggregated, have shown promising results in terms of annotation stability.

Distribution of Specific Stances

In our experiments on annotating stance towards multiple targets, we compared three predefined target sets on two datasets – one about atheism and one about the death penalty. For all three target sets, we find that distributions of annotations are strongly imbalanced, which means that only a few targets occur frequently, but that the mass occurs sparsely. This finding indicates – at least for the datasets that rely on keyword-based crawling – that the majority of posters focus on similar targets and that social media debates are not rich in highly specific positions. Furthermore, since automatic approaches on detecting stance require a sufficiently large amount of training examples, we cannot expect that systems are able to recognize stance towards arbitrarily specific targets.

Quantifying Qualitative Data

To investigate stance towards nuanced targets, we proposed a novel data collection method which we name quantifying qualitative data. Our method consists of two steps that are carried out on a crowdsourcing platform: First, we engaged people to create a comprehensive dataset of assertions that are relevant to a predefined issue. We argue that this approach on creating data results in a more nuanced dataset as opposed to keyword-based collection. Next, we collected judgments that indicate whether people agree or disagree with the assertions, and how strongly people support or oppose the assertions.

Our analysis of this data shows that there is a high level of consensus if people judge a comprehensive set of assertions, and that only few assertions lead to a significant dissent. We also propose several new metrics that can be used to rank issues or assertions based on how much dissent or consensus they cause. We demonstrate how these rankings can be used to obtain novel insights into the nature of the corresponding debate.

Relativity of Hatefulness

As we obtained a low level of reliability of hateful stance annotations, we suspect that there are both textual and non-textual factors that affect one’s perception of hatefulness. In order to reveal the potential influence of such factors, we conducted experiments in which we experimentally control for (i) implicitness of the texts and (ii) the annotator’s (nuanced) stance towards the tweets they are asked to annotate. The comparison of implicit and explicit versions of tweets showed that whether a tweet is phrased implicitly influences how hateful the tweet is perceived to be. We examined the relationship between nuanced stance and hateful stance, and find a strong negative correlation between these two concepts. This means that if one agrees with an assertion, it is unlikely that one perceives the assertion as hateful, or vice versa, that one is unlikely to agree with an assertion one perceives as hateful.

These findings indicate that there are indeed factors that moderate the perception of hatefulness and therefore that hateful stance is dependent on both textual and non-textual factors. This dependency indicates that systems that have been trained on such data make predictions that are also dependent on these factors. If one uses these predictions for decisions, this dependency must be considered accordingly. The findings also stresses the need for more research on revealing the influence of moderating factors. We suspect that it would be particularly fruitful to examine the connection between the perception of hatefulness and how strongly one values free speech.

Detection Systems

The third major contribution of this thesis is a systematic review of the state of the art in automatic stance detection, as well as an experimental approach for improving beyond this state of the art. Our findings on this technical level are derived from our participation in and organization of shared tasks, and from systems that we trained on the above described data sets. We will now first reflect on the lessons learnt from our efforts in the context of several shared tasks (i.e. SemEval 2016 task 6, StanceCat@IberEval 2017, and Germeval 2017) and subsequently outline the promising results obtained when relying on judgment similarity to predict stance.

State of the Art Stance Detection

First, we find across different datasets that automatic systems managed to outperform the majority class baselines. From this, we conclude that the current state of the art is indeed capable of detecting stance in a meaningful way. However, we also observe that the majority class is a strong baseline and that there still a significant amount of room for improvement. For instance, in SemEval 2016 task 6, the best system outperformed the majority class baseline by only four percentage points. The strong majority class baselines also emphasize that the class distribution of the datasets is a key factor in how well the systems work. This observation is further supported by our experiment on multi-target stance prediction, which shows that if the distributions are too skewed, a meaningful prediction is hardly possible. Second, across various experiments, we observe that comparably simple text classifiers are highly competitive. In SemEval 2016 task 6, a rather simple SVM equipped with ngram features even outperformed all other submissions. Since these types of classifiers are standard features of freely available and widely used machine-learning libraries, this finding indicates that users who do not have a deep NLP background can quickly establish competitive stance detection systems.

Third, we identify a few more advanced methods that frequently lead to improvements over these baseline systems. These include (i) the usage of word polarity lists, (ii) the usage of word vectors that have been (pre-)trained on large text-corpora, and (iii) ensemble learning strategies that combine several base classifiers. We therefore recommended that future approaches on stance detection consider these methods as competitive baselines. Finally, we compared the nowadays dominant learning paradigm, which is based on neural networks, with more traditional non-neural approaches. In our evaluation we cannot clearly determine whether one of the two learning paradigms is superior to the other. This finding distinguishes stance detection from other NLP tasks such as part-of-speech tagging or named entity recognition, in which deep neuronal nets significantly outperform traditional systems (cf. Plank et al. (2016) or Devlin et al. (2018)). However, our findings indicate that there is significant potential in combining neural, non-neural, or rule-based approaches for improving beyond the state of the art.

Judgment Similarity

In this thesis, we proposed to rely on the degree to which assertions are judged similarly by a large number of people (i.e. judgment similarity) to outperform the current state of the art in automatic stance detection. The underlying assumption of this approach is that texts to which people similarly agree (üöä stance) or disagree (8 stance) tend to express a similar stance.

We argue that large amounts of judgments on texts can be gathered from almost all social networking sites, for instance, by harvesting the up-votes or down-votes on Reddit posts. However, also in social media, there are situations in which there are no judgements available – for example if we want to compute judgment similarity between new posts which have not yet been noticed by a large audience. Thus, we propose to mimic judgment similarity by supervised methods that are trained to predict the judgment similarity of two texts. We find that such a function – which maps two input texts to a similarity score – can be learned with fairly good performance. For learning the function, we used learning architectures which contain little linguistic knowledge and thus do not provide deep insights into the linguistic characteristics of texts that cause them to be judged similarly (i.e. SVM with ngrams and a Siamese Neural Network). Thus, future research should examine how judgment similarity relates to linguistic phenomena such as paraphrasing or entailment.

Based on the automatic estimations of judgment similarity, we implemented approaches for predicting both nuanced (cf. Section 5.3) and hateful stance (cf. Section 6.3.3). For both tasks, we demonstrated that such approaches yield substantial improvements over state of the art stance classification approaches – at least in the controlled environment of our datasets. Whether the proposed approach will be successful in real-world settings still needs to be proven by future research. In addition, due to the high level of similarity between the concepts stance, sentiment and argumentation, we argue that judgment similarity should be a useful means of predicting these concepts as well.

Outlook

As the final step in this thesis, we will take a look at the significance of our research for stance detection and NLP in general, and moreover, what implications we may derive for society as a whole. Our research emphasises that data quality is one of the most significant factors in state-of-the-art stance detection systems. As these systems require labeled data for training and evaluation, low-quality data may result in low performance of the systems, or may render the results uninterpretable. Our findings indicate that the perception of stance is, at least partially, subjective. Hence, to ensure a high data quality, annotation procedures should account for the subjective nature of the phenomenon.

In this work, we proposed a new annotation approach that tackles the problem of subjectivity. Specifically, we propose (i) to collect annotations from more than just a few annotators and (ii) to systematically control for the group of annotators (e.g. by certain demographic variables). If one uses a large number of annotators, it is possible to average their responses and thereby minimize the influence of individual – possibly strongly deviating – annotations. By systematically controlling who the annotators are, we can further reduce variance by using homogeneous groups, and we can contextualize the resulting labels with respect to the group of annotators. Hence, in future attempts on stance detection, we could construct systems that do not predict a stance that is valid for all audiences, but rather a stance that would be assigned by a certain demographic group (e.g. students, parents, or investors). We argue, that such contextualized annotations and predictions are probably suited for improving other NLP tasks, which also suffer from the subjectivity of annotations such as sentiment analysis or argument mining.

Overall, this thesis casts a mixed picture of the performance of automatic approaches on stance detection. While in some scenarios, automatic approaches yield satisfactory performance, in others, they are hardly better than a majority vote. However, note that the NLP task of stance detection is still rather new (e.g. the first major evaluation initiative was conducted by Mohammad et al. in 2016) and NLP is a rapidly developing field with new breakthroughs every year. It is therefore conceivable that in a few years the state of the art will entirely be based on new technologies. For example, we found that the current state-of-the-art lexical semantics are not sufficient for modelling the semantic shift that is associated with a certain target. However, recent works – such as Howard and Ruder (2018), Peters et al. (2018), or Devlin et al. (2018) which encode contextualized word semantics – indicate that such problems may be solved soon.

The fact that stance detection already works reliably in some scenarios, has broader implications besides direct lessons learnt for NLP researchers and practitioners. If used wrongly or maliciously, stance detection may negatively affect people’s lives. For example, if people start to use stance detection to only engage with content that aligns with their own stance, existing echo-chambers will be amplified and discourses on controversial issues – that incorporate every member of society – will not take place. Ultimately this may lead to a more fractured society which is unable to find consensus. In addition stance detection also provides the technical basis for an all-pervading censorship or the persecution of political dissidents.

At the same time, stance detection offers the potential to positively affect people’s lives. We outlined several use-cases in which automatic stance detection can improve the efficiency with which social media users or organizations discover, group, or filter social media posts that express stance towards targets they are interested in. In this way, automatic stance detection can help the society as a whole to communicate more efficiently, and thus to make better decisions.

These risks and opportunities resulting from the usage of stance detection highlight that researchers and practitioners should consider, in each individual scenario, whether the utility of these technologies exceeds their potential costs.

Bibliography

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-scale Machine Learning. In Proceedings of the USENIX Conference on Operating Systems Design and Implementation (OSDI). Savannah, USA, pages 265–283.

Rob Abbott, Marilyn Walker, Pranav Anand, Jean E Fox Tree, Robeson Bowmani, and Joseph King. 2011. How can you say such things?!?: Recognizing disagreement in informal political argument. In Proceedings of the Workshop on Languages in Social Media (LSM 2011). Portland, USA, pages 2–11.

Gediminas Adomavicius and Alexander Tuzhilin. 2005. Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE Transactions on Knowledge and Data Engineering 17(6):734–749.

Eneko Agirre, Mona Diab, Daniel Cer, and Aitor Gonzalez-Agirre. 2012. Semeval-2012 task 6: A pilot on semantic textual similarity. In Proceedings of the Joint Conference on Lexical and Computational Semantics (*SEM). Montréal, Canada, pages 385–393.

Rami Al-Rfou, Bryan Perozzi, and Steven Skiena. 2013. Polyglot: Distributed Word Representations for Multilingual NLP. In Proceedings of the Conference on Computational Natural Language Learning (CoNLL). Sofia, Bulgaria, pages 183–192.

Pranav Anand, Marilyn Walker, Rob Abbott, Jean E. Fox Tree, Robeson Bowmani, and Michael Minor. 2011. Cats rule and dogs drool!: Classifying stance in online debate. In Proceedings of the 2nd workshop on computational approaches to subjectivity and sentiment analysis. Stroudsburg, USA, pages 1–9.

Jacob Andreas, Sara Rosenthal, and Kathleen McKeown. 2012. Annotating agreement and disagreement in threaded discussion. In Proceedings of the International Conference on Language Resources and Evaluation (LREC). Istanbul, Turkey, pages 818–822.

Shilpa Arora, Mahesh Joshi, and Carolyn P Rosé. 2009. Identifying Types of Claims in Online Customer Reviews. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT). Boulder, USA, pages 37–40.

Segun Taofeek Aroyehun and Alexander Gelbukh. 2018. Aggression Detection in Social Media: Using Deep Neural Networks, Data Augmentation, and Pseudo Labeling. In Proceedings of the Workshop on Trolling, Aggression and Cyberbullying (TRAC). Santa Fe, USA, pages 90–97.

Ron Artstein and Massimo Poesio. 2008. Inter-coder Agreement for Computational Linguistics. Computational Linguistics 34(4):555–596.

Yoav Artzi, Patrick Pantel, and Michael Gamon. 2012. Predicting Responses to Microblog Posts. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT). Montréal, Canada, pages 602–606.

Isabelle Augenstein, Tim Rocktäschel, Andreas Vlachos, and Kalina Bontcheva. 2016. Stance Detection with Bidirectional Conditional Encoding. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Austin, Texas, pages 876–885.

Pinkesh Badjatiya, Shashank Gupta, Manish Gupta, and Vasudeva Varma. 2017. Deep Learning for Hate Speech Detection in Tweets. In Proceedings of the International Conference on World Wide Web Companion (WWW Companion). Perth, Australia, pages 759–760.

Daniel Bär, Torsten Zesch, and Iryna Gurevych. 2013. Dkpro similarity: An open source framework for text similarity. In Proceedings of the Conference of the Association for Computational Linguistics: System Demonstrations (ACL). Sofia, Bulgaria, pages 121–126.

Geoffrey Barbier, Zhuo Feng, Pritam Gundecha, and Huan Liu. 2013. Provenance Data in Social Media. Synthesis Lectures on Data Mining and Knowledge Discovery 4(1):1– 84.

Francesco Barbieri, Valerio Basile, Danilo Croce, Malvina Nissim, Nicole Novielli, and Viviana Patti. 2016. Overview of the Evalita 2016 SENTIment POLarity Classification Task. In Proceedings of the Italian Conference on Computational Linguistics (CLiC-it 2016) & Evaluation Campaign of Natural Language Processing and Speech Tools for Italian (EVALITA). Naples, Italy, pages 1–12.

Marco Baroni, Georgiana Dinu, and Germán Kruszewski. 2014. Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the Conference of the Association for Computational Linguistics (ACL). Baltimore, USA, pages 238–247.

Jamie Bartlett, Jeremy Reffin, Noelle Rumball, and Sarah Williamson. 2014. Antisocial media. Demos (pages 1–51): https://www.demos.co.uk/files/DEMOS_ Anti-social_Media.pdf; last accessed November 26 2018.

Hans Baumgartner and Jan-Benedict E. M. Steenkamp. 2001. Response Styles in Marketing Research: A Cross-National Investigation. Journal of Marketing Research 38(2):143–156.

Petra Saskia Bayerl and Karsten Ingmar Paul. 2011. What Determines Inter-Coder Agreement in Manual Annotations? A Meta-Analytic Investigation. Computational Linguistics 37(4):699–725.

Farah Benamara, Cyril Grouin, Jihen Karoui, Véronique Moriceau, and Isabelle Robba. 2017. Analyse d’opinion et langage figuratif dans des tweets: présentation et résultats du Défi Fouille de Textes (DEFT). In Les procédures de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN). Orléans-France, pages 1–12.

Yoshua Bengio, Paolo Frasconi, and Patrice Simard. 1993. The Problem of Learning Long-Term Dependencies in Recurrent Networks. In Proceedings of the IEEE International Conference on Neural Networks. San Francisco, USA, pages 1183–1188.

Darina Benikova, Michael Wojatzki, and Torsten Zesch. 2017. What does this imply? Examining the Impact of Implicitness on the Perception of Hate Speech. In Proceedings of the International Conference of the German Society for Computational Linguistics and Language Technology (GSCL). Berlin, Germany, pages 171–179.

Edward M. Bennett, R. Alpert, and A.C. Goldstein. 1954. Communications Through Limited-Response Questioning. Public Opinion Quarterly 18(3):303–308.

James Bennett and Stan Lanning. 2007. The Netflix Prize. In Proceedings of KDD Cup and Workshop. San Jose, USA, pages 3–6.

Rahul Bhagat and Eduard Hovy. 2013. What Is a Paraphrase? Computational Linguistics 39(3):463–472.

Steven Bird and Mark Liberman. 2001. A Formal Framework for Linguistic Annotation. Speech communication 33(1-2):23–60.

Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning . Springer-Verlag, Berlin, Heidelberg.

Pavel Blinov and Evgeny Kotelnikov. 2014. Using Distributed Representations for Aspect-based Sentiment Analysis. In Proceedings of the International Conference ’Dialogue’ . Moscow, Russia, pages 68–80.

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics (TACL) 5:135–146.

Filip Boltužić and Jan Šnajder. 2015. Identifying Prominent Arguments in Online Debates Using Semantic Textual Similarity. In Proceedings of the Workshop on Argumentation Mining. Denver, USA, pages 110–115.

Filip Boltužić and Jan Šnajder. 2014. Back up your Stance: Recognizing Arguments in Online Discussions. In Proceedings of the Workshop on Argumentation Mining . Baltimore, USA, pages 49–58.

Margaret M. Bradley and Peter J. Lang. 1999. Affective Norms for English Words (ANEW): Instruction Manual and Affective Ratings. Technical report, the center for research in psychophysiology, University of Florida, USA.

Alexander Budanitsky and Graeme Hirst. 2006. Evaluating Wordnet-based Measures of Lexical Semantic Relatedness. Computational Linguistics 32(1):13–47.

Pete Burnap and Matthew Leighton Williams. 2014. Hate Speech, Machine Classification and Statistical Modelling of Information Flows on Twitter: Interpretation and Communication for Policy Decision Making. In Proceedings of Internet, Policy & Politics Conference. Oxford, United Kingdom, pages 1–18.

Chris Callison-Burch. 2009. Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Singapore, pages 286–295.

Erik Cambria, Robert Speer, Catherine Havasi, and Amir Hussain. 2010. SenticNet: A Publicly Available Semantic Resource for Opinion Mining. In Proceedings of the AAAI fall Symposium on Commonsense Knowledge. Arlington, USA, pages 14–18.

Jose Sebastián Canós. 2018. Misogyny Identification Through SVM at IberEval 2018. In Proceedings of the Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval). Seville, Spain, pages 229–233.

Fabio Celli, Giuseppe Riccardi, and Arindam Ghosh. 2014. CorEA: Italian News Corpus with Emotions and Agreement. In Proceedings of the Italian Conference on Computational Linguistics (CLiC-it) & the Evaluation Campaign of Natural Language Processing and Speech Tools for Italian (EVALITA). Pisa, Italy, pages 98–102.

Fabio Celli, Evgeny Stepanov, Massimo Poesio, and Giuseppe Riccardi. 2016. Predicting Brexit: Classifying agreement is better than sentiment and pollsters. In Proceedings of the Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media (PEOPLES). Osaka, Japan, pages 110–118.

Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2:27:1–27:27.

Irfan Chaudhry. 2015. #Hashtagging hate: Using Twitter to Track Racism Online. First Monday 20(2). Accessible at https://firstmonday.org/article/view/5450/4207; last accessed November 26 2018.

Ying Chen, Yilu Zhou, Sencun Zhu, and Heng Xu. 2012. Detecting Offensive Language in Social Media to Protect Adolescent Online Safety. In Proceedings of the IEEE International Conference on Privacy, Security, Risk and Trust (PASSAT) and International Conference on Social Computing (SocialCom). Amsterdam, Netherlands, pages 71–80.

Yejin Choi and Claire Cardie. 2009. Adapting a Polarity Lexicon using Integer Linear Programming for Domain-Specific Sentiment Classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Singapore, pages 590–598.

Eli Cohen. 2009. Applying best-worst scaling to wine marketing. International Journal of Wine Business Research 21(1):8–23.

Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20(1):37–46.

Jacob Cohen. 1968. Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological bulletin 70(4):213.

Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research 12:2493–2537.

Anthony J. Conger. 1980. Integration and Generalisation of Kappas for Multiple Raters. Psychological Bulletin 88(2):322–328.

Alexander Conrad, Janyce Wiebe, and Rebecca Hwa. 2012. Recognizing arguing subjectivity and argument tags. In Proceedings of the Workshop on Extra-Propositional Aspects of Meaning in Computational Linguistics. Stroudsburg, USA, pages 80–88.

Michele Corazza, Stefano Menini, Arslan Pinar, Rachele Sprugnoli, Elena Cabrio, Sara Tonelli, and Serena Villata. 2018. InriaFBK at Germeval 2018: Identifying Offensive Tweets Using Recurrent Neural Networks. In Proceedings of the GermEval 2018 – Shared Task on the Identification of Offensive Language. Vienna, Austria, pages 80– 84.

David M. Corey, William P. Dunlap, and Michael J. Burke. 1998. Averaging Correlations: Expected Values and Bias in Combined Pearson r s and Fisher’s z Transformations. The Journal of General Psychology 125(3):245–261.

Ido Dagan, Bill Dolan, Bernardo Magnini, and Dan Roth. 2009. Recognizing textual entailment: Rational, evaluation and approaches. Natural Language Engineering 15(4):i– xvii.

Ido Dagan and Oren Glickman. 2004. Probabilistic textual entailment: Generic applied modelling of language variability. In Proceedings of the PASCAL Workshop on Learning Methods for Text Understanding and Mining . Grenoble, France.

Abhinandan S. Das, Mayur Datar, Ashutosh Garg, and Shyam Rajaram. 2007. Google News Personalization: Scalable Online Collaborative Filtering. In Proceedings of the International Conference on World Wide Web (WWW). Banff, Canada, pages 271– 280.

Sanjiv R. Das and Mike Y. Chen. 2007. Yahoo! for Amazon: Sentiment Extraction from Small Talk on the Web. Management Science 53(9):1375–1388.

Herbert A. David. 1963. The Method of Paired Comparisons, volume 12. Hodder Arnold, London, UK.

James Davidson, Benjamin Liebald, Junning Liu, Palash Nandy, Taylor Van Vleet, Ullas Gargi, Sujoy Gupta, Yu He, Mike Lambert, Blake Livingston, and Dasarathi Sampath. 2010. The YouTube Video Recommendation System. In Proceedings of the ACM Conference on Recommender Systems (RecSys). Barcelona, Spain, pages 293–296.

Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber. 2017. Automated Hate Speech Detection and the Problem of Offensive Language. In Proceedings of the International AAAI Conference on Web and Social Media (ICWSM). Montréal, Canada, pages 512–515.

Mark Davies and Joseph L. Fleiss. 1982. Measuring agreement for multinomial data. Biometrics 38(4):1047–1051.

Johannes Daxenberger, Oliver Ferschke, Iryna Gurevych, and Torsten Zesch. 2014. DKPro TC: A Java-based Framework for Supervised Learning Experiments on Textual Data. In Proceedings of the Conference of the Association for Computational Linguistics: System Demonstrations (ACL). Baltimore, USA, pages 61–66.

Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science 41(6):391–407.

Fabio Del Vigna, Andrea Cimino, Felice Dell’Orletta, Marinella Petrocchi, and Maurizio Tesconi. 2017. Hate Me, Hate Me Not: Hate Speech Detection on Facebook. In Proceedings of the Italian Conference on Cybersecurity (ITASEC). Venice, Italy, pages 86–95.

Lingjia Deng and Janyce Wiebe. 2014. Sentiment Propagation via Implicature Constraints. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL). Gothenburg, Sweden, pages 377–385.

Lingjia Deng and Janyce Wiebe. 2015. Mpqa 3.0: An entity/event-level sentiment corpus. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT). pages 1323–1328.

Lingjia Deng, Janyce Wiebe, and Yoonjung Choi. 2014. Joint Inference and Disambiguation of Implicit Sentiments via Implicature Constraints. In Proceedings of the International Conference on Computational Linguistics (COLING). Dublin, Ireland, pages 79–88.

Leon Derczynski, Kalina Bontcheva, Maria Liakata, Rob Procter, Geraldine Wong Sak Hoi, and Arkaitz Zubiaga. 2017. SemEval-2017 Task 8: RumourEval: Determining rumour veracity and support for rumours. In Proceedings of the International Workshop on Semantic Evaluation Exercises (SemEval). Vancouver, Canada, pages 69–76.

Leon Derczynski, Alan Ritter, Sam Clark, and Kalina Bontcheva. 2013. Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data. In Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP). Hissar, Bulgaria, pages 198–206.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805 Accessible at https://arxiv.org/abs/1810.04805; last accessed November 26 2018.

Karthik Dinakar, Birago Jones, Catherine Havasi, Henry Lieberman, and Rosalind Picard. 2012. Common Sense Reasoning for Detection, Prevention, and Mitigation of Cyberbullying. ACM Transactions on Interactive Intelligent Systems (TiiS) 2(3):18:1– 18:30.

Nemanja Djuric, Jing Zhou, Robin Morris, Mihajlo Grbovic, Vladan Radosavljevic, and Narayan Bhamidipati. 2015. Hate Speech Detection with Comment Embeddings. In Proceedings of the International Conference on World Wide Web (WWW). Florence, Italy, pages 29–30.

Li Dong, Furu Wei, Chuanqi Tan, Duyu Tang, Ming Zhou, and Ke Xu. 2014. Adaptive Recursive Neural Network for Target-dependent Twitter Sentiment Classification. In Proceedings of the Conference of the Association for Computational Linguistics (ACL). Baltimore, USA, pages 49–54.

Jiachen Du, Ruifeng Xu, Yulan He, and Lin Gui. 2017. Stance Classification with Target-Specific Neural Attention Networks. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI-17). pages 3988–3994.

John W. Du Bois. 2007. The Stance Triangle. Stancetaking in Discourse: Subjectivity, Evaluation, Interaction 164(3):139–182.

John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research 12:2121–2159.

Richard Eckart de Castilho and Iryna Gurevych. 2014. A broad-coverage collection of portable NLP components for building shareable analysis pipelines. In Proceedings of the Workshop on Open Infrastructures and Analysis Frameworks for HLT . Dublin, Ireland, pages 1–11.

Carsten Eickhoff and Arjen de Vries. 2011. How Crowdsourcable is Your Task? In Proceedings of the Workshop on Crowdsourcing for Search and Data Mining (CSDM 2011). Hong Kong, China, pages 11–14.

Paul Ekman, E. Richard Sorenson, and Wallace V. Friesen. 1969. Pan-cultural Elements in Facial Displays of Emotion. Science 164(3875):86–88.

Andrea Esuli and Fabrizio Sebastiani. 2006. SentiWordNet: A High-Coverage Lexical Resource for Opinion Mining. In Proceedings of the Conference on Language Resources and Evaluation (LREC). Genoa, Italy, pages 417–422.

David Etter, Francis Ferraro, Ryan Cotterell, Olivia Buzek, and Benjamin Van Durme. 2013. Nerit: Named Entity Recognition for Informal Text. Technical report, John Hopkins University, Human Language Technology Center of Excellence, Baltimore, USA.

Stefan Evert. 2004. The Statistics of Word Cooccurrences: Word Pairs and Collocations. Ph.D. thesis, University of Stuttgart, Institut für Maschinelle Sprachverarbeitung (IMS), Germany.

Manaal Faruqui, Jesse Dodge, Sujay K. Jauhar, Chris Dyer, Eduard Hovy, and Noah A. Smith. 2015. Retrofitting Word Vectors to Semantic Lexicons. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT). Denver, USA, pages 1606–1615.

Adam Faulkner. 2014. Automated Classification of Stance in Student Essays: An Approach Using Stance Target Information and the Wikipedia Link-Based Measure. In Proceedings of the International Florida Artificial Intelligence Research Society Conference. Pensacola Beach, USA, pages 174–179.

Chris Fawcett and Holger H. Hoos. 2016. Analysing differences between algorithm configurations through ablation. Journal of Heuristics 22(4):431–458.

Elisabetta Fersini, Paolo Rosso, and Maria Anzovino. 2018. Overview of the Task on Automatic Misogyny Identification at IberEval 2018. In Proceedings of the Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval). Seville, Spain, pages 214–228.

Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2001. Placing Search in Context: The Concept Revisited. In Proceedings of the International Conference on World Wide Web (WWW). Hong Kong, China, pages 406–414.

John R. Firth. 1957. A synopsis of linguistic theory, 1930-1955. Studies in Linguistic Analysis (special volume of the Philological Society) 1952-59:1–32.

Darja Fišer, Tomaž Erjavec, and Nikola Ljubešić. 2017. Legal Framework, Dataset and Annotation Schema for Socially Unacceptable Online Discourse Practices in Slovene. In Proceedings of the Workshop on Abusive Language Online (ALW). pages 46–51.

Peter Flach. 2012. Machine Learning: The Art and Science of Algorithms that Make Sense of Data. Cambridge University Press.

Joseph L. Fleiss. 1971. Measuring Nominal Scale Agreement Among Many Raters. Psychological Bulletin 76(5):378–382.

Rudolph Flesch. 1948. A new Readability Yardstick. Journal of Applied Psychology 32(3):221–233.

Terry N. Flynn, Jordan J. Louviere, Tim J. Peters, and Joanna Coast. 2007. Best–worst scaling: What it can do for health care research and how to do it. Journal of Health Economics 26(1):171–189.

Eric N. Forsythand and Craig H. Martell. 2007. Lexical and Discourse Analysis of Online Chat Dialog. In International Conference on Semantic Computing (ICSC). Washington, USA, pages 19–26.

Karën Fort, Gilles Adda, and K. Bretonnel Cohen. 2011. Amazon Mechanical Turk: Gold Mine or Coal Mine? Computational Linguistics 37(2):413–420.

Karën Fort, Adeline Nazarenko, and Sophie Rosset. 2012. Modeling the Complexity of Manual Annotation Tasks: a Grid of Analysis. Proceedings of the International Conference on Computational (COLING) pages 895–910.

Jennifer Foster, Ozlem Cetinoglu, Joachim Wagner, Joseph Le Roux, Joakim Nivre, Deirdre Hogan, and Josef Van Genabith. 2011. From News to Comment: Resources and Benchmarks for Parsing the Language of Web 2.0. In Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP). Chiang Mai, Thailand, pages 893–901.

James B. Freeman. 1991. Dialectics and the macrostructure of arguments: A theory of argument structure, volume 10. Foris Publications, New York.

Gottlob Frege. 1884. Die Grundlagen der Arithmetik: eine logisch-mathematische Untersuchung über den Begriff der Zahl. Felix Meiner Verlag, Hamburg.

Simona Frenda, Ghanem Bilal, et al. 2018. Exploration of misogyny in spanish and english tweets. In Proceedings of the Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval). Seville, Spain, pages 260–267.

Evgeniy Gabrilovich and Shaul Markovitch. 2007. Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis. In Proceedings of the international joint conference on Artifical intelligence (IJCAI). Hyberabad, India, pages 1606–1611.

Björn Gambäck and Utpal Kumar Sikdar. 2017. Using convolutional neural networks to classify hate-speech. In Proceedings of the Workshop on Abusive Language Online (ALW). pages 85–90.

Gayatree Ganu, Noemie Elhadad, and Amélie Marian. 2009. Beyond the Stars: Improving Rating Predictions using Review Text Content. In Proceedings of the International Workshop on the Web and Databases (WebDB). Providence, USA, pages 1–6.

Miguel A. Garcıa-Cumbreras, Julio Villena-Román, Eugenio Martınez-Cámara, Manuel Carlos Dıaz-Galiano, Maria Teresa Martın-Valdivia, and L. Alfonso na López. 2016. Overview of TASS 2016. In Proceedings of the TASS Workshop on Sentiment Analysis. Salamanca, Spain.

Lorenzo Gatti, Marco Guerini, and Marco Turchi. 2015. SentiWords: Deriving a High Precision and High Coverage Lexicon for Sentiment Analysis. IEEE Transactions on Affective Computing 7(4):409–421.

Aniruddha Ghosh, Guofu Li, Tony Veale, Paolo Rosso, Ekaterina Shutova, John Barnden, and Antonio Reyes. 2015. SemEval-2015 Task 11: Sentiment Analysis of Figurative Language in Twitter. In Proceedings of the International Workshop on Semantic Evaluation Exercises (SemEval). Denver, USA, pages 470–478.

Debanjan Ghosh, Smaranda Muresan, Nina Wacholder, Mark Aakhus, and Matthew Mitsui. 2014. Analyzing argumentative discourse units in online interactions. In Proceedings of the Workshop on Argumentation Mining . Baltimore, USA, pages 39– 48.

Kevin Gimpel, Nathan Schneider, Brendan O’Connor, Dipanjan Das, Daniel Mills, Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A. Smith. 2011. Part-of-speech Tagging for Twitter: Annotation, Features, and Experiments. In Proceedings of the Conference of the Association for Computational Linguistics (ACL). Portland, USA, pages 42–47.

Njagi Dennis Gitari, Zhang Zuping, Hanyurwimfura Damien, and Jun Long. 2015. A Lexicon-based Approach for Hate Speech Detection. International Journal of Multimedia and Ubiquitous Engineering 10(4):215–230.

Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Conference on Artificial Intelligence and Statistics (AISTATS). Chia Laguna, Italy, pages 249–256.

Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Deep Sparse Rectifier Neural Networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS). Fort Lauderdale, USA, pages 315–323.

Carlos A. Gomez-Uribe and Neil Hunt. 2015. The Netflix Recommender System: Algorithms, Business Value, and Innovation. ACM Transactions on Management Information Systems 6(4):13:1–13:19.

Roberto González-Ibánez, Smaranda Muresan, and Nina Wacholder. 2011. dentifying Sarcasm in Twitter: A Closer Look . In Proceedings of the Conference of the Association for Computational Linguistics (ACL). Portland, USA, pages 581–586.

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning . MIT Press, Cambridge, USA.

Nancy Green, Kevin Ashley, Diane Litman, Chris Reed, and Vern Walker. 2014. Front matter of the proceedings of the first workshop on argumentation mining. In Proceedings of the Workshop on Argumentation Mining . Baltimore, Maryland, page iii.

Stephan Greene and Philip Resnik. 2009. More than words: Syntactic packaging and implicit sentiment. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT). Stroudsburg, USA, pages 503–511.

Herbert P. Grice. 1970. Logic and Conversation. Syntax and Semantics 3:41–58.

Jiang Guo, Wanxiang Che, Haifeng Wang, and Ting Liu. 2014. Learning Sense-specific Word Embeddings By Exploiting Bilingual Resources. In Proceedings the International Conference on Computational Linguistics (COLING). Dublin, Ireland, pages 497–507.

Dan Gusfield. 1997. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press.

Le Quan Ha, Elvira I. Sicilia-Garcia, Ji Ming, and F. Jack Smith. 2002. Extension of Zipf’s Law to Words and Phrases. In Proceedings of the International Conference on Computational Linguistics (COLING). Taipei, Taiwan, pages 1–6.

Ivan Habernal, Judith Eckle-Kohler, and Iryna Gurevych. 2014. Argumentation Mining on the Web from Information Seeking Perspective. In Proceedings of the Workshop on Frontiers and Connections between Argumentation Theory and Natural Language Processing . Bertinoro, Italy, page 14.

Ivan Habernal and Iryna Gurevych. 2016. Which argument is more convincing? Analyzing and predicting convincingness of Web arguments using bidirectional LSTM. In Proceedings of the Conference of the Association for Computational Linguistics (ACL). Berlin, Germany, pages 1589–1599.

Ivan Habernal, Henning Wachsmuth, Iryna Gurevych, and Benno Stein. 2018. Before Name-Calling: Dynamics and Triggers of Ad Hominem Fallacies in Web Argumentation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT). New Orleans, USA, pages 386–396.

Michele P. Hamm, Amanda S. Newton, Annabritt Chisholm, Jocelyn Shulhan, Andrea Milne, Purnima Sundar, Heather Ennis, Shannon D. Scott, and Lisa Hartling. 2015. Prevalence and Effect of Cyberbullying on Children and Young People: A Scoping Review of Social Media Studies. JAMA Pediatrics 169(8):770–777.

F. Maxwell Harper and Joseph A. Konstan. 2015. The movielens datasets: History and context. ACM Transactions on Interactive Intelligent Systems (TiiS) – Regular Articles and Special issue on New Directions in Eye Gaze for Interactive Intelligent Systems 5(4):1–19.

Kazi Saidul Hasan and Vincent Ng. 2013. Stance Classification of Ideological Debates: Data, Models, Features, and Constraints. In Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP). Nagoya, Japan, pages 1348– 1356.

Vasileios Hatzivassiloglou and Kathleen R. McKeown. 1997. Predicting the Semantic Orientation of Adjectives. In Proceedings of the Conference of the Association for Computational Linguistics (ACL) and Conference of the European Chapter of the Association for Computational Linguistics (EACL). Madrid, Spain, pages 174–181.

Marti A. Hearst, Susan T. Dumais, Edgar Osuna, John Platt, and Bernhard Scholkopf. 1998. Support Vector Machines. IEEE Intelligent Systems and their Applications 13(4):18–28.

Sepp Hochreiter. 1991. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut für Informatik, Technische Universität München, Germany.

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-term Memory. Neural Computation 9(8):1735–1780.

Seth M. Holmes and Heide Castañeda. 2016. Representing the “European refugee crisis” in Germany and beyond: Deservingness and difference, life and death. American Ethnologist 43(1):12–24.

Kurt Hornik. 1991. Approximation Capabilities of Multilayer Feedforward Networks. Neural networks 4(2):251–257.

Tobias Horsmann and Torsten Zesch. 2016. LTL-UDE at EmpiriST 2015: Tokenization and PoS Tagging of Social Media Text. In Proceedings of the Web as Corpus Workshop (WAC-X) and the EmpiriST Shared Task . Berlin, Germany, pages 120–126.

Tobias Horsmann and Torsten Zesch. 2018. DeepTC – An Extension of DKPro Text Classification for Fostering Reproducibility of Deep Learning Experiments. In Proceedings of the International Conference on Language Resources and Evaluation (LREC). Miyazaki, Japan, pages 2539 – 2545.

Leonard Hövelmann and Christoph M. Friedrich. 2017. Fasttext and Gradient Boosted Trees at GermEval-2017 Tasks on Relevance Classification and Document-level Polarity. In Proceedings of the GermEval 2017 – Shared Task on Aspect-based Sentiment in Social Media Customer Feedback . Berlin, Germany, pages 30–35.

Jeremy Howard and Sebastian Ruder. 2018. Universal Language Model Fine-tuning for Text Classification. In Proceedings of the Conference of the Association for Computational Linguistics (ACL). Melbourne, Australia, pages 328–339.

Jeff Howe. 2006. The Rise of Crowdsourcing. Wired magazine 14(6):1–4.

Jeff Howe. 2008. Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business. Crown Publishing Group, New York, USA, 1 edition.

Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. 2014. Convolutional Neural Network Architectures for Matching Natural Language Sentences. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS). Montréal, Canada, pages 2042–2050.

Minqing Hu and Bing Liu. 2004. Mining and Summarizing Customer Reviews. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD). Seattle, USA, pages 168–177.

Lawrence Hubert. 1977. Kappa Revisited. Psychological Bulletin 84(2):289–297.

Nancy Ide and Laurent Romary. 2004. International Standard for a Linguistic Annotation Framework. Natural Language Engineering 10(3-4):211–225.

Long Jiang, Mo Yu, Ming Zhou, Xiaohua Liu, and Tiejun Zhao. 2011. Target-dependent Twitter Sentiment Classification. In Proceedings of the Conference of the Association for Computational Linguistics (ACL). Portland, USA, pages 151–160.

Aditya Joshi, Abhijit Mishra, Nivvedan Senthamilselvan, and Pushpak Bhattacharyya. 2014. Measuring Sentiment Annotation Complexity of Text. In Proceedings of the Conference of the Association for Computational Linguistics (ACL). Baltimore, USA, pages 36–41.

Dan Jurafsky and James H. Martin. 2000. Speech & Language Processing . Pearson International Education, New Jersey, USA, 2nd edition.

David Jurgens. 2013. Embracing Ambiguity: A Comparison of Annotation Methodologies for Crowdsourcing Word Sense Labels. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT). Atlanta, USA, pages 556–562.

Andreas M. Kaplan and Michael Haenlein. 2010. Users of the world, unite! The challenges and opportunities of Social Media. Business Horizons 53(1):59–68.

Diederik Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. arXiv preprint:1412.6980 pages 1–13. Accessible at https://arxiv.org/abs/1412. 6980; last accessed November 27 2018.

Svetlana Kiritchenko and Saif M. Mohammad. 2016. Capturing Reliable Fine-Grained Sentiment Associations by Crowdsourcing and Best–Worst Scaling. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT). San Diego, California, pages 811–817.

Svetlana Kiritchenko and Saif M. Mohammad. 2017. Best-Worst Scaling More Reliable than Rating Scales: A Case Study on Sentiment Intensity Annotation. In Proceedings of the Conference of the Association for Computational Linguistics (ACL). Vancouver, Canada, pages 465–470.

Svetlana Kiritchenko, Saif M. Mohammad, and Mohammad Salameh. 2016. SemEval-2016 Task 7: Determining Sentiment Intensity of English and Arabic Phrases. In Proceedings of the International Workshop on Semantic Evaluation Exercises (SemEval). San Diego, USA, pages 42–51.

Svetlana Kiritchenko, Xiaodan Zhu, and Saif M. Mohammad. 2014. Sentiment Analysis of Short Informal Texts. Journal of Artificial Intelligence Research (JAIR) 50(1):723– 762.

Nozomi Kobayashi, Ryu Iida, Kentaro Inui, and Yuji Matsumoto. 2006. Opinion Mining on the Web by Extracting Subject-Aspect-Evaluation Relations. In Proceedings of the AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs. Paolo Alto, USA, pages 86–91.

Klaus Krippendorff. 1980. Content Analysis: An Introduction to its Methodology. SAGE Publications, Inc, first edition.

Klaus Krippendorff. 2004a. Content Analysis: An Introduction to its Methodology. SAGE Publications, Inc., fourth edition.

Klaus Krippendorff. 2004b. Reliability in Content Analysis: Some Common Misconceptions and Recommendations. Human Communication Research 30(3):411–433.

Klaus Krippendorff. 2008. Systematic and Random Disagreement and the Reliability of Nominal Data. Communication Methods and Measures 2(4):323–338.

Ritesh Kumar, Atul Kr. Ojha, Shervin Malmasi, and Marcos Zampieri. 2018a. Benchmarking Aggression Identification in Social Media. In Proceedings of the Workshop on Trolling, Aggression and Cyberbullying (TRAC). Santa Fe, USA, pages 1–11.

Ritesh Kumar, Aishwarya N. Reganti, Akshit Bhatia, and Tushar Maheshwari. 2018b. Aggression-annotated corpus of hindi-english code-mixed data. In Proceedings of the International Conference on Language Resources and Evaluation (LREC). pages 1425– 1431.

Irene Kwok and Yuzhou Wang. 2013. Locate the Hate: Detecting Tweets against Blacks. In Proceedings of the AAAI Conference on Artificial Intelligence. Washington, USA, pages 1621–1622.

Namhee Kwon, Liang Zhou, Eduard Hovy, and Stuart W. Shulman. 2007. Identifying and Classifying Subjective Claims. In Proceedings of the International Conference on Digital Government Research: Bridging Disciplines & Domains. San Diego, USA, pages 76–81.

Mirko Lai, Alessandra Teresa Cignarella, Hernández Farıas, and Delia Irazú. 2017a. iTacos at ibereval2017: Detecting Stance in Catalan and Spanish Tweets. In Proceedings of the Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval). Murcia, Spain, pages 185–192.

Mirko Lai, Delia Irazú Hernández Farías, Viviana Patti, and Paolo Rosso. 2017b. Friends and Enemies of Clinton and Trump: Using Context for Detecting Stance in Political Tweets. In Advances in Computational Intelligence (MICAI). Cham, Germany, pages 155–168.

Man Lan, Zhihua Zhang, Yue Lu, and Ju Wu. 2016. Three Convolutional Neural Network-based models for Learning Sentiment Word Vectors Towards Sentiment Analysis. In International Joint Conference on Neural Networks (IJCNN). Vacouver, Canada, pages 3172–3179.

Thomas K. Landauer, Peter W. Foltz, and Darrell Laham. 1998. An introduction to latent semantic analysis. Discourse Processes 25(2-3):259–284.

Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In International Conference on Machine Learning (ICML). Beijing, China, pages 1188–1196.

Yann Le Cun, Larry D. Jackel, Berhard E. Boser, John Denker, H.P. Graf, Isabelle Guyon, Don Henderson, R.E. Howard, and W. Hubbard. 1989. Handwritten digit recognition: Applications of neural network chips and automatic learning. IEEE Communications Magazine 27(11):41–46.

Ji-Ung Lee, Steffen Eger, Johannes Daxenberger, and Iryna Gurevych. 2017. UKP TU-DA at GermEval 2017: Deep Learning for Aspect Based Sentiment Detection. In Proceedings of the GermEval 2017 – Shared Task on Aspect-based Sentiment in Social Media Customer Feedback . Berlin, Germany, pages 22–29.

Russell V. Lenth. 2016. Least-squares Means: the R package lsmeans. Journal of Statistical Software 69(1):1–33.

Omer Levy and Yoav Goldberg. 2014. Neural Word Embedding as Implicit Matrix Factorization. In In Proceedings of the International Conference on Neural Information Processing Systems (NIPS). Montréal, Canada, pages 2177–2185.

Chenliang Li, Jianshu Weng, Qi He, Yuxia Yao, Anwitaman Datta, Aixin Sun, and Bu-Sung Lee. 2012. TwiNER: Named Entity Recognition in Targeted Twitter Stream. In Proceedings of the International Conference on Research and Development in Information Retrieval (ACM SIGIR). Portland, USA, pages 721–730.

Richard J. Light. 1971. Measures of Response Agreement for Qualitative Data: Some Generalizations and Alternatives. Psychological Bulletin 76(5):365–377.

Greg Linden, Brent Smith, and Jeremy York. 2003. Amazon.com Recommendations: Item-to-Item Collaborative Filtering. IEEE Internet Computing 7(1):76–80.

Marco Lippi and Paolo Torroni. 2016. Argumentation Mining: State of the Art and Emerging Trends. ACM Transactions on Internet Technology (TOIT) 16(2):10:1– 10:25.

Natalia Loukachevitch, Pavel Blinov, Evgeny Kotelnikov, Yulia Rubtsova, Vladimir Ivanov, and Elena Tutubalina. 2015. SentiRuEval: testing object-oriented sentiment analysis systems in Russian. In Proceedings of the International Conference ’Dialogue’ . Moscow, Russia, pages 3–13.

Jordan J. Louviere. 1991. Best-worst scaling: A model for the largest difference judgments. University of Alberta, USA: Working Paper.

Jordan J. Louviere. 1993. The best-worst or maximum difference measurement model: Applications to behavioral research in marketing. In Proceedings of the American Marketing Association’s Behavioral Research Conference. Phoenix, Arizona.

Jordan J. Louviere, Terry N. Flynn, and A. A. J. Marley. 2015. Best-Worst Scaling: Theory, Methods and Applications. Cambridge University Press.

Jordan J. Louviere, Ian Lings, Towhidul Islam, Siegfried Gudergan, and Terry Flynn. 2013. An introduction to the application of (case 1) best–worst scaling in marketing research. International Journal of Research in Marketing 30(3):292–303.

Caroline Lyon, James Malcolm, and Bob Dickerson. 2001. Detecting short passages of similar text in large document collections. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Pittsburgh, USA, pages 118–125.

Jamie Macbeth, Hanna Adeyema, Henry Lieberman, and Christopher Fry. 2013. Script-based Story Matching for Cyberbullying Prevention pages 901–906.

Nitin Madnani and Bonnie J. Dorr. 2010. Generating Phrasal and Sentential Paraphrases: A Survey of Data-Driven Methods. Computational Linguistics 36(3):341– 387.

Prasenjit Majumder, Thomas Mandl, et al. 2018. Filtering aggression from the multilingual social media feed. In Proceedings of the Workshop on Trolling, Aggression and Cyberbullying (TRAC). pages 199–207.

Karla Mantilla. 2013. Gendertrolling: Misogyny Adapts to New Media. Feminist Studies (Special Issue: Categorizing Sexualities) 39(2):563–570.

Ryan C. Martin, Kelsey Ryan Coyier, Leah M. VanSistine, and Kelly L. Schroeder. 2013. Anger on the Internet: The Perceived Value of Rant-Sites. Cyberpsychology, Behavior, and Social Networking 16(2):119–122.

Eugenio Martınez-Cámara, Manuel C. Dıaz-Galiano, M. Angel Garcıa-Cumbreras, Manuel Carlos Garcıa-Vega, and Julio Villena-Román. 2017. Overview of TASS 2017. In Proceedings of the TASS Workshop on Sentiment Analysis. Murcia, Spain, pages 13–21.

Puneet Mathur, Rajiv Shah, Ramit Sawhney, and Debanjan Mahata. 2018. Detecting Offensive Tweets in Hindi-English Code-Switched Language. In Proceedings of the International Workshop on Natural Language Processing for Social Media. Melbourne, Australia, pages 18–26.

Julian McAuley, Jure Leskovec, and Dan Jurafsky. 2012. Learning Attitudes and Attributes from Multi-aspect Reviews. In Proceedings of the International Conference on Data Mining (ICDM). Washington, USA, pages 1020–1025.

Tarlach McGonagle. 2013. The Council of Europe against online hate speech: Conundrums and challenges. Expert paper, document number: 1900; accessible at http://hdl.handle.net/11245/1.407945; last accessed November 27 2018.

Yashar Mehdad and Joel Tetreault. 2016. Do Characters Abuse More Than Words? In Proceedings of the Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL). Los Angeles, USA, pages 299–303.

Stefano Menini and Sara Tonelli. 2016. Agreement and Disagreement: Comparison of Points of View in the Political Domain. In Proceedings of the International Conference on Computational Linguistics (COLING). Osaka, Japan, pages 2461–2470.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient Estimation of Word Representations in Vector Space. In Workshop-Proceedings of the International Conference on Learning Representations (ICLR). Scottsdale, USA, pages 1310–1318.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. pages 3111–3119.

Pruthwik Mishra, Vandan Mujadia, and Soujanya Lanka. 2017. GermEval 2017: Sequence based Models for Customer Feedback Analysis. In Proceedings of the GermEval 2017 – Shared Task on Aspect-based Sentiment in Social Media Customer Feedback . Berlin, Germany, pages 36–42.

Pushkar Mishra, Helen Yannakoudakis, and Ekaterina Shutova. 2018. Neural Character-based Composition Models for Abuse Detection. Brussels, Belgium, pages 1–10.

Jeff Mitchell and Mirella Lapata. 2010. Composition in Distributional Models of Semantics. Cognitive science 34(8):1388–1429.

Margaret Mitchell, Jacqui Aguilar, Theresa Wilson, and Benjamin Van Durme. 2013. Open Domain Targeted Sentiment. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Seattle, USA, pages 1643–1654.

Saif M. Mohammad. 2016. A Practical Guide to Sentiment Annotation: Challenges and Solutions. In Proceedings of the Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis (WASSA). San Diego, USA, pages 174–179.

Saif M. Mohammad and Felipe Bravo-Marquez. 2017. WASSA-2017 shared task on emotion intensity. In Proceedings of the Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis (WASSA). Copenhagen, Denmark.

Saif M. Mohammad, Felipe Bravo-Marquez, Mohammad Salameh, and Svetlana Kiritchenko. 2018. SemEval-2018 Task 1: Affect in Tweets. In Proceedings of International Workshop on Semantic Evaluation (SemEval). New Orleans, USA, pages 1–17.

Saif M. Mohammad, Svetlana Kiritchenko, Parinaz Sobhani, Xiaodan Zhu, and Colin Cherry. 2016. SemEval-2016 Task 6: Detecting Stance in Tweets. In Proceedings of the International Workshop on Semantic Evaluation Exercises (SemEval). San Diego, USA.

Saif M. Mohammad, Svetlana Kiritchenko, and Xiaodan Zhu. 2013. NRC-Canada: Building the State-of-the-Art in Sentiment Analysis of Tweets. In Proceedings of the International Workshop on Semantic Evaluation Exercises (SemEval). Atlanta, USA, pages 321–327.

Saif M. Mohammad, Parinaz Sobhani, and Svetlana Kiritchenko. 2017. Stance and Sentiment in Tweets. ACM Transactions on Internet Technology (TOIT) - Special Issue on Argumentation in Social Media and Regular Papers 17(3):26:1–26:23.

Saif M. Mohammad and Peter D. Turney. 2010. Emotions Evoked by Common Words and Phrases: Using Mechanical Turk to Create an Emotion Lexicon. In Proceedings of the Workshop on Computational Approaches to Analysis and Generation of Emotion in Text. Los Angeles, USA, pages 26–34.

Joaquın Padilla Montani and Peter Schüller. 2018. Tuwienkbs at germeval 2018: German abusive tweet detection. In Proceedings of the GermEval 2018 – Shared Task on the Identification of Offensive Language. Vienna, Austria, pages 45–50.

Hamdy Mubarak, Kareem Darwish, and Walid Magdy. 2017. Abusive Language Detection on Arabic Social Media. In Proceedings of the Workshop on Abusive Language Online (ALW). Vancouver, Canada, pages 52–56.

Jonas Mueller and Aditya Thyagarajan. 2016. Siamese Recurrent Architectures for Learning Sentence Similarity. In Proceedings of the AAAI Conference on Artificial Intelligence. Phoenix, USA, pages 2786–2792.

Behzad Naderalvojoud, Behrang Qasemizadeh, and Laura Kallmeyer. 2017. HU-HHU at GermEval-2017 Sub-task B: Lexicon-Based Deep Learning for Contextual Sentiment Analysis. In Proceedings of the GermEval 2017 – Shared Task on Aspect-based Sentiment in Social Media Customer Feedback . Berlin, Germany, pages 18–21.

Preslav Nakov, Alan Ritter, Sara Rosenthal, Fabrizio Sebastiani, and Veselin Stoyanov. 2016. Semeval-2016 task 4: Sentiment analysis in twitter. In Proceedings of the International Workshop on Semantic Evaluation Exercises (SemEval). San Diego, USA, pages 1–18.

Preslav Nakov, Sara Rosenthal, Zornitsa Kozareva, Veselin Stoyanov, Alan Ritter, and Theresa Wilson. 2013. SemEval-2013 Task 2: Sentiment Analysis in Twitter. In Proceedings of the International Workshop on Semantic Evaluation Exercises (SemEval). Atlanta, USA, pages 312–320.

Paul Neculoiu, Maarten Versteegh, and Mihai Rotaru. 2016. Learning Text Similarity with Siamese Recurrent Networks. In Proceedings of the Workshop on Representation Learning for NLP (RepL4NLP). Berlin, Germany, pages 148–157.

Arvind Neelakantan, Jeevan Shankar, Alexandre Passos, and Andrew McCallum. 2014. Efficient Non-parametric Estimation of Multiple Embeddings per Word in Vector Space. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar, pages 1059–1069.

Chikashi Nobata, Joel Tetreault, Achint Thomas, Yashar Mehdad, and Yi Chang. 2016. Abusive Language Detection in Online User Content. In Proceedings of the International Conference on World Wide Web (WWW). Montréal, Canada, pages 145–153.

Neil O’Hare, Michael Davy, Adam Bermingham, Paul Ferguson, Páraic Sheridan, Cathal Gurrin, and Alan F. Smeaton. 2009. Topic-Dependent Sentiment Analysis of Financial Blogs. In Proceedings of the International CIKM Workshop on Topic-sentiment Analysis for Mass Opinion. Hong Kong, China, pages 9–16.

Nelleke Oostdijk and Hans van Halteren. 2013. N-Gram-Based Recognition of Threatening Tweets. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics (CICLing). Samos, Greece, pages 183–196.

Bryan Orme. 2009. MaxDiff Analysis: Simple Counting, Individual-Level Logit, and HB. Sawtooth Software Research Paper Series Accessible at https:// www.sawtoothsoftware.com/download/techpap/indivmaxdiff.pdf; last accessed November 27 2018.

Raquel Mochales Palau and Marie-Francine Moens. 2009. Argumentation mining: The detection, classification and structure of arguments in text. In Proceedings of the International Conference on Artificial Intelligence and Law (ICAIL). Barcelona, Spain, pages 98–107.

Martha Palmer and Tim Finin. 1990. Workshop on the Evaluation of Natural Language Processing Systems. Computational Linguistics 16(3):175–181.

Endang Wahyu Pamungkas, Alessandra Teresa Cignarella, Valerio Basile, and Viviana Patti. 2018. 14-ExLab@ UniTo for AMI at IberEval2018: Exploiting Lexical Knowledge for Detecting Misogyny in English and Spanish Tweets. In Proceedings of the Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval). Seville, Spain, pages 234–241.

Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? Sentiment Classification using Machine Learning Techniques. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Philadelphia, USA, pages 79–86.

Robert Parker, David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. 2011. English Gigaword Fifth Edition. Catalog Number: LDC2011T07, Linguistic Data Consortium, Philadelphia, USA.

Patrick Paroubek, Stéphane Chaudiron, and Lynette Hirschman. 2007. Principles of Evaluation in Natural Language Processing. Traitement Automatique des Langues (ATALA) 48(1):7–31.

Braja Gopal Patra, Dipankar Das, Amitava Das, and Rajendra Prasath. 2015. Shared Task on Sentiment Analysis in Indian Languages (SAIL) Tweets - An Overview. In Proceedings of the International Conference on Mining Intelligence and Knowledge Exploration (MIKE). Hyderabad, India, pages 650–655.

John Pavlopoulos, Prodromos Malakasiotis, and Ion Androutsopoulos. 2017. Deep Learning for User Comment Moderation. In Proceedings of the Workshop on Abusive Language Online (ALW). Vancouver, Canada, pages 25–35.

Andreas Peldszus and Manfred Stede. 2013. Ranking the annotators: An agreement study on argumentation structure. In Proceedings of the Linguistic Annotation Workshop and Interoperability with Discourse. Sofia, Bulgaria, pages 196–204.

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar, pages 1532–1543.

Isaac Persing and Vincent Ng. 2016a. End-to-End Argumentation Mining in Student Essays. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT). San Diego, USA, pages 1384– 1394.

Isaac Persing and Vincent Ng. 2016b. Modeling Stance in Student Essays. In Proceedings of the Conference of the Association for Computational Linguistics (ACL). Berlin, Germany, pages 2174–2184.

Isaac Persing and Vincent Ng. 2017. Why Can’t You Convince Me? Modeling Weaknesses in Unpersuasive Arguments. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI). Melbourne, Australia, pages 4082–4088.

John P. Pestian, Pawel Matykiewicz, Michelle Linn-Gust, Brett South, Ozlem Uzuner, Jan Wiebe, K. Bretonnel Cohen, John Hurdle, and Christopher Brew. 2012. Sentiment Analysis of Suicide Notes: A Shared Task. Biomedical Informatics Insights 5(Supplement 1):3–16.

Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT). New Orleans, USA, pages 2227–2237.

Barbara Plank, Anders Søgaard, and Yoav Goldberg. 2016. Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss. In Proceedings of the the Conference of the Association for Computational Linguistics (ACL). Berlin, Germany, pages 412–418.

Livia Polanyi and Annie Zaenen. 2006. Contextual Valence Shifters. In Computing Attitude and Affect in Text: Theory and Applications (The Information Retrieval Book Series), Springer Netherlands, volume 20, pages 1–10.

Maria Pontiki, Dimitris Galanis, Haris Papageorgiou, Ion Androutsopoulos, Suresh Manandhar, Mohammad AL-Smadi, Mahmoud Al-Ayyoub, Yanyan Zhao, Bing Qin, Orphee De Clercq, Veronique Hoste, Marianna Apidianaki, Xavier Tannier, Natalia Loukachevitch, Evgeniy Kotelnikov, Núria Bel, Salud María Jiménez-Zafra, and Gülşen Eryiğit. 2016. SemEval-2016 Task 5: Aspect-Based Sentiment Analysis. In Proceedings of the International Workshop on Semantic Evaluation Exercises (SemEval). San Diego, California, pages 19–30.

Maria Pontiki, Dimitris Galanis, Haris Papageorgiou, Suresh Manandhar, and Ion Androutsopoulos. 2015. SemEval-2015 Task 12: Aspect-Based Sentiment Analysis. In Proceedings of the International Workshop on Semantic Evaluation Exercises (SemEval). Denver, Colorado, pages 486–495.

Maria Pontiki, Dimitris Galanis, John Pavlopoulos, Harris Papageorgiou, Ion Androutsopoulos, and Suresh Manandhar. 2014. Semeval-2014 Task 4: Aspect-Based Sentiment Analysis. Proceedings of the International Workshop on Semantic Evaluation Exercises (SemEval) pages 27–35.

Dimitris Potoglou, Peter Burge, Terry Flynn, Ann Netten, Juliette Malley, Julien Forder, and John E. Brazier. 2011. Besteworst scaling vs. discrete choice experiments: An empirical comparison using social care data. Social Science & Medicine 72(10):1717– 1727.

Kashyap Raiyani, Teresa Gonçalves, Paulo Quaresma, and Vitor Beires Nogueira. 2018. Fully Connected Neural Network with Advance Preprocessor to Identify Aggression over Facebook and Twitter. In Proceedings of the Workshop on Trolling, Aggression and Cyberbullying (TRAC). Santa Fe, USA, pages 28–41.

Justus J. Randolph. 2005. Free-Marginal Multirater Kappa (multirater κ_free): An Alternative to Fleiss’ Fixed-Marginal Multirater Kappa. Online Submission, paper presented at the Joensuu Learning and Instruction Symposium (Joensuu, Finland, Oct 14-15, 2005; article number: ED490661) .

Amir H. Razavi, Diana Inkpen, Sasha Uritsky, and Stan Matwin. 2010. Offensive Language Detection Using Multi-level Classification. In Proceedings of the Canadian Conference on Advances in Artificial Intelligence. Ottawa, Canada, pages 16–27.

Chris Reed and Glenn Rowe. 2004. Araucaria: Software for argument analysis, diagramming and representation. International Journal on Artificial Intelligence Tools 13(4):961–979.

Robert Remus, Uwe Quasthoff, and Gerhard Heyer. 2010. SentiWS-A Publicly Available German-language Resource for Sentiment Analysis. In Proceedings of the International Conference on Language Resources and Evaluation (LREC). Valletta, Malta, pages 1168–1171.

Yafeng Ren, Yue Zhang, Meishan Zhang, and Donghong Ji. 2016. Improving Twitter Sentiment Classification Using Topic-Enriched Multi-Prototype Word Embeddings. In Proceedings of the AAAI Conference on Artificial Intelligence. Phoenix, USA, pages 3038–3044.

Philip Resnik and Jimmy Lin. 2010. Evaluation of NLP systems. The Handbook of Computational Linguistics and Natural Language Processing 57:271–295.

Ruty Rinott, Lena Dankin, Carlos Alzate Perez, Mitesh M. Khapra, Ehud Aharoni, and Noam Slonim. 2015. Show Me Your Evidence – an Automatic Method for Context Dependent Evidence Detection. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Lisbon, Portuga, pages 440–450.

Alan Ritter, Sam Clark, Mausam, and Oren Etzioni. 2011. Named entity recognition in tweets: An experimental study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Edinburgh, United Kingdom, pages 1524–1534.

Daniel M. Romero, Chenhao Tan, and Johan Ugander. 2013. On the Interplay between Social and Topical Structure. In Proceedings of the International AAAI Conference on Weblogs and Social Media (ICWSM). Ann Arbor, USA, pages 515–525.

Frank Rosenblatt. 1958. The Perceptron: A Probabilistic Model for Information Storage and Organization in The Brain. Psychological Review 65(6):65–386.

Robert Rosenthal. 1965. The Volunteer Subject. Human Relations 18(4):389–406.

Sara Rosenthal, Noura Farra, and Preslav Nakov. 2017. SemEval-2017 Task 4: Sentiment Analysis in Twitter. In Proceedings of the International Workshop on Semantic Evaluation Exercises (SemEval). Vancouver, Canada, pages 502–518.

Sara Rosenthal and Kathleen McKeown. 2012. Detecting Opinionated Claims in Online Discussions. In Proceedings of the International Conference on Semantic Computing (ICSC). IEEE, pages 30–37.

Sara Rosenthal, Preslav Nakov, Svetlana Kiritchenko, Saif M. Mohammad, Alan Ritter, and Veselin Stoyanov. 2015. Semeval-2015 Task 10: Sentiment Analysis in Twitter. In Proceedings of the International Workshop on Semantic Evaluation Exercises (SemEval). Denver, USA, pages 451–463.

Sara Rosenthal, Alan Ritter, Preslav Nakov, and Veselin Stoyanov. 2014. SemEval-2014 Task 9: Sentiment Analysis in Twitter. In Proceedings of the International Workshop on Semantic Evaluation Exercises (SemEval). Dublin, Ireland, pages 73–80.

Björn Ross, Michael Rist, Guillermo Carbonell, Ben Cabrera, Nils Kurowsky, and Michael Wojatzki. 2016. Measuring the Reliability of Hate Speech Annotations: The Case of the European Refugee Crisis. In Proceedings of the 3rd Workshop on Natural Language Processing for Computer-Mediated Communication (NLP4CMC III). Bochum, Germany, pages 6–9.

Sebastian Ruder. 2016. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747 Accessible at https://arxiv.org/pdf/1609.04747.pdf; last accessed November 28 2018.

Sebastian Ruder, John Glover, Afshin Mehrabani, and Parsa Ghaffari. 2018. 360^◦ stance detection. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT). New Orleans, USA, pages 31–35.

David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. 1985. Learning Internal Representations by Error Propagation. Technical report, California University, Institute for Cognitive Science, San Diego, USA.

David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. 1986. Learning Representations by Back-propagating Errors. Nature 323(6088):533–536.

Eugen Ruppert, Abhishek Kumar, and Chris Biemann. 2017. LT-ABSA: An Extensible Open-Source System for Document-Level and Aspect-Based Sentiment Analysis. In Proceedings of the GermEval 2017 – Shared Task on Aspect-based Sentiment in Social Media Customer Feedback . Berlin, Germany, pages 55–60.

Vasile Rus, Rajendra Banjade, and Mihai C. Lintean. 2014. On Paraphrase Identification Corpora. In Proceedings of the International Conference on Language Resources and Evaluation (LREC). pages 2422–2429.

Irene Russo, Tommaso Caselli, and Carlo Strapparava. 2015. SemEval-2015 Task 9: CLIPEval Implicit Polarity of Events. In Proceedings of the International Workshop on Semantic Evaluation Exercises (SemEval). Denver, USA, pages 443–450.

Marzieh Saeidi, Guillaume Bouchard, Maria Liakata, and Sebastian Riedel. 2016. Senti-Hood: Targeted Aspect Based Sentiment Analysis Dataset for Urban Neighbourhoods. In Proceedings the International Conference on Computational Linguistics (COLING). Osaka, Japan, pages 1546–1556.

Haji Mohammad Saleem, Kelly P. Dillon, Susan Benesch, and Derek Ruths. 2016. A Web of Hate: Tackling Hateful Speech in Online Social Spaces. In Proceedings of the Workshop on Text Analytics for Cybersecurity and Online Safety (TA-COS). pages 1–9.

Safi Niloofar Samghabadi, Deepthi Mave, Sudipta Kar, and Thamar Solorio. 2018. RiTUAL-UH at TRAC 2018 Shared Task: Aggression Identification. In Proceedings of the Workshop on Trolling, Aggression and Cyberbullying (TRAC). Santa Fe, USA, pages 12–18.

Zeeshan Ali Sayyed, Daniel Dakota, and Sandra Kübler. 2017. IDS-IUCL Contribution to GermEval 2017. In Proceedings of the GermEval 2017 – Shared Task on Aspect-based Sentiment in Social Media Customer Feedback . Berlin, Germany, pages 43–48.

J. Ben Schafer, Dan Frankowski, Jon Herlocker, and Shilad Sen. 2007. Collaborative Filtering Recommender Systems. The Adaptive Web (Lecture Notes in Computer Science, volume 4321) pages 291–324.

Karen Schulz, Margot Mieskes, and Christoph Becker. 2017. h-da Participation at GermEval Subtask B: Document-level Polarity. In Proceedings of the GermEval 2017 – Shared Task on Aspect-based Sentiment in Social Media Customer Feedback . Berlin, Germany, pages 13–17.

Mike Schuster and Kuldip K. Paliwal. 1997. Bidirectional Recurrent Neural Networks. IEEE Transactions on Signal Processing 45(11):2673–2681.

William A. Scott. 1955. Reliability of Content Analysis: The Case of Nominal Scale Coding. Public Opinion Quarterly 19(3):321–325.

Yohei Seki, David Kirk Evans, Lun-Wei Ku, Hsin-Hsi Chen, Noriko Kando, and Chin-Yew Lin. 2007. Overview of Opinion Analysis Pilot Task at NTCIR-6. In Proceedings of the NTCIR Workshop. Tokyo, Japan, pages 265–278.

Yohei Seki, David Kirk Evans, Lun-Wei Ku, Le Sun, Hsin-Hsi Chen, Noriko Kando, and Chin-Yew Lin. 2008. Overview of Multilingual Opinion Analysis Task at NTCIR-7. In Proceedings of the NTCIR Workshop. Tokyo, Japan, pages 185–203.

Yohei Seki, Lun-Wei Ku, Le Sun, Hsin-His Chen, and Noriko Kando. 2010. Overview of Multilingual Opinion Analysis Task at NTCIR-8. In Proceedings of the NTCIR Workshop. Tokyo, Japan, pages 209–220.

Aliaksei Severyn and Alessandro Moschitti. 2015. UNITN: Training Deep Convolutional Neural Network for Twitter Sentiment Classification. In Proceedings of the International Workshop on Semantic Evaluation Exercises (SemEval). Denver, USA, pages 464–469.

Aliaksei Severyn, Alessandro Moschitti, Olga Uryupina, Barbara Plank, and Katja Filippova. 2014. Opinion Mining on YouTube. In Proceedings of the Conference of the Association for Computational Linguistics (ACL). Baltimore, USA, pages 1252–1261.

Lei Sha, Baobao Chang, Zhifang Sui, and Sujian Li. 2016. Reading and Thinking: Re-read LSTM Unit for Textual Entailment Recognition. In Proceedings of the International Conference on Computational Linguistics (COLING). Osaka, Japan, pages 2870–2879.

C. E. Shannon. 1948. A Mathematical Theory of Communication. Bell System Technical Journal 27(3):379–423.

Uladzimir Sidarenka. 2017. PotTS at GermEval-2017 Task B: Document-Level Polarity Detection Using Hand-Crafted SVM and Deep Bidirectional LSTM Network. In Proceedings of the GermEval 2017 – Shared Task on Aspect-based Sentiment in Social Media Customer Feedback . Berlin, Germany, pages 49–54.

Frank Smadja, Kathleen R. McKeown, and Vasileios Hatzivassiloglou. 1996. Translating Collocations for Bilingual Lexicons: A Statistical Approach. Computational Linguistics 22(1):1–38.

Reginald G. Smart. 1966. Subject selection bias in psychological research. Canadian Psychologist/Psychologie Canadienne 7(2):115–121.

F.J. Smith and K. Devine. 1985. Storing and Retrieving Word Phrases. Information Processing & Management 21(3):215–224.

Rion Snow, Brendan O’Connor, Daniel Jurafsky, and Andrew Y. Ng. 2008. Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Waikiki, USA, pages 254–263.

Parinaz Sobhani, Diana Inkpen, and Stan Matwin. 2015. From argumentation mining to stance classification. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT). Denver, USA, pages 67–77.

Parinaz Sobhani, Diana Inkpen, and Xiaodan Zhu. 2017. A Dataset for Multi-Target Stance Detection. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL). Valencia, Spain, pages 551–557.

Richard Socher, Alex Perelygin, Jean Y. Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, and Christopher Potts. 2013. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Seattle, USA, pages 1631–1642.

Swapna Somasundaran and Janyce Wiebe. 2010. Recognizing Stances in Ideological On-Line Debates. In Proceedings of the Workshop on Computational Approaches to Analysis and Generation of Emotion in Text. Los Angeles, USA, pages 116–124.

Sara Owsley Sood, Elizabeth F Churchill, and Judd Antin. 2012. Automatic identification of personal insults on social news sites. Journal of the American Society for Information Science and Technology 63(2):270–285.

Charles Spearman. 1904. The proof and measurement of association between two things. The American journal of psychology 15(1):72–101.

Dan Sperber and Deirdre Wilson. 1986. Relevance: communication and cognition. Language in Society 17(04):604–609.

Dan Sperber and Deirdre Wilson. 1995. Relevance: Communication and Cognition. Handbook of Pragmatics (Blackwell Publishing, second edition) pages 607–632.

Ellen Spertus. 1997. Smokey: Automatic Recognition of Hostile Messages. In Proceedings of the National Conference on Artificial Intelligence (AAAI) and Conference on Innovative Applications of Artificial Intelligence (IAAI). Providence, USA, pages 1058–1065.

Dhanya Sridhar, Lise Getoor, and Marilyn Walker. 2014. Collective stance classification of posts in online debate forums. In Proceedings of the joint Workshop on Social Dynamics and Personal Attributes in Social Media. Baltimore, USA, pages 109–117.

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. The Journal of Machine Learning Research (JMLR) 15(1):1929–1958.

Christian Stab and Iryna Gurevych. 2014. Identifying Argumentative Discourse Structures in Persuasive Essays. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar, pages 46–56.

Philip J. Stone, Robert F. Bales, J. Zvi Namenwirth, and Daniel M. Ogilvie. 1962. The General Inquirer: A computer system for content analysis and retrieval based on the sentence as a unit of information. Systems Research and Behavioral Science 7(4):484– 498.

Carlo Strapparava and Rada Mihalcea. 2007. Semeval-2007 Task 14: Affective Text. In Proceedings of the International Workshop on Semantic Evaluation Exercises (SemEval). Prague, Czech Republic, pages 70–74.

Carlo Strapparava, Alessandro Valitutti, et al. 2004. WordNet Affect: an Affective Extension of WordNet. In Proceedings of the International Conference on Language Resources and Evaluation (LREC). Lisbon, Portugal, pages 1083–1086.

Hui-Po Su, Zhen-Jie Huang, Hao-Tsung Chang, and Chuan-Jie Lin. 2017. Rephrasing profanity in Chinese text. In Proceedings of the Workshop on Abusive Language Online (ALW). Vancouver, Canada, pages 18–24.

Xiaoyuan Su and Taghi M. Khoshgoftaar. 2009. A Survey of Collaborative Filtering Techniques. Advances in Artificial Intelligence 2009(1):4:2–4:2.

Cass R. Sunstein. 2002. The Law of Group Polarization. Journal of political philosophy 10(2):175–195.

Duyu Tang, Furu Wei, Bing Qin, Nan Yang, Ting Liu, and Ming Zhou. 2016. Sentiment Embeddings with Applications to Sentiment Analysis. IEEE Transactions on Knowledge and Data Engineering 28(2):496–509.

Mariona Taulé, M. Antonia Martı, Francisco Rangel, Paolo Rosso, Cristina Bosco, and Viviana Patti. 2017. Overview of the task of Stance and Gender Detection in Tweets on Catalan Independence at IBEREVAL 2017. In Proceedings of the Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval). Murcia, Spain, pages 157–177.

Mariona Taulé, Francisco Rangel, M. Antonia Martı, and Paolo Rosso. 2018. Overview of the Task on Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum. In Proceedings of the Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval). Seville, Spain, pages 149—166.

Tun Thura Thet, Jin-Cheon Na, and Christopher S.G. Khoo. 2010. Aspect-based sentiment analysis of movie reviews on discussion boards. Journal of information science 36(6):823–848.

Matt Thomas, Bo Pang, and Lillian Lee. 2006. Get out the vote: Determining support or opposition from Congressional floor-debate transcripts. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Sydney, Australia, pages 327–335.

Louis L. Thurstone. 1927. A Law of Comparative Judgment. Psychological Review 34(4):273–286.

Katrin Tomanek, Udo Hahn, Steffen Lohmann, and Jürgen Ziegler. 2010. A Cognitive Cost Model of Annotations Based on Eye-Tracking Data. In Proceedings of the Conference of the Association for Computational Linguistics (ACL). Uppsala, Sweden, pages 1158–1167.

Stephen E. Toulmin. 1958. The Uses of Argument. Cambridge University Press. Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT). Edmonton, USA, pages 173–180.

Kristi Tsukida and Maya R. Gupta. 2011. How to Analyze Paired Comparison Data. Technical report, Department of Electrical Engineering; University of Washington, USA.

Peter D. Turney. 2002. Thumbs Up or Thumbs Down?: Semantic Orientation Applied to Unsupervised Classification of Reviews. In Proceedings of the Conference of the Association for Computational Linguistics (ACL). Philadelphia, Pennsylvania, pages 417–424.

Peter D. Turney and Patrick Pantel. 2010. From Frequency to Meaning: Vector Space Models of Semantics. Journal of Artificial Intelligence Research (JAIR) 37(1):141–188.

Basile Valerio, Bolioli Andrea, Nissim Malvina, and Rosso Paolo. 2014. Overview of the Evalita 2014 SENTIment POLarity Classification Task. In Proceedings of the Evaluation Campaign of Natural Language Processing and Speech tools for Italian (EVALITA). Pisa, Italy, pages 50–57.

Marjan Van de Kauter, Diane Breesch, and Véronique Hoste. 2015a. Fine-grained analysis of explicit and implicit sentiment in financial news articles. Expert Systems with Applications 42(11):4999–5010.

Marjan Van de Kauter, Bart Desmet, and Véronique Hoste. 2015b. The good, the bad and the implicit: a comprehensive approach to annotating explicit and implicit sentiment. Language Resources and Evaluation 49(3):685–720.

Cynthia Van Hee, Els Lefever, Ben Verhoeven, Julie Mennes, Bart Desmet, Guy De Pauw, Walter Daelemans, and Veronique Hoste. 2015. Detection and Fine-Grained Classification of Cyberbullying Events. In Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP). Hissar, Bulgaria, pages 672–680.

Marta Vila, M Antònia Martí, Horacio Rodríguez, et al. 2014. Is This a Paraphrase? What Kind? Paraphrase Boundaries and Typology. Open Journal of Modern Linguistics 4(1):205–218.

Julio Villena-Román, Miguel Angel Garcia Cumbreras, Janine Garcıa-Morera, Eugenio Martınez Cámara, César de Pablo Sánchez, Alfonso Urena López, and Maria Teresa Martın Valdivia. 2014a. TASS 2014 – Workshop on Sentiment Analysis at SEPLN – Overview. In Proceedings of the TASS Workshop on Sentiment Analysis. Girona, Spain, pages 1–9.

Julio Villena-Román, Janine García-Morera, Miguel Ángel García Cumbreras, Eugenio Martínez-Cámara, Maria Teresa Martín-Valdivia, and Luis Alfonso Ureña López. 2015. Overview of tass 2015. In Proceedings of the TASS Workshop on Sentiment Analysis. Alicante, Spain, pages 13–21.

Julio Villena-Román, Janine García-Morera, Sara Lana-Serrano, and José Carlos González-Cristóbal. 2014b. TASS 2013 – A Second Step in Reputation Analysis in Spanish . Procesamiento del Lenguaje Natural 52(1):37–44.

Dirk von Grünigen, Fernando Benites, Pius von Däniken, Mark Cieliebak, and Ralf Grubenmann. 2018. spMMMP at GermEval 2018 Shared Task: Classification of Offensive Content in Tweets using Convolutional Neural Networks and Gated Recurrent Units. In Proceedings of the GermEval 2018 – Shared Task on the Identification of Offensive Language. Vienna, Austria, pages 130–137.

Sergey Vychegzhanin and Evgeny Kotelnikov. 2017. Stance Detection in Russian: a Feature Selection and Machine Learning Based Approach. In Proceedings of the International Conference on Analysis of Images, Social Networks and Texts (AIST). Moscow, Russia, pages 166–177.

Henning Wachsmuth, Nona Naderi, Yufang Hou, Yonatan Bilu, Vinodkumar Prabhakaran, Tim Alberdingk Thijm, Graeme Hirst, and Benno Stein. 2017. Computational Argumentation Quality Assessment in Natural Language. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL). Valencia, Spain, pages 176–187.

Marilyn A. Walker, Pranav Anand, Robert Abbott, and Ricky Grant. 2012a. Stance Classification using Dialogic Properties of Persuasion. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT). Montréal, Canada, pages 592–596.

Marilyn A. Walker, Jean E. Fox Tree, Pranav Anand, Rob Abbott, and Joseph King. 2012b. A Corpus for Research on Deliberation and Debate. In Proceedings of the International Conference on Language Resources and Evaluation (LREC). Istanbul, Turkey, pages 812–817.

Ulli Waltinger. 2010. Sentiment Analysis Reloaded: A Comparative Study On Sentiment Polarity Identification Combining Machine Learning And Subjectivity Features. In Proceedings of the International Conference on Web Information Systems and Technologies (WEBIST). Valencia, Spain, pages 203–210.

Lu Wang and Claire Cardie. 2014. Improving Agreement and Disagreement Identification in Online Discussions with A Socially-Tuned Sentiment Lexicon. In Proceedings of the Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis (WASSA). Baltimore, USA, pages 97–106.

William Warner and Julia Hirschberg. 2012. Detecting Hate Speech on the World Wide Web. In Proceedings of the Workshop on Language in Social Media (LSM). Montréal, Canada, pages 19–26.

Zeerak Waseem. 2016. Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter. In Proceedings of the Workshop on NLP and Computational Social Science (NLP+CSS). Austin, Texas, pages 138–142.

Zeerak Waseem, Wendy Hui Kyong Chung, Dirk Hovy, and Joel Tetreault. 2017a. Proceedings of the first workshop on abusive language online. In Proceedings of the Workshop on Abusive Language Online (ALW). Vancouver, Canada, pages i–vii.

Zeerak Waseem, Thomas Davidson, Dana Warmsley, and Ingmar Weber. 2017b. Understanding Abuse: A Typology of Abusive Language Detection Subtasks. In Proceedings of the Workshop on Abusive Language Online (ALW). Vancouver, Canada, pages 78– 84.

Zeerak Waseem and Dirk Hovy. 2016. Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter. In Proceedings of the NAACL Student Research Workshop. San Diego, California, USA, pages 88–93.

Penghui Wei, Junjie Lin, and Wenji Mao. 2018. Multi-Target Stance Detection via a Dynamic Memory-Augmented Network. In Proceedings of the International Conference on Research and Development in Information Retrieval (ACM SIGIR). Ann Arbor, USA, pages 1229–1232.

Wan Wei, Xiao Zhang, Xuqin Liu, Wei Chen, and Tengjiao Wang. 2016a. pkudblab at semeval-2016 Task 6: A Specific Convolutional Neural Network System for Effective Stance Detection. In Proceedings of the International Workshop on Semantic Evaluation Exercises (SemEval). San Diego, USA, pages 384–388.

Zhongyu Wei, Yang Liu, and Yi Li. 2016b. Is This Post Persuasive? Ranking Argumentative Comments in the Online Forum. In Proceedings of the Conference of the Association for Computational Linguistics (ACL). Berlin, Germany, pages 195–200.

Cynthia M. Whissell. 1989. The Dictionary of Affect in Language. The Measurement of Emotions (first edition, Elsevier) pages 113–131.

Janyce Wiebe, Theresa Wilson, and Claire Cardie. 2005. Annotating Expressions of Opinions and Emotions in Language. Language Resources and Evaluation 39(2):165– 210.

Gregor Wiedemann, Eugen Ruppert, Raghav Jindal, and Chris Biemann. 2018. Transfer Learning from LDA to BiLSTM-CNN for Offensive Language Detection in Twitter. In Proceedings of the GermEval 2018 – Shared Task on the Identification of Offensive Language. Vienna, Austria, pages 85–94.

Michael Wiegand, Josef Ruppenhofer, Anna Schmidt, and Clayton Greenberg. 2018a. Inducing a Lexicon of Abusive Words – a Feature-Based Approach. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT). New Orleans, USA, pages 1046–1056.

Michael Wiegand, Melanie Siegel, and Josef Ruppenhofer. 2018b. Overview of the germeval 2018 shared task on the identification of offensive language. In Proceedings of the GermEval 2018 – Shared Task on the Identification of Offensive Language. pages 1–10.

John Wieting, Mohit Bansal, Kevin Gimpel, and Karen Livescu. 2016. charagram: Embedding Words and Sentences via Character n-grams. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Osaka, Japan, pages 1504–1515.

Frank Wilcoxon. 1945. Individual comparisons by ranking methods. Biometrics Bulletin 1(6):80–83.

Deirdre Wilson and Dan Sperber. 1986. On defining relevance. Philosophical Grounds of Rationality: Intentions, Categories, Ends (first edition, Clarendon Press Oxford) pages 243–258.

Deirdre Wilson and Dan Sperber. 2002. Relevance theory. In Handbook of Pragmatics (Oxford: Blackwell), pages 249–287.

Theresa Wilson, Janyce Wiebe, and Paul Hoffmann. 2005. Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Vancouver, Canada, pages 347–354.

Michael J. Wise. 1996. YAP3: Improved Detection Of Similarities In Computer Program And Other Texts. SIGCSE Bulletin (ACM Special Interest Group on Computer Science Education) 28(1):130–134.

Michael Wojatzki, Tobias Horsmann, Darina Gold, and Torsten Zesch. 2018a. Do Women Perceive Hate Differently: Examining the Relationship Between Hate Speech, Gender, and Agreement Judgments. In Proceedings of the Conference on Natural Language Processing (KONVENS). Vienna, Austria, pages 110–120.

Michael Wojatzki, Saif M. Mohammad, Torsten Zesch, and Svetlana Kiritchenko. 2018b. Quantifying Qualitative Data for Understanding Controversial Issues. In Proceedings of the International Conference on Language Resources and Evaluation (LREC). Miyazaki, Japan, pages 1405 – 1418.

Michael Wojatzki, Eugen Ruppert, Sarah Holschneider, Torsten Zesch, and Chris Biemann. 2017. GermEval 2017: Shared Task on Aspect-based Sentiment in Social Media Customer Feedback. In Proceedings of the GermEval 2017 – Shared Task on Aspect-based Sentiment in Social Media Customer Feedback . Berlin, Germany, pages 1–12.

Michael Wojatzki and Torsten Zesch. 2016a. ltl.uni-due at SemEval-2016 Task 6: Stance Detection in Social Media Using Stacked Classifiers. In Proceedings of the International Workshop on Semantic Evaluation Exercises (SemEval). San Diego, USA, pages 428– 433.

Michael Wojatzki and Torsten Zesch. 2016b. Stance-based Argument Mining-Modeling Implicit Argumentation Using Stance. In Proceedings of the Conference on Natural Language Processing (KONVENS). Bochum, Germany, pages 313–322.

Michael Wojatzki and Torsten Zesch. 2017. Neural, Non-neural and Hybrid Stance Detection in Tweets on Catalan Independence. In Proceedings of the Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval). Murcia, Spain, volume 2.

Michael Wojatzki and Torsten Zesch. 2018. Comparing Target Sets for Stance Detection: A Case Study on YouTube Comments on Death Penalty. In Proceedings of the Conference on Natural Language Processing (KONVENS). Vienna, Austria, pages 69–79.

Michael Wojatzki, Torsten Zesch, Saif M. Mohammad, and Svetlana Kiritchenko. 2018c. Agree or Disagree: Predicting Judgments on Nuanced Assertions. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics (*SEM). New Orleans, USA, pages 214–224.

David H. Wolpert and William G. Macready. 1997. No Free Lunch Theorems for Optimization. IEEE Transactions on Evolutionary Computation 1(1):67–82.

Michelle Wright. 2018. Cyberbullying Victimization through Social Networking Sites and Adjustment Difficulties: The Role of Parental Mediation. Journal of the Association for Information Systems 19(2):113–123.

Ellery Wulczyn, Nithum Thain, and Lucas Dixon. 2017. Ex Machina: Personal Attacks Seen at Scale. In Proceedings of the International Conference on World Wide Web (WWW). Perth, Australia, pages 1391–1399.

Jun-Ming Xu, Kwang-Sung Jun, Xiaojin Zhu, and Amy Bellmore. 2012. Learning from Bullying Traces in Social Media. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT). Montréal, Canada, pages 656–666.

Ruifeng Xu, Yu Zhou, Dongyin Wu, Lin Gui, Jiachen Du, and Yun Xue. 2016. Overview of NLPCC Shared Task 4: Stance Detection in Chinese Microblogs. In International Conference on Computer Processing of Oriental Languages. Kunming, China, pages 907–916.

Seid M. Yimam, Chris Biemann, Richard Eckart de Castilho, and Iryna Gurevych. 2014. Automatic Annotation Suggestions and Custom Annotation Layers in WebAnno. In Proceedings of the Conference of the Association for Computational Linguistics: System Demonstrations (ACL). Baltimore, Maryland, pages 91–96.

Seid M. Yimam, Iryna Gurevych, Richard Eckart de Castilho, and Chris Biemann. 2013. WebAnno: A Flexible, Web-based and Visually Supported System for Distributed Annotations. In Proceedings of the Conference of the Association for Computational Linguistics: System Demonstrations (ACL). Sofia, Bulgaria, pages 1–6.

Liang-Chih Yu, Lung-Hao Lee, Jin Wang, and Kam-Fai Wong. 2017. IJCNLP-2017 Task 2: Dimensional Sentiment Analysis for Chinese Phrases. Proceedings of the International Joint Conference on Natural Language Processing: Shared Tasks (IJCNLP) pages 9–16.

Omar F. Zaidan and Chris Callison-Burch. 2011. Crowdsourcing Translation: Professional Quality from Non-professionals. In Proceedings of the Conference of the Association for Computational Linguistics (ACL). pages 1220–1229.

Guido Zarrella and Amy Marsh. 2016. MITRE at SemEval-2016 Task 6: Transfer Learning for Stance Detection. In Proceedings of the International Workshop on Semantic Evaluation Exercises (SemEval). San Diego, USA, pages 458–463.

Matthew D. Zeiler. 2012. ADADELTA: An Adaptive Learning Rate Method. arXiv preprint arXiv:1212.5701 Accessible at https://arxiv.org/abs/1212.5701; last accessed November 28 2018.

Torsten Zesch and Iryna Gurevych. 2010. Wisdom of crowds versus wisdom of linguists– measuring the semantic relatedness of words. Natural Language Engineering 16(1):25– 59.

Alice Zheng and Amanda Casari. 2018. Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists. O’Reilly Media, Inc., Sebastopol, USA, first edition edition.

Yiwei Zhou, Alexandra I. Cristea, and Lei Shi. 2017. Connecting Targets to Tweets: Semantic Attention-Based Model for Target-Specific Stance Detection. In Proceedings of the International Conference on Web Information Systems Engineering (WISE). Puschino, Russia, pages 18–32.

George Kingsley Zipf. 1935. The Psycho-biology of Language: An Introduction to Dynamic Philology. George Routledge & Sons, Ltd; Oxford, United Kingdom, first edition edition.

Appendix

Target-speciﬁc Lexical Semantics

In the following, we show the raw scores that have been collected from 109 participants in a laboratory setting.

Table A.1

word1	word2	without context	with context	difference
years	arctic	1.88	3.39	+1.50
warming	levels	2.47	3.80	+1.33
fuel	politician	2.24	3.41	+1.18
generations	emissions	2.42	3.56	+1.14
record	summer	2.58	3.67	+1.09
fuel	record	2.22	3.27	+1.05
emissions	record	2.76	3.80	+1.04
emissions	degrees	3.24	4.05	+0.81
fraud	warming	2.41	3.19	+0.79
fraud	oceans	1.93	2.53	+0.60
warming	fuel	3.41	4.00	+0.59
future	levels	2.65	3.20	+0.55
pollutant	water	3.54	4.09	+0.55
hoax	ice	1.72	2.18	+0.45
sustainability	fraud	2.30	2.73	+0.43
energy	effects	3.67	4.04	+0.36
sustainability	control	3.31	3.67	+0.35
sustainability	agenda	3.08	3.35	+0.27
emissions	scientist	3.60	3.76	+0.16
chemtrails	scientist	3.49	3.38	+0.11

Word relatedness judged with and without context of the target climate change.

Table A.2

word1	word2	without context	with context	difference
equality	clothes	2.13	3.44	+1.30
sex	politics	2.06	3.33	+1.28
prejudice	kitchen	2.32	3.53	+1.21
woman	equality	3.67	4.47	+0.80
gender	power	3.18	3.96	+0.79
cooking	children	2.21	2.95	+0.74
man	ignorance	2.65	3.37	+0.71
cooking	equality	2.12	2.82	+0.71
gender	hate	2.53	3.23	+0.70
politics	woman	3.40	4.07	+0.67
equality	money	3.12	3.75	+0.64
woman	respect	3.83	4.44	+0.61
politics	catcalling	1.80	2.41	+0.61
career	woman	3.94	4.35	+0.41
equality	propaganda	2.86	3.25	+0.38
money	man	3.15	3.49	+0.34
cooking	power	2.56	2.30	+0.26
man	rape	2.96	3.19	+0.23
maternity	money	2.79	2.98	+0.19
woman	cats	2.69	1.65	−1.04

Word relatedness judged with and without context of the target feminism

Table A.3

word1	word2	without context	with context	difference
emails	scandal	2.44	3.60	+1.15
emails	liar	2.20	3.21	+1.01
lies	votes	2.44	3.28	+0.84
lies	money	2.83	3.53	+0.70
foundation	lies	2.04	2.72	+0.68
nomination	hatred	2.25	2.90	+0.65
scandal	phone	2.27	2.88	+0.61
sexist	nomination	1.76	2.28	+0.52
pantsuit	votes	1.64	2.13	+0.48
scandal	nomination	2.72	3.18	+0.46
idol	liar	2.14	2.46	+0.32
sexist	qualifications	1.83	2.11	+0.28
emails	qualifications	2.08	2.32	+0.24
liberals	nomination	3.36	3.60	+0.24
pantsuit	kindness	1.89	2.06	+0.17
election	hatred	2.89	3.04	+0.15
foundation	pantsuit	1.76	1.90	+0.14
votes	wife	2.37	2.49	+0.13
idol	scandal	3.23	3.16	+0.07
election	retirement	2.64	2.64	+0.00

Word relatedness judged with and without context of the target Hillary Clinton.

Table A.4

word1	word2	without context	with context	difference
choice	body	2.60	4.21	+1.61
body	decision	2.67	4.26	+1.59
mother	freedom	2.33	3.79	+1.46
choice	women	3.04	4.33	+1.29
babies	rape	1.71	2.95	+1.24
uterus	freedom	2.12	3.29	+1.16
procedure	death	2.50	3.61	+1.11
health	decision	3.15	4.19	+1.04
unborn	right	2.50	3.32	+0.82
right	control	3.31	4.04	+0.73
unborn	procedure	2.37	3.05	+0.69
clinics	access	2.65	3.30	+0.65
decision	life	4.08	4.54	+0.47
clinics	defunding	2.46	2.80	+0.34
god	contraception	2.64	2.47	+0.18
murder	decision	3.04	2.98	+0.06
consent	conception	2.76	2.80	+0.05
murder	clinic	2.77	2.75	−0.01
birth	murder	2.77	2.72	-0.05
babies	church	2.79	2.53	-0.26

Word relatedness judged with and without context of the target abortion.

Table A.5

word1	word2	without context	with context	difference
freethinker	reason	2.83	3.88	+1.05
bible	bullshit	1.85	2.75	+0.91
god	bullshit	1.94	2.81	+0.86
prayer	bullshit	1.96	2.65	+0.69
islam	bullshit	1.79	2.47	+0.69
peace	education	3.21	3.82	+0.61
lamb	love	1.80	2.32	+0.52
lamb	human	2.32	1.98	+0.34
human	superstitions	2.86	3.16	+0.30
enemy	islam	2.25	2.51	+0.26
enemy	god	2.08	2.32	+0.24
enemy	Christ	1.94	2.18	+0.23
faith	superstitions	3.07	3.30	+0.23
superstitions	bible	2.89	3.06	+0.17
lord	joy	2.40	2.47	+0.07
slavery	bible	3.40	3.33	−0.07
bible	sins	4.02	3.81	-0.21
lamb	science	2.08	1.70	-0.38
lord	mercy	3.88	3.47	-0.41
kingdom	slavery	3.73	3.28	-0.45

Word relatedness judged with and without context of the target atheism.

Specific Targets for Multi-Target Stance Detection

Below, we provide details on the targets sets which have been used to annotate the Atheism and the Death Penalty Dataset with stance on multiple targets.

Table A.6

target	description	examples for textual evidence
Christianity	belief in the religion Christianity	Jesus, Christ, Bible, Catholic, Gospel
freethinking	idea that truth should be formed on the #freethinking, #DogmasNeverHelpbasis of logic, reason, and empiricism	#freethinking, #DogmasNeverHelp
Islam	belief in the religion Islam	Quran, Ummah, Hadith, Mohammed, Allah
no evidence	idea that there is no evidence for religion	’there is no evidence for God’
life after death	believe in an existence after death	paradise, heaven, hell
supernatural power	belief in a supernatural power	God, Lord, Ganesha, destiny, predestination
USA	United States of America	our country, our state, America, US
conservatism	the conservative movement in the USA	republicans, #tcot, tea party
same-sex marriage	the right of a same-sex couples to marry	gay marriage, same-sex marriage
religious freedom	everyone should have the freedom to have and practice any religion	#religiousfreedom, ’right to choose your own religion’
secularism	religion and nation should be strictly separated	separation of church and state, #secularism

Targets used for annotating the Atheism Datasets.

Table A.7

set	target	explanation
IDebate	closure	The death penalty helps the victims’ families achieve closure.
	deterring	The death penalty deters crime.
	eye for an eye	The death penalty should apply as punishment for ﬁrst-degree murder. We should rely on the biblical principle ’an eye for an eye’.
	ﬁnancial burden	The death penalty is a ﬁnancial burden on the state.
	irreversible	Wrongful convictions are irreversible.
	miscarriages	The death penalty can result in irreversible miscarriages of justice.
	overcrowding	Executions help alleviate the overcrowding of prisons.
	preventive	Execution prevents the accused from committing further crimes.
	state killing	All state-sanctioned killing is wrong.
Reddit	electric chair	Executions should be done by electric chair.
	gunshot	Executions should be done by gunshot.
	strangulation	Executions should be done by strangulation.
	certainty unachievable	The certainty necessary for the death penalty is unachievable.
	heinous crimes	People who commit heinous crimes (e.g. murder, rape) should be sentenced to death.
	immoral to oppose	It is immoral to oppose death penalty for convicted murders.
	more harsh	The death Penalty should be more harsh.
	more quickly	The death Penalty should be enforced more quickly.
	psychological impact	The death Penalty has a negative impact on human psyche (e.g. for the executioners, witnesses).
	no humane form	There is currently no human form of the death penalty.
	replace life-Long	Life-long prison should be replaced by the death penalty.
	lethal force	If one is against the death penalty, one has to be against all state use of lethal force (e.g. military).
	abortion	If the death penalty is allowed, abortion should be legal, too.
	euthanasia	If death penalty is allowed, euthanasia should be legal, too.
	use bodies	Bodies of the executed should be used to repay the society (e.g. organ donation, experiments).

Targets used for annotating the Death Penalty Datasets.

Questionnaires for Quantifying Qualitative Data

Below, we show an example questionnaire used for collecting assertions and an example questionnaire used for collecting judgments on the collected assertions:

QUESTIONNAIRE I: CREATING ASSERTIONS ON CONTROVERSIAL ISSUES

Provide at least ﬁve relevant assertions on the given controversial issue. The assertions must be expressions that one can agree or disagree with. They can be claims, beliefs, opinions, reasons, arguments, or any statement that can be used to inform or support one’s position on the issue. The assertions do not have to be reﬂective of your own opinions. The assertions can be about a sub-issue or an aspect of the issue.

The assertions should:

support a position that is relevant to the issue.
cover a diverse set of positions on the issue. (Avoid claims that rephrase the same argument in slightly diﬀerent ways.)
be formulated in a way that a third person can agree or contradict the assertion.
be self contained and understandable without additional context. (Do not use ’it’, ’she/her’ or ’he/him/his’ etc. to refer to an issue, a person or something else that is not directly mentioned in your assertion.)
be precise. (Avoid vague formulations such as maybe, perhaps, presumably or possibly.)

The assertion should NOT:

be a simple expression of agreeing/supporting or disagreeing/rejecting the overall issue.
contain multiple positions (e.g. migrants are friendly and hardworking).
contain expressions of personal perspective (e.g. I don’t like immigrants).
be the same as any of the provided examples; or simple negations or other variants of a provided example.

Issue:	Marijuana
Description:	This issue is about legalization of cannabis. This includes the legalization for recreational or medical usage and other positive or negative consequences of legalizing cannabis.

Q1: True or False: This issue is about risks of consuming Cocaine.

( ) true

(✓) false

Q"2: Choose the assertion which meets the format requirements:

(✓) The government should discourage any drug usage.

( ) Maybe, the government should discourage any drug usage.

Q3: Enter assertion 1 about ’Marijuana’:

Q4: Enter assertion 2 about ’Marijuana’:

Q5: Enter assertion 3 about ’Marijuana’:

Q6: Enter assertion 4 about ’Marijuana’:

Q7: Enter assertion 5 about ’Marijuana’:

QUESTIONNAIRE II: JUDGING ASSERTIONS ON CONTROVERSIAL ISSUES

We want to better understand common controversial issues such as immigration and same-sex marriage. Therefore, we have collected a large amount of assertions relevant to these issues. Your task is to:

Indicate whether you agree or disagree with these assertions.

Indicate how strongly you support or oppose these assertions. Since it is diﬃcult to give a numerical score indicating the degree of support or degree of opposition, we will give you four assertions at a time, and ask you to indicate to us:
- Which of the assertions do you support the most (or oppose the least)?
- Which of the assertion do you oppose the most (or support the least)?
- If you support two assertions equally strongly, then select any one of them as the answer. The same applies for oppose.

Q1: Indicate whether you agree or disagree with the given assertions on the issue ’Black Lives Matter’.

Every race has experienced racism.
- ( ) agree ( ) disagree
There is racial discrimination in the U.S..
- ( ) agree ( ) disagree
The Black lives matter movement is important.
- ( ) agree ( ) disagree
Black Lives Matter encourages racial hate.
- ( ) agree ( ) disagree

Q2: Which of these assertions on the issue ’Black Lives Matter’ do you support the most (or oppose the least)?

( ) Every race has experienced racism.
( ) There is racial discrimination in the U.S..
( ) The Black lives matter movement is important.
( ) Black Lives Matter encourages racial hate.

Q3: Which of these assertions on the issue ’Black Lives Matter’ do you oppose the most (or support the least)?

( ) Every race has experienced racism.
( ) There is racial discrimination in the U.S..
( ) The Black lives matter movement is important.
( ) Black Lives Matter encourages racial hate.

Guidelines for Making Implicit Hateful Tweets Explicit

If the given tweet contains an implicit stance towards Islam, Muslims, or refugees, please make it explicit by paraphrasing each sentence using one the following rules:

Built it in as an additional argument, meaning subject or object (and adjust the sentence to it): Refugees are criminals. Return to where they came from! #Refugees $\xrightarrow[]{\text{transform to}}$ Refugees are criminals. Refugees must return to where they came from! #Refugees

Replace co-references of the implicit stance: Refugees are criminals. Return to where they came from! #Refugees $\xrightarrow[]{\text{transform to}}$ Refugees are criminals. Refugees must return to where they came from! #Refugees

Built the stance in as a noun-adjective conversion specifying one oft he arguments, meaning the subject or the object: Other countries don’t have issues with Muslims. Merkel’s curse! #Muslims $\xrightarrow[]{\text{transform to}}$ Other countries don’t have issues with Muslims. Merkel’s Muslim curse! #Muslims

If the message of the Tweet is softened through the use of modals, quantiﬁers, or speciﬁcations, make it more explicit by paraphrasing it in the following way:

Delete the specifying phrase: Criminal refugees aren’t punished for their crimes. $\xrightarrow[]{\text{transform to}}$ Refugees aren’t punished for their crimes.

Replace the specifying phrase: Many refugees are criminal. $\xrightarrow[]{\text{transform to}}$ All refugees are criminal.

Deleting the softening phrase: I believe refugees are criminals. $\xrightarrow[]{\text{transform to}}$ Refugees are criminals.

Changing the softening phrase: They should be sent back. $\xrightarrow[]{\text{transform to}}$ They must be sent back.

Reformulate rhetorical questions to aﬃrmative statements: Was this a coincidence? $\xrightarrow[]{\text{transform to}}$ This was not a coincidence.

Zitiervorschlag

Wojatzki, Michael Maximilian. Computer-Assisted Understanding of Stance in Social Media: Formalizations, Data Creation, and Prediction Models. Universität Duisburg-Essen, 2019, doi:10.17185/duepublico/48043.

Repository

duepublico.uni-duisburg-essen.de

Identifikatoren

■ urn:nbn:de:hbz:464-20190201-140926-6

■ doi: 10.17185/duepublico/48043

Metadaten: Deutsche Nationalbibliothek

Schlagworte

Zusammenfassung

Interview mit Dr. Michael Wojatzki

Expertise

Interessant für

Volltext auf OpenD

Danksagung

Acknowledgements

Abstract

Zusammenfassung

Introduction

Main Contributions

Publication Record

Thesis Overview

Formalizing Stance

Stance Detection as Natural Language Processing Task

Natural Language Processing in Social Media

Annotation Complexity

Polarity

Three-way and Binary Stance

Text Polarity

Targets

Single Targets

Multiple Targets

Nuanced Targets

Chapter Summary

Stance on Single Targets

Preprocessing

Learning Algorithms

Machine Learning with Feature Engineering

Machine Learning with Neural Networks

Evaluation of Stance Detection Systems

Stacked Classification for Detecting Threeway Stance

System

Results

Target-specific Lexical Semantics

Combining Neural and Non-neural Classification

Data

System

Results

Comparison with Other Approaches

Stance in Social Media Customer Feedback

Data

Results

Chapter Summary

Stance on Multiple Targets

Annotating Stance on Multiple Targets

Inferring Stance from Text

Reliability of Multi-target Stance Annotations

Creating Multi-Target Stance Datasets

Multi-target Stance Dataset on Atheism

Multi-target Stance Dataset on the Death Penalty

Reliability of Multi-Target Stance Annotation

Atheism Dataset

Death Penalty Dataset

Distribution of Multi-target Stance Annotations

Overall Stance

Specific Stances

Coverage

Predicting Multi-Target Stance

Predicting Overall and Specific Stances

Relationship Between Specific and Overall Stance

Chapter Summary

Nuanced Assertions

Quantifying Qualitative Data

Qualitative Data

Quantitative Data

Analysis

Ranking Assertions

Ranking Controversial Issues

Similarity of Participants

Judgment Similarity

Predicting Stance on Nuanced Assertions

Predicting Judgment Similarity

Predicting Stance of Individuals

Predicting Stance of Groups

Chapter Summary

Hateful Stance

Hateful Stance, Hate Speech and Related Concepts

Hatefulness Polarity