GSoC 2017: Final report

Google Summer of Code 2017 with Red Hen Lab

Contents

  1. Organisation: Red Hen Lab
  2. Acknowledgements
  3. Abstract
  4. Pipeline
  5. Dependencies
  6. Usage
  7. Sample outputs
  8. Dataset
  9. Approaches attempted
  10. Future work

Organisation: Red Hen Lab

In their own words:

The Distributed Little Red Hen Lab is a global laboratory and consortium for research into multimodal communication. Its main goal is theory. Its second goal is the development of new computational, statistical, and technical tools to assist research on multimodal communication.

Among the medley of fascinating explorations in multimodal communication, Red Hen specializes in communicative learning, analyzed in terms of communicative effects and communicative intent.

Acknowledgements

I want to express my gratitude to Prof. Francis Steen, Prof. Mark Turner, and the entire Red Hen consortium of researchers for giving me a chance to explore and contribute to some of the projects under this banner.
I also thank Shruti Gullapuram, who helped me through my Google Summer of Code 2017 period as my mentor along with Prof. Steen.
Big thanks to Google for another successful edition of GSoC, and a shout-out to fellow GSoC participants for being ever-ready to help each other.
My deepest gratitude also towards my supportive family, friends, and peers.

Abstract

Project: Neural Network Models to Study Framing and Echo Chambers in News

With every passing day, we can see that the representation of our world that has been conjured up by the mainstream media is far removed from the facts. The media depiction of events may disregard the facts and rely on other means to achieve the desired communicative effects in the minds of the people, depending on the communicative intent of that media house.

Interpretive frames can be used to manipulate facts as per convenience. In the current political debates, opposing camps are often confident about their own interpretations of facts, or the dismissal of facts as irrelevant, or the emphasis of facts as crucial.

Media bias is no secret, with it being common knowledge for instance that Fox News leans conservative while MSNBC leans more liberal. Language plays an important role in framing and changing the debate.

The aim is to consider controversial topics, and compare the contrasting viewpoints in different news articles about the same piece of news. A person reading pieces of text produced by biased media houses should be able to correctly identify the slant. The challenge is to see whether this detection may be automated, that is, whether a machine might be able to discern that a particular news story has a bias just by analysing the way it is written. The task is an experiment in automatically detecting interpretive frames, where identical facts may be painted in different light by different sources, depending on the intended communicative effect.

Pipeline

SVO detector

A Python-based subject-verb-object (SVO) detection for each sentence from a document, based on spaCy, generates a list of detected SVO triplets when fed a block of text.

spaCy also gives details of part-of-speech as a parse tree, and the various noun phrases of the sentences input. Therefore, a list of noun phrases is added as additional sentence metadata. This is later useful to identify more complex concepts from the base noun that is considered as part of the SVO triplet.

This SVO parser works by loading a language model. This is the standard entry point into spaCy, using the spacy.load() function, which constructs a language processing pipeline. The built-in factory spacy.en.English will work right off the bat once the model data is downloaded to its path. Once loaded, the parser can process unicode strings.

The svo-senti.py code splits a text block into sentences and feeds one at a time to the parser.
The getSVOs() function returns the required information using implementation in svo.py by performing multiple runs through the sentence for identifying subject, SV, SVO, and even SVAO (with adjectives) and doing complicated analysis about negations and conjunctions. The parser’s parse.noun_chunks returns all noun phrases in the sentence.

TODO #1: Resolve issues with sentences that are in passive voice.
TODO #2: Integrate pronoun coreference resolution before SVO parsing.

Sentiment detector

A Python-based sentiment analyser using NLTK and VaderSentiment tool is used to return sentiment scores (compound, negative, neutral, positive). These are the final fields in sentence level metadata, added to data.csv file in multiple columns. The sentiment score is powerful information as it generally reflects that of the subject towards the object, or that of the speaker towards the topics in the sentence.

The svo-senti.py code splits a text block into sentences and feeds one at a time to the parser.
The getSentiments() function uses polarity_scores() from nltk.sentiment.vader object of SentimentIntensityAnalyzer to return sentiment scores as float values between -1 and 1 (scores each for compound, negative, neutral, and positive).

Bias classifier

For training, the text data is transformed into word embeddings. These are determined using the pre-trained word embeddings using GloVe embeddings. The input for training the model is a mix of these word embeddings along with sentiment scores.

Word embeddings have the useful property of mapping words to a vector space representing features of the words. It has been found that similar concepts tend to have closely grouped vectors, hence semantic relationships are captured. Moreovere, interesting relationships can be seen between vectors, such as king-queen=man-woman.

TODO #3: Port to using pre-trained vectors trained on part of Google News dataset for better political vocabulary.

Since for initial tests it was necessary to build the input data sentence-by-sentence from news articles for this model, until the training data is large enough, the model prediction performance is quite bad.

The classifier is built using Python and Keras with TensorFlow backend. Initially, the pre-trained embeddings are used to prepare an embedding matrix using word indexes of all the tokens found in the input dataset vocabulary, with Keras Tokenizer class. Hence, every unique token is mapped to corresponding embedding, and to some default if it does not exist in the pre-trained set.
The sentences are converted to sequences of vectors, hence text samples are now input 2D integer tensors. Next the category labels are converted to simple integers representing the category index. The input data using numpy is split into training and validation sets.

The given classifier model is a one-dimensional convolutional neural network with global maxpooling. An embedding layer loads the word vectors. It should be fed sequences of integers, that is, a 2D input of shape (samples, indices). These input sequences should be padded so that they all have the same length in a batch of input data. This layer maps the integer inputs to the vectors found at the corresponding index in the embedding matrix.
Then on top of it a 1D convolutional neural network is built. A dense output layer corresponds to float values between 0 and 1 (0.1 increments) for bias. model.compile() and model.fit are used to train and test the classifier.

The reason for using CNN is, quite simply, because it works. Convolutional layers are powerful to extract higher level feature in images, and actually work in most 2D problems. Hence the embedding vector sequences give a nice picture of each sentence. The training time for a CNN is over 50% faster than LSTMs on such problems.

After a gross test on only 300 sample input sentences manually tagged with bias scores and just 10 epochs, the network is able to predict new scores with about 45% accuracy.

Dependencies

Install nltk, spaCy, and vaderSentiment in order for svo-senti.py to work:

sudo pip install -U nltk
sudo pip install -U spacy
sudo python -m spacy download en
sudo pip install vaderSentiment

Install NumPy, TensorFlow, and Keras in order for classify.py to work:
First follow instructions from TensorFlow installation guide to get TensorFlow on machine with CPU/GPU support as preferred.

pip install numpy
pip install keras

Usage

In terminal command line, use as follows:

python svo-senti.py "[block of text]"
python classify.py"

svo-senti.py preprocesses and populates a data.csv file containing sentences to classify, one per row, which can then be used by classify.py.

Sample outputs

$ python svo-senti.py "Trump defends son in Paris. The President's son has 
shown astonishingly poor judgement."

Trump defends son in Paris.
SVOs:  [(u'trump', u'defends', u'son')]
Noun phrases:  Trump,  son,  Paris, 
Polarity scores:  compound: 0.0  neg: 0.0  neu: 1.0  pos: 0.0 

The President's son has shown astonishingly poor judgement.
SVOs:  [(u'son', u'shown', u'judgement')]
Noun phrases:  The President's son,  astonishingly poor judgement, 
Polarity scores:  compound: -0.4767  neg: 0.307  neu: 0.693  pos: 0.0 
$ python classify.py

Using TensorFlow backend.
Indexing word vectors.
Found 400000 word vectors.
Processing text dataset
Found 8755 unique tokens.
('Shape of data tensor:', (300, 1000))
('Shape of label tensor:', (300, 11))
Preparing embedding matrix.
Training model.
Train on 240 samples, validate on 60 samples
Epoch 1/10
240/240 [==============================] - 4s - loss: 2.3818 - acc: 0.3208 - val_loss: 2.0752 - val_acc: 0.5000
...
Epoch 10/10
240/240 [==============================] - 4s - loss: 1.3479 - acc: 0.4750 - val_loss: 1.3426 - val_acc: 0.5000

Dataset

A sample of the data as stored in the data.csv file is shown here:

Bias Sentence SVOs Noun phrases Polarity (compound) (neg) (neu) (pos)
0.2 We should make clear that we share Turkey’s concerns about the status of Crimean Turks and other ethnic minorities. [(u’we’, u’share’, u’concerns’)] We, we, Turkey’s concerns, the status, Crimean Turks, other ethnic minorities 0.5859 0 0 0.1716
0.4 While expectations for the Kerry tenure at State may have started low, he has raised them considerably. [(u’expectations’, u’started’, u’low’), (u’he’, u’raised’, u’them’)] expectations, the Kerry tenure, State, he, them -0.2732 -0.0316912 0.102544 0
0.5 Putin clearly decided that the benefits of an armed intervention in Ukraine outweighed whatever costs the United States would be willing to impose. [(u’benefits’, u’outweighed’, u’costs’)] Putin, the benefits, an armed intervention, Ukraine, the United States 0.7096 0.05322 0.0486 0.178848

Approaches attempted

Phrase-matching bias detector

The initial simple pipeline implementation merely checks for certain phrases in the input text to decide sensationalism or partisanship. These terms were scraped from various sources for this initial rough experiment. The sources for partisan language include this article from The Atlantic on “Why Democrats and Republicans Literally Speak Different Languages”, and this paper on “What Drives Media Slant?” by Gentzkow and Shapiro. Some terms that could constitute sensational language were found in this document of Homeland Security watch-phrases.

In terminal command line, use as follows:

python detect_bias.py "[block of text]"

For example,

$ python detect_bias.py "Grenfell disaster victims murdered by political decisions"

NEWS BIAS DETECTOR

Sensationalism issues: 1
Partisanship issues: 0

* Sensationalist vocabulary is used

Mentioned: disaster


$ python detect_bias.py "If Republican efforts to repeal and replace Obamacare 
are successful, one of the biggest winners would be the wealthy. The Senate's 
bill -- released this week -- differs in key ways from the House-passed version. 
But proposals eliminate the taxes imposed on high-income Americans to help 
pay for an expansion of health benefits under the Affordable Care Act. The 
legislation also would let people contribute more to certain tax-advantaged accounts."

NEWS BIAS DETECTOR

Sensationalism issues: 0
Partisanship issues: 2

* Partisan vocabulary is used

Mentioned: Obamacare, Affordable Care Act

Topic detection triplets

Topic detection in Red Hen’s NewsScape collection generates output characterized by a set of words indicating agent, location, and act, using this format:

20160701013037.536 | 20160701013 049.882|TID_01 ID=0088 who=randy_paige miguel_almaguer vaulty_glen suzie_suh corey_burdick huma_hannah paul_jaffe jeff_vaughn clayton_kershaw |where=california new_york_times_square valley_glen_area b valley palmdale victorville anaheim orange_county los_angeles florida new_york u_s |what=honda airbag owner bag model car airbags recall driver nhtsa air takata suzie civic fix replace vehicle rental accura fire immediately malfunction drive warning injure test defective lazar woodman acura crv tl dealer rupture reporter deadly thought explore ban massive traffic government repair implement accident

This dissertation on “Joint Image-Text Topic Detection and Tracking for Analyzing Social and Political News Events” by Weixin Li elaborates on the topic detection pipeline. The main data source is the UCLA Library Broadcast NewsScape.

After pre-processing steps such as story segmentation, the method proposed detects topics using a cluster sampling method, Swendsen-Wang Cuts (SWC), based on the proposed Multimodal Topic And-Or Graph (MT-AOG) which jointly models texts and images and organizes news topic components in a hierarchical structure. The text part of each topic is represented by an AND-node, and its three components (OR-nodes describing a set of possible words for the corresponding components) encode the knowledge of “who”, “where” and “what.” These three components describe the people involved, related places, and what happened respectively. The words are represented by TERMINAL-nodes in the last layer. Contextual relations are represented also, using information about co-occurrence of words and ratios of entity numbers of different components.

Hence, there are clusters of words indicating who, where, and what, and these are referred to as triplets. Such story segmentation gives some level of information on the particular slant of a story. For instance, say in coverage of an election, we may identify how often a particular candidate is mentioned or how much time is allocated to each of the candidates.

Weixin Li’s study of the 2016 election news shows that quantitative aspects of bias could be detected with the proposed methods. Hence, relative numbers of mentions of candidates and topic trajectories could be mapped.

RNN for sentence level political ideology

The model is based on the “Political Ideology Detection Using Recursive Neural Networks” paper.

RNN structure takes into account the hierarchical nature of language, thereby building the ideological bias identified from combinations of words and phrases, rather than in a mere bag-of-words manner.

The initial experiments are reported on a Convote dataset of Congressional debates that has annotations on the author level for partisanship, with party labels propagated down from the speaker to all of their individual sentences and maps from party label to ideology label. The authors also filter only biased sentences from the Ideological Books Corpus dataset annotated for ideological bias on the sentence and phrase level. Each document in the IBC has been manually labelled with coarse-grained ideologies (right, left, and centre) as well as fine-grained ideologies (such as religious-right, libertarian-right) by political science experts.

Other interesting ideas

One project showing the different word associations based on political stance, called Partisan Thesaurus could be useful at a later stage for more nuanced study.

Another paper on “Linguistic Models for Analyzing and Detecting Biased Language” uses an interesting method to determine bias, by looking at Wikipedia page edit history data where humans have tried to correct biased article phrasing.

This study on the abortion debate, and the loaded terminology used by various media houses, really highlights the effects of bias. Neutral terms would be “anti-abortion” and “abortion rights”. Loaded language would be “pro-choice” and “pro-life”.

Another more general article reveals more examples of vocabulary that might distinguish between liberal and conservative viewpoints. Liberals use “comprehensive health reform,” “estate taxes,” “undocumented workers,” “tax breaks for the wealthy”. Conservatives use “Washington takeover of health care,” “death taxes,” “illegal aliens,” and “tax reform” to refer the the same concepts.

Future work

SVO and aspect-based sentiment

Weixin Li also attempts to determine the sentiment of our candidate-related tweets, using Vader sentiment analysis tools. Focusing on tweets has advantages over using it for longer-form news stories, as the target for the sentiment in a sentence is often ambiguous. Her focus is on analysing trends and sentiments associated with candidate mentions alone.

For framing and bias detection in more general terms, tweets are not sufficient. News articles must be analysed for contrasting views on controversial issues. The aforementioned problem of ambiguous targets for sentiment must be tackled. One potential approach would make use of the SVO detection step described in this report. Annotating a sentence with subject-verb-object data clearly shows who is discussing what. There are then a few options for the sentiment scores to be relevant to: the writer of the sentence, the person/entity that is the subject, or the person/entity that is the object.

For instance, the first sentence is a neutral fact while the second is the writer’s negative opinion about a fact.

Trump defends son in Paris.
Polarity scores:  compound: 0.0  neg: 0.0  neu: 1.0  pos: 0.0 

The President's son has shown astonishingly poor judgement.
Polarity scores:  compound: -0.4767  neg: 0.307  neu: 0.693  pos: 0.0 

SVAO information associates additional descriptive words with the object for option three. Entity-level or aspect-based sentiment methods along with this will indicate option two or three. If no such exist, then option one is most likely.

Noun phrases

It is unreasonable to pass entire sentence parse tree information to a classifier. However, merely SVOs result in a lot of information loss.

An SVO detector will usually pinpoint individual words that it will regard as subject and object and verb, in the base form of the word. All modifiers, adjectives, etc are discarded. Here, noun phrases come into the picture and can preserve useful information. The SVO terms can be mapped back to more complex phrases to ensure that the meaning is not lost.

For example,

The President's son has shown astonishingly poor judgement.
SVOs:  [(u'son', u'shown', u'judgement')]
Noun phrases:  The President's son,  astonishingly poor judgement

Pronoun coreference resolution

For correct interpretation of a text, or even to estimate the relative importance of various mentioned subjects, pronouns and other referring expressions must be connected to the right entities. Algorithms intended to resolve coreferences often look for the nearest preceding compatible entity.

Without it (that is, coreference resolution) topic segmentation and parsing of the news articles would not generate the appropriate associated sentiments. A human could easily identify what “it” in the previous sentence meant, but for an automatic classifier, these must be replaced by the actual terms.

Moreover, further complications arise where the same entity is referred to in many ways. Imagine articles that refer to “Donald Trump Jr.”, “Trump’s son”, “President’s eldest son”, etc.


Repository: News bias detector

All code for the work done in this project can be found in my GitHub repository.