GSoC 2017: The preprocessing

Google Summer of Code 2017 with Red Hen Lab

In their own words:

The Distributed Little Red Hen Lab is a global laboratory and consortium for research into multimodal communication. Its main goal is theory. Its second goal is the development of new computational, statistical, and technical tools to assist research on multimodal communication.

Among the medley of fascinating explorations in multimodal communication, Red Hen specializes in communicative learning, analyzed in terms of communicative effects and communicative intent.

The preprocessing

From the vast NewsScape dataset available, I assembled a set of news stories and transcripts on various topics from Fox News, CNN, and MSNBC. The initial crude attempts to use this text to train a neural network to generate its own text in a certain style. From here, I will later attempt to seed the generated text with either some start words or an image.

Preprocessing involves putting together the relevant text body from separate, individual files that are tagged with information about the media house. Doing this, I obtain my raw input file with ?, ?, ? lines of text for Fox News, CNN, and MSNBC respectively.

This raw input includes an ugly timestamps with every line, which I get rid of with a script and some regex magic.

I also found an old dataset with CNN and Fox transcripts that could prove useful to pad the training vocabulary. Stowed that away, just in case.