Skip to content

Code to obtain raw texts of the CNN / Daily Mail dataset (non-anonymized) for summarization (python3)

Notifications You must be signed in to change notification settings

tagucci/cnn-dailymail

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 

Repository files navigation

This code produces the non-anonymized version of the CNN / Daily Mail summarization dataset, as used in the ACL 2017 paper Get To The Point: Summarization with Pointer-Generator Networks.

This is a modified scripts to obtain raw texts instead of tensorflow binaries.

Instructions

1. Download data

Download and unzip the stories directories from here for both CNN and Daily Mail.

2. Download Stanford CoreNLP

We will need Stanford CoreNLP to tokenize the data. Download it here and unzip it. Then add the following command to your bash_profile:

export CLASSPATH=/path/to/stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar

replacing /path/to/ with the path to where you saved the stanford-corenlp-full-2016-10-31 directory. You can check if it's working by running

echo "Please tokenize this text." | java edu.stanford.nlp.process.PTBTokenizer

You should see something like:

Please
tokenize
this
text
.
PTBTokenizer tokenized 5 tokens at 68.97 tokens per second.

3. Process into .txt and vocab files

Run

python make_datafiles.py /path/to/cnn/stories /path/to/dailymail/stories

replacing /path/to/cnn/stories with the path to where you saved the cnn/stories directory that you downloaded; similarly for dailymail/stories.

This script will do several things:

  • First, cnn_stories_tokenized and dm_stories_tokenized will be temporary created and filled with tokenized versions of cnn/stories and dailymail/stories. If you set rm_tokenized_dir = True, these tokenized directories will be removed after processing texts. This may take some time. Note: you may see several Untokenizable: warnings from Stanford Tokenizer. These seem to be related to Unicode characters in the data; so far it seems OK to ignore them.

  • For each of the url lists all_train.txt, all_val.txt and all_test.txt, the corresponding tokenized stories are read from file, lowercased and written to text files.

  • All text files saved in newly-created finished_files directory. There will be {train/val/test}_{article/abstract}.txt in that directory. This text format is written each stories article/abstract line by line. This allows you to quickly try open source packages like OpenNMT to train models.

  • The original data size is train: 287,226, val: 13,368, test: 11,490 as described in the paper. However, train: 287,226 because train dataset contains 114 empty articles.

  • In addition, train/val/test directories will be created in finished_files. In each directory, stories of article and abstract are placed in article/abstract directories. Both of them are saved line by line. Considering extractive summarization methods, this may be convenient.

  • Additionally, a vocab file is created from the training data. This is also placed in finished_files.

Extra (Lead-3 baseline result)

I evaluate lead-3 baseline as See's paper showed. Instead of using pyrouge as the author used, I use pythonrouge to evaluate ROUGE. While I got same ROUGE scores of pointer-generator / pointer-generator+coverage models by using test-output downloaded from author's pointer-generator repository, lead-3 baseline result is slightly different.

If you want to evaluate lead-3 baseline, eval_rouge = True in line 19 of make_datafiles.py. You also need to install pythonrouge package.

# install pythonrouge
pip install git+https://github.com/tagucci/pythonrouge.git

ROUGE Scores

ROUGE-1 ROUGE-2 ROUGE-L
lead-3 baseline (Nallapati et al., 2017) 39.2 15.7 35.5
lead-3 baseline (See et al., 2017) 40.34 17.70 36.57
lead-3 baseline (this repository) 40.24 17.70 36.45

About

Code to obtain raw texts of the CNN / Daily Mail dataset (non-anonymized) for summarization (python3)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%