Skip to content
jasonmclaren edited this page Jan 15, 2011 · 2 revisions

Remaining tasks for collecting individual contributions data:

  1. Download those 100 web pages that make up the big contributions table. (Save these to github please!) We'll need to automate this later for the party data (10x bigger), but for now, I think it's simplest to just bite the bullet and hit "next" 100 times.

  2. Run the contributions.rb script to scrape these pages and output the contributions csv file (save this to github too).

  3. Run the contributors.rb script to read the contributions csv file, download the 20,000 contributor web pages and scrape them into a contributors csv file, and save it to github. (Maybe do this at night / weekend? Shouldn't be hard for them to handle our 1 request/second for 6 hours, but who knows.)

  4. Join the two csv files into one totally-un-normalized table (in yet another csv file). (Possibly with the unix 'join' command if it works nicely with csv format.) This is a hack so we can start playing with this data early.

Clone this wiki locally