I am interested in determining if there is a way of predicting if an online article link is clickbait based on the structure and usage of the title.
This is interesting to me because learning to sift through the deluge of information in the modern world is rapidly becoming a required skill and determining truth in phrasing is important in both artifical intelligence as well as maximizing a person's precious time.
I would like to know if:
- There is a particular pattern used to mask clickbat - this implies that we as humans have a pre-desposition to certain communication patterns.
- If article sources can be assigned a co-efficient, implying that we can rank article sources by "trustworthiness"
- Which cognitive distortions/logical fallicies are most exploited in the creation of clickbait.
To do this, we need three groups of data sources:
- Truth control: This is a group of article sources that are considered to be telling the truth
- Liar control: This is a group of articles sources that are considered to be misrepresenting the truth
- Variable: This is the group of article sources that we will be comparing against the prior two to determine their "trustworthiness"
For the truth control group I intend to use articles from:
- The New York Times Article Search and Most Popular Listings
- The US Governments Open Data portal Climate Reports
- Wikipedia Page Traffic Statistics From the last 3 months
- Reddit's Not The Onion Subreddit
For the variable group, I will compare and analyze:
For the liar control group I inted to use articles from:
- The Onion Politics Section
- Reddit's Fake News Subreddit
- The Daily Currant
- The National Report
Further sources can be taken from this list on Fake News Watch
Consideration is also made for Comparing Kickstarter pitches to their article content, against their success: http://webrobots.io/kickstarter-datasets/