# Capstone Project Notes

## Examples

<https://github.com/davidmasse/freelancer-rates>&#x20;

<https://github.com/slieb74/NBA-Shot-Analysis>&#x20;

<https://github.com/cpease00/etf_forecasting>&#x20;

<https://github.com/NaokoSuga/gentrification_yelp>&#x20;

<https://github.com/mrethana/news_bias_final>&#x20;

<https://github.com/paulinaczheng/twitter_flu_tracking>

## Project

{% embed url="<http://www.opensources.co/>" %}
lists of online sources
{% endembed %}

{% embed url="<https://github.com/several27/FakeNewsCorpus>" %}
cleaned dataset (9.1GB)
{% endembed %}

{% embed url="<https://github.com/aws/aws-cli>" %}

{% embed url="<https://scrapy.org/>" %}

{% embed url="<https://stackoverflow.com/questions/45828616/streaming-large-training-and-test-files-into-tensorflows-dnnclassifier>" %}

{% embed url="<https://geoip2.readthedocs.io/en/latest/>" %}

```
import geoip2.database
import socket

ip = socket.gethostbyname('nike.com')
reader = geoip2.database.Reader('GeoLite2-Country_20190305/GeoLite2-Country.mmdb')
response = reader.country(ip)
response.country.iso_code # Results in 'US'

```

{% embed url="<https://makenewscredibleagain.github.io/#works>" %}

{% embed url="<https://realpython.com/python-keras-text-classification/>" %}

{% embed url="<https://towardsdatascience.com/multi-class-text-classification-model-comparison-and-selection-5eb066197568>" %}

{% embed url="<https://monkeylearn.com/blog/practical-explanation-naive-bayes-classifier/>" %}

{% embed url="<https://monkeylearn.com/text-classification-naive-bayes/>" %}

{% embed url="<https://www.datacamp.com/community/tutorials/naive-bayes-scikit-learn>" %}

### Workflow

1. **Data Collection**

   Collect news articles from a set of credible and non-credible websites. Get training labels from [OpenSources](http://www.opensources.co/), a professionally curated database.&#x20;
2. **Sampling**

   Sample from the corpus in such a way that the training set contains an even number of unique articles from both credible and non-credible sources for each day of data collection.
3. **Classifier**

   Build an ensemble classifier that considers the predictions of two separate models:\
   a) "Content-only" model (Multinomial Naive Bayes)\
   b) "Context-only" model (Adaptive Boosting)
