Blog Post 3 - Fake News Classifier

In this blog, we’ll build a machine learning model with Tensorflow that helps us to classify fake news.

First, we’ll import some of the needed packages our dataset to feed the model.

import pandas as pd
import tensorflow as tf
import numpy as np

from matplotlib import pyplot as plt

Both the training and testing datasets are accessed from Kaggle:

#train_url = "https://github.com/PhilChodrow/PIC16b/blob/master/datasets/fake_news_train.csv?raw=true"
df = pd.read_csv('fake_news_train.csv')

test_url = "https://github.com/PhilChodrow/PIC16b/blob/master/datasets/fake_news_test.csv?raw=true"
test_df = pd.read_csv(test_url)

df

	Unnamed: 0	title	text	fake
0	17366	Merkel: Strong result for Austria's FPO 'big c...	German Chancellor Angela Merkel said on Monday...	0
1	5634	Trump says Pence will lead voter fraud panel	WEST PALM BEACH, Fla.President Donald Trump sa...	0
2	17487	JUST IN: SUSPECTED LEAKER and “Close Confidant...	On December 5, 2017, Circa s Sara Carter warne...	1
3	12217	Thyssenkrupp has offered help to Argentina ove...	Germany s Thyssenkrupp, has offered assistance...	0
4	5535	Trump say appeals court decision on travel ban...	President Donald Trump on Thursday called the ...	0
...	...	...	...	...
22444	10709	ALARMING: NSA Refuses to Release Clinton-Lynch...	If Clinton and Lynch just talked about grandki...	1
22445	8731	Can Pence's vow not to sling mud survive a Tru...	() - In 1990, during a close and bitter congre...	0
22446	4733	Watch Trump Campaign Try To Spin Their Way Ou...	A new ad by the Hillary Clinton SuperPac Prior...	1
22447	3993	Trump celebrates first 100 days as president, ...	HARRISBURG, Pa.U.S. President Donald Trump hit...	0
22448	12896	TRUMP SUPPORTERS REACT TO DEBATE: “Clinton New...	MELBOURNE, FL is a town with a population of 7...	1

22449 rows × 4 columns

Apart from the indices column, the dataframe contains three more informative columns: title column, the titles of the news; text column, the content of that news; the fake column, label indicating whether the news is 0(real), or 1(fake).

Make Dataset

Before we actually start the training process, transforming the dataset into decent forms for training are also very important. Here, we’ll create a make_dataset function that’ll do two things:

remove stop words from the article text and title columns(stop words refers to those most common words in a language, which could be filtered out for training. Some common stop words are a, the, as, at, by, to, etc)
construct a tensorflow dataset with inputs (title, text) and output fake column labels. Dataset format of enables the input to be smoothly handled by tensorflow (tensorflow dataset is a specific class containing many useful functions to keep you organized)

To remove stop words, we’ll first import some extra packages and functions.

the stop words function gives you a list of common stop words of specified language;
the word_tokenize function breaks a string into single words and punctuations of a 1d array form;
the TreebankWordDetokenizer function detokenizes a 1d array of words, combining it back into one individual string.

import nltk
#nltk.download('stopwords') #only need to be downloaded once
#nltk.download('punkt')  #only need to be downloaded once as well
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
stop = stopwords.words('english')

from nltk.tokenize.treebank import TreebankWordDetokenizer

def make_dataset(df):
    df_ = df
    
    #remove stopwords for the dataframe
    df_['title'] = df_['title'].apply(word_tokenize)# split a string into separate tokens
    df_['title'] = df['title'].apply(lambda x: [w for w in x if not w in stop] )#remove stopwords
    df_['text'] = df_['text'].apply(word_tokenize)
    df_['text'] = df_['text'].apply(lambda x: [w for w in x if not w in stop] )
    df_['text'] = df_['text'].apply(lambda x: TreebankWordDetokenizer().detokenize(x))#combine the tokens back to a complete sentence
    df_['title'] = df_['title'].apply(lambda x: TreebankWordDetokenizer().detokenize(x))

    #create tensorflow dataset with inputs (title, text) and output fake column labels and return it
    return tf.data.Dataset.from_tensor_slices(({"title": df_['title'], "text": df_['text']}, df_['fake']))

We then apply the function to our training and testing datasets:

dataset = make_dataset(df)

test_dataset = make_dataset(test_df)

Train Validation Split, Create Batches

Now we are ready to construct some train & validation split inside our training set. Along with that, we’ll also set batches for training set, as training with the original dataset volume might be too inefficient for demonstration purposes.

#shuffle the complete training set
dataset = dataset.shuffle(buffer_size = len(dataset))

#specify training and validation size: 0.7 & 0.2, respectively
train_size = int(0.7*len(dataset))
val_size   = int(0.2*len(dataset))

#pick out first 0.7 as training set, with a batch of 100
train = dataset.take(train_size).batch(100)
#pick out the next 0.2 as validation set, with a batch of 100
val   = dataset.skip(train_size).take(val_size).batch(100)
#test  = dataset.skip(train_size + val_size).batch(100) #last 0.1 of data

#check for the length of the two sets
len(train), len(val)#, len(test)

(158, 45)

test_ds = test_dataset.batch(100)

Create Models

We are now ready to make our models. While our data contains informative columns of both news title and news text, we would like to know whether titles only, text only, or both, would be more effective for detection of fake news. Therefore, we’ll be creating three models, with similar strategies, and comparing their accuracies.

Vectorization layer

No matter which combination we’ll be using, we all need to first vectorize the input strings. Vectorization refers to processing of the strings into computer readable format, which is, numbers. There are multiple ways of vectorization of strings, such as the term-document matrix, etc. Here, we’ll be using the TextVectorization function provided by tensorflow. This function would represent each of the words by its rank of frequency in the whole dataset. First, we’ll need to define a standardization function to supply to the TextVectorization function.

Here, the standardization function would convert all words to lowercases and remove all punctuations.

def standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    no_punctuation = tf.strings.regex_replace(lowercase,
                                  '[%s]' % re.escape(string.punctuation),'')
    return no_punctuation 

Then, with the TextVectorization function, we’ll make a vectorize_layer that transforms each words in the training set, after standardization treatment, into tensors with the words’ corresponding frequency rank.

from tensorflow.keras import losses
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

import re
import string

# only the first 2000 distinct words in the whole set will be tracked
max_tokens = 2000

# only the first 20 words of each headline will be considered
sequence_length = 20

vectorize_layer = TextVectorization(
    standardize=standardization,
    max_tokens=max_tokens, # only consider this many words
    output_mode='int',
    output_sequence_length=sequence_length) 

After creating the layers, we would still need to adapt the layer to our set. By adapting it, we will ensure that the vectorization layer have the words’ corresponding frequency ranks fixed. Thus whenever we are training or testing with that layer, each distinct word would have that one and only ranking.

vectorize_layer.adapt(train.map(lambda x, y: x["title"]))
vectorize_layer.adapt(train.map(lambda x, y: x["text"]))

Then, we’ll specify two input layers for title and text.

from tensorflow import keras
from tensorflow.keras import layers

title_input = keras.Input(
    shape = (1,), 
    name = "title",
    dtype = "string"
)

text_input = keras.Input(
    shape = (1,), 
    name = "text",
    dtype = "string"
)

Model trains with titles only

Since we have two kinds of inputs, title and text, we’ll be using the keras functional API, instead of sequential API. The functional API can create layers that have more flexible sequences, therefore suitable for building a model with multiple inputs. First, we’ll build a model that learns the news titles only.

Our first layer would be the vectorize_layer we just built. In each of the following layers, we supply the input as the result obtained from the preceding layer in (). With Dropout, the input units in that layer will be set to 0 with a frequency of rate at each step during training time, which helps prevent overfitting. The Embedding layer creates a vocabulary space, giving each word its own unique spot in that space. Furthermore, the spots of the words are set in a way that the distance and direction between words would indicate relatedness and relationships. We’ll give this layer a name “embedding_tl” to ease the process of examining it later.

x = vectorize_layer(title_input)
x = layers.Embedding(max_tokens, output_dim = 5, name = "embedding_tl")(x)
x = layers.Dropout(0.2)(x)
x = layers.GlobalAveragePooling1D()(x)
x = layers.Dropout(0.2)(x)
x = layers.Dense(32, activation='relu')(x)
output = layers.Dense(2, name = "fake")(x)

Now, we are finally ready to build the model. Only specifying the initial input, tensorflow will be able to extract all that information we’ve provided of the layers.

model_tl = keras.Model(
    inputs = title_input,
    outputs = output
)

model_tl.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
title (InputLayer)           [(None, 1)]               0         
_________________________________________________________________
text_vectorization (TextVect (None, 20)                0         
_________________________________________________________________
embedding_tl (Embedding)     (None, 20, 5)             10000     
_________________________________________________________________
dropout (Dropout)            (None, 20, 5)             0         
_________________________________________________________________
global_average_pooling1d (Gl (None, 5)                 0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 5)                 0         
_________________________________________________________________
dense (Dense)                (None, 32)                192       
_________________________________________________________________
fake (Dense)                 (None, 2)                 66        
=================================================================
Total params: 10,258
Trainable params: 10,258
Non-trainable params: 0
_________________________________________________________________

As always, we’ll compile the model and fit it to the training set.

model_tl.compile(optimizer = "adam",
              loss = losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy']
)

history = model_tl.fit(train, 
                    validation_data=val,
                    epochs = 50, 
                    verbose = False)

And we’ll plot out the accuracy curve.

plt.plot(history.history["accuracy"], label = "training")
plt.plot(history.history["val_accuracy"], label = "validation")
plt.gca().set(xlabel = "epoch", ylabel = "accuracy")
plt.legend()

model_tl.evaluate(test_ds)

225/225 [==============================] - 0s 1ms/step - loss: 0.0840 - accuracy: 0.9752

[0.08402662724256516, 0.9751881957054138]

Evaluating the model with only titles with out test set, we could see that the accuracy is actually pretty high: 0.97! Let’s see if the other two models could do better.S

Model trains with text only

Training with text would have a similar process: specify each layers, make the model, compile it, fit it, evaluate and observe.

x_ = vectorize_layer(text_input)
x_ = layers.Embedding(max_tokens, output_dim = 5, name = "embedding_tx")(x_)
x_ = layers.Dropout(0.2)(x_)
x_ = layers.GlobalAveragePooling1D()(x_)
x_ = layers.Dropout(0.2)(x_)
x_ = layers.Dense(32, activation='relu')(x_)
output = layers.Dense(2, name = "fake")(x_)

model_tx= keras.Model(
    inputs = text_input,
    outputs = output
)

model_tx.compile(optimizer = "adam",
              loss = losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy']
)

model_tx.fit(train, 
                    validation_data=val,
                    epochs = 50, 
                    verbose = False)

model_tx.evaluate(test_ds)

225/225 [==============================] - 1s 6ms/step - loss: 0.1612 - accuracy: 0.9463

[0.16124901175498962, 0.9462782144546509]

The accuracy for the model trained with the text only has a lower accuracy. Somehow reasonable, as the text is longer, and a fake news might also possess some sentences that sounds true.

Model trained with both text and title

both = layers.concatenate([x, x_], axis = 1)
output_both = layers.Dense(2, name = "fake")(both)

model_both= keras.Model(
    inputs = [title_input, text_input],
    outputs = output_both
)

model_both.summary()

Model: "model_2"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
title (InputLayer)              [(None, 1)]          0                                            
__________________________________________________________________________________________________
text (InputLayer)               [(None, 1)]          0                                            
__________________________________________________________________________________________________
text_vectorization (TextVectori (None, 20)           0           title[0][0]                      
                                                                 text[0][0]                       
__________________________________________________________________________________________________
embedding_tl (Embedding)        (None, 20, 5)        10000       text_vectorization[0][0]         
__________________________________________________________________________________________________
embedding_tx (Embedding)        (None, 20, 5)        10000       text_vectorization[1][0]         
__________________________________________________________________________________________________
dropout (Dropout)               (None, 20, 5)        0           embedding_tl[0][0]               
__________________________________________________________________________________________________
dropout_2 (Dropout)             (None, 20, 5)        0           embedding_tx[0][0]               
__________________________________________________________________________________________________
global_average_pooling1d (Globa (None, 5)            0           dropout[0][0]                    
__________________________________________________________________________________________________
global_average_pooling1d_1 (Glo (None, 5)            0           dropout_2[0][0]                  
__________________________________________________________________________________________________
dropout_1 (Dropout)             (None, 5)            0           global_average_pooling1d[0][0]   
__________________________________________________________________________________________________
dropout_3 (Dropout)             (None, 5)            0           global_average_pooling1d_1[0][0] 
__________________________________________________________________________________________________
dense (Dense)                   (None, 32)           192         dropout_1[0][0]                  
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 32)           192         dropout_3[0][0]                  
__________________________________________________________________________________________________
concatenate (Concatenate)       (None, 64)           0           dense[0][0]                      
                                                                 dense_1[0][0]                    
__________________________________________________________________________________________________
fake (Dense)                    (None, 2)            130         concatenate[0][0]                
==================================================================================================
Total params: 20,514
Trainable params: 20,514
Non-trainable params: 0
__________________________________________________________________________________________________

model_both.compile(optimizer = "adam",
              loss = losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy']
)

model_both.fit(train, 
                    validation_data=val,
                    epochs = 50, 
                    verbose = False)

<tensorflow.python.keras.callbacks.History at 0x7f84c388e910>

model_both.evaluate(test_ds)

225/225 [==============================] - 1s 6ms/step - loss: 0.0371 - accuracy: 0.9913

[0.037129346281290054, 0.991313636302948]

Here, we could observe that training with both text and title yields an accuracy of 0.9913!

Embedding Analysis

In the next stage, we’ll extract the previous Embedding layer, do some visualization of it to demonstrate how the words are learned by the model. We’ll plot out each word’s spot in the whole 2d vocabulary space. Since our embedding layer has an output of 5 dimensions, we’ll first need to conduct principle component analysis to transform it into 2d representations. Sci-kit learn has that function created for us.

First, we’ll get the vocabularies and weights corresponding to the words in the two embedding layers of title/text out:

weights_tl = model_both.get_layer('embedding_tl').get_weights()[0]
weights_tx = model_both.get_layer('embedding_tx').get_weights()[0]# get the weights from the embedding layer
vocab = vectorize_layer.get_vocabulary()  

vocab = vectorize_layer.get_vocabulary()  

weights_tl

array([[ 0.10268341,  0.09916885, -0.11284392,  0.12111397, -0.10344727],
       [-0.10890676, -0.13384679,  0.10487777, -0.12676527,  0.13442   ],
       [ 0.05327572,  0.05760193, -0.11840718, -0.00806848, -0.11700375],
       ...,
       [ 0.02861674,  0.02574723,  0.04686182,  0.09227268, -0.03490727],
       [ 0.15639482,  0.09743544, -0.06591856,  0.02370312, -0.01798774],
       [ 0.00066107, -0.01369754, -0.06120993,  0.03191474, -0.09007049]],
      dtype=float32)

Next, we perform PCA to the weights to convert them into 2d arrays:

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
weights_tl = pca.fit_transform(weights_tl)
weights_tx = pca.fit_transform(weights_tx)

len(vocab)

Then, we’ll create a data frame containing the embedding of words in titles.

embedding_tl = pd.DataFrame({
    'word' : vocab, 
    'x0'   : weights_tl[:,0],
    'x1'   : weights_tl[:,1]
})
embedding_tl

	word	x0	x1
0		-0.195497	0.003443
1	[UNK]	0.317627	0.025609
2	said	-0.105505	-0.010339
3	trump	0.100054	-0.045148
4	the	1.743519	-0.003649
...	...	...	...
1995	tonight	0.910266	0.059968
1996	repeated	0.170522	0.024102
1997	projects	-0.014011	-0.052086
1998	outcome	-0.117215	0.046006
1999	launch	-0.029681	-0.028275

2000 rows × 3 columns

We’ll do the same thing with the embedding of words in the news text.

embedding_tx = pd.DataFrame({
    'word' : vocab, 
    'x0'   : weights_tx[:,0],
    'x1'   : weights_tx[:,1]
})

Now, we are finally ready to plot the words and their embeddings out！We’ll be using plotly express function!

import plotly.express as px 
import plotly.io as pio
pio.templates.default = "plotly_white"
import plotly.express as px 

fig = px.scatter( embedding_tl, 
                 x = "x0", 
                 y = "x1", 
                 size = list(np.ones(len(embedding_tl))),
                 size_max = 2,
                 hover_name = "word")

fig.show()

Here, the graph demonstrates each words’ spot in the vocabulary space.

from plotly.io import write_html

write_html(fig, "emb_tx.html")

Written on May 18, 2021