## The IMDB (Internet Movie Database) Dataset

An example of a **binary classification** task. The goal is to classify movie reviews as *positive* or *negative*.

In [None]:
import random
import numpy as np
import matplotlib.pyplot as plt

In [None]:
from tensorflow.keras.datasets import imdb

Load in the dataset. The movie reviews contain over 88,500 unique words in all, but we will read in only the 10,000 most frequently occurring words, so as to keep the vectors to a manageable size:

In [None]:
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

In [None]:
train_data.shape

In [None]:
train_data.dtype

In [None]:
np.array([[1,2], [3,4]])

In [None]:
np.array([[1,2], [3]])

In [None]:
np.array([[1,2],[3]], dtype=object)

In [None]:
type(train_data[0])

In [None]:
len(train_data[0])

In [None]:
len(train_data[1])

In [None]:
len(train_data[2])

In [None]:
train_data[0]

In [None]:
min(train_data[0]), max(train_data[0])

In [None]:
train_labels

In [None]:
test_data.shape

How long are the shortest and longest reviews in the test set?

In [None]:
lengths = [len(review) for review in test_data]

In [None]:
min(lengths), max(lengths)

In [None]:
np.argmin(lengths), np.argmax(lengths)

In [None]:
len(test_data[2104])

In [None]:
test_data[2104]

In [None]:
test_labels

In [None]:
test_labels[2104]

### The Word Index

The word index is a dictionary of all words appearing in the reviews.  Each word is mapped to a rank number that indicates the word's relative frequency of occurrence in the reviews.

In [None]:
word_index = imdb.get_word_index()

In [None]:
type(word_index)

In [None]:
len(word_index)

In [None]:
word_index['great']

In [None]:
word_index['funny']

In [None]:
word_index['bad']

In [None]:
word_index['the']

In [None]:
word_index

In [None]:
word_index.items()

We will build a reversed version of the word index that maps rank numbers to words.

In [None]:
[(value, key) for (key, value) in word_index.items()]

In [None]:
sorted([7,4,5,3,7,9,8,2,3])

In [None]:
ranks = sorted([(value, key) for (key, value) in word_index.items()])

In [None]:
ranks

In [None]:
ranks[9950:10050]

In [None]:
ranks[-50:]

In [None]:
lookup = dict(ranks)

In [None]:
lookup

In [None]:
lookup[50]

In [None]:
test_data[2104]

In [None]:
[lookup[n] for n in test_data[2104]]

In [None]:
train_data[0]

In [None]:
first_words = train_data[0][:6]

In [None]:
first_words

In [None]:
[lookup[n] for n in first_words]

The following special index numbers are reserved:
* 0 = "padding"
* 1 = "start of sequence"
* 2 = "unknown word"

In [None]:
first_words

In [None]:
[n for n in first_words if n > 2]

In [None]:
[n-3 for n in first_words if n > 2]

In [None]:
[lookup[n-3] for n in first_words if n > 2]

In [None]:
def decode(review):
    word_codes = [n-3 for n in review if n > 2]
    words = [lookup[c] for c in word_codes]
    return ' '.join(words)

In [None]:
decode(train_data[0])

In [None]:
train_labels[0]

In [None]:
decode(test_data[2104])

### Vectorizing the Data

In [None]:
vector = np.zeros(10)

In [None]:
vector

In [None]:
vector[3] = 1

In [None]:
vector

In [None]:
vector[[1,3,5,7]] = 1   # note the inner []'s

In [None]:
vector

In [None]:
def encode_as_vector(sequence, num_words=10000):
    vector = np.zeros(num_words)
    vector[sequence] = 1
    return vector    

In [None]:
encode_as_vector([4,1,5], 10)

In [None]:
encode_as_vector([2,3,2,4,2], 5)

In [None]:
encode_as_vector(train_data[0], 10000)

In [None]:
for x in ['apple', 'banana', 'cherry', 'lemon', 'lime']:
    print(x)

In [None]:
# a simple example of enumerate
for i, x in enumerate(['apple', 'banana', 'cherry', 'lemon', 'lime']):
    print(i, x)

In [None]:
list(enumerate(['apple', 'banana', 'cherry', 'lemon', 'lime']))

In [None]:
def vectorize_sequences(sequences, num_words=10000):
    number_of_sequences = len(sequences)
    data = np.zeros((number_of_sequences, num_words))  # note the extra ()'s
    for i, sequence in enumerate(sequences):
        data[i] = encode_as_vector(sequence)  # OR: data[i, sequence] = 1
    return data

In [None]:
train_vectors = vectorize_sequences(train_data)
test_vectors = vectorize_sequences(test_data)

In [None]:
train_vectors[0]

In [None]:
len(train_vectors[0])

In [None]:
train_vectors.shape

In [None]:
test_vectors.shape

In [None]:
train_labels

To create the target values, we need to convert the review labels (0=negative review, 1=positive review) from integers to floats, so that they correspond to probabilities:

In [None]:
train_targets = train_labels.astype('float32')
test_targets = test_labels.astype('float32')

In [None]:
train_targets

### The Neural Network

<img src="http://science.slc.edu/jmarshall/bioai/images/imdb_network.png" width="55%">

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

def build_network():
    network = Sequential()
    network.add(Dense(16, activation='relu', name='hidden1', input_shape=(10000,)))
    network.add(Dense(16, activation='relu', name='hidden2'))
    network.add(Dense(1, activation='sigmoid', name='output'))
    
    network.compile(loss='binary_crossentropy',
                    optimizer='rmsprop',
                    metrics=['accuracy'])
    return network

In [None]:
network = build_network()

In [None]:
network.summary()

In [None]:
history = network.fit(train_vectors, train_targets, epochs=20, batch_size=512)

In [None]:
def plot_history(history):
    loss_values = history.history['loss']
    accuracy_values = history.history['accuracy']
    epoch_nums = range(1, len(loss_values)+1)
    plt.figure(figsize=(12,4)) # width, height in inches
    plt.subplot(1, 2, 1)
    plt.plot(epoch_nums, loss_values, 'r')
    plt.title("Training loss")
    plt.xlabel("Epochs")
    plt.ylabel("Loss")
    plt.subplot(1, 2, 2)
    plt.plot(epoch_nums, accuracy_values, 'b')
    plt.title("Training accuracy")
    plt.xlabel("Epochs")
    plt.ylabel("Accuracy")
    plt.ylim(0, 1)
    plt.show()

In [None]:
plot_history(history)

In [None]:
network.evaluate(train_vectors, train_targets)

In [None]:
network.evaluate(test_vectors, test_targets)

### Using a Validation Set

In [None]:
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

In [None]:
train_vectors = vectorize_sequences(train_data)
test_vectors = vectorize_sequences(test_data)

train_targets = train_labels.astype('float32')
test_targets = test_labels.astype('float32')

We will use the first 10,000 training samples as a **validation set** for monitoring learning progress, and the remaining 15,000 samples to actually train the network:

In [None]:
# validation set
val_vectors = train_vectors[:10000]
val_targets = train_targets[:10000]

# training set
train_vectors_remaining = train_vectors[10000:]
train_targets_remaining = train_targets[10000:]

In [None]:
network = build_network()

In [None]:
history = network.fit(train_vectors_remaining, train_targets_remaining,
                      epochs=20, batch_size=512,
                      validation_data=(val_vectors, val_targets))

In [None]:
history_dict = history.history

In [None]:
history_dict.keys()

In [None]:
history_dict['accuracy']

In [None]:
# generalized version that plots validation data if available

def plot_history(history):
    loss_values = history.history['loss']
    accuracy_values = history.history['accuracy']
    validation = 'val_loss' in history.history
    if validation:
        val_loss_values = history.history['val_loss']
        val_accuracy_values = history.history['val_accuracy']
    epoch_nums = range(1, len(loss_values)+1)
    plt.figure(figsize=(12,4)) # width, height in inches
    plt.subplot(1, 2, 1)
    if validation:
        plt.plot(epoch_nums, loss_values, 'r', label="Training loss")
        plt.plot(epoch_nums, val_loss_values, 'r--', label="Validation loss")
        plt.title("Training/validation loss")
        plt.legend()
    else:
        plt.plot(epoch_nums, loss_values, 'r', label="Training loss")
        plt.title("Training loss")
    plt.xlabel("Epochs")
    plt.ylabel("Loss")
    plt.subplot(1, 2, 2)
    if validation:
        plt.plot(epoch_nums, accuracy_values, 'b', label='Training accuracy')
        plt.plot(epoch_nums, val_accuracy_values, 'b--', label='Validation accuracy')
        plt.title("Training/validation accuracy")
        plt.legend()
    else:
        plt.plot(epoch_nums, accuracy_values, 'b', label='Training accuracy')
        plt.title("Training accuracy")
    plt.xlabel("Epochs")
    plt.ylabel("Accuracy")
    plt.ylim(0, 1)
    plt.show()

In [None]:
plot_history(history)

We ended up **overtraining** our network!  After about the third epoch, the loss on the validation data starts to increase, and the validation accuracy starts to degrade.  This tells us that we should stop training after 3 epochs.  So now we will retrain a new network on the full training data, for just 3 epochs:

In [None]:
network = build_network()

In [None]:
history = network.fit(train_vectors, train_targets, epochs=3, batch_size=512)

In [None]:
network.evaluate(train_vectors, train_targets)

In [None]:
network.evaluate(test_vectors, test_targets)

In [None]:
outputs = network.predict(test_vectors)

In [None]:
outputs

In [None]:
decode(test_data[1])

In [None]:
outputs[1]

In [None]:
test_labels[1]

In [None]:
decode(test_data[-1])

In [None]:
outputs[-1]

In [None]:
test_labels[-1]

In [None]:
wrong = [n for n in range(len(test_data)) if outputs[n] > 0.5 and test_labels[n] == 0
                                            or outputs[n] < 0.5 and test_labels[n] == 1]

In [None]:
len(wrong)

In [None]:
print("accuracy:", 1 - len(wrong)/len(test_data))

In [None]:
correct = [n for n in range(len(test_data)) if n not in wrong]

In [None]:
len(correct)

In [None]:
correct[:10]

In [None]:
outputs[:10]

In [None]:
test_labels[:10]

In [None]:
[(n, outputs[n][0], test_labels[n]) for n in correct[:10]]

In [None]:
decode(test_data[1])

In [None]:
decode(test_data[7])