Assignment 4
CS 4341, summer 2024
100 points total (%)
Due: XXXXXXXXXX:59
Delivery: Submit via Canvas
For this assignment, you will:
(30pts) Go through the process of building the Seq2Seq Model for text summarization.
Download the Dataset
Fill in your address for the Dataset
Run the code from start to finish
(20pts) Question 1: Explain the tokenized sequences.
(10pts)Question 2: Explain the necessity of multiple inputs and specific dimensions in the
model.
(10pts)Question 3: Analyze the parameter counts in the model's summary.
(10pts)Question 4: Understand the purpose and difference between decoder sequences.
(20pts)Question 5: Discuss and suggest improvements for the generated titles.
Tutorial
To create a model that generates news titles based on the content of articles, you can follow a
tutorial that implements text summarization using a Sequence-to-Sequence (Seq2Seq) model,
which is a powerful architecture for tasks like these. Here's a step-by-step guide to help you
through this process, based on the tutorial provided:
Step 1: Download and Explore the Dataset
Student Task: Download the Dataset:
Go to the NYT News Dataset XXXXXXXXXXon Kaggle.(https:
www.kaggle.com/datasets
enda
nmiles/nyt-news-dataset XXXXXXXXXX)
Download the dataset and extract it as NYT_Dataset.csv to your working directory.
Step 2: Data Preprocessing
1. Install Required Li
aries:
2. Import Li
aries:
!pip install pandas numpy tensorflow keras
af:
n986
af:
n988
https:
www.kaggle.com/datasets
endanmiles/nyt-news-dataset XXXXXXXXXX
https:
www.kaggle.com/datasets
endanmiles/nyt-news-dataset XXXXXXXXXX
af:
n1004
3. Load and Preprocess Data:
Student Task: Fill in your address for the Dataset:
4. Tokenize the text:
Question 1 (20pts):
(1) Answer What is the shape of content_seq and title_seq,
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenize
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense, TimeDistributed,
RepeatVector, Bidirectional, Dropout, Input
from keras.models import Model
from keras.layers import Concatenate
# Load the dataset
data = pd.read_csv('Your path here to load the dataset')
# Select relevant columns
data = data[['abstract', 'headline']]
# Drop rows with missing values
data.dropna(inplace=True)
# Rename columns for ease of use
data.columns = ['content', 'title']
data['title'] = data['title'].apply(lambda x: 'starttoken ' + x + '
endtoken')
# Tokenize and pad the sequences
max_len_content = 100 # max length for content
max_len_title = 20 # max length for title
# Vocabulary sizes
vocab_size_content = 30000
vocab_size_title = 10000
# Tokenizer for content
tokenizer_content = Tokenizer(vocab_size_content)
tokenizer_content.fit_on_texts(data['content'])
content_seq = tokenizer_content.texts_to_sequences(data['content'])
content_seq = pad_sequences(content_seq, maxlen=max_len_content,
padding='post')
# Tokenizer for title
tokenizer_title = Tokenizer(vocab_size_title)
tokenizer_title.fit_on_texts(data['title'])
title_seq = tokenizer_title.texts_to_sequences(data['title'])
title_seq = pad_sequences(title_seq, maxlen=max_len_title, padding='post')
(2) Answer What is the first row of title_seq
(3)Explain why it is necessary to tokenize the text.
(4)Explain why there are multiple zeros at the end of the title sequence.
Step 3: Building the Seq2Seq Model
1. Define the Encoder:
2. Define the Decoder:
Question 2(20 pts):
(1) Explain why two Inputs(encoder_inputs and decoder_inputs) are needed.
(2) Explain why the dimension of decoder_lstm is latent_dim*2 (instead of latent_dim).
3. Dense Layer for Output:
latent_dim = 256
encoder_inputs = Input(shape=(max_len_content,))
encoder_embedding = Embedding(vocab_size_content, latent_dim,trainable=True)
(encoder_inputs)
encoder_lstm1 = Bidirectional(LSTM(latent_dim, return_state=True,
eturn_sequences=True))
encoder_outputs, forward_h, forward_c, backward_h, backward_c =
encoder_lstm1(encoder_embedding)
encoder_lstm2 = Bidirectional(LSTM(latent_dim, return_state=True,
eturn_sequences=True))
encoder_outputs, forward_h, forward_c, backward_h, backward_c =
encoder_lstm2(encoder_outputs)
encoder_lstm3 = Bidirectional(LSTM(latent_dim, return_state=True,
eturn_sequences=True))
encoder_outputs, forward_h, forward_c, backward_h, backward_c =
encoder_lstm3(encoder_outputs)
state_h = Concatenate()([forward_h, backward_h])
state_c = Concatenate()([forward_c, backward_c])
decoder_inputs = Input(shape=(max_len_title-1,))
decoder_embedding = Embedding(vocab_size_title, latent_dim, trainable=True)
(decoder_inputs)
decoder_lstm = LSTM(latent_dim*2, return_sequences=True, return_state=True,
dropout=0.4, recu
ent_dropout=0.2)
decoder_outputs, _, _ = decoder_lstm(decoder_embedding, initial_state=
[state_h, state_c])
dense = TimeDistributed(Dense(vocab_size_title, activation='softmax'))
output = dense(decoder_concat_input)
af:
n1024
4. Define and Compile the Model:
Question 3(10 pts):
(1) From the output of model.summary(), explain why the two Embeddings have
parameter counts of 7,680,000 and 5,130,000 respectively. (Express them in the form
like a+b, a*b, a^2*b or a*b+c etc.)
Step 4: Training the Model
1. Prepare the Data for Training:
2. Train the Model:
Question 4(10pts):
(1) Explain why there are two decoder sequences during training: decoder_input_seq
and decoder_output_seq.
(2) What is the difference between the value of them?
Step 5: Generating Titles
1. Define the Inference Model:
# Define the model
model = Model([encoder_inputs, decoder_inputs], output)
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
model.summary()
# Split the data
X_train, X_val, y_train, y_val = train_test_split(content_seq, title_seq,
test_size=0.2, random_state=42)
# Create the decoder input and output sequences for training
decoder_input_seq = y_train[:, :-1]
decoder_output_seq = y_train[:, 1:]
# For validation data
decoder_input_val_seq = y_val[:, :-1]
decoder_output_val_seq = y_val[:, 1:]
history = model.fit(
[X_train, decoder_input_seq],
np.expand_dims(decoder_output_seq, -1),
epochs=10,
batch_size=128,
validation_data=([X_val, decoder_input_val_seq],
np.expand_dims(decoder_output_val_seq, -1))
)
# Encoder model for inference
encoder_model = Model(encoder_inputs, [encoder_outputs, state_h, state_c])
# Decoder model for inference
af:
n1043
af:
n1065
2. Function to Generate Titles:
3. Generate Titles for New Content:
decoder_state_input_h = Input(shape=(latent_dim*2,))
decoder_state_input_c = Input(shape=(latent_dim*2,))
decoder_hidden_state_input = Input(shape=(max_len_content, latent_dim*2))
decoder_output2, state_h, state_c = decoder_lstm(
decoder_embedding, initial_state=[decoder_state_input_h,
decoder_state_input_c]
)
decoder_outputs = dense(decoder_output2)
decoder_model = Model(
[decoder_inputs] + [decoder_hidden_state_input, decoder_state_input_h,
decoder_state_input_c],
[decoder_outputs] + [state_h, state_c]
)
def decode_sequence(input_seq):
# Encode the input
enc_out, enc_h, enc_c = encoder_model.predict(input_seq)
# Generate empty target sequence of length 1 with the start token
target_seq = np.zeros((1, 1))
target_seq[0, 0] = tokenizer_title.word_index['sostok']
stop_condition = False
decoded_sentence = ''
while not stop_condition:
output_tokens, h, c = decoder_model.predict([target_seq] + [enc_out,
enc_h, enc_c])
# Sample a token
sampled_token_index = np.argmax(output_tokens[0, -1, :])
sampled_token = tokenizer_title.index_word[sampled_token_index]
decoded_sentence += ' ' + sampled_token
# Exit condition: either hit max length or find the stop token
if sampled_token == 'eostok' or sampled_token_index == 0 or
len(decoded_sentence) > max_len_title:
stop_condition = True
# Update the target sequence (of length 1)
target_seq = np.zeros((1,1))
target_seq[0, 0] = sampled_token_index
# Update internal states
enc_h, enc_c = h, c
return decoded_sentence
Question 5((Open question)(20pts):
(1)Discuss the results of the generated titles compared to the original titles.
(2) What are some possible ways to improve the generated results?
for i in range(10): # generate titles for 5 random articles
input_seq = content_seq[i:i+1]
decoded_sentence = decode_sequence(input_seq)
print(f"Content: {post_pre['content'][i]}\n")
print(f"Original Title: {post_pre['title'][i]}\n")
print(f"Generated Title: {decoded_sentence}\n")
Tutorial
Step 1: Download and Explore the Dataset
Step 2: Data Preprocessing
Step 3: Building the Seq2Seq Model
Step 4: Training the Model
Step 5: Generating Titles