# **Lab 4: Topic Modeling with Latent Semantic Analysis (LSA)**

In [None]:
# Install the necessary libraries
%pip install nltk scikit-learn pandas


### **Part 1: Loading and Preprocessing the BBC News Dataset**

In this section, weâ€™ll load the dataset, preprocess the text by removing stopwords, tokenizing, and creating a term-document matrix using TF-IDF.


In [None]:
import pandas as pd

# TODO: Load the BBC news dataset
data = ...

# Check the first few rows
data.head()

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
import nltk

nltk.download('stopwords')
stop_words = stopwords.words('english')

# A simple function to preprocess text
def preprocess_text(text):
    return ' '.join([word.lower() for word in text.split() if word.lower() not in stop_words])

# TODO: Apply the preprocessing function to the 'content' column
data['processed_content'] = ...

# Create a TF-IDF vectorizer
vectorizer = TfidfVectorizer(max_df=0.05, max_features=1000, ngram_range=(1, 2))

# TODO: Create the term-document matrix
term_doc_matrix = ...

# Get the terms (feature names) from the vectorizer
terms = ...

# Display the shape of the term-document matrix
print(f"Term-Document Matrix Shape: {term_doc_matrix.shape}")



### **Part 2: Applying SVD to the Term-Document Matrix**

In this section, we will apply **Singular Value Decomposition (SVD)** to reduce the term-document matrix into its latent structure and identify the topics.


In [None]:
from sklearn.decomposition import TruncatedSVD

# TODO: Define the number of components
num_components = ...

svd_model = TruncatedSVD(n_components=num_components)

# TODO: Fit the SVD model
svd_matrix = svd_model.fit_transform(...)

# Show the resulting latent space (topic space)
print(f"Latent Topic Matrix Shape: {svd_matrix.shape}")

In [None]:
import numpy as np

# Get the top terms for each topic

num_top_words = ... # TODO: Adjust this until you can easily identify the topics

for i, topic in enumerate(svd_model.components_):
    top_term_indices = ... # TODO: Get the indices of the top terms
    top_terms = [terms[i] for i in top_term_indices]
    print(f"Topic {i+1}: {', '.join(top_terms)}")


### **Part 3: Labeling the Topics**

TODO: Using the terms extracted from each topic, try to assign labels that best describe what each topic is about.

- **Topic 1**: ...
- **Topic 2**: ...
- **Topic 3**: ...


### **Summary & Takeaways**

In this lab, you have:
1. Preprocessed the BBC News dataset and created a term-document matrix using TF-IDF.
2. Applied SVD to reduce the term-document matrix into a lower-dimensional space, revealing hidden topics.
3. Examined the most significant terms in each topic and interpreted their meaning.
4. Labeled the topics based on the terms and document clusters.

You now have a better understanding of how **LSA** can reveal hidden topics in a collection of text documents and how similar documents can group together based on their content.