Sarcasm Detection AI Model (97% Accuracy) Trained With Reddit Comments - Cleaning and Saving The Data
Steven Mathew
Posted on July 7, 2024
Now we will clean the data and save the data for training and testing in the next part.
def clean_comment(text):
text = re.sub(r'http\S+', '', text) # Remove any web URLs in the text
text = re.sub(r'/u/\w+', '', text) # Remove mentions of Reddit users (like /u/username)
text = re.sub(r'r/\w+', '', text) # Remove mentions of subreddits (like r/subreddit)
text = re.sub(r'\n', ' ', text) # Replace new line characters with spaces
text = re.sub(r'[^A-Za-z0-9\s]', '', text) # Remove any characters that are not letters, numbers, or spaces
return text.lower() # Convert the cleaned text to lowercase
This function takes in a piece of text (text) and cleans it up by removing web URLs, mentions of Reddit users and subreddits, new line characters, and any characters that are not letters, numbers, or spaces. Finally, it converts the cleaned text to lowercase.
# Load data from a CSV file into a DataFrame
df = pd.read_csv('reddit_comments.csv')
# Apply the cleaning function to each comment and create a new column for cleaned comments
df['cleaned_comment'] = df['comment'].apply(clean_comment)
Here, we load data from a CSV file (reddit_comments.csv) into a table-like structure called a DataFrame. Then, for each comment in the 'comment' column of this DataFrame, we use the clean_comment function we defined earlier to clean up the text. The cleaned versions of the comments are stored in a new column named 'cleaned_comment'.
# Manually assign labels to the comments
labels = [0, 1] * (len(df) // 2) # Create a list of labels alternating between 0 and 1
if len(labels) < len(df):
labels.append(0) # Add one more label to match the number of comments
df['label'] = labels # Assign the labels to a new column named 'label' in the DataFrame
In this part, we assign labels to each comment to indicate whether it's sarcastic or not. For demonstration purposes, we alternate between labels 0 (for non-sarcastic) and 1 (for sarcastic). We make sure that each comment gets a corresponding label. These labels are stored in a new column named 'label' in the DataFrame.
# Remove rows where the cleaned comment is empty or NaN (missing)
df = df.dropna(subset=['cleaned_comment']) # Remove rows where 'cleaned_comment' is NaN
df = df[df['cleaned_comment'].str.strip() != ''] # Remove rows where 'cleaned_comment' is empty or only whitespace
# Save the cleaned and labeled data to a new CSV file
df.to_csv('labeled_reddit_comments.csv', index=False) # Save DataFrame to CSV without including the index
Finally, we clean up the data further by removing any rows where the cleaned comment is empty or missing (NaN). We also remove rows where the cleaned comment consists only of whitespace.
After cleaning and filtering, we save the cleaned and labeled data (including the 'cleaned_comment' and 'label' columns) to a new CSV file named labeled_reddit_comments.csv.
Note:
The index=False parameter ensures that the CSV file does not include an extra column for row numbers.
Read the Part 3 - Sarcasm Detection From Reddit Comments : Training & Testing
GITHUB: https://github.com/stevie1mat/Sarcasm-Detection-With-Reddit-Comments
Author: Steven Mathew
Posted on July 7, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.