In this blog post, we’ll explore how to use the Naive Bayes algorithm to classify emails as either spam or ham (non-spam). We’ll walk through a Python implementation using the MultinomialNB
classifier from the scikit-learn
library. This method is particularly effective for text classification problems.
Step-by-Step Implementation
Importing Libraries: We start by importing the necessary libraries:
Loading the Dataset: We load the dataset containing emails and their labels (spam or ham):
The original dataset was obtained from https://github.com/NStugard/Intro-to-Machine-Learning/blob/main/spam.csv
Data Preparation: We check for numeric columns and create a new column spam
where spam emails are labeled as 1 and ham emails as 0:
Splitting the Data: We split the data into training and testing sets:
Vectorizing the Text Data:We convert the text data into a matrix of token counts:
Training the Model: We train the Naive Bayes model using the training data:
Testing the Model: We test the model with user input and predefined examples:
Evaluating the Model: Finally, we evaluate the model’s accuracy on the test set:
Conclusion
Using the Naive Bayes algorithm, we can effectively classify emails as spam or ham. This method leverages the simplicity and efficiency of the Naive Bayes classifier, making it a great choice for text classification tasks.
You can try out this version right here form this page using the widget below. Click on Run to run the code. Then enter your own phrases to see how it detects a spam or ham! It’ll ask for two inputs and will run predictions on each input independently. The predictions should be exactly the same if the inputs are the same. Feel free to experiment with it, and the code and dataset to see how well it performs on your own data. Happy coding!