Classifying Emails as Spam or Ham Using Naive Bayes

In this blog post, we’ll explore how to use the Naive Bayes algorithm to classify emails as either spam or ham (non-spam). We’ll walk through a Python implementation using the MultinomialNB classifier from the scikit-learn library. This method is particularly effective for text classification problems.

Step-by-Step Implementation

Importing Libraries: We start by importing the necessary libraries:

Loading the Dataset: We load the dataset containing emails and their labels (spam or ham):

The original dataset was obtained from https://github.com/NStugard/Intro-to-Machine-Learning/blob/main/spam.csv

Data Preparation: We check for numeric columns and create a new column spam where spam emails are labeled as 1 and ham emails as 0:

Splitting the Data: We split the data into training and testing sets:

Vectorizing the Text Data:We convert the text data into a matrix of token counts:

Training the Model: We train the Naive Bayes model using the training data:

Testing the Model: We test the model with user input and predefined examples:

Evaluating the Model: Finally, we evaluate the model’s accuracy on the test set:

Conclusion

Using the Naive Bayes algorithm, we can effectively classify emails as spam or ham. This method leverages the simplicity and efficiency of the Naive Bayes classifier, making it a great choice for text classification tasks.

You can try out this version right here form this page using the widget below. Click on Run to run the code. Then enter your own phrases to see how it detects a spam or ham! It’ll ask for two inputs and will run predictions on each input independently. The predictions should be exactly the same if the inputs are the same. Feel free to experiment with it, and the code and dataset to see how well it performs on your own data. Happy coding!

Step-by-Step Implementation

Conclusion

Leave a Reply Cancel reply