Coding Education STEM

Extracting phone numbers from any document/file

Imagine you have documents of various types: email, Word documents, PDF, text, HTML, Excel, etc. And you want to extract phone numbers, and phone numbers only from those documents as you try to build a list of contact information or a database. In this post, I show a quick and effective way to do just that regardless of the source document type.

The first thought might be to use custom libraries for each: one for reading/parsing Word, another for HTML, another for Excel, and so on. However, the simplest way is to convert the source files into a simple text file since we need to parse text (albeit digits) and no other information about formatting, images, controls. Each of the source files have a native application that can export/save-as text format.

Once we have the text version of the source file. We can parse the entire file and only extract phone numbers from them. Not just digits of any kind but formats where there’s area code, and digits following that. The area code can be enclosed in parentheses or brackets or not at all! They could be separated by dashes, or they could even start with the long distance code or 1 as in the USA. So, some of the possible phone numbers we’ll assume for this example (the code can be modified to virtually any format!) are:

xxx-xxx-xxx, (xxx)xxx.xxxx, (xxx)xxx-xxxx, (xxx)xxxxxxx, x-xxx-xxxxxxx, and any combination of the parentheses, dashes, spaces, dots, and so on.

Let’s take an example: you have an email from which you want just the phone numbers. The email is first extracted into a text file. Now the text file looks like this:

Date: 3:20 PM 8/21/2020ValleyML Machine Learning and Boot Camp -2020We are excited to announce the availability of live Machine Learning/ Deep Learning Boot Camp in collaboration with IEEE and ACM by ValleyML from July 14th-Sept 10th for professionals in Greater Seattle Area. Enroll and Learn at ValleyML Live Learning Platform.This course further helps advance your career by augmenting your skills for your current position Sr. Business Analyst in Launch Consulting Group. You can add IEEE PDH Certificate to your LinkedIn profile. Machine Learning and Deep Learning are becoming very crucial in Information Technology and Services.Hiring Referrals: We also partnered with Triplebyte to help our Boot Camp attendees as well as all members of ValleyML. Top engineering roles come to you 450+ top tech companies hire for their best engineering teams from Triplebyte. Teams reach out to you so you will never miss those hot opportunities! If you feel that you are already well prepared, you can directly sign up now at https://triplebyte.com/a/RhUglgw/valleyml .IEEE Certificates guarantee that education program meets IEEE standards, and offers some of the most relevant content that engineers need to stay ahead.IEEE Certificates will help technical professionals:• Gain a competitive advantage• Update their knowledge and skills• Build professional credibility• Earn the CEUs/PDHs they need to keep their licenses currentA not-for-profit organization, IEEE is the world’s largest technical professional organization dedicated to advancing technology for the benefit of humanity.Build a solid foundation of Machine Learning / Deep Learning principles and apply the techniques to real-world problems. Get IEEE PDH Certificate. Virtual Live Boot Camp from July 14th-Sept 10th. Please see Boot Camp Description.Learn fundamentals of machine learning to the latest advances of deep learning technologies and their applications. In collaboration with IEEE and ACM by ValleyML. 8 workshops.
If you have questions about this course, feel free to reach out to: 555-555-0000. For technical issues with attending the webinar, contact: (555)555.1111
For academic discounts, contact: (555)555-2222 or (800)5552222 or 1-888-5552222Coupon ValleyML25 for 25% by July 2nd (UTC).Enroll and Learn at ValleyML Live Learning Platform

 

As you notice, this single email lists 5 different phone numbers in 5 different formats! Of course, as humans we can see that in a single content easily and copy/paste them into a separate file. But what if there are hundreds, thousands or documents of different types that you need to do this for? How fun will that be? Not much! That’s where the following solution comes into play.

There are different ways of solving this including writing a custom application in language of your choice, importing into Excel and applying formulas to extract line by line, or via VBA, etc. Of all the solutions, I’ve experimented with, my preferred solution would be to use Python.

The Solution: Design

Open the file using Python

Read each line, parse it using reg expression.

Save each match into a list as a separate element.

(For more information on reg expression, search for regex in Python or start with this link.)

The Solution: Code

To leverage the power of regular expressions, we need to import re library

Then open the text file in read mode and load it to memory. The file (content shown above) is in a text file in my Data subdir of current folder.

source="Data/sampleemail.txt"
f = open(source, 'r')
content = f.readlines()

Then we need to convert the list to a string and use it with regexp to extract results as a list…

res=re.findall(r'[\+\(]?[1-9][0-9 .\-\(\)]{8,}[0-9]', str(content))

Now res has everything we need! Each phone number extracted in any variation is loaded as an element in the list. To verify, we can print it on screen:

print("Found:", res)

The output in this case is:

Found: ['555-555-0000', '(555)555.1111', '(555)555-2222', '(800)5552222', '1-888-5552222']

Wonderful, isn’t it? Now we can do whatever with it…save to a new text file, to a CSV, to Excel, etc. etc. I believe this is an extremely powerful yet efficient way to conquer this challenge. If you need to read a myriad of files, put them in a directory, load up all the file names from that directory using os.listdir() into a list, then iterate through each of them applying the above code.


This post is not meant to be a step-by-step, detailed tutorial, instead to serve key concepts, and approaches to problem-solving. Basic->Intermediate technical/coding knowledge is assumed.
If you like more info or source file to learn/use, or need more basic assistance, you may contact me at tanman1129 at hotmail dot com. To support this voluntary effort, you can donate via Paypal from the button in my Home page, or become a Patron via my Patreon site. A supporter/patron gets direct answers on any of my blogs including code/data/source files whenever possible.

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top