Find Superscripts, Subscripts, and Unicode in a text file (Python)

Occasionally, it becomes necessary to search for special characters like superscripts, subscripts, symbols, emojis, or any Unicode characters within a text document. This is crucial when working with data files that should not contain any such characters, unless they are explicitly required and managed. Most editors, including Word, lack a ‘Find’ feature that reveals all Unicode characters in a file without having to search for a specific known character. However, I need to be able to detect all such characters without prior knowledge of their presence in the document. In this post, I am sharing my Python code that offers this exact functionality.

This script opens the file with UTF-8 encoding, which can handle Unicode characters. It then uses a regular expression to find any character that is not a basic ASCII character (i.e., any character with a code value greater than 127). The findall() function returns a list of all matches. The script reads a given file line by line and prints the line numbers where a Unicode char was found. Below is the code:

# Author: Tony Rahman. 
import re

def find_unicode_characters(filename):
    with open(filename, 'r', encoding='utf-8') as file:
        pattern = re.compile(r'[^\x00-\x7F]+')  # Matches any character that is not a basic ASCII character
        line_number = 1
        for line in file:
            matches = pattern.findall(line)
            if matches:
                print(f'Line {line_number}: {matches}')
            line_number += 1


filename = input("Enter the file name or full path to the text file: ")
filename= filename.strip()
find_unicode_characters(filename)
###

Here’s a sample session output:


▛Interested in creating programmable, cool electronic gadgets? Give my newest book on Arduino a try: Hello Arduino!  
▟

Leave a Reply Cancel reply