Find Superscripts, Subscripts, and Unicode in a text file (Python)

Occasionally, it becomes necessary to search for special characters like superscripts, subscripts, symbols, emojis, or any Unicode characters within a text document. This is crucial when working with data files that should not contain any such characters, unless they are explicitly required and managed. Most editors, including Word, lack a ‘Find’ feature that reveals all Unicode characters in a file without having to search for a specific known character. However, I need to be able to detect all such characters without prior knowledge of their presence in the document. In this post, I am sharing my Python code that offers this exact functionality.

This script opens the file with UTF-8 encoding, which can handle Unicode characters. It then uses a regular expression to find any character that is not a basic ASCII character (i.e., any character with a code value greater than 127). The findall() function returns a list of all matches. The script reads a given file line by line and prints the line numbers where a Unicode char was found. Below is the code:

<span style="color: #888888"># Author: Tony Rahman. </span>
<span style="color: #008800; font-weight: bold">import</span> <span style="color: #0e84b5; font-weight: bold">re</span>

<span style="color: #008800; font-weight: bold">def</span> <span style="color: #0066BB; font-weight: bold">find_unicode_characters</span>(filename):
    <span style="color: #008800; font-weight: bold">with</span> <span style="color: #007020">open</span>(filename, <span style="background-color: #fff0f0">&#39;r&#39;</span>, encoding<span style="color: #333333">=</span><span style="background-color: #fff0f0">&#39;utf-8&#39;</span>) <span style="color: #008800; font-weight: bold">as</span> <span style="color: #007020">file</span>:
        pattern <span style="color: #333333">=</span> re<span style="color: #333333">.</span>compile(<span style="background-color: #fff0f0">r&#39;[^\x00-\x7F]+&#39;</span>)  <span style="color: #888888"># Matches any character that is not a basic ASCII character</span>
        line_number <span style="color: #333333">=</span> <span style="color: #0000DD; font-weight: bold">1</span>
        <span style="color: #008800; font-weight: bold">for</span> line <span style="color: #000000; font-weight: bold">in</span> <span style="color: #007020">file</span>:
            matches <span style="color: #333333">=</span> pattern<span style="color: #333333">.</span>findall(line)
            <span style="color: #008800; font-weight: bold">if</span> matches:
                <span style="color: #008800; font-weight: bold">print</span>(f<span style="background-color: #fff0f0">&#39;Line {line_number}: {matches}&#39;</span>)
            line_number <span style="color: #333333">+=</span> <span style="color: #0000DD; font-weight: bold">1</span>


filename <span style="color: #333333">=</span> <span style="color: #007020">input</span>(<span style="background-color: #fff0f0">&quot;Enter the file name or full path to the text file: &quot;</span>)
filename<span style="color: #333333">=</span> filename<span style="color: #333333">.</span>strip()
find_unicode_characters(filename)
<span style="color: #888888">###</span>

# Author: Tony Rahman.

import re

def find_unicode_characters(filename):

with open(filename, 'r', encoding='utf-8') as file:

pattern = re.compile(r'[^\x00-\x7F]+') # Matches any character that is not a basic ASCII character

line_number = 1

for line in file:

matches = pattern.findall(line)

if matches:

print(f'Line {line_number}: {matches}')

line_number += 1

filename = input("Enter the file name or full path to the text file: ")

filename= filename.strip()

find_unicode_characters(filename)

###

Here’s a sample session output:


▛Interested in creating programmable, cool electronic gadgets? Give my newest book on Arduino a try: Hello Arduino!  
▟

Leave a Reply Cancel reply