Occasionally, it becomes necessary to search for special characters like superscripts, subscripts, symbols, emojis, or any Unicode characters within a text document. This is crucial when working with data files that should not contain any such characters, unless they are explicitly required and managed. Most editors, including Word, lack a ‘Find’ feature that reveals all Unicode characters in a file without having to search for a specific known character. However, I need to be able to detect all such characters without prior knowledge of their presence in the document. In this post, I am sharing my Python code that offers this exact functionality.
This script opens the file with UTF-8 encoding, which can handle Unicode characters. It then uses a regular expression to find any character that is not a basic ASCII character (i.e., any character with a code value greater than 127). The findall() function returns a list of all matches. The script reads a given file line by line and prints the line numbers where a Unicode char was found. Below is the code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | # Author: Tony Rahman. import re def find_unicode_characters(filename): with open(filename, 'r', encoding='utf-8') as file: pattern = re.compile(r'[^\x00-\x7F]+') # Matches any character that is not a basic ASCII character line_number = 1 for line in file: matches = pattern.findall(line) if matches: print(f'Line {line_number}: {matches}') line_number += 1 filename = input("Enter the file name or full path to the text file: ") filename= filename.strip() find_unicode_characters(filename) ### |
Here’s a sample session output:
▛Interested in creating programmable, cool electronic gadgets? Give my newest book on Arduino a try: Hello Arduino!
▟