Find Superscripts, Subscripts, and Unicode in a text file (Python)
Occasionally, it becomes necessary to search for special characters like superscripts, subscripts, symbols, emojis, or any Unicode characters within a text document. This is crucial when working with data files that should not contain any such characters, unless they are explicitly required and managed. Most editors, including Word, lack a ‘Find’ feature that reveals all Unicode characters in a file without having to search for a specific known character. However, I need to be able to detect all such characters without prior knowledge of their presence in the document. In this post, I am sharing my Python code that offers this exact functionality.
This script opens the file with UTF-8 encoding, which can handle Unicode characters. It then uses a regular expression to find any character that is not a basic ASCII character (i.e., any character with a code value greater than 127). The findall() function returns a list of all matches. The script reads a given file line by line and prints the line numbers where a Unicode char was found. Below is the code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
<span style="color: #888888"># Author: Tony Rahman. </span>
pattern<span style="color: #333333">=</span>re<span style="color: #333333">.</span>compile(<span style="background-color: #fff0f0">r'[^\x00-\x7F]+'</span>) <span style="color: #888888"># Matches any character that is not a basic ASCII character</span>
filename<span style="color: #333333">=</span><span style="color: #007020">input</span>(<span style="background-color: #fff0f0">"Enter the file name orfull path tothe text file:"</span>)