Data cleansing challenge: non-ASCII characters

Non-ASCII characters can pose challenges in data cleansing for several reasons:

Encoding Issues: Different systems and databases may use different character encodings (like UTF-8, ISO 8859-1, etc.). If the encoding isn’t handled correctly during data import/export, non-ASCII characters might not be represented correctly, leading to data corruption.
Inconsistency: Non-ASCII characters can introduce inconsistencies in your data. For example, the word “résumé” could also be written as “resume” or “resumé”. This can complicate text processing tasks like searching, sorting, or matching strings.
Software Compatibility: Some software, especially older versions, may not support non-ASCII characters, which can lead to errors or data loss.
Increased Complexity: Handling non-ASCII characters can make data processing, cleansing, and analysis more complex as you need to take into account various encodings and special characters.

Therefore, it’s a good practice to standardize or normalize text data to ASCII when possible, or ensure correct handling of non-ASCII characters. This helps to maintain data integrity and simplifies subsequent data processing tasks.

Superscripts, subscripts, or “special” characters often look like ascii characters but they are actually not.
For example, many data from web sites will included double-quotes as these:’“’ or ‘”’ which really should be ‘””‘ when using in code or data processing.
A seemingly dash character as this ‘⁻’ is actually unicode character codepoint 8315 and not a dash from a standard keyboard! And there are many more.

Consider this set of strings:

H₂O – Water
Na⁺ – Sodium Ion
Cl⁻ – Chloride Ion
CO₂ – Carbon Dioxide
H₃O⁺ – Hydronium Ion
OH⁻ – Hydroxide Ion
C₆H₁₂O₆ – Glucose
H₂SO₄ – Sulfuric Acid
NO₃⁻ – Nitrate Ion
Ca²⁺ – Calcium Ion
PO₄³⁻ – Phosphate Ion
NH₄⁺ – Ammonium Ion
HCO₃⁻ – Bicarbonate Ion
C₂H₅OH – Ethanol
Mg²⁺ – Magnesium Ion
SO₄²⁻ – Sulfate Ion
H₃PO₄ – Phosphoric Acid
HNO₃ – Nitric Acid
CH₃COOH – Acetic Acid
C₃H₇NO₂ – Alanine

Each row contains a chemical formula with superscript and subscript characters, followed by the common name of the compound.
To data-cleanse this, we need to replace non-ASCII characters with ASCII equivalents or removing non-ASCII characters altogether depending no the scenario.
e.g. “H₂O” could be replaced with “H2O”, and “Na⁺” could be replaced with “Na+” and so on.

Doing this visually and manually is not only error-prone, it’s also very tedious for a large data set. One my solutions is to use Python to take care of that task. Below, I’ll share my method and code to accomplish this.

Example 1: To identify the “special”, non-ASCII characters in a file:

<span style="color: #008800; font-weight: bold">import</span> <span style="color: #0e84b5; font-weight: bold">re</span>

<span style="color: #008800; font-weight: bold">def</span> <span style="color: #0066BB; font-weight: bold">find_unicode_characters</span>(filename):
    <span style="color: #008800; font-weight: bold">with</span> <span style="color: #007020">open</span>(filename, <span style="background-color: #fff0f0">&#39;r&#39;</span>, encoding<span style="color: #333333">=</span><span style="background-color: #fff0f0">&#39;utf-8&#39;</span>) <span style="color: #008800; font-weight: bold">as</span> <span style="color: #007020">file</span>:
        pattern <span style="color: #333333">=</span> re<span style="color: #333333">.</span>compile(<span style="background-color: #fff0f0">r&#39;[^\x00-\x7F]+&#39;</span>)  <span style="color: #888888"># Matches any character that is not a basic ASCII character</span>
        line_number <span style="color: #333333">=</span> <span style="color: #0000DD; font-weight: bold">1</span>
        <span style="color: #008800; font-weight: bold">for</span> line <span style="color: #000000; font-weight: bold">in</span> <span style="color: #007020">file</span>:
            matches <span style="color: #333333">=</span> pattern<span style="color: #333333">.</span>findall(line)
            <span style="color: #008800; font-weight: bold">if</span> matches:
                <span style="color: #008800; font-weight: bold">print</span>(f<span style="background-color: #fff0f0">&#39;Line {line_number}: {matches}&#39;</span>)
            line_number <span style="color: #333333">+=</span> <span style="color: #0000DD; font-weight: bold">1</span>


filename <span style="color: #333333">=</span> <span style="background-color: #fff0f0">&#39;your_file.txt&#39;</span>  <span style="color: #888888"># or .csv etc. Replace string with your actual file name.</span>
find_unicode_characters(filename)

import re

def find_unicode_characters(filename):

with open(filename, 'r', encoding='utf-8') as file:

pattern = re.compile(r'[^\x00-\x7F]+') # Matches any character that is not a basic ASCII character

line_number = 1

for line in file:

matches = pattern.findall(line)

if matches:

print(f'Line {line_number}: {matches}')

line_number += 1

filename = 'your_file.txt' # or .csv etc. Replace string with your actual file name.

find_unicode_characters(filename)

An example output after I specified a file name with possible special characters looks like this:
Line 57: ['’', '’'] Line 100: ['’'] Line 102: ['’']

It shows the line number and the characters that are non-ASCII. The above function will not replace them, it just identifies them, which is useful for a quick check or for small files. Let’s look at the next example.

Example 2: Find and replace the “special”, non-ASCII characters in a file and write cleansed data into a new file.

To actually replace those characters with an ASCII equivalent, my code is modified to add another function to actually do the replacements in memory, and then write the cleansed data to a new file.

<span style="color: #008800; font-weight: bold">def</span> <span style="color: #0066BB; font-weight: bold">replace_unicode_characters</span>(filename):
    replacements <span style="color: #333333">=</span> {
        <span style="background-color: #fff0f0">&#39;₂&#39;</span>: <span style="background-color: #fff0f0">&#39;2&#39;</span>,
        <span style="background-color: #fff0f0">&#39;₃&#39;</span>: <span style="background-color: #fff0f0">&#39;3&#39;</span>,
        <span style="background-color: #fff0f0">&#39;₄&#39;</span>: <span style="background-color: #fff0f0">&#39;4&#39;</span>,
        <span style="background-color: #fff0f0">&#39;₅&#39;</span>: <span style="background-color: #fff0f0">&#39;5&#39;</span>,
        <span style="background-color: #fff0f0">&#39;₆&#39;</span>: <span style="background-color: #fff0f0">&#39;6&#39;</span>,
        <span style="background-color: #fff0f0">&#39;₇&#39;</span>: <span style="background-color: #fff0f0">&#39;7&#39;</span>,
        <span style="background-color: #fff0f0">&#39;₈&#39;</span>: <span style="background-color: #fff0f0">&#39;8&#39;</span>,
        <span style="background-color: #fff0f0">&#39;₉&#39;</span>: <span style="background-color: #fff0f0">&#39;9&#39;</span>,
        <span style="background-color: #fff0f0">&#39;⁺&#39;</span>: <span style="background-color: #fff0f0">&#39;+&#39;</span>,
        <span style="background-color: #fff0f0">&#39;⁻&#39;</span>: <span style="background-color: #fff0f0">&#39;-&#39;</span>,
        <span style="background-color: #fff0f0">&#39;¹&#39;</span>: <span style="background-color: #fff0f0">&#39;1&#39;</span>,
        <span style="background-color: #fff0f0">&#39;²&#39;</span>: <span style="background-color: #fff0f0">&#39;2&#39;</span>,
        <span style="background-color: #fff0f0">&#39;³&#39;</span>: <span style="background-color: #fff0f0">&#39;3&#39;</span>,
        <span style="background-color: #fff0f0">&#39;⁴&#39;</span>: <span style="background-color: #fff0f0">&#39;4&#39;</span>,
        <span style="background-color: #fff0f0">&#39;⁵&#39;</span>: <span style="background-color: #fff0f0">&#39;5&#39;</span>,
        <span style="background-color: #fff0f0">&#39;⁶&#39;</span>: <span style="background-color: #fff0f0">&#39;6&#39;</span>,
        <span style="background-color: #fff0f0">&#39;⁷&#39;</span>: <span style="background-color: #fff0f0">&#39;7&#39;</span>,
        <span style="background-color: #fff0f0">&#39;⁸&#39;</span>: <span style="background-color: #fff0f0">&#39;8&#39;</span>,
        <span style="background-color: #fff0f0">&#39;⁹&#39;</span>: <span style="background-color: #fff0f0">&#39;9&#39;</span>,
        <span style="background-color: #fff0f0">&#39;°&#39;</span>: <span style="background-color: #fff0f0">&#39; deg&#39;</span>,
        <span style="background-color: #fff0f0">&#39;“&#39;</span>: <span style="background-color: #fff0f0">&#39;&quot;&#39;</span>,
        <span style="background-color: #fff0f0">&#39;”&#39;</span>: <span style="background-color: #fff0f0">&#39;&quot;&#39;</span>,
        <span style="background-color: #fff0f0">&#39;’&#39;</span>: <span style="background-color: #fff0f0">&quot;&#39;&quot;</span>,
        <span style="background-color: #fff0f0">&#39;⁻&#39;</span>:<span style="background-color: #fff0f0">&#39;-&#39;</span>
    }

    <span style="color: #008800; font-weight: bold">with</span> <span style="color: #007020">open</span>(filename, <span style="background-color: #fff0f0">&#39;r&#39;</span>, encoding<span style="color: #333333">=</span><span style="background-color: #fff0f0">&#39;utf-8&#39;</span>) <span style="color: #008800; font-weight: bold">as</span> <span style="color: #007020">file</span>:
        lines <span style="color: #333333">=</span> <span style="color: #007020">file</span><span style="color: #333333">.</span>readlines()
        
   <span style="color: #888888"># Create a new filename with the sufffix &quot;_Stripped.txt&quot;</span>
    new_filename <span style="color: #333333">=</span> filename <span style="color: #333333">+</span> <span style="background-color: #fff0f0">&quot;_Stripped.txt&quot;</span>
        

    <span style="color: #008800; font-weight: bold">with</span> <span style="color: #007020">open</span>(new_filename, <span style="background-color: #fff0f0">&#39;w&#39;</span>, encoding<span style="color: #333333">=</span><span style="background-color: #fff0f0">&#39;utf-8&#39;</span>) <span style="color: #008800; font-weight: bold">as</span> <span style="color: #007020">file</span>:
        <span style="color: #008800; font-weight: bold">for</span> line <span style="color: #000000; font-weight: bold">in</span> lines:
            <span style="color: #008800; font-weight: bold">for</span> old, new <span style="color: #000000; font-weight: bold">in</span> replacements<span style="color: #333333">.</span>items():
                line <span style="color: #333333">=</span> line<span style="color: #333333">.</span>replace(old, new)
            <span style="color: #007020">file</span><span style="color: #333333">.</span>write(line)
    
    <span style="color: #008800; font-weight: bold">print</span>(f<span style="background-color: #fff0f0">&quot;Modified file saved as {new_filename}&quot;</span>)

def replace_unicode_characters(filename):

replacements = {

'₂': '2',

'₃': '3',

'₄': '4',

'₅': '5',

'₆': '6',

'₇': '7',

'₈': '8',

'₉': '9',

'⁺': '+',

'⁻': '-',

'¹': '1',

'²': '2',

'³': '3',

'⁴': '4',

'⁵': '5',

'⁶': '6',

'⁷': '7',

'⁸': '8',

'⁹': '9',

'°': ' deg',

'“': '"',

'”': '"',

'’': "'",

'⁻':'-'

}

lines = file.readlines()

# Create a new filename with the sufffix "_Stripped.txt"

new_filename = filename + "_Stripped.txt"

with open(new_filename, 'w', encoding='utf-8') as file:

for line in lines:

for old, new in replacements.items():

line = line.replace(old, new)

file.write(line)

print(f"Modified file saved as {new_filename}")

In the above function, I define a dictionary for the replacements (to convert them from non-ASCII to ASCII characters) as they’re found in the file. Then it appends “_Stripped.txt” to the original filename that was passed to it, and writes the new, cleansed file to disk. After writing, it shows the cleansed file name (and path).

To call it, first we can call the previous function to display on screen what’s found, then call this function to actually remove what’s found and save the cleansed content in a new file.
The calls from the main code would like this:

filename = 'your_file.txt' # or .csv etc. Replace string with your actual file name. # Display the found non-ascii chars (if any) found in the file find_unicode_characters(filename) replace_unicode_characters(filename)

I hope this was helpful. Thanks for reading!

Leave a Reply Cancel reply