STEM

Data cleansing challenge: non-ASCII characters

Non-ASCII characters can pose challenges in data cleansing for several reasons:

  1. Encoding Issues: Different systems and databases may use different character encodings (like UTF-8, ISO 8859-1, etc.). If the encoding isn’t handled correctly during data import/export, non-ASCII characters might not be represented correctly, leading to data corruption.
  2. Inconsistency: Non-ASCII characters can introduce inconsistencies in your data. For example, the word “résumé” could also be written as “resume” or “resumé”. This can complicate text processing tasks like searching, sorting, or matching strings.
  3. Software Compatibility: Some software, especially older versions, may not support non-ASCII characters, which can lead to errors or data loss.
  4. Increased Complexity: Handling non-ASCII characters can make data processing, cleansing, and analysis more complex as you need to take into account various encodings and special characters.

Therefore, it’s a good practice to standardize or normalize text data to ASCII when possible, or ensure correct handling of non-ASCII characters. This helps to maintain data integrity and simplifies subsequent data processing tasks.

Superscripts, subscripts, or “special” characters often look like ascii characters but they are actually not.
For example, many data from web sites will included double-quotes as these:’“’ or ‘”’ which really should be ‘””‘ when using in code or data processing.
A seemingly dash character as this ‘⁻’ is actually unicode character codepoint 8315 and not a dash from a standard keyboard! And there are many more.

Consider this set of strings:

  1. H₂O – Water
  2. Na⁺ – Sodium Ion
  3. Cl⁻ – Chloride Ion
  4. CO₂ – Carbon Dioxide
  5. H₃O⁺ – Hydronium Ion
  6. OH⁻ – Hydroxide Ion
  7. C₆H₁₂O₆ – Glucose
  8. H₂SO₄ – Sulfuric Acid
  9. NO₃⁻ – Nitrate Ion
  10. Ca²⁺ – Calcium Ion
  11. PO₄³⁻ – Phosphate Ion
  12. NH₄⁺ – Ammonium Ion
  13. HCO₃⁻ – Bicarbonate Ion
  14. C₂H₅OH – Ethanol
  15. Mg²⁺ – Magnesium Ion
  16. SO₄²⁻ – Sulfate Ion
  17. H₃PO₄ – Phosphoric Acid
  18. HNO₃ – Nitric Acid
  19. CH₃COOH – Acetic Acid
  20. C₃H₇NO₂ – Alanine

Each row contains a chemical formula with superscript and subscript characters, followed by the common name of the compound.
To data-cleanse this, we need to replace non-ASCII characters with ASCII equivalents or removing non-ASCII characters altogether depending no the scenario.
e.g. “H₂O” could be replaced with “H2O”, and “Na⁺” could be replaced with “Na+” and so on.

Doing this visually and manually is not only error-prone, it’s also very tedious for a large data set. One my solutions is to use Python to take care of that task. Below, I’ll share my method and code to accomplish this.

Example 1: To identify the “special”, non-ASCII characters in a file:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import re

def find_unicode_characters(filename):
    with open(filename, 'r', encoding='utf-8') as file:
        pattern = re.compile(r'[^\x00-\x7F]+')  # Matches any character that is not a basic ASCII character
        line_number = 1
        for line in file:
            matches = pattern.findall(line)
            if matches:
                print(f'Line {line_number}: {matches}')
            line_number += 1


filename = 'your_file.txt'  # or .csv etc. Replace string with your actual file name.
find_unicode_characters(filename)

An example output after I specified a file name with possible special characters looks like this:
Line 57: ['’', '’']
Line 100: ['’']
Line 102: ['’']

It shows the line number and the characters that are non-ASCII. The above function will not replace them, it just identifies them, which is useful for a quick check or for small files. Let’s look at the next example.

Example 2: Find and replace the “special”, non-ASCII characters in a file and write cleansed data into a new file.

To actually replace those characters with an ASCII equivalent, my code is modified to add another function to actually do the replacements in memory, and then write the cleansed data to a new file.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
def replace_unicode_characters(filename):
    replacements = {
        '₂': '2',
        '₃': '3',
        '₄': '4',
        '₅': '5',
        '₆': '6',
        '₇': '7',
        '₈': '8',
        '₉': '9',
        '⁺': '+',
        '⁻': '-',
        '¹': '1',
        '²': '2',
        '³': '3',
        '⁴': '4',
        '⁵': '5',
        '⁶': '6',
        '⁷': '7',
        '⁸': '8',
        '⁹': '9',
        '°': ' deg',
        '“': '"',
        '”': '"',
        '’': "'",
        '⁻':'-'
    }

    with open(filename, 'r', encoding='utf-8') as file:
        lines = file.readlines()
        
   # Create a new filename with the sufffix "_Stripped.txt"
    new_filename = filename + "_Stripped.txt"
        

    with open(new_filename, 'w', encoding='utf-8') as file:
        for line in lines:
            for old, new in replacements.items():
                line = line.replace(old, new)
            file.write(line)
    
    print(f"Modified file saved as {new_filename}")

In the above function, I define a dictionary for the replacements (to convert them from non-ASCII to ASCII characters) as they’re found in the file. Then it appends “_Stripped.txt” to the original filename that was passed to it, and writes the new, cleansed file to disk. After writing, it shows the cleansed file name (and path).

To call it, first we can call the previous function to display on screen what’s found, then call this function to actually remove what’s found and save the cleansed content in a new file.
The calls from the main code would like this:

filename = 'your_file.txt' # or .csv etc. Replace string with your actual file name.
# Display the found non-ascii chars (if any) found in the file
find_unicode_characters(filename)
replace_unicode_characters(filename)

I hope this was helpful. Thanks for reading!


Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top
+