In this blog I share techniques on how to convert a dataset containing words into actual numbers (integer or float) save the output with numbers to an external CSV/XLSX data file.
If you’re wondering why would one ever need this? Here’s a scenario: A list of movies and their corresponding revenues were read into a microphone and converted into text (using speech-to-text engine or dictation) instead of typing. So a revenue of 33048353 was dictated as “thirty three million forty eight thousand three hundred fifty three”. This information was read into Excel or a text file editor along with each movie name, and its revenue separated by a comma. So, we end up having a dataset that looks like this:
Our task is to convert the second column revenue words into numbers so we can do quantitative analysis on revenues. Notice, how the last row even contains a decimal point (“…fifty three point nine five”)!
There’s a neat Python library called word2number that can convert these words into actual numbers. It takes a string of words and will return a numeric output. We can use that. So, next thing we need to do is read the dataset into memory, figure out how many rows there are, and then for each row, extract the RevenueWords column’s content, pass it through word2number method to get a numeric output, store it in an array, and continue doing this for all the rows, and save the array into a new data file (output) along with the associated movie names.
The output file we want should look like as a CSV file:
Notice how the last row even converted the words into decimal point with following digits accurately.
So how did we achieve this?
The code below shows my complete solution. It’s short and sweet without getting fancy with any comprehension or shorthands. I simply read the argument passed by user which is the input file name. So, if the application is called wordconv.py, and our original dataset is in a file called movierevwords.csv we would run it from the Shell as: wordconv.py movierevwords.csv
The first thing I check is if the input argument is supplied, if not, show the proper syntax and quit. Otherwise, the code continues and loads the csv file using pandas library into a dataframe. I verify the column names of the dataframe and from that I see that the column with words I need to convert are stored in column titled ‘ RevenueWords’…so I load all its rows and find the number of rows…which happens to be 9 rows plus the header row.
Then I loop through that many rows of that column and get each string from that column from each row, and pass it onto our helper function ReturnNumber()…which takes the string and returns its corresponding number as an integer or float. As each number is returned by our function, I save that value into an array ra and keep appending each element to it until all rows are traversed.
Finally, we create and add a new column title ‘RevenueConverted’ in our dataframe, copy the converted numbers from ra into that column, while retaining the rest of the dataframe intact from the input file and save the new dataframe into a separate file as defined by OUTPUTFILE. Quite simple once the logic and syntax are figured out….and only a few lines of code.
I hope this was useful. Now imagine what else you can do with this!
Interested in creating programmable, cool electronic gadgets? Give my newest book on Arduino a try: Hello Arduino!▟