Sunday, June 15, 2025
Coding STEM

The Dictionary Challenge-Part 1

Ok, today I’m inflicting a self-imposed challenge…here it is:

  1. Get an English dictionary.
  2. Then find how many words start with vowels.
  3. How many words are there starting with “odd” letters, which I’ll set as: Z, Q, J.
  4. Then find the longest word in that dictionary. (Say, we’re going to make a word guessing game)
  5. All this must be FREE (no $). And automated in code (no manual counting).

Are you in? Ok, the first step is to figure out where to get a legit dictionary!

Turns out Project Gutenberg has the entire Webster dictionary as an ebook. Unfortunately, it’s not formatted in any structured data such as CSV, JSON, or XLS/X, XML, or DB…it’s a raw text dump!

A sample download of the file is a large text file that looks like this:

So, our job is to make sense of it and parse it in a way where we ONLY have valid words that are 3 letters or longer. Also, no hyphenated words, no multiple words (just single word with no spaces, commas, hyphens, etc.).

First, we have to find a pattern that can be automated. We see that words are in all uppercase. For this mission, we don’t definition or tenses, or other information, just the valid words. We also, notice that the valid words (in uppercase) are also in a line of their own. That’s a good start!

So, let’s create our own data file without the fluff, but only containing the valid english words (with above criteria) that we can work with!

I’ll use Python only because I’m recently tweaking with it, and also it has pretty good libraries to deal with text and metrics that I’d have to write from scratch in C/C++.

I downloaded the raw dictionary file from http://www.gutenberg.net and saved it as Dictionary-Websters-ProjectGutenberg.txt in my personal, designated folder C:\Users\Tony\Documents\Programming\SampleDatasets\

The Code:

So, I want to read that file and only save the upper-cased lines into a new file that I can work with. The output will be Dictionary-Websters-Curated.txt and only contain words that are longer than 3 letters. So, here’s the code in Python…

Explanation:

We read the source file (raw text), and we create a new file object curated_file to write to only the words with the criteria explained above. We read the text file line by line, and check if it’s upper case and if it’s more than 3 letters (accommodating for the carriage/line return character at end of line, which is another invisible character).

On the next blog, we will see how this output file looks like and if it’s good enough to get to our goals. So far, we’ve achieved objective #1, and #5.

 

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top