Ok, today I’m inflicting a self-imposed challenge…here it is:
- Get an English dictionary.
- Then find how many words start with vowels.
- How many words are there starting with “odd” letters, which I’ll set as: Z, Q, J.
- Then find the longest word in that dictionary. (Say, we’re going to make a word guessing game)
- All this must be FREE (no $). And automated in code (no manual counting).
Are you in? Ok, the first step is to figure out where to get a legit dictionary!
Turns out Project Gutenberg has the entire Webster dictionary as an ebook. Unfortunately, it’s not formatted in any structured data such as CSV, JSON, or XLS/X, XML, or DB…it’s a raw text dump!
A sample download of the file is a large text file that looks like this:
The Project Gutenberg EBook of Webster's Unabridged Dictionary, by Various http://www.gutenberg.org This eBook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.net ******* My note: All words in dictionary are in upper-case on its own line. To avoid 1-letter word, check length of string >1. Title: Webster's Unabridged Dictionary Author: Various Release Date: August 22, 2009 [EBook #29765] Language: English Produced by Graham Lawrence *** start dictionary *** A A (named a in the English, and most commonly ä in other languages). Defn: The first letter of the English and of many other alphabets. The capital A of the alphabets of Middle and Western Europe, as also the small letter (a), besides the forms in Italic, black letter, etc., are all descended from the old Latin A, which was borrowed from the Greek Alpha, of the same form; and this was made from the first letter (Aleph, and itself from the Egyptian origin. The Aleph was a consonant letter, with a guttural breath sound that was not an element of Greek articulation; and the Greeks took it to represent their vowel Alpha with the ä sound, the Phoenician alphabet having no vowel symbols. This letter, in English, is used for several different vowel sounds. See Guide to pronunciation, §§ 43-74. The regular long a, as in fate, etc., is a comparatively modern sound, and has taken the place of what, till about the early part of the 17th century, was a sound of the quality of ä (as in far). 2. (Mus.) Defn: The name of the sixth tone in the model major scale (that in C), or the first tone of the minor scale, which is named after it the scale in A minor. The second string of the violin is tuned to the A in the treble staff. -- A sharp (A#) is the name of a musical tone intermediate between A and B. -- A flat (A) is the name of a tone intermediate between A and G. A per se Etym: (L. per se by itself), one preëminent; a nonesuch. [Obs.] O fair Creseide, the flower and A per se Of Troy and Greece. Chaucer. A A (# emph. #). 1. Etym: [Shortened form of an. AS. an one. See One.] Defn: An adjective, commonly called the indefinite article, and signifying one or any, but less emphatically. Defn: "At a birth"; "In a word"; "At a blow". Shak. ...
BAG Bag, v. t. [imp. & p. p. Bagged(p. pr. & vb. n. Bagging] 1. To put into a bag; as, to bag hops. 2. To seize, capture, or entrap; as, to bag an army; to bag game. 3. To furnish or load with a bag or with a well filled bag. A bee bagged with his honeyed venom. Dryden. BAG Bag, v. i. ....
So, our job is to make sense of it and parse it in a way where we ONLY have valid words that are 3 letters or longer. Also, no hyphenated words, no multiple words (just single word with no spaces, commas, hyphens, etc.).
First, we have to find a pattern that can be automated. We see that words are in all uppercase. For this mission, we don’t definition or tenses, or other information, just the valid words. We also, notice that the valid words (in uppercase) are also in a line of their own. That’s a good start!
So, let’s create our own data file without the fluff, but only containing the valid english words (with above criteria) that we can work with!
I’ll use Python only because I’m recently tweaking with it, and also it has pretty good libraries to deal with text and metrics that I’d have to write from scratch in C/C++.
I downloaded the raw dictionary file from http://www.gutenberg.net and saved it as Dictionary-Websters-ProjectGutenberg.txt in my personal, designated folder C:\Users\Tony\Documents\Programming\SampleDatasets\
So, I want to read that file and only save the upper-cased lines into a new file that I can work with. The output will be Dictionary-Websters-Curated.txt and only contain words that are longer than 3 letters. So, here’s the code in Python…
source_file="C:\\Users\\Tony\\Documents\\programming\\SampleDatasets\\Dictionary-Websters-ProjectGutenberg.txt" curated_file="C:\\Users\\Tony\\Documents\\programming\\SampleDatasets\\Dictionary-Websters-Curated.txt" my_full_list = # this array could load only the chosen words. MEMORY INTENSIVE! f = open(source_file) line = f.readline() with open(curated_file, 'w') as cf: while line: # this for open() line... if (line.isupper()): if (len(line)>3): cf.write(line) line = f.readline() f.close()
We read the source file (raw text), and we create a new file object curated_file to write to only the words with the criteria explained above. We read the text file line by line, and check if it’s upper case and if it’s more than 3 letters (accommodating for the carriage/line return character at end of line, which is another invisible character).
On the next blog, we will see how this output file looks like and if it’s good enough to get to our goals. So far, we’ve achieved objective #1, and #5.