Word War III – Dev Diary – 02: Word Smithing

So I need to find some words, more importantly I need a list of words that gradually gets more and more difficult to guess.

So what makes a word hard to guess? Doing some research on the web turns up a couple of interesting posts such as this one and this one.

It turns out that short words are harder to guess, especially ones that have “non-obvious” letters. Thus words such as “jazz”, “jug”, “by” and “gym” are much harder to guess then words such as “deployments”, “historical” or “compartmentalised”.

The list of words I have decided to use scores each word based on the relative frequencies of letters in the English language.

So “jazz” scores: j(0.153) + a(8.167) + z(0.074) = 8.394.

Whereas “deployments” scores: d(4.253) + e(12.702) + p(1.929) + l(4.025) + o(7.507) + y(1.974) + m(2.406) + n(6.749) + t(9.056) + s(6.327) = 56.928.

Lower scores indicate words that are harder to guess when you have a limited number of turns.

Letter frequency for the English language – source Wikipedia

As the blog post points out this algorithm is not perfect “cup” scores 7.469 but “gym” scores 6.396. The reality is that if “_y_” is guessed ( players often work through the vowels and then onto “Y”) there isn’t a lot of options but “gym”. However if you guess “_u_” there is quite a lot of options besides “cup”. So actually “cup” should be a harder word to guess then “gym”.

However for my purposes this list will do very nicely. Unfortunately it’s 173528 words long which is a wee bit bigger then I need :) Also my design only allows me to display words with a length of 10 characters or less so I need to cull words longer then 10 characters as well.

Enter the pig

Apache Pig is something that I have used in the past when working with “Big Data” datasets. The word list isn’t exactly Big Data but Pig will happily suit my word list mangling needs.

First off I only want words that are between 2 and 10 characters. Next I only want a sample of the 173528 words. Lastly I want the list sorted from easiest to hardest. Here is my script:

allWords = LOAD 'allscores.txt' USING PigStorage() AS (word:chararray, junk, score:float);
lessThen10 = FILTER allWords BY SIZE(word) <= 10 AND SIZE(word) > 1;
shortList = SAMPLE lessThen10 0.02;

groupAll = GROUP shortList ALL;
wordsWithMaxScore = FOREACH groupAll GENERATE FLATTEN(shortList), MAX(shortList.score) AS maxScore;

wordsWithRatio = FOREACH wordsWithMaxScore GENERATE shortList::word AS word,shortList::score AS score,
                 maxScore, (shortList::score/maxScore) AS ratio;

ordered = ORDER wordsWithRatio BY ratio DESC;
justWords = FOREACH ordered GENERATE word;

STORE ordered INTO 'wordsAndScores' USING PigStorage();
STORE justWords INTO 'words' USING PigStorage();

This produces two files, one with the words, their scores and a calculated ratio (0 to 10) based on the words score vs the maximum score in the sample list of words (wordsAndScores):

tendrilous 66.33   66.33   1.0
breathings  65.555  66.33   0.988316
rediscount  65.087  66.33   0.9812603
inoculated  64.965  66.33   0.979421
atrophies   64.735  66.33   0.9759535
stewarding  64.582  66.33   0.9736469
tailenders  64.232  66.33   0.96837026
nonethical  64.048  66.33   0.9655962
authorized  63.564  66.33   0.9582994
destroying  63.537  66.33   0.9578923

The second file (words) contains just the words:

tendrilous
breathings
rediscount
inoculated
atrophies
stewarding
tailenders
nonethical
authorized
destroying

Each file contains 2477 words sampled out of the original list of 173k words.

I was initial going to use the ratio to group blocks of similar difficulty words together but have abandoned this idea. The plan is now to simply step through the list in random increments which will have the effect of gradually increasing the difficulty of the words during a game but not result into too many words repeating between games.

One problem with the list is that it contains words that are not in common usage for example “tendrilous” (adjective for “tendril”, a specialized threadlike leaf or stem that attaches climbing plants to a support by twining or adhering) and “kohlrabies” (plural of “kohlrabi”, A cabbage of a variety with an edible turnip like swollen stem). So I will need to manually groom the list at some stage :)

Loading the word list

Loading the word list in Unity is pretty straight forward. Firstly I copy the file into my project’s resource folder and name it words.txt, next I use the following code to load the words into a string list:

List words;

void LoadWordList()
{
    words = new List ();
    StringReader reader = null;

    TextAsset words = (TextAsset)Resources.Load("words", typeof(TextAsset));

    reader = new StringReader (words.text);
    if (reader == null)
    {
        Debug.Log("words.txt not found or not readable");
    } else
    {
        string txt;

        // Read each line from the file
        while ((txt = reader.ReadLine()) != null)
        {
            words.Add(txt);
        }
    }

    Debug.Log("Loaded " + words.Count + " words");
}

With the word list in place, I think I will next focus on getting the game’s screen flows sorted. Onwards and upwards!

_Update: _Click here for the next post in this series