Word War III – Dev Diary – 02: Word Smithing
So I need to find some words, more importantly I need a list of words that gradually gets more and more difficult to guess.
So what makes a word hard to guess? Doing some research on the web turns up a couple of interesting posts such as this one and this one.
It turns out that short words are harder to guess, especially ones that have “non-obvious” letters. Thus words such as “jazz”, “jug”, “by” and “gym” are much harder to guess then words such as “deployments”, “historical” or “compartmentalised”.
The list of words I have decided to use scores each word based on the relative frequencies of letters in the English language.
So “jazz” scores: j(0.153) + a(8.167) + z(0.074) = 8.394.
Whereas “deployments” scores: d(4.253) + e(12.702) + p(1.929) + l(4.025) + o(7.507) + y(1.974) + m(2.406) + n(6.749) + t(9.056) + s(6.327) = 56.928.
Lower scores indicate words that are harder to guess when you have a limited number of turns.
As the blog post points out this algorithm is not perfect “cup” scores 7.469 but “gym” scores 6.396. The reality is that if “_y_” is guessed ( players often work through the vowels and then onto “Y”) there isn’t a lot of options but “gym”. However if you guess “_u_” there is quite a lot of options besides “cup”. So actually “cup” should be a harder word to guess then “gym”.
However for my purposes this list will do very nicely. Unfortunately it’s 173528 words long which is a wee bit bigger then I need :) Also my design only allows me to display words with a length of 10 characters or less so I need to cull words longer then 10 characters as well.
Enter the pig
Apache Pig is something that I have used in the past when working with “Big Data” datasets. The word list isn’t exactly Big Data but Pig will happily suit my word list mangling needs.
First off I only want words that are between 2 and 10 characters. Next I only want a sample of the 173528 words. Lastly I want the list sorted from easiest to hardest. Here is my script:
allWords = LOAD 'allscores.txt' USING PigStorage() AS (word:chararray, junk, score:float); lessThen10 = FILTER allWords BY SIZE(word) <= 10 AND SIZE(word) > 1; shortList = SAMPLE lessThen10 0.02; groupAll = GROUP shortList ALL; wordsWithMaxScore = FOREACH groupAll GENERATE FLATTEN(shortList), MAX(shortList.score) AS maxScore; wordsWithRatio = FOREACH wordsWithMaxScore GENERATE shortList::word AS word,shortList::score AS score, maxScore, (shortList::score/maxScore) AS ratio; ordered = ORDER wordsWithRatio BY ratio DESC; justWords = FOREACH ordered GENERATE word; STORE ordered INTO 'wordsAndScores' USING PigStorage(); STORE justWords INTO 'words' USING PigStorage();
This produces two files, one with the words, their scores and a calculated ratio (0 to 10) based on the words score vs the maximum score in the sample list of words (wordsAndScores):
tendrilous 66.33 66.33 1.0 breathings 65.555 66.33 0.988316 rediscount 65.087 66.33 0.9812603 inoculated 64.965 66.33 0.979421 atrophies 64.735 66.33 0.9759535 stewarding 64.582 66.33 0.9736469 tailenders 64.232 66.33 0.96837026 nonethical 64.048 66.33 0.9655962 authorized 63.564 66.33 0.9582994 destroying 63.537 66.33 0.9578923
The second file (words) contains just the words:
tendrilous breathings rediscount inoculated atrophies stewarding tailenders nonethical authorized destroying
Each file contains 2477 words sampled out of the original list of 173k words.
I was initial going to use the ratio to group blocks of similar difficulty words together but have abandoned this idea. The plan is now to simply step through the list in random increments which will have the effect of gradually increasing the difficulty of the words during a game but not result into too many words repeating between games.
One problem with the list is that it contains words that are not in common usage for example “tendrilous” (adjective for “tendril”, a specialized threadlike leaf or stem that attaches climbing plants to a support by twining or adhering) and “kohlrabies” (plural of “kohlrabi”, A cabbage of a variety with an edible turnip like swollen stem). So I will need to manually groom the list at some stage :)
Loading the word list
Loading the word list in Unity is pretty straight forward. Firstly I copy the file into my project’s resource folder and name it words.txt, next I use the following code to load the words into a string list:
List words; void LoadWordList() { words = new List (); StringReader reader = null; TextAsset words = (TextAsset)Resources.Load("words", typeof(TextAsset)); reader = new StringReader (words.text); if (reader == null) { Debug.Log("words.txt not found or not readable"); } else { string txt; // Read each line from the file while ((txt = reader.ReadLine()) != null) { words.Add(txt); } } Debug.Log("Loaded " + words.Count + " words"); }
With the word list in place, I think I will next focus on getting the game’s screen flows sorted. Onwards and upwards!
_Update: _Click here for the next post in this series