Document Distance Optimization
February 13, 2021
One of the first assignments in MIT’s Data Structures and Algorithms course on OpenCourseWare is analyzing the evolution of a document distance program across eight iterations. You can download the .py files here, in the Lecture Notes ZIP for Lecture 2.
I’ll be using these variables, and looking at the code function by function.
Number of Words | n |
---|---|
Number of Lines | L |
Number of Characters | c |
Number of Unique Words | u |
Average Words per Line | w |
read_file()
The only change made to this function is a switch in v8 from readlines(), which returns a list with every line in the file represented as a separate string, to read(), which returns the entire file as a single string.
I can’t find any solid information on the performance benefit here, but I would assume that readlines() includes a conditional that checks every single character against ‘\n’, an O(c) process. read() doesn’t need that conditional, since it just dumps everything into a string, treating ‘\n’ the same as every other char.
get_words_from_line_list()
In v1, new words are added to the word list using concatenation. For every line, the function performs the operation
word_list = word_list + words_in_line
This creates a new list and copies the contents of both lists into it, for a complexity of O(len(list1) + len(list2)). Over L iterations, this becomes (w + 2w + 3w + … + Lw), which simplifies to Lw(L + 1)/2, or n(L + 1)/2.
In v2, this is replaced with the extend() function, which adds list2 to list1 in place, meaning no new list is created and each word gets moved in memory only once, resulting in O(n) complexity.
In v8, this function still exists, but its contents have changed to what had previously been get_words_from_string(). Lines are no longer used in this version of the program.
get_words_from_string()
The original version of this function creates two lists. character_list accumulates letters until a non-alphanumeric character is reached. The function then checks to make sure the list isn’t empty (which would happen if two non-alphanumeric characters were encountered consecutively), then joins the accumulated characters into a string, which it makes lowercase and adds to the word_list.
In v5, this process is replaced by translate() and split(). I’m not positive why this is faster; it seems like these two functions would have to do everything the previous version did. I’m guessing the time save is due to the reduced number of function calls, which I have heard can be costly in Python.
This function is called L times in versions 1-7, but in v8, it’s called only once (per document, that is).
count_frequency()
Versions 1-3 use nested for loops to iterate through every word in the document and compare them to every unique word collected up to that point. This means n iterations on the outer loop, and (0 + 1 + 2 + … + u) iterations on the inner loop. Considering a worst case where every word in the document is unique, this works out to O(n3) time.
Version 4 collects its unique words in a dictionary, or hash table, instead of a list. Checking to see if a given word exists in a hash table takes O(1) time, reducing this whole function to O(n) complexity.
sorting()
Versions 1-5 use insertion to alphabetically sort the unique word lists, which is O(n2). v6 switches to merge sort, which is O(nlog(n)).
Version 7 ditches sorting completely. Sorting was only useful for speeding up the inner product function when the program was still storing unique words in lists. Now that it’s using dictionaries instead, the inner product can use hashing to achieve similar performance.
inner_product()
Initially, this function is set up to work for unsorted lists. Every unique word in Document 1 is checked against every unique word in Document 2, meaning exponential complexity. In v3, since we know that we are dealing with sorted lists, we can use a two-key method (much like the merge part of a merge sort) which has linear complexity. Once dictionaries are introduced and sorting is removed, we switch to hashing. I believe this is still linear, but I’m fuzzy on some details of hash tables. If it does take longer, it’s certainly a smaller time cost than sorting was.
return
That’s it! The other functions are unchanged throughout all eight versions. When this course was filmed, running docdist1.py on two test TXT files took 228.1 seconds. The same two files run through docdist8.py took only 0.2 seconds. I love optimization!