Z Function


Dreams and Hopes

You are a new computer science graduate, besides a big student loan balance, you also have the big dream of joining a famous tech company and become a software engineer.

You crafted your LinkedIn profile: you connected with your uncle who runs a catering service; you debated on if you should list Spanish as one of the languages you speak (you’ve taken two semesters of Spanish in high school); you uploaded your resume, which managed to be a page long after you included the two summer jobs at Arby’s and McDonald’s.

You sent out applications. You waited, anxiously.

A new email appeared in your inbox:

Hi,

This is Ashley from Goomazon’s Talent Acquisition Team. I came across your resume and think that you might be a good fit for a role here.

The email that just came

Your heart almost stopped beating, and you read on. The email seems unusually long. You read about Goomazon’s culture, their latest projects, and a new bright future despite the “current situations”. Everything sounds good, and you just want to work there!

At the end of the email, a lone paragraph stands out from the rest: it’s in a different font, and the font-size is ridiculously large. You read carefully:

To get everything started, please take some time to complete our online coding challenge. The coding challenge will take about 1.5 hours to complete, so please set aside some uninterrupted time to make sure you can answer the questions at your best.

That same email that just came

The Coding Challenge

“Hey, it’s just like HackerRank!” You exclaimed as the link takes you to a familiar IDE window inside your browser.

The comfort of familiarity eases your nervousness a little. You scrolled through the T&C, you agreed to the privacy policy without reading it.

You started the coding challenge.

“Ha!” You read the first question. It asks you to find the k-th largest number in an unsorted array. “That’s so easy.”

def find_k(arr, k):
# complete your code here
    return sorted(arr)[k - 1]

After thanking the guy named Tim for his awesome sorting algorithm, you clicked on Run All Test. A bold move nonetheless, but you knew that your code is correct.

The tests passed, every single one of them. An adrenaline rush hits you, your already shaky hands from caffeine abuse is now shaking even more.

Submitting your answers, you move on to the second question with some newly found confidence.

Suffix and Prefix?

You are baffled, to say the least.

After overcoming the initial shock of seeing an entire wall of question descriptions, you’ve got some good idea about what the question wants: you have a string, they want you to find sum of the length of common prefixes between each suffix of the string and the original string.

That means, suppose you have a string “ABCD”, you’ll have suffixes “ABCD”, “BCD”, “CD” and “D”. Because none of the suffix share any prefix with the original string “ABCD”, the length of common prefixes is 0.

You have an intuition. You can just take each sub-string and compare it from start to end with the original string. And you started coding:

def common_prefix(string):
    length = 0
    for i in range(len(string)):
        sub_string = string[i:]
        for j, c in enumerate(sub_string):
            if string[i] != c:
                break
            else:
                length += 1
    return length

Satisfied with what you just wrote done, you trace the code in your mind.

“Oh no it’s O(n^2).” Your heart sinks as you come to the saddening realization.

The Big-O

You see, it’s not difficult to see why an O(n^2) algorithm isn’t ideal: if the input size increases by a factor of 10, the execution time will increase by a factor of 100.

It grows very fast.

Let’s go back a little bit and figure out why the algorithm you just wrote is indeed O(n^2). First we have a for loop :

for i in range(len(string))

Which itself is n operations, if we let n be the length of the string. Then inside this for loop, we have another for loop to check if the suffix have a common prefix with the original string:

for j, c in enumerate(sub_string)

The number of operations here equals the length of sub-string varies. But as the length of the sub-string varies through out the outer loop, we should look at the average length of this sub-string.

Given the shortest sub-string has length 0 and the longest sub-string has the same length as the string, which is n, we can get the total length of all the sub-strings by (0 + n) * (n + 1) / 2. The average length of a sub-string is (0 + n) * (n + 1) / 2 / (n + 1), which in turn is conveniently n / 2. This means that on average, our inner for loop has a time complexity of O(n / 2).

The proof of correctness of the above calculations is left as an exercise for the reader.

Armed with the knowledge of the complexity of both the outer and inner for loops, we can derive the overall complexity of the algorithm by multiplying them together. (Intuitively you can think about “doing n/2 tasks n times.”) This eventually give us an overall complexity of O(1/2 n^2), which can be further simplified to O(n^2).

“But wait”, you objected, pointing out that the break statement inside your inner loop should keep run time at a reasonable level. After all, you are stopping calculations when it’s apparent that it’s not going anywhere. It gotta help, right?

It is not hard to come up with an example where your algorithm will defy the performance boost provided by the break and make your life truly miserable:

Imagine the string “AAAAAAAAA … AA”, where each and every character in the string is A. In this case, each and every suffix of the string shares a common prefix with the original string that is exactly the entirety of the suffix strings.

And this is exactly one of the test cases that your algorithm will fail at. Not only did the test designers gave you a string that only has the letter A, there are also 10^5 of them.

You are speechless.

Z Algorithm

The Basic Idea

If you are familiar with competitive coding or having done some coding exercises on websites like Leetcode or HackerRank, you are probably already familiar with the idea of memoization and dynamic programing.

The idea behind the approach is simple: you try to make use of information that you’ve already obtained to speed up later calculations. In the example of Fibonacci numbers, you can store the value of fib(5) to help with the calculation of fib(6).

If we record the information we’ve already obtained in the prefix problem, by recording the length of longest common prefix at each starting position i in the string, we might have something like this:

stringabceabfgi
i012345678
length900020000
Longest common prefix at various starting positions of a string.

Again if you are comfortable with dynamic programming, this table looks very similar to a recursion table.

So, maybe we can make use of the prefix length we have calculated to our advantage.

Imagine you have a string, AAAFFFFAAAFFF. And it’s easy to spot two repeating patterns of AAA within the string.

By observation, we can see that inside the second AAA, the second A shares the same prefix length as the second A in the first AAA. And the third A shares the same as the third A in the first AAA.

This is a great discovery. With this knowledge in mind, we can begin crafting our algorithm.

The Algorithm

The Naive Approach. We must have something to start with when building our prefix table. So we have to make use the naive algorithm when we have no information available from previous calculations.

The Meoization Approach. We can keep track of how far we’ve scanned the string to find a matching pattern. If you think of the process as scanning from left to right, then we can keep track of the right-most position we have visited.

This right-most position tells us how far to the right, from index 0, can we use already computed values of prefix lengths as a starting point of our calculation for the current position on the string.

stringaaafaaaff
i012345678
length921012300
Longest common prefix at various starting positions of a string.

Consider the prefix table above. We are at index 4, and we have used the naive approach to find out that we have a prefix match up to index 6. In this case, 6 will be our right-most position we have “scanned” so far, with respect to the current position at index 4.

This gives us a very big advantage for positions after index 4. When we move onto index 5, since we know index 5 is within a matched prefix pattern, that the prefix length at index 6 should be no different from the prefix length at index 1. (Index 6 and index 1 are both the second element of the aaa pattern.)

This is when the speed-up comes, when we no longer have to use the naive approach to calculate the prefix length at a certain position.

It is worth noting that the prefix length at index 0 is a special case. It is always the length of the entire string by definition, but this value will be problematic in actual use. For example, at index 4, the prefix length can never be 9, since it’s inside the original string.

To make things simple, we can initialize the prefix length at index 0 to 0. This way, we can avoid the problem mentioned above.

Implementation

def zfunction(string: str) -> list:
    z = [0]  # initialize the prefix table
    j = 0  # right-most index

    for i in range(1, len(string)):
        # make use of old data
        if i < j + z[j]:
            # if current index is to the left of right-most
            # index, meaning we can make use of old data.
            #
            # i-j is the relative position of the current
            # index in the previous matched segment we want
            # to make use of. and z[i-j] is the memoized
            # prefix lenght.
            #
            # j+z[j]-i is the prefix length at current index
            # given by the current matched segment.
            #
            # we choose the min value because we do not yet
            # know how far to the right we can keep matching
            # in our current matched segment. and this task
            # is completed by the naive approach later.
            z[i].append(min(z[i - j], j + z[j] - i))
        else:
            # when no old data can be used, we set it to 0,
            # and let the naive approach do its work
            z[i].append(0)

        # naive approach
        while i + z[i] < len(string) and string[z[i]] == string[i + z[i]]:
            z[i] += 1
        
        # update right-most index, this happens if we can find
        # an even longer prefix match at current index i.
        #
        # here we are essentially comparing the length of each
        # "matched" segment.
        if i + z[i] > j + z[j]:
            j = i
        
    
    z[0] = len(string)  # setting z[0] by definition
    return z

Credits and Further Reading

Please refer to the following awesomely written article that inspired this work.

Z-function and its calculation

Z Algorithm by Ivan Yurchenko

,

Leave a Reply

Your email address will not be published.