Algorithms Weekly by Petr Mitrichev: February 2015

Saturday, February 28, 2015

This week in competitive programming

There were three big competitions planned for this week: Challenge24 2015 Electronic Contest, TopCoder SRM 651 and Open Cup 2014-15 Grand Prix of China. However, the first two faced some serious issues, and the Open Cup is yet to happen tomorrow. Given that I've falled a bit behind on the Open Cup rounds, I will keep this week's summary without any news, and continue describing the Open Cup rounds from February.

Last week we discussed a very nice problem from the Grand Prix of Japan: count the number of pairs of strings (s, t) such that the length of s is N, the length of t is M, each character is from an alphabet of size A, and t is a substring of s.

Suppose the string t is fixed. How do we count the number of strings s of given length that contain t? We will count two things: the number a_n of strings s of each length n between 1 and N that do not contain t, and the number b_n of strings s of each length n that contain t, but only as a suffix and not before. First approximation for a_n is simple: we have to take a string of length n-1 and add one character to it, so we have a_n=a_n-1*A. Of course, that formula is not correct because it doesn't take t into account at all. How could t appear in the length-n string, if the first n-1 characters didn't contain t? Well, it must appear at the rightmost position, so the correct formula is a_n=a_n-1*A-b_n.

But how do we find b_n? Well, the first approximation is also simple: we have to take a string without any appearances of t, and add t in the end: b_n=a_n-M. However, this is also not correct because by adding t in the end, we might've also added some occurrences of t before the end, so we need to subtract something. Well, let's look at the first appearance of t that we added - let's say it ends p (p>0) characters from the end. This can happen only if t has a border (a string that is both its prefix and its suffix) of length M-p - see the picture on the left. And for each string of length n-p that ends with t, there's exactly one way to extend it to a "bad" string of length n. So the correct formula is: b_n=a_n-M-sum(b_n-p), where the sum is taken over all positive p such that there's a border of length M-p in t.

In other words, it turns out that knowing all border lengths of t is sufficient for finding the number of strings of any given length that contain it. At the same time, not all sets of border lengths are possible, since having even one border implies a lot of equalities between the string characters, and in case we have several, the string quickly converges to having all characters equal, and thus all border lengths at the same time. This idea inspires a reasonably simple backtracking: let's try to build all non-contradictory sets of border lengths for a string of length M by recursively trying to include/not include the longest border (of length M-1), then M-2, and so on. We will bail out from the recursion as soon as we have a contradiction. In order to check if we have one, we will build equivalence classes of characters according to borders that do exist, and check if they imply a border that we've declared as not existing.

This recursion runs almost instantaneously for M=50, and reports that there are only 2249 non-contradictory sets of border lengths. We have almost solved our problem: we can iterate over all possible sets of border lengths, and find the number of strings s that contain a string t with such set of border lengths using the dynamic programming described above. But note that we must also multiply that by the number of such strings t - so we also need to find the number of strings with the given set of border lengths.

The latter, on the surface, is also computed easily: it's simply A to the power of the number of equivalence classes defined by the borders. But as you've probably guessed, this will overcount: here we won't throw out the strings that have some additional borders. But since we have just 2249 different sets of borders, we can fix this by simply subtracting the numbers of strings for all supersets of our set of border lengths, obtaining a dynamic programming approach for counting t's. And that completes the solution!

The next round of the Open Cup, Northern Grand Prix, happened three weeks ago (results, top 5 on the left). Just like the Grand Prix of Japan, this round featured many interesting problems, but I'd like to highlight two. Problem B was also about string borders: one needed to count the k-th lexicographically borderless word of length n (up to 64), where a word is borderless when it has no non-trivial borders. Riding the wave of the solution explained above, I've tried to use the same approach by using inclusion-exclusion principle over sets of possible border lengths - but unfortunately, this approach was a bit too slow for the time limit. Can you see how to solve this problem faster?

Problem D considered languages defined by deterministic finite automata. More precisely, one needed to define a language containing just the given word, and no other words. When the given word has length n, it's easy to construct an automaton with n+2 states that accepts the given word and only the given word: a chain of n+1 states that have transitions using the characters of the word, the (n+1)-st state being final and all other transitions leading to (n+2)-nd "sink state" which is not final. This problem, however, limited the automata to n+1 states. It's not hard to see that it's impossible to create an automaton with n+1 states that accepts the given word of length n and only that word. Because of that, you were asked two output two automata with at most n+1 states that have the language with just one word in their intersection - in other words, such that both of them accept the given word of length n, and there's no other word such that both of them accept it. Can you figure this one out?

Thanks for reading, and check back next week!

Sunday, February 22, 2015

This week in competitive programming

Codeforces Round 292 took place late on Tuesday (problems, results, top 5 on the left). Haghani has pulled off an impressive victory given that he was not the rating favorite - but of course he did climb much higher in the ratings as a result, now he's ranked 28th. Congratulations! rng_58, on the other hand, could've probably been the rating favorite if not for his recent strategy of solving the hardest problem first. Once again he became the only person to solve the said hardest problem, but couldn't solve two other problems in time and only placed 10th.

TopCoder SRM 650 started next morning less than 8 hours after the Codeforces round ended (problems, results, top 5 on the left), so big congratulations to Endagorion on winning it in addition to placing 3rd earlier, and for being the only one to solve the hard problem!

There were also four Open Cup rounds in January and February. The problems from those rounds are not published yet, but it's been a month now and it's boring to wait longer, so I will try to explain the problem statements in full below.

Open Cup 2014-15 Grand Prix of Japan took place three weeks ago (results, top 5 on the left). Problem I was, on the surface, a repeat of a problem from Petr Mitrichev Contest 9 back from 2011: it simply asked to count the number of pairs of strings (s, t) such that the length of s is N, the length of t is M, each character is from an alphabet of size A, and t is a substring of s. When this problem was given on my contest, N, M and A were up to 100, and you needed to output a floating-point number - the probability that a random pair has this property. No team could solve this problem during the contest - take a look at problem H in standings. The reference solution made heavy use of the fact that we need to output a fraction and are allowed 10^-9 error, since for longer substrings the probability is almost zero, and for shorter substrings we can iterate over all possible prefix functions of the substring, and when the prefix function is fixed counting superstrings is a relatively straightforward dynamic programming on the Knuth-Morris-Pratt automaton.

Fast forward 3.5 years, and now the harder version is given on a contest: now N is up to 200, M is up to 50, A is up to 1000, and you need to find the answer modulo 10⁹+7, so you can't squeeze by with approximation arguments anymore. And one team does solve this problem - congratulations to Alex, Slava and Artem!

The key to solving the harder version is to notice that we don't, in fact, need the entire prefix function of the substring to count the number of superstrings containing it: it turns out that we can just look at the set of prefixes of the substring that are also its suffixes, in other words at the set of its borders. Can you see why the borders are enough?

Thanks for reading, and check back next week for the full explanation of this problem's solution, and for a new hard problem!

Thursday, February 19, 2015

This week in competitive programming

Theorem (Generalized Cayley's formula?).

Given a graph with n labeled vertices, let s be the number of its connected components and m₁,m₂,...,m_s be the numbers of vertices in those components. Then the number of ways to add s-1 edges to this graph so that it becomes connected is n^s-2∏m_i.

When Mikhail `Endagorion' Tikhomirov told me this theorem, I was very suprised that it looks so simple yet I've never encountered it before. Is it a well-known result? Does it have a name?

This theorem comes in handy when solving a problem from previous week's summary: how many simple undirected graphs are there with n labeleld vertices and k bridges?

Here's how we can solve this problem: in addition to the numbera a_n,k of graphs with n vertices and k bridges, we'll also count the number b_n,k of connected graphs with n vertices and k bridges (including, as k=0 case, the number of biconnected graphs), the value c_n,k which is the sum over all graphs with n vertices and k connected components, such that all its connected components are biconnected, of the products of sizes of the components, and finally the number d_n of connected graphs with n vertices.

Let's start with b_n,k with positive k. Since the graph has at least one bridge, the bridges split it into several biconnected components, and the Generalized Cayley's formula mentioned above helps count the ways to join those biconnected components with bridges into a connected graph, so we get b_n,k=n^k-1*c_n,k+1. Now b_n,0 is simply d_n-sum(b_n,k).

To find c_n,k, we first notice that c_n,1=n*b_n,0. For larger k, we should consider how many vertices are there in the biconnected component the contains the first vertex - let's say it has m vertices, then we have to multiply the number of ways to pick the remaining m-1 vertex in the component by the numbe of ways to pick the remaining components. In other words, c_n,k=sum(comb(n-1,m-1)*c_m,1*c_n-m,k-1).

Finding d_n is a standard question that is solved in a similar way: connected graphs are all graphs minus disconnected graphs, and disconnected graphs can be categorized by the size of the connected component containing the first vertex. In other words, d_n=2^n*(n-1)/2-sum(comb(n-1,m-1)*d_m*2^{(n-m)*(n-m-1)/2}).

And finally, we can find the value that we need: a_n,k. Again, let's look at the connected component containing the first vertex, and suppose it has m vertices and p bridges. Then we get a_n,k=sum(comb(n-1,m-1)*b_m,p*a_n-m,k-p).

We compute O(n²) values, spending O(n) for each value in b, c and d, but O(n²) for each value of a, so the total running time is O(n⁴).

Here's the code:

public class Fragile {
    static final long MODULO = (long) (1e9 + 7);

    public int countGraphs(int N, int K) {
        long[][] a = new long[N + 1][N + 1];
        long[][] b = new long[N + 1][N + 1];
        long[][] c = new long[N + 1][N + 1];
        long[] d = new long[N + 1];
        d[0] = 1;
        long[][] comb = new long[N + 1][N + 1];
        comb[0][0] = 1;
        a[0][0] = 1;
        long[] p2 = new long[N * (N - 1) / 2 + 1];
        p2[0] = 1;
        for (int i = 1; i < p2.length; ++i) {
            p2[i] = 2 * p2[i - 1] % MODULO;
        }

        for (int n = 1; n <= N; ++n) {
            comb[n][0] = 1;
            for (int k = 1; k <= N; ++k) {
                comb[n][k] = (comb[n - 1][k - 1] + comb[n - 1][k]) % MODULO;
            }
            d[n] = p2[n * (n - 1) / 2];
            for (int m = 1; m < n; ++m) {
                d[n] = (d[n] - comb[n - 1][m - 1] * d[m] % MODULO * p2[(n - m) * (n - m - 1) / 2] % MODULO + MODULO) % MODULO;
            }
            for (int k = 2; k <= n; ++k) {
                for (int m = 1; m < n; ++m) {
                    c[n][k] = (c[n][k] + comb[n - 1][m - 1] * c[m][1] % MODULO * c[n - m][k - 1]) % MODULO;
                }
            }
            b[n][0] = d[n];
            long pw = 1;
            for (int k = 1; k < n; ++k) {
                b[n][k] = pw * c[n][k + 1] % MODULO;
                pw = pw * n % MODULO;
                b[n][0] = (b[n][0] - b[n][k] + MODULO) % MODULO;
            }
            c[n][1] = n * b[n][0] % MODULO;
            for (int k = 0; k < n; ++k) {
                for (int m = 1; m <= n; ++m) {
                    for (int p = 0; p <= k; ++p) {
                        a[n][k] = (a[n][k] + comb[n - 1][m - 1] * b[m][p] % MODULO * a[n - m][k - p]) % MODULO;
                    }
                }
            }
        }

        return (int) a[N][K];
    }
}

Do you know a simpler way to solve this problem? Do you know a faster way to solve this problem?

My solution at the contest was even more complicated with O(n⁴) states and O(n⁶) running time, so it didn't complete in 2 seconds and I had to precompute the answers and submit those. I've used the following dynamic programming: how many graphs are there with n vertices, m vertices in the biconnected component containing the first vertex, k bridges, and p bridges going out from the biconnected component containing the first vertex.

Now, let's turn to last week's only regular online competition - TopCoder SRM 649 (problems, results, top 5 on the left). Fourteen people solved all three presented problems, but sdya has managed to create a comfortable 100+ point margin with fast solutions and a challenge - congratulations!

Thanks for reading, and check back next week!

Thursday, February 12, 2015

This week in competitive programming

TopCoder SRM 648 has started the last week's contests as early as Monday (problems, results, top 5 on the left, my screencast). The hard problem, while innocent on the surface, turned out to be exceptionally tricky, requiring some serious combinatorics and dynamic programming mastery - check it out! The statement is as simple as: how many simple undirected graphs are there with N labeled vertices and K bridges? N is up to 50.

Codeforces Round 290 took place just several hours later (problems, results, top 5 on the left, my screencast). Problem D was another cute combinatorics exercise. Its most interesting part boiled down to: given a tree, how many ways are there to remove its leaves one by one until we remove everything? Of course, new leaves may appear after removing some. Had the tree been rooted, this would be a pretty standard dynamic programming application, but how to deal with the fact that there's no root?

Finally, Rockethon 2015 gave everybody a shot at prizes on Saturday (problems, results, top 5 on the left). Gennady has carefully progressed through all problems, not really leaving anybody else a chance - congratulations!

There were also several Open Cup rounds in the past weeks, and the Petrozavodsk training camp has assembled an unprecedented number of ACM ICPC teams - but as the same problems are still going to be used for other training camps, I will postpone discussing those for now. Stay tuned for some really tough problems in the coming weeks!

Tuesday, February 3, 2015

This week in competitive programming

Facebook Hacker Cup 2015 has completed its weekly routine with Round 3 on Saturday night (problems with Facebook login, results with Facebook login, top 5 on the left, my screencast). Top 25 will travel to the United States for the final round on March 6, and big congratulations to Gleb who claimed the first place in this really competitive round!

Let's also come back to the interesting problem I mentioned in last week's summary: you are given a directed graph where each arc has a cost, one of the vertices is marked as starting point, and one as ending point. You need to pick a subset of arcs with the smallest total cost such that every path from the starting point to the ending point (including paths that pass through the same vertex or arc more than once) passes through arcs from this subset exactly once.

Whenever we need to pick a subset of arcs that splits a graph in two, the theory of flows and cuts comes to mind. However, we're not simply asked to find a minimum cut on a given graph, and not even on its condensation, as illustrated by the following acyclic example:

In this graph, the minimum cut is of size 2 and separates vertices S and A from vertices B and T. However, the path S-B-A-T crosses the cut three times (twice going outside, and once going inside), and we need for each path to cross the cut exactly once. Basically, the minimum cut is solving the "at least once" version.

But the power of the flow/cut theory often lies in applying it to slightly modified graphs. For cuts in particular, adding infinite arcs to the graph is usually the solution of choice. Here our goal is to make sure one can't re-enter the cut after leaving it, so for each arc we'll add a reverse arc with infinite capacity, as illustrated by the following picture:

In particular, the newly added A-B arc of infinite capacity changes the incorrect cut displayed above from capacity 2 to infinite capacity (1+1+infinity), and thus the smallest cut is now of size 3, displayed by the dotted line to the left. There's also another cut of size 3 - just cut off the source vertex S. It's not hard to verify that every path from S to T croses this cut exactly once.

Another great thing about this addition of infinite arcs is that it allows us to skip the condensation step. If there's a cycle somewhere in our graph, we'll thus add a reverse infinite cycle, and thus no minimum cut will ever cross any cycle!

Such observations make this solution "click" when you see it: instead of fighting with the problem, everything flows naturally and the solution makes sense from all angles. I especially value such moments, I think they show the beauty of mathematics and algorithms.

However, everything was not that simple in this problem :) It turns out that in order for this solution to work correctly in all cases, we need to prune the graph before adding the infinite arcs. More specifically, we need to remove all vertices that don't lie on any path from S to T from the graph. Can you see where we went wrong, and why doesn't the cut theory save us in this case?

Thanks for reading, and check back next week!