Algorithms Weekly by Petr Mitrichev: A Polish week

TopCoder SRM 690 took place in the very early hours of Wednesday, May 4th (problems, results, top 5 on the left). Snuke has managed to score 50 additional points during the challenge phase, and that’s what allowed him to jump from third to first place – congratulations on the first SRM victory!

VK Cup 2016 Round 3 on Saturday selected 20 lucky teams to advance to the onsite finals in St Petersburg (problems, results, top 5 on the left, online mirror results, analysis). The rating favorite team “Never Lucky” bounced back from the relatively weak 4th place showing in Round 2 to win this round by over 1000 points – great job! Nevertheless, there are many other strong teams in the top 20 who will make sure the St Petersburg final round isn’t easy for subscriber and tourist.

Google Code Jam Round 1C selected the final 1000 participants for Round 2 (problems, results, top 5 on the left, analysis). Among the top scorers were the reigning champion Gennady.Korotkevich and linguo who doesn’t participate in competitions other than Google Code Jam but nevertheless does very well each year – but they couldn’t take the first place from artberryx – congratulations on the win!

Finally, Russian Code Cup 2016 Qualification Round 1 wrapped up the tournament-heavy weekend late on Sunday (problems, results, top 5 on the left, analysis). Subscriber pulled out another victory just a few days before the ACM ICPC World Finals, suggesting that the St Petersburg ITMO team is still very well in the running for the top spots there – way to go!

The last problem E required one to combine two relatively standard but normally unrelated algorithms into one solution: given two trees with at most 50 vertices in each one, find the size of the largest connected subtree in the first tree that has an isomorphic connected subtree in the second tree.

In my last summary, I've mentioned a problem that I couldn't solve: you are given two strings of the same length consisting of lowercase English letters, and want to transform the first string into the second one. You are allowed operations of two kinds: changing one character into another takes 1 second, and changing all occurrences of some letter into some other letter (the same for all occurrences) takes C seconds. What is the fastest way to do the required transformation?

Tomek Jarosik has posted a link to analysis of the original contest of this task in comments. Here's my rough understanding.

First of all, it's not hard to see that we can do all group changes before all individual changes – if we're going to waste 1 second for a given character, we can do it in the end and change it directly to what we need. Because of this, the task can be reformulated: now we're only allowed to do group changes, each costing C, and need to minimize total cost of group changes plus the total number of positions in which the resulting string differs from the goal.

Now let's draw a graph where the vertices are the 26 letters. The edges, somewhat counter-intuitively, will not be the group changes we make – instead, let's draw an edge from each letter to the letter that all its occurrences will change to after all group changes. For example, if we change all As to Bs, then all Cs to As, and then all Bs to Cs, then our graph will have edges A->C, B->C, C->A. Each letter will have 0 or 1 outgoing edge.

Given such graph, it's easy to count the total number of positions in which the resulting string differs from the goal: since we know how each letter ends up, we know the resulting string, and can simply count the differences. The other part of the value we minimize is not so clear: what is the minimum number of group changes required to construct the given graph?

The lower bound on the number of group changes is the total number of edges in this graph. Moreover, in most cases this lower bound is actually the answer. To see that, let's consider the structure of our graph. Since each vertex has 0 our 1 outgoing edge, the connected component of this graph are either directed trees, or a cycle with a directed tree growing into some of its nodes.

In order to find the group operations that construct the given directed tree, we can start from the root (the "sink"), apply the operations for all letters that need to be changed into the root letter, then continue with the letters that need to be changed to those letters, and so on.

When we have a cycle, we can't do the same. In fact, when a connected component of the graph forms a cycle of length K, we can't really implement the changes using just K group operations – we need K+1 in that case. To see that, notice that the first operation we make can not transform one of the letters in the cycle into another, as that would "merge" two letters together that need to be separate in the end. Instead, we must use an auxiliary letter that's not in the cycle. For example, if we need to build A->B->C->D->A, then we need to do something like: A to Z, then D to A, then C to D, then B to C, then Z to B.

However, when a connected component is not just a cycle – in other words has one or more directed trees going into the cycle's vertices – then we don't need an extra group operation for this cycle. For example, suppose the above A->B->C->D->A cycle also had a E->C incoming edge. Then we can do: B to E, then A to B, then D to A, then C to D, then E to C, handling both the cycle of length 4 and an extra incoming edge in 5 total group operations, and achieving the lower bound. If there's more than one edge going into the cycle, we can handle the rest as separate directed trees after handling the cycle.

There's one more constraint that we need to take care of: the standalone cycle resolution mentioned above needs an auxiliary letter (denoted as Z in the example above). In case there's a letter with an outgoing edge but without incoming edges, it can perform the role of that auxiliary letter: first, handle the component containing it, and then use it as auxiliary letter for all standalone cycles. If, however, there's no such letter – in other words, if the entire graph consists of standalone cycles and isolated vertices, then it's not clear what to do. After some more thinking we can notice in case such graph has at least one standalone cycle (in other words, at least one edge), then it's impossible to construct it at all, since it implements a non-identity permutation on the letters, and our very first group operation will merge two letters together.

Let's summarize what we have reduced our problem to. We need to build a graph on the 26 letters as vertices, with 0 or 1 outgoing edge for every vertex. If we have an edge from letter A to letter B, then its cost is the number of mismatches that yields (number of positions where the first string has A and the second string has not B) plus C (for the group operation corresponding to this edge). If we decide not to have an edge from letter A, this also has a cost (number of positions where the first string has A and the second string has not A). In addition, we get an extra cost of C for any standalone cycle in our graph, and the graph must not be all standalone cycles and isolated vertices, but it can be all isolated vertices. We need to minimize the total cost.

Our team got this far at the actual contest, but we couldn't come up with a way to solve this reduced problem. It turns out that solving it relies one one key insight: first, let's build this graph greedily: for each letter, pick the outgoing edge (or lack thereof) that minimizes the contribution to the total cost (taking account the cost C of having the edge, but not the additional bonus for a standalone cycle or the not-all-standalone-cycles restriction). This will give some graph G₀. If this graph does not have standalone cycles, then we're done. And in case it does, it turns out that those are the only cycles that we need to take care of!

More precisely, we can relax the problem as follows without changing the answer: instead of getting an extra cost of C for any standalone cycle, we will only get an extra cost of C for any standalone cycle that is present in G₀. Similarly, instead of forbidding the graph to consist of only standalone cycles and isolated vertices, we will only forbid the graph to be identical to G₀, and only if the latter consists of only standalone cycles and isolated vertices.

To see why the answer doesn't change, consider some optimal solution for the relaxed problem. Suppose it has some unexpected standalone cycle that is not present in G₀. At least one vertex on this cycle has a different outgoing edge in G₀. Let's change that edge to the one from G₀. The total cost will not increase since G₀ is locally optimal, and at the same time the standalone cycle will be destroyed, and no new standalone cycles will form (since it's impossible to do that with just one edge change). By continuing this process, we can show that there exists an optimal solution for the relaxed problem that is also a valid solution for the original problem.

Finally, we can solve the relaxed problem using dynamic programming: we process all letters in order, and pick the outgoing edge (or lack thereof) for each of them, while remembering which cycles from G₀ we have already broken, and whether our graph is still identical to G₀. Since G₀has at most N/2 cycles, the running time is O(2^N/2*N²) which is very small for N=26.

Thanks for reading, and check back soon for the last week's summary. Also make sure to check my twitter for the ACM ICPC 2016 World Finals updates!

Algorithms Weekly by Petr Mitrichev

Tuesday, May 17, 2016

A Polish week

No comments:

Post a Comment