Sometimes, Test-Driven Development (TDD) works great; other times I find myself slipping away from it. This was one of the great times.
I wanted a function to return the longest common substring between two strings. Simple, easy to test. I started with this:
[Test] public void LCS_1()
{Assert.AreEqual("", Matcher.LongestCommonSequence("", ""));}
Which was, of course, very easy to make pass:
public static string LongestCommonSequence(string s1, string s2)
{return "";}
The next step was to make it return the first string if the strings were the same (or, actually, even if they weren't - the test didn't require a comparison):
[Test] public void LCS_2()
{Assert.AreEqual("a", Matcher.LongestCommonSequence("a", "a"));}
Which was also very easy to make pass:
public static string LongestCommonSequence(string s1, string s2)
{return s1;}
Notice a certain amount of duplication in the tests? I did:
private void LCS(string want, string s1, string s2)
{Assert.AreEqual(want, Matcher.LongestCommonSequence(s1, s2));}
[Test] public void LCS_1() {LCS("", "", "");}
[Test] public void LCS_2() {LCS("a", "a", "a");}
Unfortunately, I didn't keep all the intermediate steps. But of course I kept the intermediate
tests! Tests are forever:
[Test] public void LCS_3() {LCS("", "t", "a");}
[Test] public void LCS_4() {LCS("a", "ta", "a");}
[Test] public void LCS_5() {LCS("ta", "ta", "ta");}
[Test] public void LCS_6() {LCS("tgggtaa", "ccccctgggtaacccccc", "gggggggggtgggtaagg");}
This resulted in a working, slow version of the function. I was iterating over all possible lengths (my like-real data set had 700+ character strings), over all possible positions. With a working function and a suite of tests to
keep it working, it was a simple matter to break out of the outer loop as soon as it failed (if a length of 35 didn't work, there was no point in looking for lengths of 36 or greater), to break out of the inner loop as soon as it succeeded (there may be multiple 35-character substrings in common, but we only need one), and to start each inner loop at the starting point of the previous result (if there were no 34-character matches before the 92nd character, there will be no 35-character matches there, either). Here's the final function, which passes all tests and is fast enough:
public static string LongestCommonSequence(string s1, string s2)
{
string result = "";
int start = 0;
for (int length = 1; length <= s1.Length; ++length)
{
for (int i = start; i < s1.Length - length + 1; ++i)
{
string snippet = Substring(s1, i, length);
if (!Regex.Match(snippet.ToLower(), "^[acgt]*$").Success) break;
if (s2.IndexOf(snippet) >= 0)
{
start = i;
result = snippet;
break;
}
}
if (result.Length < length) break;
}
return result;
}
As I look at this, I see that I should probably extract the regex bit into a nice function along the lines of IsJustBases(), and I could squeeze a little more efficiency out of this by ToLower()ing the input parameters instead of the snippet,
and I could probably force a bug by mixing cases across s1 and s2. All the changes will be easy and safe, thanks to the existing tests. I can't justify the ToLower() change based on performance, which is adequate, but I certainly can on correctness, if I can show the bug.
Thanks to Ehsan Akhgari, whose
Code Formatter has made this post more readable.