Justin Bozonier’s Blog

 

Using Graphs to Build Your Own Ruby Pattern Matcher

There's More Than One Way To Skin a Rat!

A while back I posted an article on using dynamic programming to implement a string matching algorithm I cobbled together. That article is here: http://justinbozonier.posterous.com/string-pattern-matching-welcome-to-dynamic-pr

Tonight, I'm writing about solving that same problem but by using a graph. It was a solution suggested by a few people and then finally a couple of days ago my boss (Kelly Leahy) brought up the problem once more as I was asking for suggestions of CS concepts he thinks I should tackle and we got back on the subject of that pattern matcher. If it was mentioned so much in the past why did I jump on it now? Quite simply, I've been learning Ruby lately at the insistence of James Thigpen. He's been ranting and raving (on top of many others) so I thought this would be a great first project.

Tools and Techniques Used

I tried my best to stick to pure TDD and unit testing with a context/spec naming style and test structure using RSpec. For my IDE I tried using RadRails, RubyMine, and NetBeans. Ultimately I chose NetBeans. For some reason it turned out to be the easiest for me. **shrug**

This is my very first Ruby project ever and I typically do C# development so if you see some ugly .NET-isms... please flame away so I might learn ;D

Stating the Problem

The problem to solve is given a pattern such as "a.c*z" we need a way to determine if a string like "abcdefghijklmnopqrstuvwxyz" matches it (it does) and that a string like "afdgkjbdkfdstg" does not. Our pattern matching grammar will have only three basic elements. A literal element that matches a character in a string exactly (ex. a, b, c, ..., x, y, z, 0-9, etc), a match any element which matches any single character (a period ie. "."), and a match all element (an asterisk "*") which matches zero or more occurences of any character or combination thereof.

It's like a REALLY simple regex evaluator.

And I know some people are going to ask, why not just use a regular expression? They're super sexy and simple to use in Ruby. Why reinvent the wheel?

I have this crazy idea that even though frameworks are pretty awesome at abstracting the difficult detais away from our view, living in ignorance of the basic data structures and algorithms that these frameworks are constructed of, that just leaves us using these things not because they save us time, but because we are dependent on them to get the job done. I'm all for frameworks, but I want to use them because I understand why I should use them, not because I can't understand how to live without them.

What Makes Solving This Difficult?

One of the key things that makes this a somewhat difficult problem to solve if you've never encountered it before is probably best shown as an example.

Imagine you have the following test string:
"*a*a*a"

now imagine how you would test that the following string matches it:
"aaaaababaaaaaaaaa"

See the problem is the * matches everything including the a so how do you know whether to treat the a as a part of the asterisk or as a literal? The answer that Kelly helped me to formulate is that you treat it as both. Basically you allow your solution to branch and when you come to the end of the string you're testing, you look to see if any of the branched paths finally completed the pattern.

I designed the program so that every character in the test pattern represents a given transition from one state to another. There are also three different types of states. A starting state where every match algorithm begins life, a terminal state which is where all matches will end life, and a standard state that represents an interim step from character to character.

So given our string pattern above, the program I wrote would generate the following graph:
  • Start State
  • Transitions to itself on any character (the asterisk)
  • Also, transitions to the next standard state via a literal transition edge that matches on the letter "a"
  • Literal Transition State
    • Transitions to itself on any character (the asterisk)
    • Also, transitions to the next standard state via a literal transition edge that matches on the letter "a"
  • Literal Transition State
    • Transitions to itself on any character (the asterisk)
    • Also, transitions to the next standard state via a literal transition edge that matches on the letter "a"
  • Literal Transition State
    • transitions to the terminal state via an always transition edge.
  • Terminal State
  • Getting Down to Brass Tacks

    Let's look at the transitions and states in code to try to make some sense out of what has been a fairly abstract read up until now. First the graph states:

    As you can see the states really do nothing aside from managing their transitions. I do a poor job of encapsulating these from the rest of my string matching world (ie. the objects that use this array could technically add to it directly) and that should be marked for refactoring later (whether or not it is is my lil secret! ;)

    And then the transitions:

    These are even simpler! The only bit of logic to a transition is it's test to determine whether or not the provided character can be transitioned on. 

    Hopefully from just seeing this code you can get a pretty good feeling for how these things might fit together.

    What gets more complicated is the building of the graph. I like to imagine that specs can help to explain the code they test somewhat so let me share some of those first:

    My tests are awkwardly named so I could use some naming help if you care to leave a comment! :)

    This is the class they test:

    Here's a quick high level explanation:
    • First, create the root of the whole graph. This is ALWAYS a start state.
    • Store that as the state we're currently adding transitions to. (line 9)
    • Next, for each character in the test string we are matching against the string pattern create the appropriate transition.
    • Add the newly created transition to the current state
    • Set the current state to point to the destination of the new state provided by our new transition.
    • Rinse, lather, repeat...
    • Once that's done ALWAYS create a terminal state as the last state.
    • Attach it to our current state by adding a transition always transition to the current state that points at the final state.
    Now about this transition always transition. It feels dirty and like needing to know the start and end state should be unecessary. Even in the current state needing to know the start state is probably unecessary, but the final state... I'm not sure how to handle that generically like the other states. Thoughts?

    Also, notice the code block on line 11! That was my first sampling of Ruby awesomeness. I should totally have been able to avoid writing my own string reader by using ruby's string.each_char |c| {} code block but I learned about that after wrapping up and I didn't care to do more refactoring at the time. :)

    Notice how I lied about the difference between a "." and a "*"? They're both the same transitions but they have different destination states. Here it is again:

    The "*" transition acts as more of a loop back into the same state it left. The "." moves us forward to the next state.

    Meat and Potatoes Time!!

    Now all of that was built with the ultimate focus on getting the construction of the actual matcher to be as easy as possible. Here are the tests for how the matcher is ultimately used:

    God it's painful showing off code that I'm sure could've been written more elegantly! I always convince myself it's for the best to not go back over it an edit it so that way it gives a more honest view of my coding process and doesn't turn into Justin just trying to look pretty.

    So without further ado this is the entire implementation of the code that actually utilizes the graph to match a string against a given pattern:

    Here's a general overview of the steps involved so that you can hopefully glean as much from my code as possible:
    • First "compile" the test pattern provided by the user into a graph representation.
    • Next make that graph the first state to be evaluated (a graph is really just represented by a start node that acts as the root.
    • Next we read the next character from the test string provided by the user.
    • We every transition of every state that has been queued for evaluation.
    • For each transition that tests true on its can_do method grab that transition's destination state and place that into our next states list/array.
    • Once we've done that for every state and transition until we've ran out of test string to process there should be at least one state in our next states queue that is ready to transition to the terminal state (assuming there is a match).
    • If there is then return true! We have a winner!
    • Else it's false and these strings don't match according to the given pattern.
    Was It Good For You?

    I try to share these kinds of learning experiences because so many times in the past I've been so thankful when someone else did. Maybe you're like me and don't have a CS background/degree and are trying to find ways to make data structures and algorithms more practical/applicable. It can be hard to take theory and find a way to apply it. Well anyway, hopefully it helps somebody.

    Also, Ruby is a pretty nice language. I might even enjoy it more than Python (I like having an end keyword instead of ambiguous white space ;).

    If the mood strikes you, you can find the whole project on my github account here: http://github.com/jcbozonier/DfaStringMatcher

    Also, I know I call this a DFA in the github title but it isn't. A deterministic finite automata would have no cycles, whereas my graph does.

    So that's that. If this was the least bit helpful let me know by leaving a comment and if you feel like I got something wrong or left something else out entirely leave a comment to that effect as well. Thanks for reading!

    Loading mentions Retweet
    Filed under  //   Algorithms   Computer Science   Ruby  

    Comments [2]

    Mocking Frameworks, How'd They Do That?!

    Writing Your Own Stubber, A How To Guide

    The largest mystery when trying to understand how a mocking framework works has always been just understanding where this object it creates comes from. I mean I give the mocking framework an interface and it comes back with an object that implements everything (at least in the case of a stub). The point of this post is to show that they aren't magical and to hopefully extend what you think is possible in your own problem solving with C#. I'll be assuming you're familiar with mock objects and how/why they're used.

    I developed a very small stub generating utility using TDD in order to explore the concepts that go into the big names such as Rhino.Mocks. This is the first test I wrote:
    http://github.com/jcbozonier/CrutchMocks/blob/master/MiniMock/TestProject/When_instantiating_an_empty_interface.cs

    Hopefully that gives you an idea of how it is used. All of the tests you will see are centered around this single method call 

    MockedObject = Mockery.Mock<IFoo>();

    No matter what interface I'm stubbing out the call to my library will be the exact same. Here's a more complicated set of examples:
    http://github.com/jcbozonier/CrutchMocks/blob/master/MiniMock/TestProject/When_executing_a_mocked_method_that_has_a_return_value_of_some_type.cs

    Here are the rest of the use cases I tested: http://github.com/jcbozonier/CrutchMocks/tree/master/MiniMock/TestProject/

    You already get the mocking frameworks I'm sure so let's cut to the meat and potatoes here:
    http://github.com/jcbozonier/CrutchMocks/blob/master/MiniMock/MiniMock/Mocking/Mockery.cs

    The "magic" of the stubbing is handled through reflection and by emitting IL code. Here's an English translation of that code:
    • First we need to build our new type that will be implementing the interface we were given by the person using our code. We define this type as being public and as being a class. The null at line 25 tells the TypeBuilder that this class has no parent and line 26 tells the TypeBuilder that we're implementing the interface provided to us by the user.
    • Next we need to get a list of all of the methods that we need to implement on our new type. To do this I needed to reflect into the interface type provided and grab all of the methods it declares. I do this on lines 28 and 29.
    • If there aren't any methods we're done! Create that type and create a new instance of it! (lines 49 and 50).
    • If there are methods things get trickier.
      • First, if the method takes in parameters we need to generate a list of the types of those parameters so that we can write the corresponding IL. (Line 35)
      • Each method needs to be built up according to whether or not it has a return value. 
    • Then just create our instance.
    Now the hardest part to figure out was how to create a method body for this new class. Let's look at the simplest example first:

    IL is Sexy Black Magic

    So what's happening here? 
    • First, I'm defining my method. I tell the TypeBuilder to define a method with the following properties:
      • same name as the one on my interface
      • that the method should be public and virtual, 
      • that the return type is void
      • that it will take in the provided set of types as parameters.
    • Then we're going to generate IL that does absolutely nothing with any of this. :)
    • We just generate a single IL instruction that just returns.
    One thing that might seem magical is knowing what IL OpCodes to emit to get the desired effect. For that magic I just wrote an method in C#, compiled it, and opened it up as IL inside Reflector. The code it gave was a little verbose though so I tweaked it a bit to minimize things. 

    Now let's take a look at the slightly more complicated example where I actually need to return a value:

    The only differences here are at lines 16 and 17. At line 16 I emit an instruction to declare a local variable for the value that will be returned. At line 17 I push this value onto the stack and then immediately afterwards I return it. There is a property on the IL Generator to have the system automatically initialize your local variables to their default values. This is great for my stubber because that's all I want it to do! w00t!

    Other than that knowing what the hell to call and how to get these different type and method builders just took a lot of googling.

    <tdd_rant>

    One of the benefits of using TDD in this process is that I could easily try out a chunk of code just to get things working (a spike) and then i could slowly strip portions of it away that I suspected added no value (especially handing for minimizing the IL instructions).  There are currently 12 tests or observations going on here. Because I know that all of my cases I tried to build have all been driven by those observations I can try making changes and I quickly tell how many of my expectations will fail.

    For example, on line 67 I defined a single line method to detect if a method has parameters:

    If I change that to this:

    I get a failing unit test. The bonus is that I know exactly the use case that will fail (because it did ;).

    When_executing_a_mocked_method_that_has_a_return_value_of_null_and_parameters (1 test), 1 test failed
       It_should_return_the_default_value, Failed:

    This is a full list of test cases for my code:
    <TestProject> (12 tests), Success
      TestProject (12 tests), Success
        When_executing_a_method_on_an_interface_with_several_methods_with_no_return_values (1 test), Success
          It_should_execute, Success

        When_executing_a_method_that_returns_an_int_on_a_mocked_object_with_mutliple_methods (1 test), Success
          It_should_return_the_default_value, Success

        When_executing_a_method_that_returns_an_obj_on_a_mocked_object_with_mutliple_methods (1 test), Success
          It_should_return_the_default_value, Success

        When_executing_a_mocked_method_that_has_a_return_value_of_null (1 test), Success
          It_should_return_the_default_value, Success

        When_executing_a_mocked_method_that_has_a_return_value_of_null_and_multiple_parameters (1 test), Success
          It_should_return_the_default_value, Success

        When_executing_a_mocked_method_that_has_a_return_value_of_null_and_parameters (1 test), Success
          It_should_return_the_default_value, Success

        When_executing_a_mocked_method_that_has_a_return_value_of_type_bool (1 test), Success
          It_should_return_the_default_value, Success

        When_executing_a_mocked_method_that_has_a_return_value_of_type_int (1 test), Success
          It_should_return_the_default_value, Success

        When_executing_a_mocked_method_that_has_no_return_or _parameters (1 test), Success
          It_should_be_callable, Success

        When_executing_a_mocked_method_that_has_NO_return_value_and_multiple_parameters (1 test), Success
          It_should_execute_with_no_error, Success

        When_instantiating_an_empty_interface (1 test), Success
          It_should_return_an_object_of_the_same_type, Success

        When_instantiating_an_interface_with_a_single_parameterless_method (1 test), Success
          It_should_return_an_object_of_the_same_type, Success

    Loading mentions Retweet

    Comments [0]

    Markov Chains in Python

    What is a Markov Chain?

    (This was originally posted here on my original blog but since I'm not sure how much longer that will be around I'm reposting it. I'm also updating it slightly.)


    This is my first markov chain! I was very excited to see it producing odd english. Here’s a quote from it: 

    I felt only for i can be swept through to tone. all the language between professional gentlemen, the disparition

    Amazingly, the basic premise is simple enough that I was able to figure out how to build one without a tutorial. I’m sure there are a bunch of things I could do to get the program to create more pleasing quotes but thus far I am happy.

    So what is a markov chain? Essentially in this context it's a collection of words and a set of probabilities that one word will transfer to the next. That's it. You get a word, you look at what percent of the time that word transitions to the others and choose one of the other words it's connected to using the associated probabilities to choose between them.


    You’ll notice that I did do a lil optimization already (I'll leave that up to you to find). I base the next word on the word before it. I also allow the punctuation to remain where it is. I decided to do this because I have seen standard Markov Chains and while funny, they’re pretty bad readability wise. I was hoping that this would produce slightly less silly results.

    If you would like to load your own sample files you can grab book text from Project Gutenberg (located here: http://www.gutenberg.org/)

    The Code 
     
    I’ve pasted the code at the link below which should be able to be copy and pasted directly into your Python editor:

    http://gist.github.com/131289

    A lil Refactoring...


    So after I wrote my own first go at a Markov chain I decided to look up how someone else did it in Python. Some things the other person did better included some pythonisms and others were algorithmic improvements.

    My algorithm was actually too complex. I was finding one word and then picking the next word based on a weighting (that was based on how often the successive word appeared after the current word). All I really needed was to group the words into pairs and then just randomly choose one of the words that appear after the given pair. The quality of the results went up drastically.

    Here is the new algorithm (heavily borrowed concepts from http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/194364

    http://gist.github.com/131290


    Hungry for More Knowledge!
     
    If you'd like to check out another Python project I've done that uses markov chains to generate music check this link.
     
    Have fun! 

    Loading mentions Retweet

    Comments [4]

    Using Dynamic Programming to Diff Strings

    My boss left a note on my last article noting how similar my solution to pattern matching was to file diff algorithms he had been reading about. So naturally I had to take it as a great chance to explore how to do a basic diff between two strings. The solution I've created would be pretty easy to upgrade to support whole lines of text rather than individual letters but I'll leave that as an exercise to you. I'll probably do it if/when I get a spare moment I just doubt I'll blog about it. :)

    What's the Real Problem?

    We need to figure out what the minimum amount of editing is that's needing to convert string #1 (the "old string") into string #2 (the "new string"). If we didn't need to minimize the editing we could just say delete all of the old string and insert each character of the new string. That's pretty useless though. That's would be the equivalent of your favorite SCM software telling you that the difference between file A and file B is the whole file. Like I said, useless.

    So what do I mean when I say editing distance? Imagine you only have three ways to modify a string: insert, delete, and flat out changing letters in your new string to match letters in your old string. So for example to change "ba" into "fg" you would need to swap the b for an f and the a for a g. That would mean that your minimum editing distance would be two. Another example would be to change "adbed" to "abcde". This would have a minimum editing distance of 3 since you would delete the first "d", change the "e" to a "c", and then add an "e" to the end of the string.

    Getting Technical

    Remember the technique I used last time for string pattern matching? No? Read it. I'm using the same basic idea here with using a matrix, placing scores for the different combinations of two characters and figuring out how to traverse this matrix in an optimal fashion.

    The trick in this instance is use the old string as your column headings and the new string as your row headings. The values in the cells that make up the table should be comprised of how many editing steps it would take for you to change the given old character (in its current position!) into the new character. Since, for us, deleting, inserting, or swapping a character all have the same cost (the value 1)  and since the edit distance between two identical characters is the value 0 initially it would seem as though we'll just be traversing a matrix of 0's and 1's. However in this case since we are looking for the minimum edit distance we need to take into account the minimum edit distance from the previous step.

    Remember that we will be traversing this list from one corner of the matrix to the other. This is important because it limits the number of cells we need to take into account to figure out the minimum edit distance up to our cell.

    In an attempt to quickly clarify what I mean let's first look at my code for calculating the edit distance of each pair of characters (which represents a single cell in our matrix at position oldCharacterIndex, newCharacterIndex):

    private static int _CalculateMinimumCellEditDistance(List<List<int>> matrix, string oldString, int oldCharacterIndex, string newString, int newCharacterIndex)
    {
        var oldChar = oldString[oldCharacterIndex];
        var newChar = newString[newCharacterIndex];

        var localEditDistance = oldChar == newChar ? 0 : 1;

        var leftArrayValue = localEditDistance + _GetCellValue(matrix, oldCharacterIndex - 1, newCharacterIndex);
        var topLeftArrayValue = localEditDistance + _GetCellValue(matrix, oldCharacterIndex - 1, newCharacterIndex - 1);
        var topArrayValue = localEditDistance + _GetCellValue(matrix, oldCharacterIndex, newCharacterIndex - 1);

        var minEditDistance = Math.Min(Math.Min(leftArrayValue, topLeftArrayValue), topArrayValue);

        return minEditDistance;
    }

    That makes perfect sense if you don't think about the left and top edges. Here we have to introduce a lil finagling to calculate those cells appropriately. Basically the following code means that if you're in the top row looking up the edit distance in column number oldCharacterIndex, the only way to have gotten there is to have made (oldCharacterIndex + 1) count deletions. By the same token, if we're in the left-most column looking for the value in row number newCharacterIndex, the only way to have gotten there is to have made (newCharacterIndex + 1) count insertions.

    private static int _GetCellValue(List<List<int>> matrix, int oldCharacterIndex, int newCharacterIndex)
    {
        if (oldCharacterIndex < 0 && newCharacterIndex < 0)
            return 0;
        if (oldCharacterIndex < 0)
            return newCharacterIndex + 1;
        if (newCharacterIndex < 0)
            return oldCharacterIndex + 1;
        return matrix[oldCharacterIndex][newCharacterIndex];
    }

    Now I'd like to point out, I'm 100% certain that this is **A** "right" way but my code might have bugs so take it with a grain of salt. The main purpose of this post is to show another way of tackling a different dynamic programming problem and the thought process I went through.

    Once I've scored each pair of characters I then set out to traverse the given matrix by going to the smallest edits possible. I also created a class to handle tracking the deletes and edits that would be needed along the way. This class can be used to play back my changes to tell someone how they could change the old string into the newer one. Also, since dynamic programming calls for us to work through the problem backwards, this class reorders the steps to read correctly.

    Now, once again we create a matrix of how well each cell performs according to the above method. In our context a cell's value should essentially be thought of as the number of steps required to convert the character at the cell's column index into the character at cell's row index. In my calculation I'm also taking into account shifting the character into the appropriate position. That means that every cell has a been scored in such a way that they all solved their own version of the problem as though they were the only portion of it. What's beautiful about this is that by bringing these results together we can find the optimal string editing just by following the smallest values backwards from the lower right cell of our matrix. Basically you start at what has to be your end point and assume that you took the optimal path there from the previous three optional cells (the cell immediately to the left, to the top-left, and to the top). If you have a tie between top or left just choose whichever you like. In the case of a tie between the top-left and one of the others though I default to the top-left since that's roughly half the number of cells (and thus string edit steps) we'll need to process.

    Quick Example

    To help pull together all of this here's an edit distance matrix from my program for diffing "bcdeffghi" and "abcdefghij". This string is useful because it exercises all 3 diff operations (insert, swap, and delete):

        b  c  d  e   f  f   g  h  i
     a 1  2  3  4  5  6  7  8  9
     b 1  2  3  4  5  6  7  8  9
     c 2  1  2  3  4  5  6  7  8
     d 3  2  1  2  3  4  5  6  7
     e 4  3  2  1  2  3  4  5  6
     f  5  4  3  2  1  2  3  4  5
     g 6  5  4  3  2  2  2  3  4
     h 7  6  5  4  3  3  3  2  3
     i  8  7  6  5  4  4  4  3  2
     j  9  8  7  6  5  5  5  4  3

    The lower right hand corner (3 in this case) is our optimal edit distance. This means the string on the top of the table can be transformed into the string on the left of the table with exactly three separate steps.

    As always here's the link to my code (It's in the StringMatching folder along with the pattern matcher):

    Loading mentions Retweet

    Comments [1]

    String Pattern Matching: Welcome to Dynamic Programming!

    The Challenge

    So here's the puzzle you're provided: Given a string pattern tell if another given string matches the pattern. The character "." is a wild card that can only match one character and a "*" is a wild card that can match as many characters as possible. Nothing matches a missing character.

    For the pattern "a.c*f", for example, "abcdef" would match as would "azcrwgfdjkgfdkjhdfjkfhf" but "bbcdef" would not.

    How would you go about solving this problem? When I first set out to solve this problem I got horribly stuck. I tried to use just a for loop and then got totally lost in that the possible permutations branched all over the place. It seemed nigh intractable (for me of course since I knew this was solvable since regex does even more coolness than this).

    Side note: That is a great way to tell what computer science challenges deserve your attention most. If you literally can't even come close to tackling a problem understanding how to finally conquer it will make you a better developer by an order of magnitude each problem IMO.

    Think about this for a bit cuz up next is the solution.

    A Speedy Savior

    So first off, one solution would be to create a string for every possible permutation of the pattern (up to the test string's length) and check if one of those patterns matches your test string. There are two issues with this approach: A) It takes up a lot of space and B) it would take a LOT of time.

    To prove it would take a lot of time consider that our solution would essentially need to consider every letter in the place of the wild card. That's 26 additional runs PER single wild card... in the case of the asterisk that could be 26^N where N is the number of letters your wild card might take up. In the larger example above, with that technique, there'd be at least 766,467,265,200,361,890,474,622,976 permutations just to the wild card alone.

    Thankfully as it turns out there's a better way: Dynamic programming! Ok so what is dynamic programming and why does it apply here? Basically, dynamic programming can be helpful whenever you have a problem that breaks down into individual subproblems that can be combined together to form a complete answer. Once you've solved all of the individual subproblems you start from the finishing line and make your program assume that you found your optimal solution and try to work out what the previous step must've been given that the current step is optimal. 

    What some of you might be thinking is... what if my string doesn't match? Assuming correctness seems wrong in that case huh? You definitely don't want to return a false positive but that won't happen. What will happen is that if you're program isn't able to match the strings, then the algorithm will be left incomplete and will never reach the block of code that says to return true. Essentially we only say there is a match if the algorithm starts at our finish line and ends up at the starting line we specify. If that path is broken, then our algorithm will return false.

    Another benefit of this method is that the worst case runtime is O(mn) or for you non-computer science types, the number of steps the algorithm will take is approximately equal to the length of the pattern string multiplied by the length of the test string (the string we're matching against). In our large sample above that would be about 115 steps to figure out the solution (at the worst)!

    Forming a Recurring Subsequence Problem

    This is really the hardest part. You need to think about the problem from several different angles until you can see it as a sum of several smaller problems. Once you see this subsolution though the programming just pops out at you.

    In our example with the strings, I had been thinking about how I could break this up for the better of the day in the back of head during idle cycles. Finally when I sat down to draw out some ideas (I'm very visual) I realized that I could create a table with the pattern string forming the columns and the test string forming the rows. Once that was done I marked the cells that matched with a T (for True) and an F if they didn't. I saw that in cases where I expected a match, I could traverse the table from the upper left corner to the lower right by traversing cells diagonally (or vertically in the case of an asterisk or a period).

    I know this is hard to visualize so I'm going to *try* to give a visual here... bare with me.  :)

    1 is True and 0 is False BTW...

    Pattern: ".a*.j*"

    Test string: "cadeajmn"

    .a*.j*
    c101101
    a111101
    d101101
    e101101
    a101101
    j101111
    m101101
    n101101

    That is the test that made me have my "a-ha" moment. So now looking at this primitative visualization let's see how our pattern matching rules show up here. First notice that the asterisks and periods have their entire columns set to true. The reason for this is that they can stand in for any letter so given appropriate positioning they **could** be true in any of those instances. I got a little tripped up here. I thought that this matrix should represent the fact that my period can only represent one letter but that turned out to be unnecessarily complicated during this phase. The 2nd phase, as you'll see, makes that much easier.

    From here see if you can start at the finish line though and work your way back to the start. Now while it *is* possible in this table that you could move from cell to cell horizontally, our phase 2 rules won't allow for this. If we did allow this that would mean that two pattern characters could match to one test character and hopefully you can see why that makes no sense in our scenario. Also remember that you can only traverse cells vertically if you're in an asterisk column (since that means that that single pattern character matched multiple test strings and that's the only pattern character capable of doing that by definition).

    Note that we know if our traversal was successful because at some point, on some path, we will arrive at cell (0,0). If we miss the starting line and go off the table, then we don't have a match.

    These last two paragraphs we've basically listed out the rules for our recursive algorithm. More clearly, they are as follows:

    Given a cell (i, j) where i is the column, j is the row

    • if we reach cell (0,0) AND this cell's value is true we have a match
    • if we reach a cell where either i or j are negative we have no match (for this path, regardless of the value)
    • if none of the above but the current cell is true then test cell (i-1,j-1) (the cell to the upper left)
    •   if that doesn't have a complete path and we are in an asterisk column try the row above

    Here are those same rules but in code:

    private
    static bool _HasCompletePath(IList<List<bool>> matrix, int patternIndex, int stringIndex, string pattern)
    {
        var result = false;
     
        if(patternIndex < 0 || stringIndex < 0)
            return false;
     
        if (patternIndex == 0 && stringIndex == 0 && matrix[patternIndex][stringIndex])
        {
            result = true;
        }
        else if(matrix[patternIndex][stringIndex])
        {
            var tempResult = _HasCompletePath(matrix, patternIndex - 1, stringIndex - 1, pattern);
            if(!tempResult && _HasCompletePath(matrix, patternIndex, stringIndex-1, pattern) && pattern[patternIndex] == '*')
            {
                tempResult = true;
            }
            result |= tempResult;
        }
     
        return result;
    }

    Wrapping up

    I don't expect this to make perfect sense to someone who has never written a dynamic programming algorithm in their life. My main hope is that by me sharing the insights I gained while I'm still new to this that this will help others to learn this stuff a little easier. If nothing else, it should at least provide a new non-formalized non-mathematical perspective (which is REALLY difficult to find on the web).

    Full source code is available here: http://github.com/jcbozonier/Dynamic-Programming-Sample/tree/master 

    Loading mentions Retweet
    Filed under  //   Algorithms   C#   Computer Science   Dynamic Programming   String Pattern Matching  

    Comments [7]

    Creating Music Procedurally (AKA My Computer Makes Art)

      
    (download)

    My Latest Project

    This week I decided to work on a little script for the session I'll be holding at Portland Code Camp called "Artistic Expression Through Code". The first night I developed the meat of the script: A markov chain based algorithm that you could recite a song to which it would then use to create its own song. This first version didn't have support for timings so everything was quarter beats. It didn't sound as bad as random computer noise, but it didn't sound like a song. It was very... computer-like.

    Tonight I decided to implement a second markov chain to track the timings of the recited song and use those along with the notes to create real songs. The results this time weren't too bad at all. (I've attached a sample of the song to this post)

    Give it a listen. I think you'll be somewhat surprised that that's a completely computer generated song.

    How'd You Do That?!

    There's a fairly simple computer science technique known as a "Markov Chain". Don't let the whole reference to computer science fool you, it's really not tough to grasp. Basically I created a table of numbers that answers the following question: When note X was played what percentage of the time were the other notes played next? So just imagine a table with all of the notes from a to g# laid out on top (these are the notes that we last played) and vertical axis is all of the same notes but this axis represents the probability that that particular note was played next.

    Here's a sample of what my program generates:
         a a# b c  c# d d# e  f   f# g  g#
       [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
        [0, 0, 0, 6, 0, 1, 0, 0, 0, 0, 2, 0], 
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
        [0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0], 
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
        [0, 0, 0, 1, 0, 2, 0, 3, 1, 0, 0, 0], 
        [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0], 
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
        [0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 2, 0], 
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

    You'll have to just imagine that same list of notes listed along the left hand edge of that table since it's too hard for me to put it there myself in this editor. :)

    Here's the timings table (1, 1/2, 1/4, 1/8, 1/16 notes):
      1  2  4   8 16  
     [0, 0, 0,  0, 0], 
     [0, 0, 0,  1, 0], 
     [0, 0, 3,  5, 0], 
     [0, 2, 4, 11, 0], 
     [0, 0, 0,  0, 0]


    But that's it, that's the magic that drives the whole process. As you can see, I'm actually not storing percentages but instead just the count of the number of times a note led to a different note. It works out to be the same thing in the end.

    Grok'd

    So to summarize, my program was able to generate the music because I fed it a sample musical score and it figured out what percentage of the time a c note led to another note. Here's a step by step run through:
    • Give program a note and a timing.
    • When I give it a second note/timing it notes in it's table that the first note led to the 2nd note one time. It also notes that the first timing led to this 2nd timing one time. (note I don't attempt to relate notes/timings, it's not important surprisingly).
    • Enter a third note.
    • The program then notes in its table that the 2nd note led to the 3rd note one time.
    • Continue ad inifinitum
    So that's how we set the system up. Next this is how we get the program to come up with its own song:
    • Choose some random note/timing to start.
    • Ask the computer to suggest what a good note/timing would be to follow those.
    • print out the note/timing (in case it's a work of genius ;)
    • play each note using the python library I'm using, pysynth.
    I know that's pretty general and a pretty quick overview but I'll be giving a more in depth explanation at Portland Code Camp.

    Here's a link to my git repo with all of my Python code: http://github.com/jcbozonier/MarkovMusic/tree/master

    That's all for now!

    Loading mentions Retweet

    Comments [0]

    New Alloy Features Planned

    First I want to touch on what Alloy is and what the guiding vision behind it is.
     
    Put simply, Alloy is a messaging application written for Microsoft Windows. It allows you to IM ur friends on google talk and also let's you send tweets via Twitter. Ok, so what's the big deal?
     
    It's the way Alloy does this that is special. Alloy not only connects you to these services it also connects them together in a way that is intuitive for you. When send out a tweet Alloy also takes that and posts it as your available message on google talk. Why? at it's core that's what a tweet is, a status update.
     
    Alloy is all about understanding the context in which different messaging services are used and making intelligent decisions for you about how it shares your information and cross posts your messages.
     
    Here's another example: have you noticed how when you want to paste a large amount of text (think code or patagraphs of text) you have to stop yourself and think, "that's not appropriate here. I need to paste that onto pastie first and then paste the link in my tweet." that's too much thinking in my opinion.
     
    In Alloy, you simply paste whatever globs of text you want to send to friends and it will handle pasting it somewhere else for you. Alloy is about freeing you the user from having to think about anything but the message you want to communicate.
     
    That's really what Alloy is.
     
    So, what will Alloy be?
     
    Currently the highest priority task on Alloy before I'd even consider it a beta is the addition of a contact list of some sort. My friends and I have some great ideas about how to do this and it's just a matter if choosing a way and getting it done.
     
    Once that's done I'd like to include support for twitter searches and also some light facebook integration (mainly revolving around status updates).
     
    My last high priority item is animation. I'd love to see our UI more fluidly indcate it's intent to you.
     
    So that's all. Hopefully zomeo e else out there ends up discovering Alloy and enjoys it as much as I do. If you try it out but decide to not use it please shoot me an email with all of your criticism. I'd love to know what it would take to convert you into an avid Alloy user.

    Loading mentions Retweet

    Comments [0]

    My big vim cheat sheet in my office

    This is my limited vim vocabulary at the moment. I'm currently using ViEmu to help me improve my vim skills for when I am using slicehost and ssh.

    Sent from my iPhone

    Loading mentions Retweet

    Comments [2]

    First Public Release of Alloy!

    Click here to download:
    Alloy.Setup-0.1.0.115.msi (1009 KB)


    Alpha Version of Alloy Released!

    I just installed the latest copy for myself and then realized.. hey! This is good enough that it hasn't been crashing for me and a lot of the annoyances are gone. Maybe it's time to post an alpha release? Yup!


    I will come back later on and update this post regarding what exactly Alloy is and such but for now I just want to ensure that I can manage my releases like this here on posterous.

    Loading mentions Retweet

    Comments [0]

    Planning Artistic Expression through Code for Portland CodeCamp

    I just realized interactive fiction (IF) is a HUGE place where programming merges with art.

    Loading mentions Retweet

    Comments [0]