Monday, April 4, 2011

More on chemical structure drawing with C#, TDD, and a dash of algorithms

Progress 

So recently I've been working on implementing a chemical structure drawing tool in C# using WinForms. Along the way I've been adhering to TDD (test-driven development), and it's been a pretty positive experience. I've managed to avoid using the debugger and have a very clear separation between UI and logic code.

The "industry standard" for representing chemical structures is to represent atoms as nodes and bonds as edges of an undirected graph. Although atoms and bonds can have an extensive set of details, I've been sticking to a set of simple properties in the meantime. So far the back-end of the program is fairly developed, and on the GUI front-end I have basic functionality such as creating and deletion of atoms/bonds.

Quirks 

There's always a random quirk that you end up having to implement, in my case it was allowing the user to click on a bond. Now, a bond is just a connection between two (x,y) points, so to test for a mouse click on a bond we need to test the distance of the mouse position from the line segment representing the bond.

Oddly enough, I had a working C# solution to that problem awhile ago (it was a pain to find one too, adapted from this stackoverflow question!)

But the other 'quirk' is a lot more interesting than geometry, and it involves generating a string of text that represents a chemical compound. One of the important formats to do this is called SMILES, and I would eventually like the program to be able to generate and/or parse it.

The problem? The format is kind of complicated and hairy, and has to account for a lot of different chemical properties. So for the moment writing a parser is a task in itself.

Generation using depth-first searching

Luckily, generating the SMILES code is easier. The most simplistic way is to simply do a depth-first search through the graph, while identifying nodes that appear twice (these will be part of rings). This algorithm will generate "correct" but not canonical or normalized SMILES strings, although the programs I've used will all parse them.  To get more involved requires identifying rings, taking into account other bond/atom properties, and all kinds of hairy stuff.

In fact, in the big chemistry libraries (in Java for example), the code to do this is very long and hard to understand. At least, I can't understand it. So for the moment I'm stuck trying to reinvent the wheel, but starting a whole lot simpler.

And that's all for today. Check back next time for more progress and other fun topics.

No comments:

Post a Comment