When at Catharon, I wrote a regular expression engine for their environment. It worked, basically, but had a couple problems: first, I changed the standard regexp syntax too much in the name of "improvement"; second, the program was not well written, it had no overall design, and was very ad hoc, and so fragile. Getting it to work was a hassle, and took a long time, as each complicated new RE I tried it with seemed not to work, and to require adjustment. First, my "improved" RE syntax was really no better, and because it is so far from standard it was too hard to use. And second, for the last few years I have regretted that program, and figured if I ever had time and inspiration I would redo it properly. Now I have. The main difference now is that the RE is first parsed to make a RE tree, and then this tree is stepped along to build the execution stack, which is then used to generate the solution structure. This has a few advantages: first, it breaks the code up into several independent sections. It is even possible to replace the RE parser with a different one, so if I really want to "improve" it I can support both the traditional syntax and my new one. Second, the code is far, far simpler and more robust. Once I got it working, it basically works for everything, so I have little fear that each new RE will break it. Third, it supports a lot more - unicode characters, greedy and lazy evaluation, alternative parsers, to name a few. And, it is (I think) very well-written and elegant code. The tree and stack walking routines are small and surprisingly simple, and the greedy and lazy search algorithms require almost no coding to work. With that advertisement, you can now look and judge for yourself. Here are the 4 files: re1.c, re-parse.c, re-debug.c, re.h, or here is the zip'ed archive. |
|