Monday, December 11, 2006
Regular Expressions in Multithreaded environments
Just don't do it: you will save yourself a great deal of pain.
Usual clue when you have one of these infestations is when someone asks if you can see anything wrong with a particular regex as they know it's the cause of their transient problem but just can't see anything wrong with the code.
The reason is pretty simple: regular expressions can only be parsed by a state machine and this is coupled with the general inability of many programmers to code state machines nicely for multi-threaded environments.
(The reason for using state machines in this is not so simple and one day I might fill out this post with the details, but essentially regular expressions are a type of finite state machine.)
The mistake many coders of these things make is using some class variable to hold state, this is where the problem is created. Now when two threads try and do something with a regular expression concurrently the result is anyones guess. At best things just crash, but generally the regex matcher for one side fails giving a strange result.
As regex is usually buried deep in a library, such as an XML parser, your options include replacing the entire library, or even change programming language. As this is generally not possible one is stuck to removing the regular expressions...
How can the programmers of regular expression parsers help their customers? Simple really don't hold state in the class, don't reuse an instance (unless you know it's been finished with) .
Usual clue when you have one of these infestations is when someone asks if you can see anything wrong with a particular regex as they know it's the cause of their transient problem but just can't see anything wrong with the code.
The reason is pretty simple: regular expressions can only be parsed by a state machine and this is coupled with the general inability of many programmers to code state machines nicely for multi-threaded environments.
(The reason for using state machines in this is not so simple and one day I might fill out this post with the details, but essentially regular expressions are a type of finite state machine.)
The mistake many coders of these things make is using some class variable to hold state, this is where the problem is created. Now when two threads try and do something with a regular expression concurrently the result is anyones guess. At best things just crash, but generally the regex matcher for one side fails giving a strange result.
As regex is usually buried deep in a library, such as an XML parser, your options include replacing the entire library, or even change programming language. As this is generally not possible one is stuck to removing the regular expressions...
How can the programmers of regular expression parsers help their customers? Simple really don't hold state in the class, don't reuse an instance (unless you know it's been finished with) .
Labels: programming
Subscribe to Posts [Atom]