CodeBetter.Com
CodeBetter.Com
RSS 2.0 via Feedburner
           Do you Twitter? Follow us @CodeBetter

Peter's Gekko

public Blog MyNotepad : Imho { }

Use a regular expression to lock up the webserver

I am a rookie when it comes to regular expressions but learned one thing which I want to share with you. With a regular expression you can do very sophisticated matching of string patterns. The content of a regullar expression is a science on itself. The .net framework makes evaluating the expression quite simple.

string Pat = MyRegularExpression;

RegexOptions options = new RegexOptions();
options |= RegexOptions.Singleline;

Regex r = new Regex(Pat, options);
Match m = r.Match(MyStringToSearch);

if (m.Success)

The Match method of the RegEx class will do the actual matching. What the documentation does not tell you is that it can take virtually forever to execute this statement. We had a beautifull expression which worked very well on a test string. Skimming the contents of a 396 KB text file went different. Starting the match drove the aspnet worker process (w3wp on server 2003) to 100 % processor utilization (51% on a (hypertreaded) dual) which locked up the server. We killed the process after half an hour. Nobody can wait that long.

Browsing the net I found this post on MSDN which explained what was going on. Finally found a need for supercomputing@home.



Comments

Sam said:

I've found regular expressions to be great for parsing, and I'm not talking about 396K files either, but 800MB to 2GB files.

It all depends on the expression, and the complexity of the data. If you know an explicit start/end it really helps. If you don't, then it can run much slower. Also, one big expression called fewer times is faster than lots of little expressions called many more times, so once you've broken up a file into manageable chunks, using other regexs to break them up into groups can be comparatively slow. Try to do as much as possible in a single go (for a given buffered chunk of a file) and you'll be good, if you can't, then you'll probably have to abandon regexs.
# April 28, 2005 11:51 AM

pvanooijen said:

The file being parsed is an xml file. The first go was using xpath. In the proces of developing the xpath expression xpath went mazurk. As an alternative regular expressions popped up. This went mazurk to.

As stated, I'm not deep in regular expresssions at all and I was surprised to find out they could take that long to match. Looking back I think xpath went mazurk for the same reason. Still investigating that. No fingerpointing yet :)
# April 28, 2005 1:45 PM

Ryan said:

Sed might help to trim down the file before searching through it in some other way. Look at http://www.student.northpark.edu/pemente/sed/sed1line.txt and search for "regexp." (Yes sed does run on Windows.) Of course, sed works line by line, so if your file doesn't have carriage returns in normal places, it won't work.
# April 28, 2005 7:40 PM

Peter's Gekko said:

At the moment I'm migrating my website. The old version is a bunch of static htm pages scribbled with...
# April 12, 2006 6:33 AM

Leave a Comment

(required)  
(optional)
(required)  

Enter the numbers above:
Add
Check out Devlicio.us!