CodeBetter.Com
CodeBetter.Com
RSS 2.0 via Feedburner
           Do you Twitter? Follow us @CodeBetter

Peter's Gekko

public Blog MyNotepad : Imho { }

Including classic htm pages in an asp.net site (Look mom, no frames)

At the moment I'm migrating my website. The old version is a bunch of static htm pages scribbled with FrontPage, the new version is an asp.net 2.0 site. As the old site is well read I don't want to break any links, a link to http://www.gekko-software.nl/DotNet/Art01.htm should keep serving the same "Delphi vs. C# article"  The easiest way would be to just copy the files and use IIS as a dumb "htm-file-server". But I want to have control over the pages; display them in a nice .net (master) page and add functionality as desired. An option could be to use frames, one frame for the aspx, one frame for the htm. But for many reasons I don't find frames very nice to work with. So here I'll present a pure .net solution.

In the first step the handler of an htm request to the site (like http://www.gekko-software.nl/DotNet/Art01.htm) has to be set to asp.net. You do this in the configuration of the virtual directory in IIS. Add the htm extension in the application configuration list and set the executable to aspnet_isapi.dll.

Now every incoming htm request for the virtual directory will be handled by asp.net.

To intercept the request and redirect it to my viewer I install a so called HttpModule. An httpmodule is a way to be the first or the last in the handling of any request coming to the site. Installing a HttpModule is done in the web.config

<system.web>

<httpModules>
   <add name="UrlRewriter" type="Gekko.WebSite.URLrewriter"/>
</httpModules>

The module has a name and a type. This type is a class which implements the IHttpModule interface. It is in the app_code folder of the site.

namespace Gekko.WebSite

{

    public class URLrewriter : IHttpModule

    {

 

        #region IHttpModule Members

        public void Dispose()

        {

 

        }

 

        public void Init(HttpApplication context)

        {

            context.BeginRequest += new EventHandler(context_BeginRequest);

        }

        #endregion

 

        void context_BeginRequest(object sender, EventArgs e)

        {

            HttpApplication httpApp = sender as HttpApplication;

            string pageName = httpApp.Request.AppRelativeCurrentExecutionFilePath;

            if (pageName.EndsWith(".htm"))

            {

                httpApp.Context.RewritePath(string.Format(@"~/ArticleViewer.aspx?article={0}", pageName.Substring(2)), false);

            }

        }

    }

}

IHttpModule is a nice lean interface. The init method is passed the full context of you web application. I add a handler to the beginrequest event. Which gives my code a first look at every request coming in and even the possibility to change the request. The method filters out any request for an htm and, using the Context.RewritePath method, rewrites the request url to that of my aspx page with the viewer. It passes the desired htm file in the querystring. Now all request for an htm will be served by my asp.net 2.0 code.

(You can do a lot more with HttpModules. There are many events to hook into. The module is the first and the last one to handle, bend, modify or analyze all requests served by your app. There are loads and loads of good samples to be found all over the web)

Now the viewer has to display the htm. How will it do that ? The easy part is that you can assign any html to the text property of a label. The result will be that the page rendered displays the htm in its full glory. But I want to be a neat citizen on the web and not render any garbage. The original htm of my pages has a lot of bla-bla Frontpage headers. What my code will do is extract the real content from the htm file and assign that to the label.

An html response  (should) look(s) like this

<html>
   <head>
      <title>This page is about software</title>
      .......
   </head>
   <body>
      .................
   </body>
</html>

The content is between the body tags.

The code takes this appoach

  • Read the htm filename form the querystring
  • Read in the htm file into the rawHtml string.
  • Extract the page title using a regular expression
  • Assign the title to the viewerpage's title
  • Extract the page body by searching for the body tags
  • Assign the body to the text of a label

private void displayArticle()

{

    object o = Request.Params["article"];

    if (o != null)

    {

        string pageName = o.ToString();

        // read in the htm file

        string fullFileName = HttpContext.Current.Server.MapPath(o.ToString());

        StreamReader sr = null;

 

        try

        {

            sr = new StreamReader(fullFileName);

            string rawHtml = sr.ReadToEnd();

            // Use regex to extract title

            Regex reTitle = new Regex(@"<title\b[^>]*>(.*?)</title", RegexOptions.IgnoreCase & RegexOptions.Multiline);

            if (reTitle.Matches(rawHtml).Count > 0)

                this.Title = reTitle.Matches(rawHtml)[0].Groups[1].Value;

            // Plain search to extract body

            int bodyStart = rawHtml.IndexOf("<body");

            if (bodyStart >= 0)

            {

                // Find end of body tag

                bodyStart = rawHtml.IndexOf(">", bodyStart);

                int bodyEnd = rawHtml.IndexOf("</body", bodyStart);

                if (bodyEnd < 0)

                    bodyEnd = rawHtml.Length;

                LabelArticle.Text = rawHtml.Substring(bodyStart + 1, bodyEnd - bodyStart - 1);

            }

        }

        catch (Exception ex)

        {

            LabelArticle.Text = "Article not available";

        }

        finally

        {

            if (sr != null)

                sr.Close();

        }

    }

}

For the code to build you need to include System.Text.RegularExpressions in the using list. A regular expression is a nice way to get the title, also when the tags are spelled poorly, like <tiTle  >. The Groups[1].Value member returns the title enclosed by the tags. It would be tempting to use a regular expression as well to get the body. But due to the many nested <'s and >'s inside the body that would be a pretty complicated one. And when you manage to figure out a working one there's quite a chance it literally will take ages to evaluate. Here I know there is (maximum) one pair of body tags, a linear search will be fast and good enough.

<Update>
 In a comment James Curran writes down one regular expression which yields both results in one go. Which works like a charm and makes the code even simpler.

private void displayArticle()

{

    object o = Request.Params["article"];

    if (o != null)

    {

        string pageName = o.ToString();

        // read in the htm file

        string fullFileName = HttpContext.Current.Server.MapPath(o.ToString());

        StreamReader sr = null;

 

        try

        {

            sr = new StreamReader(fullFileName);

            string rawHtml = sr.ReadToEnd();

 

            // Use regex to extract title and body 

            Regex reHtml = new Regex(@"<title\b[^>]*>(?<Title>.*)</title\b[^>]*>.*<body>(?<Body>.*)</body>", RegexOptions.IgnoreCase | RegexOptions.Singleline);

            MatchCollection mc = reHtml.Matches(rawHtml);

            this.Title = mc[0].Groups["Title"].Value;

            LabelArticle.Text = mc[0].Groups["Body"].Value;

 

        }

        catch (Exception ex)

        {

            LabelArticle.Text = "Article not unavailable";

        }

        finally

        {

            if (sr != null)

                sr.Close();

        }

    }

}

This was to good not to be included in the full story.
</Update>

The result is that all my classic pages are a full part of the asp.net 2.0 site and are still accessible by the classic url. The reader won't even notice



Comments

Christopher Steen said:

&quot;Atlas&quot; April CTP Release [Via: Rich Ersek ]
[T-SQL] Call a stored procedure once for each row in a...
# April 12, 2006 10:24 PM

James Curran said:

With a bit more advance Regex expression, the rest can be simplified:

Regex reHtml = new Regex(@"<title\b[^>]*>(?<Title>.*)</title\b[^>]*>.*<body>(?<Body>.*)</body>",
RegexOptions.IgnoreCase | RegexOptions.Singleline);

MatchCollection mc = reHtml.Matches(rawHtml);
string Title = mc[0].Groups["Title"].Value;
string Body =  mc[0].Groups["Body"].Value;

Console.WriteLine("Title:\"{0}\", Body:{1}", Title, Body);
# April 13, 2006 11:04 AM

pvanooijen said:

Thank you James. Yourt regexpression works well. And fast :)
The only thing it does not cover is html where (even) the closing body tag missing. IE will render that ..
# April 14, 2006 4:48 AM

Peter's Gekko said:

Working&amp;nbsp; a lot with a computer has a couple of health risks, which get bigger and bigger over the...
# June 9, 2006 5:11 AM

Mina said:

First it is a good way of displaying the HTM files, but i need to ask about, if my HTM files contain Images, how do i display them ??

# October 29, 2007 8:19 PM

pvanooijen said:

As an image is just a link that should be no problem. As long as you make sure the path to the images can be accesed. Else you have to do soem fiddling with those url's

# October 30, 2007 5:20 AM

Leave a Comment

(required)  
(optional)
(required)  

Enter the numbers above:
Add
Check out Devlicio.us!