CodeSprite's preferred programming editor

Automatic XML to HTML Converter

"It is one of the tragedies of the Internet," mused the Code Sprite "that given a medium for transferring information freely across the planet, we have fallen into the trap of perpetually adding 'enhancements' that imprison our data behind bars of incompatibility".


Download
htmlgen.zip

I shrugged non-commitally, hoping that this bout of philosophical reflection was going to lead somewhere.

"Take XML for example - what a wonderful idea! With it, we could keep our documents in a repository, and publish them as web pages one day, a magazine article the next, and next week completely change the appearance of the website just by changing the XSL rules that convert the document into a web page!"

I agreed that this was most laudable, separating form from content, allowing data to flow from one medium to another with the minimum of editing

"Precisely! But what do we find? A singular lack of standardisation in XSL, and precious little adoption of XML and XSL in web browsers. What is the point of even trying to create documents using XML if one is restricted to displaying them on a tiny minority of computers that have a browser that can interpret the XSL conversion code correctly? And who knows what will be in the final XML/XSL specification?"

I asked if this was just going to be a rant against the realities of software development, or did he actually have an alternative suggestion?

"How funny you should ask - wait here a moment". The Code Sprite hopped down from my desk and rummaged around in the great haversack he habitually carried around. Before long, he was back with a small folder under his arm. He placed it on my computer, and a directory listing appeared on the screen.

"Now, it occurred to me, that just because XSL is barely supported - hardly surprising, they're still working on the specification as far as I know - and just because the alternative, accessing the XML tree using javascript only works on Internet Explorer (and may change with each successive generation of the browser), that's no reason to throw away the benefits that XML can bring. Also, coding up documents in XML now, means that you'll have a terrific headstart when more tools that use XML become available".

"So, I decided to write a program that can take a set of rules about how an XML database should be interpreted, and automatically generate one or more documents from the database. It's only a partial solution; I haven't yet supported Tag attributes, or the alternative form of empty tags - you know, <tag/>, but it works for now, and you can extend its capabilities by simply adding to the list of rules that it uses to parse an XML file".

The codesprite went on to describe what his program "HTLMgen" could do, and how to build up a rules file to create HTML pages from an XML data file. I've paraphrased his examples here, to remove the many rants against the inequalities of web browsers.

The General Form of HTMLgen Rules and XML files

It is assumed that a single XML file will contain one or more records, each of which will be used to generate a separate HTML document, and each of these records will contain sub-records that describe the contents of the page. The rules file contains entries for each XML tag that must cause some text to be put into the output file. The most important rules, are those that tell HTMLgen to start or end a new document. To achieve this, there are special rules file markers, !!NEW!! and !!CLOSE!!. !!NEW!! creates a new file, for which the filename is generated from the root of the XML filename, concatenated with a text item extracted from one of the tags within the "new page" tag in the XML file. An example may help!

Test.XML:

	<?xml version="1.0" encoding="UTF-8"?>
	<?xml-stylesheet href="pages.xsl" type="text/xsl"?>
	<document>
	  <page>
	    <documentname>Page1</documentname>
	    <documentsubject>Just a test page</documentsubject>
	    <documentheading>This is a test page</documentheading>
	    <section>
	      <para>Here's a paragraph</para>
	      <para>And this is another</para>
	    </section>
	  </page>
	  <page>
	    <documentname>Page2</documentname>
	    <documentsubject>Another test
	page</documentsubject>
	    <documentheading>This is another test page</documentheading>
	    <section>
	      <image>
		<filename> bensprite6.gif </filename>
		<text> CodeSprite Logo </text>
	      </image>
	    </section>
	  </page>
	</document>
				
Test.rules:

	<page>!!NEW!! !!documentname!!
	 <html>
	 <head><title>!!documentheading!!</title></head>
	 <body>
	<endrule>
	</page></body></html>!!CLOSE!!
	<endrule>
	<para> <P>!!.!!</P>
	<endrule>
	<image> <IMG src='!!filename!!' alt='!!text!!'></IMG>
	<endrule>
	

How it Works

The XML file is a simple database containing two "page" records, One with some text, and the other with a reference to a graphics file. Note that the tag-fields in the XML file are purely arbitrary - HTMLgen makes no assumptions about the contents of an XML file.

The rules file tells HTMLgen what to do for each of the tag-types specified in the rule file. In this example, HTMLgen is to create a new html file every time it finds a "<page>" tag. The !!NEW!! marker must be followed by a tagfield identifier, in this case !!documentname!!.

Using these rules, every time HTMLgen finds "<page>" in Test.XML, it creates a new HTML file. The filename will be "TEST" concatenated with the contents of the next "documentname" tag. So in this example, TestPage1.htm and TestPage2.htm will be generated.

The rule for <page> also sets up the HTML header, including using the contents of another tag to set up a title for each HTML page. The rule for </page< finishes of the HTML and tells HTMLgen to close the HTML file.

Ok, so that's how to insert the contents of a sub-field into the HTML document. There's another special directive that allows you to insert everything between the current start and end tags into the document. If you look at the rule for <para>, you'll see "!!.!!" in the HTML code to be inserted into the document, which tells HTMLgen that everything between <para> and </para> should be inserted into the HTML document.

The last rule in this simple example uses data from two sub-fields when writing HTML code to display an image and alternate text.

Summary

HTMLgen's syntax is extremely simple, but the rules can be used to describe complex web pages. The aims are to promote the use of XML for information storage - XML is rapidly gaining favour for data exchange between applications, and secondly to assist with the process of separating form from content, which in the case of a website, can dramatically reduce effort when creating a "new look" for the site.

HTMLgen is a console app - open an MS-DOS window, change to the directory in which you've unpacked htmlgen.zip, and type htmlgen test.xml test.rules