Archive for May, 2008

Using html2wiki for Google Code

Wednesday, May 14th, 2008

I recently released a HTML to Google Code wiki converter.  Now that the module is published on CPAN, it’s time to provide some usage instructions.

If you’re an experienced Perl hacker, see

perldoc HTML::WikiConverter

and

perldoc HTML::WikiConverter::GoogleCode.

If you’re not a Perl hacker, read on…

There are two ways to use the module; the easiest is via a shell script, html2wiki:

http://search.cpan.org/~diberri/HTML-WikiConverter/bin/html2wiki.

which is provided as part of the HTML::WikiConverter Perl module. The other option is to write your own Perl script and include the Google Code module. Using html2wiki is easier as you only have to supply the desired command line options. Writing your own script is the more flexible and powerful option. I’m going to cover the html2wiki option in detail, and then briefly cover an example Perl script.

Installation

Before you can use the converter, you’ll need to have Perl installed. I use ActiveState’s binary distribution for Windows; The installation instructions are here:

http://aspn.activestate.com/ASPN/docs/ActivePerl/5.10/install.html.

The you’ll need to install Perl modules HTML::WikiConverter and HTML::WikiConverter::GoogleCode. There are a couple of ways to install these modules; you can download the tar files from cpan.org, or use ppm if you have the ActiveState Perl distribution. My favorite method is to use the Perl CPAN module which is part of the core Perl distribution. The following shell commands should get you close:

>perl -MCPAN -e shell
cpan> install HTML::WikiConverter
CPAN: Storable loaded ok (v2.16)
...
cpan> install HTML::WikiConverter::GoogleCode
CPAN: Storable loaded ok (v2.16)
...
cpan>

Line 1 invokes the CPAN module in an interactive mode. If you have any trouble, just type help at the cpan> prompt.

At this point, html2wiki should be installed. From a command prompt, you should be able to summon up the usage instructions:

>html2wiki
Usage:
    html2wiki [options] [file]
    Commonly used options:
        --dialect=dialect    Dialect name, e.g. "MediaWiki"
...

You should also be able to verify that the Google Code dialect is installed.

>html2wiki --list
Installed dialects:
GoogleCode
...

Using html2wiki

For this example, I put some random HTML in file named example.html:

>type example.html
CamelCase words like JavaScript, <pre>JavaScript</pre> and
<b>bold</b> words,and html tokens: 1 < 1

Using the Windows cmd.exe shell, the default conversion to wiki markup looks like this:

>html2wiki --dialect=GoogleCode < example.html
CamelCase words like JavaScript,

{{{
JavaScript
}}}

and *bold* words, and html tokens: 1 &amp;amp;amp;lt; 1

The default conversion escapes HTML tokens such as < by replacing them with the HMTL escape sequence (&lt;, in this case). The Google Code wiki will render &lt; as &lt; not what we want (<). To turn escaping off, set the -–no-escape-entities option:

html2wiki --dialect=GoogleCode ^
     --no-escape-entities ^
     < example.html

The output is now:

CamelCase words like JavaScript,

{{{
JavaScript
}}}

 and *bold* words, and html tokens: 1 < 1

The next adjustment I make is to turn off the Google Code feature of automatically turning CamelCase words into wiki links. The documents I’ve been posting turn out to have many CamelCase words for most of which I do not want the automatic link. The wiki markup to disable link generation is to proceed the word by an exclamation mark. I’ve added an option to enable this feature for specific words. For example, to prevent link generation for the words CamelCase and JavaScript, modify the html2wiki command to be:

html2wiki --dialect=GoogleCode ^
     --no-escape-entities ^
     --escape-autolink=CamelCase ^
     --escape-autolink=JavaScript ^
     < example.html

The generated wiki markup from this command is:

!CamelCase words like !JavaScript,

{{{
JavaScript
}}}

and *bold* words, and html tokens: 1 < 1

In this case, the CamelCase words are escaped, except when the word occurs in within a <pre> tag (line 4).

The last Google Code feature of note is the ability to embed page summary and labels within the wiki markup. The page summary is a comment on the first line which is displayed on the project’s wiki index. Likewise, the labels markup element is a comment and is used by Google Code. For example, when the label ‘Featured’ is applied to a wiki page, a link to the page is created on the project’s front web page. The final example shows adding a summary and two labels to the wiki markup. The html2wiki command is:

html2wiki --dialect=GoogleCode ^
     --no-escape-entities ^
     --escape-autolink=CamelCase ^
     --escape-autolink=JavaScript ^
     --summary="This is a great page" ^
     --labels=Featured ^
     --labels=Phase-Deploy ^
     < example.html

and the generated wiki markup is:

#summary This is a great page
#labels Featured,Phase-Deploy
!CamelCase words like !JavaScript,

{{{
JavaScript
}}}

and *bold* words, and html tokens: 1 < 1

Using H::WC::GoogleCode in a Perl Script

This last example shows a Perl script that makes use of the options show above. This example is a script I’ve used to convert a Developer’s Guide saved as HTML into wiki markup. The source code for the script follows below.

On line 5, the H::WC module is imported. I didn’t import the H:::WC::GoogleCode module since it is imported by H::WC based on the dialect name.

This document has a long list of CamelCase words for which I wanted to suppress generation of links. These words are collected into an array on line 7. Line 22 sets up a hash with keys of the path to the HTML document and values of the path where the generated wiki markup should be stored.

On line 26, and new instance of H::WC is created. Line 27 pulls in the H::WC::GoogleCode module. Line 28 passes the list of CamelCase words. Line 29 turns off escaping of HTML entities.

For images, the HTML document uses relative links to images stored on the file system (at ../docs/img). In order to support the wiki, I’ve staged the images on my web server under /img and passed in the web-site URI on line 30. The relative image links are turned into absolute URLs pointing to my web server in the wiki markup.

The final option, the summary parameter on line 30, generates a page summary on the wiki.

The balance of the script pulls in the HTML, converts it with the H::WC instance (line 42), and writes the wiki markup to a file.

 #!/usr/bin/perl -w

 package main;
 use strict;
 use HTML::WikiConverter;

 my @toEscape = qw/
              SqlMapClient
              JBati
              SqlMapClientBuilder
              JavaScript
              NovaJug
              ExampleJSS
              SqlLite
              ExamplesJSS
              SqlMap
              MySql
              SqlMapConfig
              DevGuide
 /;

 my %to_process = qw /
      ..\docs\DevelopersGuide.html</pre>
 /;

 my $wc = new HTML::WikiConverter(
      dialect => 'GoogleCode',
      escape_autolink => \@toEscape,
      escape_entities => 0,
      base_uri => 'http://beavercreekconsulting.com',
      summary => 'Developers Guide V0.2 - jBati usage and examples'
 );

 foreach my $in (keys %to_process) {

      open(HTML, "<$in") or die "cannot open $in: $!\n";
      my $html = do {local $/; <HTML>};
      close(HTML);

      open(WIKI, '>' . $to_process{$in}) or die "cannot open " .
              $to_process{$in} . ": $!\n";
      my $converted = $wc->html2wiki($html);
      print WIKI $converted, "\n";
      close(WIKI);
 }

HTML to Google Code wiki Converter

Tuesday, May 13th, 2008

I’ve written a HTML::WikiConverter dialect for Google Code. The module, HTML::WikiConverter::GoogleCode, is now available on CPAN at http://search.cpan.org/~martykube/.

I wrote the module to support an open source project I’m working on. The module allows me to write my documentation in a good editor (OpenOffice in this case), save the document as HTML, convert it to Google Code wiki markup, and then post the file to my project wiki via svn. The main advantage is that I can also include the HTML documentation in the release file and thereby have one version of a document that is posted on the wiki and included as part of a release.

That’s the short story, read on for the long story…

I recently started an open source project (shameless plug… the project is jBati, a JavaScript ORM library) hosted at Google Code. One nice feature of Google’s project hosting is that each project has wiki. The project’s wiki is a nice place to post information as potential consumers can see the project documentation without having to download a release.

However, there are a couple of drawbacks to the wiki. The wiki editor is a plain HTML text box. This is serviceable enough for short documents but doesn’t work well for larger documents. Also, for someone like me who depends heavily on a spell checker, unless my browser does the checking, there is no spell checking.

The other drawback is that some documents, such as a User’s Guide, need to be posted on the wiki and also distributed with releases. There seemed to be one of two ways to accomplish this, either convert from the wiki to documents for release (which need to be in a text or HTML format), or, convert from the release documents to the wiki format.

I figured I wasn’t the first one with this idea and looked around. I found a wiki to HTML tool, an on-line wiki to HTML converter (http://jtidy.de/), and a really hard to read blog which reported that the Perl module HTML::WikiConverter did a decent job of converting HTML to Google Code wiki syntax.

I’m a Perl guy, so the HTML::WikiConverter module sounded like a winner. The HTML::WikiConverter comes with a set of plug-ins to support various wiki dialects. The recommend wiki dialect that gets closest to the Google Code wiki dialect is MoinMoin. The results were decent, yet some hand editing was needed to get perfect results. IMHO, once you choose a generation route like this (HTML wiki), the tool chain has to work without intervention, otherwise you loose much of the advantage due to things getting out of sync.

So, I decided to fix this one for good and wrote a Google Code wiki dialect. I’m currently writing my project documentation with OpenOffice and saving the files as HTML. I use HTML::WikiConverter to generate wiki markup and then post the wiki files to the Google Code via svn.