Helma Logo
main list history
previous version  overview  next version

Version 10 by hannes on 25. July 2006, 11:29

==== Ideas

Some separation of concerns:

# Parsing HTML for skin rendering
# Adding of missing tags
# Adding of formatting tags
# entity encoding
# Plugin architecture for stuff like wiki formatting

Potentially of interest: <http://mercury.ccil.org/~cowan/XML/tagsoup/>.

Another interesting Package (via *Jürg on helma-dev|http://grazia.helma.at/pipermail/helma-dev/2006-June/002842.html*): <http://htmlparser.sourceforge.net/>

==== Plan A

# Keep <a href="http://adele.helma.org/source/viewcvs.cgi/helma/src/helma/util/HtmlEncoder.java?rev=1.30&cvsroot=hop&content-type=text/vnd.viewcvs-markup">current code</a> as starting point, as I can't find any other code with a similar feature mix (most importantly smart formatting) that looks like it's worth the switch.
# Separate character entity escaping from the formatting/tag closing.
# Update the list of recognized tags from the Tagsoup project.
# Allow for plugins to handle formatting at various stages, e.g. before/after/instead of default formatting.

==== Plan B

# Keep <a href="http://adele.helma.org/source/viewcvs.cgi/helma/src/helma/util/HtmlEncoder.java?rev=1.30&cvsroot=hop&content-type=text/vnd.viewcvs-markup">current code</a> for character entity escaping only.
# Use Tagsoup for cleaning up tags and -- using the knowledge from helma's old html formatter -- to generate break/paragraph tags

==== Open Issues

We should provide a feature to only allow certain tag/attribute combinations to exclude scripts or just to keep people from ruining the layout.

Skin parsing might start from this code too if we move to HTML/XML style skin tags.

==== *Tagsoup|http://home.ccil.org/~cowan/XML/tagsoup/* notes

* Very small (around 50 k)
* Implements straight SAX2 parser
* Never ever throws an exception
* But does tag balancing, tag insertion etc.
* Is pretty good at this tag balancing business, probably better than HtmlParser
* Likes to convert HTML snippets to whole documents, which is not really convenient for some of our purposespurposes (but not a big problem either: just provide convenience methods to drop the html and body tags)

==== *HtmlParser|http://htmlparser.sourceforge.net/* notes

* sufficiently small (full jar is ~300 k)
* low level lexer is even smaller (~70 k), might be enough for us
* NodeFactory class is used to create Nodes
* most important Node subinterfaces are Text and Tag
* Tag has methods getEnders() and getEndTagEnders() that determine which (end) tags will close this tag to implement tag balancing/injection of virtual end tags
* Tag balancing is only done if a matching subclass of CompositeTag exists and is registered
* Nodes and NodeLists have a toHtml() method that convert it back to html text
* HtmlParser provides an advanced filtering/nodewalking framework that would be quite useful for tag/attribute filtering