Html Processing
Ideas
Some separation of concerns:
- Parsing HTML for skin rendering
- Adding of missing tags
- Adding of formatting tags
- entity encoding
- Plugin architecture for stuff like wiki formatting
Potentially of interest: <http://mercury.ccil.org/~cowan/XML/tagsoup/>.
Another interesting Package (via Jürg on helma-dev): <http://htmlparser.sourceforge.net/>
Plan A
- Keep current code as starting point, as I can't find any other code with a similar feature mix (most importantly smart formatting) that looks like it's worth the switch.
- Separate character entity escaping from the formatting/tag closing.
- Update the list of recognized tags from the Tagsoup project.
- Allow for plugins to handle formatting at various stages, e.g. before/after/instead of default formatting.
Plan B
- Keep current code for character entity escaping only.
- Use Tagsoup for cleaning up tags and -- using the knowledge from helma's old html formatter -- to generate break/paragraph tags
Open Issues
We should provide a feature to only allow certain tag/attribute combinations to exclude scripts or just to keep people from ruining the layout.
Skin parsing might start from this code too if we move to HTML/XML style skin tags.
Tagsoup notes
- Very small (around 50 k)
- Implements straight SAX2 parser
- Never ever throws an exception
- But does tag balancing, tag insertion etc.
- Is pretty good at this tag balancing business, probably better than HtmlParser
- Likes to convert HTML snippets to whole documents, which is not really convenient for some of our purposes (but not a big problem either: just provide convenience methods to drop the html and body tags)
HtmlParser notes
- sufficiently small (full jar is ~300 k)
- low level lexer is even smaller (~70 k), might be enough for us
- NodeFactory class is used to create Nodes
- most important Node subinterfaces are Text and Tag
- Tag has methods getEnders() and getEndTagEnders() that determine which (end) tags will close this tag to implement tag balancing/injection of virtual end tags
- Tag balancing is only done if a matching subclass of CompositeTag exists and is registered
- Nodes and NodeLists have a toHtml() method that convert it back to html text
- HtmlParser provides an advanced filtering/nodewalking framework that would be quite useful for tag/attribute filtering