Helma Logo
main list history

Html Processing


Some separation of concerns:

  1. Parsing HTML for skin rendering
  2. Adding of missing tags
  3. Adding of formatting tags
  4. entity encoding
  5. Plugin architecture for stuff like wiki formatting

Potentially of interest: <http://mercury.ccil.org/~cowan/XML/tagsoup/>.

Another interesting Package (via Jürg on helma-dev): <http://htmlparser.sourceforge.net/>

Plan A

  1. Keep current code as starting point, as I can't find any other code with a similar feature mix (most importantly smart formatting) that looks like it's worth the switch.
  2. Separate character entity escaping from the formatting/tag closing.
  3. Update the list of recognized tags from the Tagsoup project.
  4. Allow for plugins to handle formatting at various stages, e.g. before/after/instead of default formatting.

Plan B

  1. Keep current code for character entity escaping only.
  2. Use Tagsoup for cleaning up tags and -- using the knowledge from helma's old html formatter -- to generate break/paragraph tags

Open Issues

We should provide a feature to only allow certain tag/attribute combinations to exclude scripts or just to keep people from ruining the layout.

Skin parsing might start from this code too if we move to HTML/XML style skin tags.

Tagsoup notes

HtmlParser notes