So am working on parsing some bbcode and displaying it as html (I know it is already done, but I want to redo it for learning purposes) - the language of choice is PHP (this is important as will be clear soon)
A little bit of background: I checked out how the Text_Wiki module from PEAR (php) works, and it seems to be based on regular expressions, a bottom up approach, and a set of delimiters. I hated the technique, and would rather have something more concrete rather than working within the string itself to represent the parsing. Additionally I believe it's not very stable yet, for example wiki style quoteblocks just plain destroy the parsing and delete the included text.
Reviewing a little bit about Context-Free Grammar, it seemed like the ultimate solution. But there are a few quirks that I need to sort out.
Starting up, parsing the text character by character is out of the question, it'll just put the interpreter into a crawl and would seriously hinder any attempt to scale an active application (a busy forum for example may have a 100 visitors visiting those pages which are being rendered at runtime; if for whatever reason they have to be always dynamically rendered (save space for example)).
Regular expressions for the save:
[quote](.*?)[/quote]
. You'd say it would work like a charm. All hell breaks loose with this:
[quote]aoihfia fshf[quote]kaska[/quote]
[/quote]
It would match the whole string except for the last: end-quote tag.
Another solution would be to use a string find function to find the first occurence of an opening tag. I quickly hit the wall, and change it to the more complex: The first occurrence of any allowed opening tag.
An alternative would be to search for the longest run of text that doesn't match any of the opening tags, but that is just plain hackery and I hate going this path.
If anyone had the chance to work with this kind of string dancing, lend me a good advice please.