Parsing BBcode and formatting it

arithma

So am working on parsing some bbcode and displaying it as html (I know it is already done, but I want to redo it for learning purposes) - the language of choice is PHP (this is important as will be clear soon)
A little bit of background: I checked out how the Text_Wiki module from PEAR (php) works, and it seems to be based on regular expressions, a bottom up approach, and a set of delimiters. I hated the technique, and would rather have something more concrete rather than working within the string itself to represent the parsing. Additionally I believe it's not very stable yet, for example wiki style quoteblocks just plain destroy the parsing and delete the included text.

Reviewing a little bit about Context-Free Grammar, it seemed like the ultimate solution. But there are a few quirks that I need to sort out.
Starting up, parsing the text character by character is out of the question, it'll just put the interpreter into a crawl and would seriously hinder any attempt to scale an active application (a busy forum for example may have a 100 visitors visiting those pages which are being rendered at runtime; if for whatever reason they have to be always dynamically rendered (save space for example)).

Regular expressions for the save:

[quote](.*?)[/quote]

. You'd say it would work like a charm. All hell breaks loose with this:

[quote]aoihfia fshf[quote]kaska[/quote]
[/quote]

It would match the whole string except for the last: end-quote tag.

Another solution would be to use a string find function to find the first occurence of an opening tag. I quickly hit the wall, and change it to the more complex: The first occurrence of any allowed opening tag.

An alternative would be to search for the longest run of text that doesn't match any of the opening tags, but that is just plain hackery and I hate going this path.

If anyone had the chance to work with this kind of string dancing, lend me a good advice please.

rolf

I did this once, it is a headache. I had to set up a stack...
Everytime you encouter an opening tag, it gets added to the stack, and when you find a closing tag, it gets removed from the stack. If the closed tag wasnt the last opened tag, then it would throw an error for invalid markup.

Another thing you could do is this:

[quote]([^[]*)[/quote]

That will mach text that does not contain brackets.
Replace the matched text with the correctly formatted text, then execute replace again on that, until you get 0 matches.
This way you will be replacing quotes from the shortest one to the longest, parent one, in the correct order.
Note that this will fail if the text contains any brackets, so you need to escape all the non-BB brackets from the user.
This will also fail badly if the user decides to do invalid markup example:

[b] this is bold [i] italic... [/b] bold is gone but italic still there [/i]...

You can take further by using lookaheads:

\[([a-zA-Z]+)\](.(?!\1)*)\[/\1\]

heik shi... :-)

An even simpler idea would be to simply translate "quote" into a block div, and "/quote"
into </div>, then the browser will display one quote into another. You can also let the browser handle italics and the other stuff.

arithma

Instead of setting up a stack, I usually use the internal function stack of whatever language am using (which is exactly why they are called stack variables (in contrast to heap)).
To avoid parsing character by character, I will have to use some "string position" calls. If non succeeds in a certain context I'd have to assume it's a regular paragraph.
I don't like the idea of a failure to post because of a failed markup. I'd tend to assume it's just part of the text. Got to implement it over night now, took all the while I need to think it over.

samer

I have used Rolf's approach to solve an invalid bracket problem. Basically you push a dummy item into stack when a bracket is opened and pop an item when a closing bracket is encountered. Additional conditions should be implemented if you want to include things like

[quote]the symbol ']' is intriguing.[/quote]

ideas?