Shortlog - a log of everyday things

2012-02-18

More thoughts relating to my previous entry grumbling about plaintext markup formats.

It turns out that markdown implementations also accept lists with proper indentation, so long as you start them with spaces. So that's one complaint already addressed (though in my defense, the "spec" doesn't mention it). Yay.

What's more, python-markdown2 supports google-code style tables, which, while not my favorite, are certainly better than raw HTML tables.

The ugly-looking headers are probably something I can either tear out of an implementation with relative ease.

Most markdown implementations these days also disable the underscores-within-words-becoming-italics nonsense; I'll probably find some way to hack around one of the implementations to get (in my opinion) more reasonable underline, italic, strong, and monospace behavior. But hey, at least someone realized one of the most glaring defects was problematic.

Apparently markdown output is producible without actually truly parsing the content - it's done nearly entirely as a pipeline with a series of regex replacements (some quite complex). This makes the code fairly simple, in practice.

On the downside, it does mean that there isn't really a grammar for markdown, nor any semantic tree for it. Markdown is never parsed, just rendered. And it is very much tied to HTML as the output format, which has both upsides and downsides - on the one hand, you can easily punt to HTML when you need more precise control of output. On the other hand, you can't (sanely) use markdown to output to, say, a PDF, or a LaTeX document. Tradeoffs.

The fact here is that if you do want to actually parse markdown (or similar constructs), you're left to face a few unhelpful facts:

Markdown (and most of these plaintext markup formats) are context-sensitive.
Context-sensitive parsers cannot have completely separate lexers, since the tokens are ambiguous and depend on context (is that number followed by a period the beginning of an ordered list item, or just some text in a paragraph?)
Parser generators suck at producing parsers for context-sensitive grammars, since we lack a good way to represent them, unlike for regular grammars (regular expressions, of course!) or even context-free grammars (BNF/EBNF).

So if you want to truly parse something like markdown, you're probably left manually coding the appropriate context-sensitive parser with codependent lexer (if you bother lexing at all), or modifying the language itself to be non-context sensitive. It's worth noting that regular expressions with non-greedy or lazy quantifiers are capable of dealing with this particular type of context-sensitive grammar. It's unclear to me whether this generalizes to all context-sensitive grammars or not, but I'd be interested to know.

With all this in mind, I'm now deciding whether to try to modify a decent markdown implementation to add things I like, since several of the things I complained about are no longer issues, or to continue writing a proper parser that can handle all the things I like about both markdown and dokuwiki.

Shortlog - a log of everyday things

Home

2012-02-18