I've been playing with some code to handle WordPress exports (I'm planning to consolidate and retool this site--I don't like the schizophrenic 2 sites within a site mentality that it has right now) and one thing is clear: WordPress has some issues. A nice platform, by and large, but the export, running the latest stable version, produces invalid XML. The database coalition is UTF-8 and there are characters in the dump that are valid UTF-8, but invalid XML. Moreover, the URLs are not properly escaped, so the anchors in URLs make the parser throw invalid charref errors.

Most of the offending posts are, of course, spam from before I got some good captcha software running (thanks, Zach). These are duly marked as such in the markup and would, of course, have been excluded from any of the later processing--except that I am having to spend time hacking around the broken markup just to get to that point.

Oh, well. Such is life.