Hi everyone,
I'm working on a piece of software that reads Wordpress XML export files (containing posts and pages) and parses them.
I'm having trouble with a number of XML files that don't seem to have any P or BR tags to mark new lines in the content field. However the content includes other HTML tags such as UL and LI.
Example XML looks something like this...
<content:encoded><![CDATA[This is a paragraph.
Another paragraph.
<ul>
<li>Bullet list</li>
<li>Bullet list</li>
</ul>
]]>
Currently my script treats this as HTML content and I end up with all the content on one line. "This is a paragraph. Another paragraph."
However if I use the PHP nl2br() function to add in the missing line breaks then I end up with this...
<ul><br /><li>Bullet list</li><br /><li>Bullet list</li><br /></ul>
Does anyone have a method of parsing this pseudo-html code in the XML files to retain the line breaks? I notice on the original site it has the P tags in the correct place so something about the import must be stripping them. Unfortunately I'm not the person generating the export file so I have no control over this.
Has anyone come across this before or have any ideas?
Thanks in advance :)