People always draw the wrong conclusions in articles like these. He shows a piec...

nailer · on July 9, 2008

As you mention, RegExps are the wrong way to handle markup.

In Python, load lxml module and parse the content into an etree, a heirarchical data structure for XML elements.

lxml.etree will read them in and treat them as Comment or ProcessingInstruction.

You can then iterate the tree and collect the comments.

I've been working with etree for the last couple of weeks, firstly using xpaths to extract data from a HR database a customer expects my company to look up manually for 6000 people, then to create OpenDocument files. It's really easy, even if you've never done any XML programming before, and the mailinglist is very helpful too.

dangoldin · on July 9, 2008

There are Perl packages to do the same - XML::TreeBuilder and HTML::TreeBuilder come to mind. I'm sure for most cases they would both work just fine and it's really just a matter of preference.

nailer · on July 9, 2008

I agree - wasn't trying to use that as an an illustration of superiority of Python, but just show how the data should be properly parsed, not treated like unstructured data.

That said, I do prefer the 'one, right way to do it' of Python - it makes it a lot easier to pick up on someone else's code.

kingkongrevenge · on July 8, 2008

> He shows a piece of code that uses a regular expression over implicitly-defined variables.

His snippet is perfectly idiomatic perl code and I find it easily legible. His point is that it's pretty hard to parse that without having already spent some hours with the perldocs. Fair enough.

initself · on July 9, 2008

It might be 'perfectly idiomatic' but jrockway's code is much more typical of what a Perl programmer who wished to communicate clear ideas would write.