IBM Developer Works has just released an article of mine on High-Performance XML Parsing in Python. Although there is nothing publishing-centric about the article itself, it was based on my own experience in dealing with large XML datasets in academic publishing.
Massive XML files are uncommon in the general web development world, where the primary roles of XML are either as configuration files, read only infrequently, or for interchange across the web, in which case the files are necessarily small. It’s rare to encounter XML measured in gigabytes or more; data at that level is usually stored in a relational database.
For that reason I find myself frustrated with many XML tools, even those ostensibly designed to handle large amounts of data. Too often they don’t scale well or at least easily. I don’t believe that scaling should be a black art that each individual developer needs to solve independently. Unfortunately, in commercial products ease-of-use is a key bullet point and computationally-difficult problems are hard to summarize in a user’s guide.