An "XML-ised" version of David Lewis' distribution of the original Reuters News Corpus for Text Classification (Reuters-21578) can be found be in the following archive:
Once the archive has been expanded, the resulting files
reut2-001.xml, ..., reut2-021.xml
will contain about 26Mb
of well-formed (though not valid) XML-encoded news items, each
annotated with categories (see cat-descriptions_120396.txt) and other
information such as TITLE, and BODY of text. The files can be easily
parsed by any non-validating, event-driven parser such as J. Clark's
XP or any other SAX-conformant parser. More details in can be found in
README_LEWIS.txt.