XML-encoded version of Reuters-21578

An "XML-ised" version of David Lewis' distribution of the original Reuters News Corpus for Text Classification (Reuters-21578) can be found be in the following archive:

reuters21578-xml.tgz (5.2Mb)

Once the archive has been expanded, the resulting files reut2-001.xml, ..., reut2-021.xml will contain about 26Mb of well-formed (though not valid) XML-encoded news items, each annotated with categories (see cat-descriptions_120396.txt) and other information such as TITLE, and BODY of text. The files can be easily parsed by any non-validating, event-driven parser such as J. Clark's XP or any other SAX-conformant parser. More details in can be found in README_LEWIS.txt.