Supported file formats include .doc, .docx, .htm, .html, .pdf, .ps, .tmx, .txt, .vert, .xml. An XML file is also possible if you upload it as plain text but it should only contain text with structural mark-up (such as document or paragraph boundaries; document metadata, etc.). More complex XML will not be processed correctly. Here is a sample of XML text that would be processed correctly:
With regards to PDF files, please bear in mind that firstly PDF files are converted into plain text in order to create a corpus. This conversion is still the unsolved problem in computer science, especially, there may be problems with PDF files containing multiple columns, headings/footers or splitting words at the end of lines which may not be processed correctly.