Read Word and PDF files from PHP

In most of the big web applications we will have to deal more or less with file manipulations. Sometimes we need to create some config files, create PDF files for generating reports or reading XML feed and parsing data. Since most of the servers that host PHP are Linux based, real question here will be, how we can read Windows files, for example .doc or .docx files created in Word or PDF file in order to retrieve data from given files? Fortunately for us, there is a couple solutions to accomplish this. We will here present the most simple ones. For reading Word files we will use antiword. Installing antiwordon Linux system is really  strength forward cause antiword is in most of Linux repositories. So, we can for example install it on Ubuntu via:

apt-get install antiword

 

After installation, usage is very simple. We will simply load whole file content in one variable and later we can manipulate with it.

$fileContent = shell_exec('/usr/local/bin/antiword '.$filename);
foreach( $fileContent as $line )
{
    print $line."rn";
}

 

In above example we output whole file content line by line. Pretty same as with Word documents, we can read PDF files. In that purpose we will use XPDF package. Installation is a bit complicated, but you can download binaries for Linux or MacOS. After you setup Xpdf usage is pretty the same as with antiword:

$fileContent = shell_exec('/usr/local/bin/pdftotext '.$filename.' -');

 

and later we can manipulate with content as we like. If you have easier way to read Word files or PDF file, please write.

No related posts.

0cbd9056832582ae09c62a619e383593

18. April 2012 by admin
Categories: PHP | 2 comments

Comments (2)

  1. AntiWord seems to have stopped development 7 years ago and doesn’t handle Word 2007 or 2010 documents.

  2. You may find interesting PHPDocX (http://www.phpdocx.com) that lets you manipulate word documents (.docx) with PHP.

    Best regards,
    Eduardo

Leave a Reply

Required fields are marked *

*