Sitecore 7 PDF and Document Content Search
April 28, 2015
Recently I had to implement PDF content search for a project so content editors could find files stored in Sitecore easily. With Sitecore 7, since search is built in, you would think this would be a piece of cake. However, I ran into a number of issues. With Sitecore 7, you should be able to install an IFilter such as Adobe or Foxit, and search should work once your indexes are rebuilt. However, the Foxit IFitler was not a free option, so that was out for me (this is the recommend IFilter from Sitecore). The Adobe IFilter presented issues on its own that Sitecore has identified as a defect. If you want to use the Adobe IFilter you will need to follow these steps for content search to work:
-
Copy all the Adobe iFilter “.dll” files into the \System32\Inetsrv folder. This is the working directory for IIS on Windows Server. The Adobe iFilter “.dll” files are stored in the C:\Program Files\Adobe\Adobe PDF iFilter 9 for 64-bit platforms\bin folder by default. Also, you can use the “IFilter Explorer” tool to detect the folder where the “.dll” files are stored using this IFilter Tool. For more details please see the screenshot.
-
Delete all the files under the Website/App_Data/MediaCache folder.
-
Rebuild the Sitecore Search Indexes (Sitecore -> Control Panel -> Indexing -> Indexing Manager).
-
Clear the Sitecore cache (the http://{hostname}/sitecore/admin/cache.aspx tool).
-
Restart IIS.
Unfortunately, these steps did not work for me. When I would test using Luke the _content field would keep coming in blank for all of my PDFs and documents. If you want more information on how to use Luke, check out John West’s blog post: Using Luke to Understand Sitecore 7 Search. However, back to the issue at hand - So Both IFilters were not working for me, so I decided to write my own method for parsing the content using PDFBox.NET. To do that I first grabbed the most recent DLLs I could find for PDF Box. Here is the complete zip of the DLLs: PDFBox.NET-1.7.0 DLLs, but for the content search, all I needed to add was references to:
- bcmail-jdk15-1.44.dll
- bcprov-jdk15-1.44.dll
- commons-logging.dll
- EPocalipse.IFilter.dll (Document search)
- fontbox-1.7.0.dll
- IKVM.OpenJDK.Core
- IKVM.OpenJDK.SwingAWT
- IKVM.OpenJDK.Util
- IKVM.Runtime
- pdfbox-1.7.0.dll
Once you add those you should be all good to build out the computed field that will override the current _content computed field defined in your Sitecore.ContentSearch.Lucene.DefaultIndexConfiguration.config file. First let’s create the computed field class:
First we figure out the type of file we are indexing, and then pass it to the proper method to parse the content. In each function we pass the media item’s stream to the proper text stripper or text reader and then have PDFBox handle getting the content of the document or PDF. If we get content out of the file, then we do a replace on \r\n, which represent carriage returns and line break, and lower the entire content so when we do a compare later, we don’t run into any mismatches in case.
Now that we have our custom class, we need to update the _content computed field assembly reference:
Now we have all the ground work in place. Go into your Sitecore instance and rebuild your index. Once this finishes you should be all good to go with your content search.
Happy Coding!
Tags: Sitecore,C#,PDFBox,ContentSearch
Back to Posts