Drupal 8 Index pdf files in search - drupal

Does anyone know of a way to include pdf documents in the search for drupal 8?
I can't find anything to achieve this.

Including PDF documents in a search, requires you to index those PDF documents. This might prove to be a difficult task on Drupal 8, since there are not a lot of stable solutions.
That being said, Try Search File Attachments module.

Related

Recipes PDF batch extraction

I am now working with 500 pdf recipes files, which I want to display in my website. How can I batch extract them and display information on PDF to my website? PDF has all the information for recipes. For each recipe, I need to display its description, image, ingredients, instructions, nutrition label and so on. Is there any way so that I don't need to work on it manually?
Do these all have the same basic template for how the information is structured? This isn't really specifically a WordPress issue. One thing you can do is use Go to loop through and process all the files. I played with Go and it's incredibly fast to parse large amounts of information. Maybe you can fiddle with it in this library here https://github.com/unidoc/unidoc.
There are a lot of library options to try in PHP also. Here's just one example https://www.pdfparser.org/. There's documentation here and you can install it via composer. https://www.pdfparser.org/documentation
If every recipe follows the same sort of template, and you want to extract specific details in specific sections of the PDF, it should be easy enough. If you don't mind extracting all the text from a PDF and just display that on your website, it should be easy enough using one of the libraries. If you go the Golang route, you could just parse all the text for each PDF, save them to a file, and just upload them using PHP and have the PHP code insert everything into custom post types or something.

Search multiple plone site indexes

I need to implement a central search for multiple plone sites on different servers/machines.If there is a way to select which sites to search would be a plus but not the primary concern.Few ways I came upon to go about this:
-Export the ZCatalog indexes to an XML file and use a crawler periodically to get all the XML files so a search can be done on them,but this way does not allow for live searching.
-There is a way to use a common catalog but its not optimal and cannot be implemented on the sites i am working on because of some requirements.
-I read somewhere that they used solr but i need help on how to use it.
But I need a way to use the existing ZCatalog and index and not create another index as i think is the case with using solr due to the extra overheads and the extra index required to be maintained.But will use it if no other solution possible.I am a beginner at searching so please give details as much as possible.
You should really look into collective.solr:
https://pypi.python.org/pypi/collective.solr/4.1.0
Searching multiple sites is a complex use case and you most likely need a solution that scales. In the end it will require far less effort to go with Solr instead of coming up with your own solution. Solr is build for these kind of requirements.
As an alternative, you can also use collective.elasticindex, an extension to index Plone content into ElasticSearch, for this.
According to its documentation:
This doesn’t replace the Plone catalog with ElasticSearch, nor
interact with the Plone catalog at all, it merely index content inside
ElasticSearch when it is modified or published.
In addition to this, it provides a simple search page called
search.html that queries ElasticSearch using Javascript (so Plone is
not involved in searching) and propose the same features than the
default Plone search page. A search portlet let you redirect people to
this new search page as well.
That can be and advantage over collective.solr.

Bulk upload of Microsoft Word files to WordPress pages

I have been asked to upload 200 Microsoft Word documents — many of them containing lengthy, complex math problems or scientific notation — into a WordPress setting. Each Word file would become a separate WordPress post.
I would clearly prefer to not cut-and-paste each file one-by-one into a post and then save it . Does anyone know of a way to automate the process while ensuring the accuracy of the translation, or at least minimizing the number of issues we might find when converting from Word to WordPress? Or am I dreaming the impossible dream?
Thanks for any input you can offer.
Sounds like an interesting problem. I have an idea that might be worth exploring. There are a number of free or shareware tools that can convert Word docs to HTML.
If you can manage to convert them into decently clean markup with one of those tools, I would recommend using the HTML Import 2 WordPress plugin. It can take a batch of HTML files and create Posts / Pages out of them.
It's a two step process, but I bet it'll work. (And certainly be faster than copy/paste 200 times).
Hope that helps, have fun!
Well I got the solution which works for me, but its bit manual but still save a lots of time.
Here are the Steps.
Connect your Blog to Ms word 2007/2013
Make sure Remote writing is Enabled in WP
Copy all post in one Word document or use merge to make one single DOC.
Now Set Default posting category from WP and Save it.
Now from your MSWORD copy the post and start posting one by one.
Tips:
Make Shortcut key for publishing.
Use Ctrl+C for text before publishing.
Make shortcut for publishing to WP

Generating keywords from a pdf automatically

My application allows user to upload pdf files and store them on the webserver for later viewing. I store the name of the file, location, size, upload date, user name etc in an SQL server database.
I'd like to be able to programatically, just after a file is uploaded, generate a list of keywords (maybe everything except common words) and store them in the sql database as well so that subsequent users can do keyword searches...
Suggestions on how to approach this task? Does these type of routine already exist?
EDIT: Just to clarify my requirements, I wouldn't be concerned with doing OCR, I don't know the insides' of PDF's, but I understand that if it was generated by an app, such as Word->PDF Print, the text of the document is searchable...so really my first task, and the intent of my question is, how do I access the text of a PDF file from an asp.net app? OCR on scanned PDF's is probably beyond my requirements at this point.
As a first step you should extract all text from the PDF.
ghostscript and pdftotext can do this, the PDFBox is another option.
There are certainly other tools as well.
Then you can remove all stopwords and duplicates and write it to the database.
I has been mentioned that this does not work for scanned PDF documents but this is only half the truth. On the one hand there are lots of scanned PDFs which have text additionally embeded, because that is what some scanners drivers do (Canon CanoScan drivers performs OCR and generate searchable PDFs). On the other hand documents generated with LaTeX that contain non-ASCCII characters return garbage in my experience (even when I copy and paste in acrobat).
The only problem I foresee of grabbing every non-common word is that you'll dilute your search results and have to query the DB for more pdfs. One website to look at is Scribd which does something similar to what you are talking about doing with users uploading files and people being able to view them online via a flash app.
That is very interesting topic. The question is how many keywords do you need to define one PDF. If you say:
3 to 10 - I would check methods of text categorization such as bayesian classifier or K-NN (that method will group PDF files into clusters which are similar). I know that similar algorithms are used to filter spam. But it is a system that need input for example if you add keywords to 100 PDF this system will learn the schemas. I am not an expert but this is one way to do it.
more than 10 - then I would suggest brute force -> filter common words -> get most frequent words for a specific document.
I would explore first option. You must surely check such methods as "text categorization", "auto tagging", "text mining", "automatic keyword extraction".
Some links :
http://en.wikipedia.org/wiki/Naive_Bayes_classifier
Keyword Extraction Using Naive Bayes
If you are planning on indexing PDF documents, you should consider using a dedicated text search engine like Lucene. Lucene provides features that will be difficult to implement using only SQL and a relational database. You will still need to extract the text from the PDF documents, but won't have to worry about filtering out common words. By filtering out common words, you will completely lose the ability to do phrase searches.

In ASP.NET what is the best way to convert a PDF file to HTML?

What my users will do is select a PDF document on their machine, upload it to my website, where I will convert into an HTML document for display on the website. The document will be stored in a database after conversion.
What's the best way to convert a PDF to HTML?
I have been handed a requirement where a user would create a "news" story as a pdf and then would upload it to the sever, where it will be converted to HTML and displayed on the website.
Any document creation software that can save documents as PDF can save them as HTML. I'm assuming the issue is that your users will be creating rich documents (lots of embedded images), which results in multiple files, and your requirements stem from a desire to make uploading these documents as simple as possible to the user.
There are numerous conversion packages that can probably do this for you, however when you're talking about rich content, you are talking about text plus images. Those images have to be stored somewhere and served somehow, and whatever conversion method you use will require you to examine all image sources to make sure they point to valid locations on your server.
I would like to suggest an alternate way of doing this that you can take to your team: Implement one of the many blog APIs for publishing content. There are free and commercial software packages that use these APIs to publish content directly to a website, such as Windows Live Writer and Microsoft Word. Your users can simply create their content and upload it directly to your website without having to publish it as PDF first then upload it. So the process becomes much smoother for your users, and you get the posts in a form that doesn't require you spend thousands of dollars on developing or buying conversion code.
The two most common APIs are the MetaWeblog API and the Movable Type API. Both are very simple and easy to implement. I think this way would be a MUCH better alternative than what you're thinking about doing.
I don't think converting a PDF to an HTML string is necessarily the best idea, especially if you want to export it back as PDF. PDF files often contain binary elements such as images, so you may be best to convert it to ASCII via an encoding, such as Base64. That way you will have an ASCII string you can save into a text field in the DB and then convert it back out. Could you expand more on the main requirement?
My recommendation would be to not do it that way IF POSSIBLE (but we all know what managers are like) so...
I would recommend that you stay away from converting the PDF to/from HTML (because unless you can find a commercial solution it will be nigh on impossible) and instead do as has already been mentioned and store it as an encoded Base64 string, or BLOB or some other binary format in the database, and then display it to the user with some sort of PDF view plugin for the browser.
All it took was a simple google search for "PDF to HTML": http://www.gnostice.com/pdf2manyOverview_x.asp. I'm sure there are others.
So while it's 'possible', you may want to explain to your manager that this isn't the best content management solution.
Why not use the iTextSharp to read the PDF content? Then You could save both the binary PDF and the text content to the database. You could then let users search the content and download the PDF.
You should look into DynamicPDF. They have a converter (currently Beta) out for serving exactly this purpose. We have used their products with great success (especially for dumping Reporting Services reports directly to PDF).
Ref: http://www.dynamicpdf.com/

Resources