In Paw, it was possible to easily share a paw file in a repository as part of the documentation because it was stored in plain text, making changes to it legible.
Artsy is doing it in its energy project and they wrote some blog posts about this.
However, since the 3.x release, it appears that the file is now stored in binary format.
Am I missing a hidden option to save in xml format or is the feature completely gone?
Related
I am now working with 500 pdf recipes files, which I want to display in my website. How can I batch extract them and display information on PDF to my website? PDF has all the information for recipes. For each recipe, I need to display its description, image, ingredients, instructions, nutrition label and so on. Is there any way so that I don't need to work on it manually?
Do these all have the same basic template for how the information is structured? This isn't really specifically a WordPress issue. One thing you can do is use Go to loop through and process all the files. I played with Go and it's incredibly fast to parse large amounts of information. Maybe you can fiddle with it in this library here https://github.com/unidoc/unidoc.
There are a lot of library options to try in PHP also. Here's just one example https://www.pdfparser.org/. There's documentation here and you can install it via composer. https://www.pdfparser.org/documentation
If every recipe follows the same sort of template, and you want to extract specific details in specific sections of the PDF, it should be easy enough. If you don't mind extracting all the text from a PDF and just display that on your website, it should be easy enough using one of the libraries. If you go the Golang route, you could just parse all the text for each PDF, save them to a file, and just upload them using PHP and have the PHP code insert everything into custom post types or something.
I downloaded xnview two years ago, and added title/subject/keywords metadata for a bunch of jpegs. However recently I compared the keywords from xnview and adobe bridge. I realize that they are different. I believe the title/subject is not the same. I noticed this, because some stock photo agencies do not read the metadata from xnview. They usually leave the title field blank. It looks like xnview uses xmp format whereas adobe uses iptc. Any suggestions on how I can batch convert
Does Adobe software support IPTC metadata these days? They are authors of XMP which is a competing standard, so I wonder. Also, see https://en.wikipedia.org/wiki/Extensible_Metadata_Platform#Support_and_acceptance for a list of software with which you can check actual content of the metadata for the particular file (I would suggest http://owl.phy.queensu.ca/~phil/exiftool/ as the most versatile option).
Trying to create custom types, aspects and properties for Alfresco, I followed the Alfresco Developer Series guide. When I reached the localization section I found out that Alfresco does not handle UTF-8 encoding in the .properties files that you create. Greek characters are not displayed correctly in Share.
Checking out other built-in .properties files (/opt/alfresco-4.0.e/tomcat/webapps/alfresco/WEB-INF/classes/alfresco/messages) I noticed that in Japanese, for example, the characters are in this notation: \u3059\u3079\u3066\u306e...
So, the question is: do I have to convert the greek words in the above mentioned notation for Share to display them correctly, or is there another -more elegant- way to do it?
The \u#### form is the Java form of the Unicode Escape Sequence, and is used to reference unicode characters without having to worry about the encoding of the file storing them.
This question has some information on how to create and decode them
Another way, which is what Alfresco developers tend to use, is the Native2ASCII tool which ships with Java itself. With that, you can initially write your strings in a UTF-8 (for example) file, then use the tool to turn them into their escaped form.
Is there a pre-existing library to extract plain text form Open XML file formats (e.g. docx, pptx, and xlsx) files?
I require this to populate a lucene.net index.
I've found this example which extracts text from docx and it seems to work okay. But before building my own solution based on this I was wondering if there's something already available for the other file formats?
Before spending cash, it may be worth looking at the IFilter interface - these were/are designed to do exactly what you want.
http://msdn.microsoft.com/en-us/library/ms691105
http://www.codeproject.com/KB/cs/IFilter.aspx
(Some links at the bottom of the codeprject link).
MS provide IFilters for office file types.
http://www.microsoft.com/downloads/details.aspx?familyid=60c92a37-719c-4077-b5c6-cac34f4227cc&displaylang=en
I know that we use this technology to allow us to index PDFs using Lucene but I did not write the actual code and cannot be of much use I am afraid.
If your Google-fu is strong I am sure you can dig up more examples of using IFilters to do exactly what you want.
watch aspose.com, they have a good library to handle both ppt and pptx.
You can try Toxy, an open source text/data extraction framework for .NET. For now, it supports xls, xlsx, doc, docx. It will support pptx in version 1.5 very soon.
For detail, you can check here
What my users will do is select a PDF document on their machine, upload it to my website, where I will convert into an HTML document for display on the website. The document will be stored in a database after conversion.
What's the best way to convert a PDF to HTML?
I have been handed a requirement where a user would create a "news" story as a pdf and then would upload it to the sever, where it will be converted to HTML and displayed on the website.
Any document creation software that can save documents as PDF can save them as HTML. I'm assuming the issue is that your users will be creating rich documents (lots of embedded images), which results in multiple files, and your requirements stem from a desire to make uploading these documents as simple as possible to the user.
There are numerous conversion packages that can probably do this for you, however when you're talking about rich content, you are talking about text plus images. Those images have to be stored somewhere and served somehow, and whatever conversion method you use will require you to examine all image sources to make sure they point to valid locations on your server.
I would like to suggest an alternate way of doing this that you can take to your team: Implement one of the many blog APIs for publishing content. There are free and commercial software packages that use these APIs to publish content directly to a website, such as Windows Live Writer and Microsoft Word. Your users can simply create their content and upload it directly to your website without having to publish it as PDF first then upload it. So the process becomes much smoother for your users, and you get the posts in a form that doesn't require you spend thousands of dollars on developing or buying conversion code.
The two most common APIs are the MetaWeblog API and the Movable Type API. Both are very simple and easy to implement. I think this way would be a MUCH better alternative than what you're thinking about doing.
I don't think converting a PDF to an HTML string is necessarily the best idea, especially if you want to export it back as PDF. PDF files often contain binary elements such as images, so you may be best to convert it to ASCII via an encoding, such as Base64. That way you will have an ASCII string you can save into a text field in the DB and then convert it back out. Could you expand more on the main requirement?
My recommendation would be to not do it that way IF POSSIBLE (but we all know what managers are like) so...
I would recommend that you stay away from converting the PDF to/from HTML (because unless you can find a commercial solution it will be nigh on impossible) and instead do as has already been mentioned and store it as an encoded Base64 string, or BLOB or some other binary format in the database, and then display it to the user with some sort of PDF view plugin for the browser.
All it took was a simple google search for "PDF to HTML": http://www.gnostice.com/pdf2manyOverview_x.asp. I'm sure there are others.
So while it's 'possible', you may want to explain to your manager that this isn't the best content management solution.
Why not use the iTextSharp to read the PDF content? Then You could save both the binary PDF and the text content to the database. You could then let users search the content and download the PDF.
You should look into DynamicPDF. They have a converter (currently Beta) out for serving exactly this purpose. We have used their products with great success (especially for dumping Reporting Services reports directly to PDF).
Ref: http://www.dynamicpdf.com/