I am working on pdf file in vb.net. I want to compare two pdf files.
Is there any dll that is used for this purpose?
You can use ABCpdf to do a lot of things with PDFs. It works very well.
However, there is an inherent problem with comparing PDFs, as they are not always very well structured. But, for example, if your PDFs have form fields, you can quite easily compare the form field values with ABCpdf.
Here is a post that shows how to get the text from a PDF with ABCpdf:
https://stackoverflow.com/a/10998043/392362
If you want to compare for exact match you could simply calculate and compare their SHA1 checksums.
Related
I need your help over a problem I have. Actually, I have a page with a simple embed which displays a PDF file.
I got a request to add another PDF file to the same embed (or at least to do something which would look like it).
I searched some solutions and not finding a simple one, I'm thinking about using iTextSharp to merge both files (by getting their stream from their url), merging them into a new pdf file and display this resulting file into the embed.
But I'm just telling myself it's a bit too much for such a simple modification... And so I'm here asking you if someone would have a better idea ? From what I searched on stackoverflow and google it looks like I will have to take the merge solution but hey, we never know '^^
A simpler option would be to merge the two PDF files using either a free online tool or Adobe Combine Files option and then adding that newly combined PDF to your site. Unless I am missing something, there is no real reason or benefit to do this using code.
I have a report that gets displayed in a report viewer. I would like to be able to export the report to excel and set the data which gets exported to show up as numeric instead of text.
Is there a way to do this?
If you mean have the row/column that is numbers be formatted in excel as 'number' instead of 'general' upon exporting I would doubt it. Reason for this is because the Report doesn't save the Datatype, the designer does (the RD: tags in the report's definition are all for the report designer specifically and can be deleted without harm to the loading of reports) therefore, excel would have really no way of knowing if you wanted to have it formatted as a number or not.
I'm guessing there might be a way around this with excel macros or something but for simplicity's sake I am pretty certain it isn't possible.
(I would have posted this as a comment but for some reason you need 50 rep to post comments?)
A Question in my mind Is it possible to convert Postscript(PS) File Into Word(doc) file using Asp.Net? If Yes then how can we resolve it via C# Code.
I don't know of any tool which will convert PostScript to word. Not only that, but you certainly can't reliably do anything except render the whole thing to an image, and isert that as a graphic.
Up to a point you can extract text, what is it you actually want to do ?
My application allows user to upload pdf files and store them on the webserver for later viewing. I store the name of the file, location, size, upload date, user name etc in an SQL server database.
I'd like to be able to programatically, just after a file is uploaded, generate a list of keywords (maybe everything except common words) and store them in the sql database as well so that subsequent users can do keyword searches...
Suggestions on how to approach this task? Does these type of routine already exist?
EDIT: Just to clarify my requirements, I wouldn't be concerned with doing OCR, I don't know the insides' of PDF's, but I understand that if it was generated by an app, such as Word->PDF Print, the text of the document is searchable...so really my first task, and the intent of my question is, how do I access the text of a PDF file from an asp.net app? OCR on scanned PDF's is probably beyond my requirements at this point.
As a first step you should extract all text from the PDF.
ghostscript and pdftotext can do this, the PDFBox is another option.
There are certainly other tools as well.
Then you can remove all stopwords and duplicates and write it to the database.
I has been mentioned that this does not work for scanned PDF documents but this is only half the truth. On the one hand there are lots of scanned PDFs which have text additionally embeded, because that is what some scanners drivers do (Canon CanoScan drivers performs OCR and generate searchable PDFs). On the other hand documents generated with LaTeX that contain non-ASCCII characters return garbage in my experience (even when I copy and paste in acrobat).
The only problem I foresee of grabbing every non-common word is that you'll dilute your search results and have to query the DB for more pdfs. One website to look at is Scribd which does something similar to what you are talking about doing with users uploading files and people being able to view them online via a flash app.
That is very interesting topic. The question is how many keywords do you need to define one PDF. If you say:
3 to 10 - I would check methods of text categorization such as bayesian classifier or K-NN (that method will group PDF files into clusters which are similar). I know that similar algorithms are used to filter spam. But it is a system that need input for example if you add keywords to 100 PDF this system will learn the schemas. I am not an expert but this is one way to do it.
more than 10 - then I would suggest brute force -> filter common words -> get most frequent words for a specific document.
I would explore first option. You must surely check such methods as "text categorization", "auto tagging", "text mining", "automatic keyword extraction".
Some links :
http://en.wikipedia.org/wiki/Naive_Bayes_classifier
Keyword Extraction Using Naive Bayes
If you are planning on indexing PDF documents, you should consider using a dedicated text search engine like Lucene. Lucene provides features that will be difficult to implement using only SQL and a relational database. You will still need to extract the text from the PDF documents, but won't have to worry about filtering out common words. By filtering out common words, you will completely lose the ability to do phrase searches.
What my users will do is select a PDF document on their machine, upload it to my website, where I will convert into an HTML document for display on the website. The document will be stored in a database after conversion.
What's the best way to convert a PDF to HTML?
I have been handed a requirement where a user would create a "news" story as a pdf and then would upload it to the sever, where it will be converted to HTML and displayed on the website.
Any document creation software that can save documents as PDF can save them as HTML. I'm assuming the issue is that your users will be creating rich documents (lots of embedded images), which results in multiple files, and your requirements stem from a desire to make uploading these documents as simple as possible to the user.
There are numerous conversion packages that can probably do this for you, however when you're talking about rich content, you are talking about text plus images. Those images have to be stored somewhere and served somehow, and whatever conversion method you use will require you to examine all image sources to make sure they point to valid locations on your server.
I would like to suggest an alternate way of doing this that you can take to your team: Implement one of the many blog APIs for publishing content. There are free and commercial software packages that use these APIs to publish content directly to a website, such as Windows Live Writer and Microsoft Word. Your users can simply create their content and upload it directly to your website without having to publish it as PDF first then upload it. So the process becomes much smoother for your users, and you get the posts in a form that doesn't require you spend thousands of dollars on developing or buying conversion code.
The two most common APIs are the MetaWeblog API and the Movable Type API. Both are very simple and easy to implement. I think this way would be a MUCH better alternative than what you're thinking about doing.
I don't think converting a PDF to an HTML string is necessarily the best idea, especially if you want to export it back as PDF. PDF files often contain binary elements such as images, so you may be best to convert it to ASCII via an encoding, such as Base64. That way you will have an ASCII string you can save into a text field in the DB and then convert it back out. Could you expand more on the main requirement?
My recommendation would be to not do it that way IF POSSIBLE (but we all know what managers are like) so...
I would recommend that you stay away from converting the PDF to/from HTML (because unless you can find a commercial solution it will be nigh on impossible) and instead do as has already been mentioned and store it as an encoded Base64 string, or BLOB or some other binary format in the database, and then display it to the user with some sort of PDF view plugin for the browser.
All it took was a simple google search for "PDF to HTML": http://www.gnostice.com/pdf2manyOverview_x.asp. I'm sure there are others.
So while it's 'possible', you may want to explain to your manager that this isn't the best content management solution.
Why not use the iTextSharp to read the PDF content? Then You could save both the binary PDF and the text content to the database. You could then let users search the content and download the PDF.
You should look into DynamicPDF. They have a converter (currently Beta) out for serving exactly this purpose. We have used their products with great success (especially for dumping Reporting Services reports directly to PDF).
Ref: http://www.dynamicpdf.com/