Is it possible to get the pages of a chapter in ITextsharp or somehow split the file on chapters in vb.net, basically I'm concatenating multiple files into one pdf and separating them as chapters but sometimes I'll need to read these chapters out separately, is this possible?
I used this example with great success
Merge PDF files with IText# And .Net
The trick is to use this keeping your pages in memory stream so you can a) write them as individual pages and b) hand them to the merger to merge together.
It is fast too. I have code that can produce 100s of documents in less than a minute, in merged and singular form.
Related
For our SaaS (LAMP) product reporting we are currently using JasperReports. We find it too cumbersome to develop reports with and the output in Word unworkable. Moreover, a couple of customers request to be able to develop simple reports themselves (to be used as mail merge). We would therefore like to develop templates right in Word. The idea is to have an application/webservice that would receive the Word template and JSON data from the LAMP application and return the filled-in report. The report has to support:
Loops inside content (repeating a document section several times while filling in array data)
Filling in tables (populating rows from array)
Filling in chart data in pre-created charts (from array)
This is the functionality we are using in JasperReports right now. Are there existing solutions to this? I've found quite a lot that can substitute simple variables, but no info about the the above three points. Will it be a lot of effort to write one from scratch? I would prefer a Windows OpenXML-based solution rather than a Linux PHPOffice-based one as I presume the former would handle the text split by spell-checker and language tags (though I'm not sure).
Windward and Docmosis are both commercial products that support the features you've listed and they are intended to be added to your application to provide reporting capabilities. Neither is are not OpenXML based. They can use Word documents as templates and perform the data merge into different output formats. Please note I work for Docmosis.
Aspose Words is another tool and it can populate a template but most of the power is through code rather than controls/directives in the template. Given your OpenXML thoughts, perhaps this is more what you are looking for.
More tools are recommended here in StackExchange.
I hope that helps.
ReportBox is a Web based reporting solution that can be used by any software application to generate documents and reports in Microsoft Word/ Excel/ PowerPoint/ HTML(DocX/Xlsx/PPTx/HTML) using OpenXML.
The process starts by building a Microsoft Word/ Excel/ PowerPoint/ HTML document as a template and uploading to ReportBox portal. Your application either sends data to ReportBox or ReportBox can pull data from your application database, which is then merged with the template to produce the finished report. Please note that I work for GreenThoughts.
I am trying to build a system in C#.NET / MVC / ASP.NET, where we have 1000s of different documents in .doc, .xls, .pdf, .txt etc and those are movie / serial scripts for subtitle and dubbing.
I have to extract the actual content i.e. dialogues, exclude unwanted text from all templates and count number of lines / paragraphs are spoken by different characters in a single script.
Issue is here, there is no predefined / concrete format of documents and said we can't define that too, since they come from various countries / states and each has different way of writing scripts.
If someone already have developed this type of system or used any third party opensource or paid API, would be really grateful.
Thanks in Advance.
I am working on pdf file in vb.net. I want to compare two pdf files.
Is there any dll that is used for this purpose?
You can use ABCpdf to do a lot of things with PDFs. It works very well.
However, there is an inherent problem with comparing PDFs, as they are not always very well structured. But, for example, if your PDFs have form fields, you can quite easily compare the form field values with ABCpdf.
Here is a post that shows how to get the text from a PDF with ABCpdf:
https://stackoverflow.com/a/10998043/392362
If you want to compare for exact match you could simply calculate and compare their SHA1 checksums.
i try to merge multiple Files (PDF and TIF) to one PDF Bundle. I want to print all Documents from one Workspace. Any Suggestions how to do it with alfresco?
Thx
Rene
There is no PDF merging out of the box in Alfresco. You should have a look at Jared Ottley's PDF Toolkit. It implements Alfresco actions to work with PDF documents, merging PDFs being one of them. Personally I have not used it and it looks a bit dated but it should get you started.
My application allows user to upload pdf files and store them on the webserver for later viewing. I store the name of the file, location, size, upload date, user name etc in an SQL server database.
I'd like to be able to programatically, just after a file is uploaded, generate a list of keywords (maybe everything except common words) and store them in the sql database as well so that subsequent users can do keyword searches...
Suggestions on how to approach this task? Does these type of routine already exist?
EDIT: Just to clarify my requirements, I wouldn't be concerned with doing OCR, I don't know the insides' of PDF's, but I understand that if it was generated by an app, such as Word->PDF Print, the text of the document is searchable...so really my first task, and the intent of my question is, how do I access the text of a PDF file from an asp.net app? OCR on scanned PDF's is probably beyond my requirements at this point.
As a first step you should extract all text from the PDF.
ghostscript and pdftotext can do this, the PDFBox is another option.
There are certainly other tools as well.
Then you can remove all stopwords and duplicates and write it to the database.
I has been mentioned that this does not work for scanned PDF documents but this is only half the truth. On the one hand there are lots of scanned PDFs which have text additionally embeded, because that is what some scanners drivers do (Canon CanoScan drivers performs OCR and generate searchable PDFs). On the other hand documents generated with LaTeX that contain non-ASCCII characters return garbage in my experience (even when I copy and paste in acrobat).
The only problem I foresee of grabbing every non-common word is that you'll dilute your search results and have to query the DB for more pdfs. One website to look at is Scribd which does something similar to what you are talking about doing with users uploading files and people being able to view them online via a flash app.
That is very interesting topic. The question is how many keywords do you need to define one PDF. If you say:
3 to 10 - I would check methods of text categorization such as bayesian classifier or K-NN (that method will group PDF files into clusters which are similar). I know that similar algorithms are used to filter spam. But it is a system that need input for example if you add keywords to 100 PDF this system will learn the schemas. I am not an expert but this is one way to do it.
more than 10 - then I would suggest brute force -> filter common words -> get most frequent words for a specific document.
I would explore first option. You must surely check such methods as "text categorization", "auto tagging", "text mining", "automatic keyword extraction".
Some links :
http://en.wikipedia.org/wiki/Naive_Bayes_classifier
Keyword Extraction Using Naive Bayes
If you are planning on indexing PDF documents, you should consider using a dedicated text search engine like Lucene. Lucene provides features that will be difficult to implement using only SQL and a relational database. You will still need to extract the text from the PDF documents, but won't have to worry about filtering out common words. By filtering out common words, you will completely lose the ability to do phrase searches.