Extract Rules from InfoPath Form - infopath

Is there a way to automatically grab the rules, in their original, human readable format from an InfoPath 2007 form (probably the manifest.xsf file, they're not human readable here)? A commercial tool would be fine, even. We're looking to build a summary of the rules as they appear in the design form for easy browsing by a maintenance team.

There is the Logic Instpector built in to InfoPath that displays a page that lists all the rules, data validation, calculated default values, etc. It can be found under Tools -> Logic Inspector. Unfortunalty, other than pressing the Print button in the top left corner and printing to PDF I see no other way of exporting this information from within infopath.
To get the same sort of info collected into a report format I susspect that you will have to extract the files from the XSN and extract the default values, rules etc.
As for commertial tool a possibility is the qDabra Rules.
I have not used it at all and am not sure if it does export the rules to a human readable format but it should point you in the right direction (I hope)

Related

officer: this document contains fields that may refer to other files

I'm creating Microsoft Word outputs using the amazing officer package in R. I'm using a template Word document to specify much of the formatting.
Below is a code snippet that illustrates what I'm doing.
library("officer")
read_docx("Output Template Blank.docx") %>%
body_add_par(value = "Kaplan-Meier Analysis of Time (Months) to HSV-2 Episodes", style = "caption") %>%
body_add_flextable(my_km_table, align = "left") %>%
print("Kaplan-Meier Output.docx")
This generally works very well. The only trouble is that opening the document generates an unwanted message in Word.
"This document contains fields that may refer to other files. Do you want to update the fields in the document?"
I can get rid of this by clicking "Yes," slightly altering the document by adding a space, and then hitting save. I'd prefer not to have to do this manually though and was hoping there is a better way. I investigated this some time back and recall that there is a way to turn this off in Microsoft Word. I also recall that this was seen as something of a security risk. I'm not sure how much of a risk. So am wondering if this could be a good solution or if there truly is a better way.
This is a deliberate design decision for security reasons. Some kinds of Word fields can access external data. Microsoft's policy on this point is that responsibility for opening a document (and taking a risk) lies with the user - the user needs to decide whether the document comes from a trusted source.
For this reason, if the fields are set to automatically update, a message will be displayed asking the user whether to allow the update.
It's possible to insert fields and not set the automatic update. In this case, the user will need to manually update the fields or there could be an add-in that takes care of this when any document is opened. Since the user will have made the choice to install the add-in, that's again the user's responsibility.
The only other way to suppress the message is to have opened the document and updated the fields before passing it to the user. Programmatically, this could be done either using Word automation (not server-side) or in an on-premises version of SharePoint that has the Word Automation Services installed.

Is InfoPath right for this purpose?

I'm currently looking for a way to make a dynamic checklist-type document for my job to be used for software upgrades. Right now, we have a generic Word checklist that has all the steps for upgrading a client's software, but due to its nature, not all steps apply to each client, and to list all possible options would make it difficult to navigate and difficult to use, which goes against its purpose.
What I'm looking for is a way to input information (checkboxes, drop-downs, and text fields), and based on that information, produce a list of tasks in some format that is user-readable. For example, if I check one box to indicate that they have a certain feature installed, then add 3 items to the task list.
Is InfoPath the right tool for the job, or am I barking up the wrong tree?
From what you describe, I'd say InfoPath is a very good choice for your project. My first thought would be to work in two different views. The first view would be for your people to input the information about what features are installed (there can be hidden content that only shows if certain answers are given, making it less unwieldy than your Word form). Then I'd have another view designed for printing out and giving to the client, containing only the task list info derived from the data in the first view. Bark away!

Generating keywords from a pdf automatically

My application allows user to upload pdf files and store them on the webserver for later viewing. I store the name of the file, location, size, upload date, user name etc in an SQL server database.
I'd like to be able to programatically, just after a file is uploaded, generate a list of keywords (maybe everything except common words) and store them in the sql database as well so that subsequent users can do keyword searches...
Suggestions on how to approach this task? Does these type of routine already exist?
EDIT: Just to clarify my requirements, I wouldn't be concerned with doing OCR, I don't know the insides' of PDF's, but I understand that if it was generated by an app, such as Word->PDF Print, the text of the document is searchable...so really my first task, and the intent of my question is, how do I access the text of a PDF file from an asp.net app? OCR on scanned PDF's is probably beyond my requirements at this point.
As a first step you should extract all text from the PDF.
ghostscript and pdftotext can do this, the PDFBox is another option.
There are certainly other tools as well.
Then you can remove all stopwords and duplicates and write it to the database.
I has been mentioned that this does not work for scanned PDF documents but this is only half the truth. On the one hand there are lots of scanned PDFs which have text additionally embeded, because that is what some scanners drivers do (Canon CanoScan drivers performs OCR and generate searchable PDFs). On the other hand documents generated with LaTeX that contain non-ASCCII characters return garbage in my experience (even when I copy and paste in acrobat).
The only problem I foresee of grabbing every non-common word is that you'll dilute your search results and have to query the DB for more pdfs. One website to look at is Scribd which does something similar to what you are talking about doing with users uploading files and people being able to view them online via a flash app.
That is very interesting topic. The question is how many keywords do you need to define one PDF. If you say:
3 to 10 - I would check methods of text categorization such as bayesian classifier or K-NN (that method will group PDF files into clusters which are similar). I know that similar algorithms are used to filter spam. But it is a system that need input for example if you add keywords to 100 PDF this system will learn the schemas. I am not an expert but this is one way to do it.
more than 10 - then I would suggest brute force -> filter common words -> get most frequent words for a specific document.
I would explore first option. You must surely check such methods as "text categorization", "auto tagging", "text mining", "automatic keyword extraction".
Some links :
http://en.wikipedia.org/wiki/Naive_Bayes_classifier
Keyword Extraction Using Naive Bayes
If you are planning on indexing PDF documents, you should consider using a dedicated text search engine like Lucene. Lucene provides features that will be difficult to implement using only SQL and a relational database. You will still need to extract the text from the PDF documents, but won't have to worry about filtering out common words. By filtering out common words, you will completely lose the ability to do phrase searches.

Printing a Calendar or Diary from ASP.NET Application

We have an ASP.NET application that uses the Infragistics WebSchedule control to display appointments etc in the same manner as Outlook. The problem we have is that the customer wants to be able to print the page as it appears on the screen - which the control itself does not appear to support directly.
We have developed a Crystal Report that does a fair job but it is pretty complicated and just a little bit flaky (it does not stretch to accommodate all of the appointments for a particular day so if there are too many then they spill over). Bascially we have bullied Crystal to doing something it is not really meant to do - render a graphical representation of a diary rather than list the data in a tabular manner.
Does anyone have a better alternative to this?
Thanks in advance
DayPilot Pro (our product) supports PNG export that allows easy calendar/schedule printing (almost a pixel-by-pixel copy of the HTML control).
It's working for both the Calendar (traditional Outlook-like day/week view):
http://www.daypilot.org/demo/Calendar/
and for the Scheduler (showing a time line for multiple resources):
http://www.daypilot.org/demo/Scheduler/
Try "Print/export" button below the controls.
Well in the end I decided to junk the Crystal Report in this instance. It's fine for tabular data and graph data but not really suitable for a graphical representation of a diary/scheduler.
I opted for an XML/XSLT solution which has turned out better than I expected - especially in terms of speed.
I was able to generate an XML stream and depending on the date range feed it to a suitable XSL template which produced a Weekly or Monthly view of the report. A colleague sprinkled some CSS over it and we're sorted.

In ASP.NET what is the best way to convert a PDF file to HTML?

What my users will do is select a PDF document on their machine, upload it to my website, where I will convert into an HTML document for display on the website. The document will be stored in a database after conversion.
What's the best way to convert a PDF to HTML?
I have been handed a requirement where a user would create a "news" story as a pdf and then would upload it to the sever, where it will be converted to HTML and displayed on the website.
Any document creation software that can save documents as PDF can save them as HTML. I'm assuming the issue is that your users will be creating rich documents (lots of embedded images), which results in multiple files, and your requirements stem from a desire to make uploading these documents as simple as possible to the user.
There are numerous conversion packages that can probably do this for you, however when you're talking about rich content, you are talking about text plus images. Those images have to be stored somewhere and served somehow, and whatever conversion method you use will require you to examine all image sources to make sure they point to valid locations on your server.
I would like to suggest an alternate way of doing this that you can take to your team: Implement one of the many blog APIs for publishing content. There are free and commercial software packages that use these APIs to publish content directly to a website, such as Windows Live Writer and Microsoft Word. Your users can simply create their content and upload it directly to your website without having to publish it as PDF first then upload it. So the process becomes much smoother for your users, and you get the posts in a form that doesn't require you spend thousands of dollars on developing or buying conversion code.
The two most common APIs are the MetaWeblog API and the Movable Type API. Both are very simple and easy to implement. I think this way would be a MUCH better alternative than what you're thinking about doing.
I don't think converting a PDF to an HTML string is necessarily the best idea, especially if you want to export it back as PDF. PDF files often contain binary elements such as images, so you may be best to convert it to ASCII via an encoding, such as Base64. That way you will have an ASCII string you can save into a text field in the DB and then convert it back out. Could you expand more on the main requirement?
My recommendation would be to not do it that way IF POSSIBLE (but we all know what managers are like) so...
I would recommend that you stay away from converting the PDF to/from HTML (because unless you can find a commercial solution it will be nigh on impossible) and instead do as has already been mentioned and store it as an encoded Base64 string, or BLOB or some other binary format in the database, and then display it to the user with some sort of PDF view plugin for the browser.
All it took was a simple google search for "PDF to HTML": http://www.gnostice.com/pdf2manyOverview_x.asp. I'm sure there are others.
So while it's 'possible', you may want to explain to your manager that this isn't the best content management solution.
Why not use the iTextSharp to read the PDF content? Then You could save both the binary PDF and the text content to the database. You could then let users search the content and download the PDF.
You should look into DynamicPDF. They have a converter (currently Beta) out for serving exactly this purpose. We have used their products with great success (especially for dumping Reporting Services reports directly to PDF).
Ref: http://www.dynamicpdf.com/

Resources