Replace text in Word Document via ASP.NET - asp.net

How can I replace a string/word in a Word Document via ASP.NET? I just need to replace a couple words in the document, so I would like to stay AWAY from 3rd party plugins & interop. I would like to do this by opening the file and replacing the text.
The following attempts were made:
I created a StreamReader and Writer to read the file but I think that I am reading and writing in the wrong format. I think that Word Documents are stored in binary?? If word documents are binary, how would I read and write the file in binary?
Dim template As String = Request.MapPath("documentName.doc")
If File.Exists(template) Then
Dim sr As New StreamReader(template)
Dim content As String = sr.ReadToEnd()
sr.Close()
Dim sw As New StreamWriter(template)
content = content.Replace("# T O D A Y S D A T E", Date.Now.ToString("MM/dd/yyyy"))
sw.Write(content)
sw.Close()
Else

Word binary format is proprietary to Microsoft. The specification to read the binary format is complex and will take you ages to learn about the document structure and the internal bit and byte structure. I really dont think you will save yourself anytime going down this path, so consider the below:
Use Open XML
Automate Word
Use third party library like Aspose
Use RTF rather than Doc. You can then look for specific RTF tag with your text and replace it with another set of RTF text block. This is probably the simplest for what you want to do if RTF is an acceptable format.
Personal experience, automating Word isn't as bad as it sounds. It is really not suitable for server high volume environment, but for smaller load, it works well of course if you write your code well to manage the application object and handling exceptions.
EDITED: Corrected about my initial NDA comment mentioned. This was the case when I worked on this back in 2005/6 and didnt realize Microsoft had decided to publish that in the recent year.

Lots of choices:
Some of them expensive (Apose)
Some of them hard (binary formats)
Some of them require Interop (VSTO)
or newer formats (Open XML)
Some of them not mentioned yet, like
running Word on the server and just
writing to that (not recommended by
MSFT, but probably your only real
choice for a) cheap, b) simple
OfficeWriter.

If word documents are binary, how would I read and write the file in binary?
They are, and that's why you should use a third party library to program against them.
I would like to stay AWAY from 3rd party plugins & interop
This requirement makes the task extremely hard. If your documents are in the "old Word format" (.doc), I will almost say that you are out of luck. If you can use Word 2007 documents (.docx) instead, you should be able to solve the problem by unzipping the file (it's essentially a ZIP archive), do search/replace in contained XML files and zip the document up again.
See also: Generating a Word Document with C#

You could perform Word automation on the server to easily do it, but that route is fraught with danger. Automation is not designed to run server side and you will find it regularly hangs when Word pop's up a prompt or confirmation box waiting for input that nobody can see.
You have to make a trade off, use Word automation and accept it may hang pretty regularly (anything from daily to weekly), or buy a third party solution. I use Aspose and it has solved a lot of problems.

Related

How to replace a list of words into a word document automatically?

I have a custom Dictionary I have made in word and it is about 200 words. it contain the common errors that some people mistype and it didn't exist in the original word dictionary.
how to replace any word in my list in any document automatically. I mean whenever word see the wrong word it replace it with the one existed in my list.
I don't want to auto correct every single word every time I open a document.
I used to to something like this some years ago with Windows scripting host and OLE.
So today this would be kind of old fashioned but I bet it still works. However it requires you to have Word installed.
Since Word is remote controllable via COM, you could use nearly every programming language (C++ works natively but is a native nightmare) to remote control it. Then it would be easy to traverse thru and replace the words. Or invoke the search and replace function multiple times.
200 words can easily be held in your source code.
As the "included" feature Windows Scripting host is ok - you can choose between JScript and VB. Just create a .js file and by default it should be executed with WSH. But why don't you use macros (Essentially the same functionality - only seen thru the Word -GUI). But you may also use C# or Delphi....

count words in word file that was uploaded

is there a way so i can count the words in a word file (all versions) in classic asp or asp.net?
what i need is to know how many words and if possible to make an array of word length and how many from each so words of 1,2,3 letters will get less attention from the code later.
i was thinking of using FSO or something like that but that won't work for docx
i can upload the file with aspupload or any other object if needed. if there is an object that can be bought that will upload and count words i don't have a problem purchasing it
thanks in advance
You have several options -
If you can have office installed on the server and don't require this to be an fast solution, you can try Word Interop. See Word count using Microsoft.Office.Interop.Word. A similar option is to have OpenOffice installed and work with that, never did that myself.
You can use the IFilter interface (http://msdn.microsoft.com/en-us/library/ms691105(v=vs.85).aspx). Microsoft already implemented logic to take Word files and give you access to the inner text, so all you'll have to do is count the words. Look at the first answer here Are IFilters necessary to index full text documents using Lucene.NET and the link it provides or How to extract text from MS office documents in C#. You can also look at http://blogs.msdn.com/b/jasonz/archive/2009/08/31/sample-parsing-content-in-c-using-ifilter.aspx
You can use 3rd party tools, I know there are some out there, but I'm not really familiar with any of them. For example see http://www.aspose.com/.net/word-component.aspx
If you don't really need support for ALL word versions, then there are various ways to work with Word 2007+ files - for example - the official openXML or the open source docx
Option (2) seems like the way to go to me.

How to feed Word 2010 (.docx) documents/templates with data from MySQL database?

What would be the best approach to replace placeholders in a .docx document (Word 2010) with data coming from a MySQL database?
Can I just open the file using a server side language and do a string replace per each placeholder?
Is there any existing tool/library available?
Thanks
Disclosure: I work for Invantive.
Using Invantive Composition (http://www.invantive.com/products/invantive-composition) you can fill Word documents (letters, legal pleadings, insurancy policies) with data from a database (IBM DB2, Oracle, MySQL, Teradata and SQL Server) and then fully change the contents at will manually. It is intended for real Microsoft Word end-users (both the guys that make the template and the ones that use it) that access the databases through a central webservice and models with queries. Invantive Composition allows nested repeating groups of data and lay-out. Integrates into Microsoft Word using click once.
In the past, I personally have also been using JasperReports (http://community.jaspersoft.com/project/jasperreports-library) to generate letters using the RTF output target of JasperReports. It is free and works fine as long as you do not want to edit the output more than a few words and have Java/SQL development skills. Just as Invantive Composition it works fine for large numbers of different reports.
As long as you can control the environment completely, you can also consider using RTF as intermediate language (not for end-users, only real developers). Save document as RTF, replace parts of the text you need to be replacable, write a webservice that accepts the parameter and dumps back the resulting RTF. Takes some time to generate more complex tables (tables are obviously something invented by the human race after the RTF specification was written :-) This approach only works with very limited number of templates and when you have sufficient developer time available to get it up and running and stabilized.
As an independent reviewer, I have also seen cases where XML templates were used, but the results were not as good as with JasperReports.
**Disclosure: I lead the docx4j project **
There are heaps of existing tools/libraries available!
Yes, you can just do a string replace, but that is a brittle approach, since Word may have split the string across runs.
You can use MERGEFIELDs, or content control data binding.
docx4j supports all three approaches, but content control data binding is the most powerful.
ContentControlsMergeXML
MERGEFIELDs
VariableReplace
One thing to consider especially is "repeats". If you want say a row of a table in Word, for each matching row in your MySQL table, then you need a way to make this happen.
docx4j does this with a "repeat" content control around the table row; whichever solution you choose, I'd make sure up front that it can handle repeats.
If you want to use PHP the most complete available solution is PHPDocX.
You may check in the tutorial how to substitute placeholder variables by data coming from any data source (like a MySQL DB).
In particular, you may populate table rows with an indefinite number of entries and you may delete whole blocks of the Word document depending on the data fed to the application or build dynamical Word charts.
You may check the available DEMO for a simple but quite illustrative example (its inner workings are explained in the tutorial section).
You can use open Open XML SDK and replace your placeholders like this.
Disclosure: I lead the docxgenjs project
I think you shouldn't have to code everything by yourself, that's why I created a Mustache-like templating engine for docx
Demo:
http://javascript-ninja.fr/docxgenjs/examples/demo.html
Repo
https://github.com/edi9999/docxgenjs
It is JS-based and works client and server side.
Yes, you can use server side language to do it.
Check on apache POI.
http://poi.apache.org
Hello I read the above esp the comments and Ivantive looks impressive - but the solution I needed was much simpler. Use Selection.Range.InsertDatabase in Word to fetch records from an access database or excel spreadsheet or even just another word document. With the access solution you can choose the layout of the records to fetch and have it fetch just particular recordds based on a field (eg ID). Google the words above and it'll take you to MS guidance and an example VB script. Worked well in just a few mins. Now looking for VB script that asks the person what ID they want from the dbase and we're done.
it uses docx templates that have merge fields with java objects (the objects have the information you load from mysql or any other source). The xdoc report is an project for java language, the home page of the project is https://code.google.com/p/xdocreport/.
*Disclosure: I create the templ4docx project *
Hello
You can use templ4docx java library, which is on maven central repository, so you can just add it to your maven dependencies:
<dependency>
<groupId>pl.jsolve</groupId>
<artifactId>templ4docx</artifactId>
<version>2.0.0</version>
</dependency>
Example usage:
Docx docx = new Docx("E:\\template.docx");
Variables variables = new Variables();
variables.addTextVariable(new TextVariable("${firstName}", "John"));
variables.addTextVariable(new TextVariable("${lastName}", "Sky"));
docx.fillTemplate(variables);
docx.save("E:\\filledTemplate.docx");
More details you can find here: http://jsolve.github.io/java/templ4docx/

Programmatically generating editable Word docs from ASP.NET?

The purpose is to generate proposal documents that can manually be edited in Word after the fact, but before sending them out to the customers.
Much proposal content would be drawn from existing HTML website content (backing CMS) and also some custom (non-HTML) injection for certain scenarios. Of course the conditional logic could go into server-side ASP.NET to vary the content appropriately.
I'm open to 3rd-party tools if raw manipulation of the Word API is arduous. In fact a good 3rd party tool might be the answer.
Use the Aspose Words component for .Net.
Aspose Words Component Link
The component natively understands the Microsoft Word file format without having to install any Microsoft Office products on your application environment. You can then start from an existing word template or programatically build up an entire Microsoft Word document from scratch. The Word object model then allows you to export to doc / docx etc and save as a native Word file to wherever you required.
They have plenty of demos set up on their website.
I've not used any third-party tools before, as I've only ever written Office automation applications for PCs which already have Office installed.
Creating documents from scratch, or basing them on a template, is quite straightforward. With templates, you can define bookmarks and mail-merge fields to make finding and replacing document elements easier.
Here's a few things that you may find useful:
Named and Optional Arguments
The Word object model is reasonably easy to work with. VB.NET used to be easier to work with than C#: as the Office automation APIs were originally written with VB in mind, you could take advantage of optional parameters. In earlier versions of C#, you had to specify every argument in API calls, which was quite tedious. I understand that this has changed in Visual C# 2010:
How to: Use Named and Optional Arguments in Office Programming (C# Programming Guide)
http://msdn.microsoft.com/en-us/library/dd264738.aspx
Tutorials
I found these tutorials quite handy:
Automating Office Programs with VB.NET
http://www.xtremevbtalk.com/showthread.php?t=160433
VB.NET Office Automation FAQ
http://www.xtremevbtalk.com/showthread.php?t=160459
Understanding the Word Object Model from a .NET Developer's Perspective
http://msdn.microsoft.com/en-us/library/aa192495%28office.11%29.aspx
Early and Late binding
One point worth mentioning: late-binding is normally recommended against, but it can be very useful if you don't know what version of Office will be deployed on the application's host. Early-binding tends to operate faster, and has the advantage of intellisense in your IDE:
Using early binding and late binding in Automation
http://support.microsoft.com/kb/245115
Early vs. Late Binding
http://word.mvps.org/faqs/interdev/earlyvslatebinding.htm
Search and Replace
One thing to be aware of is that the find and replacement objects may not work as you would expect. Rather than searching the whole document, it searches just the main text. If you have text frames in the document, these will be ignored. Instead, you have to loop through all the StoryRanges, and search the content of each. Here's what I do in VB.NET to search the main text story and text frames:
Private Sub FindReplaceAll(ByVal objDoc As Object, ByVal strFind As String, ByVal strReplacement As String)
Dim rngStory As Object
For Each rngStory In objDoc.StoryRanges
Do
If rngStory.StoryType = wdMainTextStory Or rngStory.StoryType = wdTextFrameStory Then
With rngStory.Find
.Text = strFind
.Replacement.Text = strReplacement
.Wrap = wdFindContinue
.Execute(Replace:=wdReplaceAll)
End With
End If
rngStory = rngStory.NextStoryRange
Loop Until rngStory Is Nothing
Next rngStory
End Sub
StoryRanges Collection Object
http://msdn.microsoft.com/en-us/library/bb178940%28office.12%29.aspx
I have a long history regarding document generation and mail merge. In the old days we were using Office COM extensively even in server side (ASP) applications. In years we have learnt that this approach was causing many problems and today I’m always advocating against using Office COM (Word automation) in almost any scenario.
With the Microsoft’s introduction of Open XML SDK we managed to create a solid mail-merge component that was many times faster and much more robust than the solution(s) with Office COM. In my experience Open XML SDK allows a developer to create a solid solution, but it takes a lot of effort and time to make it useful and robust.
There are several good document generation/processing libraries on the market. We later ended up purchasing one and in my opinion creating your own solution (based on Open XML SDK or Office COM) simply never pays off.
Currently we are using Docentric Toolkit which is a general purpose document processing library and even better template-based/mail-merge toolkit for .NET. It allows template design in MS Word and then populating them with application data and producing final documents in different formats.
You can look into using XSL to generate some WordML.
This technique is definitely convoluted but gives you a lot power in your layout.
You don't need any 3rd party controls to create a Word document. From 2007 and onward Word can read html as a word document. You simply save any web page with the ".doc" extension and Word will sort it out.
Simply create your web page with whatever formatting you want then save it with a .doc extension.
I used HttpWebRequestto call the Url (with parmaters) to my page then used WebResponse and Stream to get my page into a buffer, then StreamReader and StreamWriter to save it to an actual document. I've then got my own custom function to download the file.
If anyone wants my code let me know

Generating keywords from a pdf automatically

My application allows user to upload pdf files and store them on the webserver for later viewing. I store the name of the file, location, size, upload date, user name etc in an SQL server database.
I'd like to be able to programatically, just after a file is uploaded, generate a list of keywords (maybe everything except common words) and store them in the sql database as well so that subsequent users can do keyword searches...
Suggestions on how to approach this task? Does these type of routine already exist?
EDIT: Just to clarify my requirements, I wouldn't be concerned with doing OCR, I don't know the insides' of PDF's, but I understand that if it was generated by an app, such as Word->PDF Print, the text of the document is searchable...so really my first task, and the intent of my question is, how do I access the text of a PDF file from an asp.net app? OCR on scanned PDF's is probably beyond my requirements at this point.
As a first step you should extract all text from the PDF.
ghostscript and pdftotext can do this, the PDFBox is another option.
There are certainly other tools as well.
Then you can remove all stopwords and duplicates and write it to the database.
I has been mentioned that this does not work for scanned PDF documents but this is only half the truth. On the one hand there are lots of scanned PDFs which have text additionally embeded, because that is what some scanners drivers do (Canon CanoScan drivers performs OCR and generate searchable PDFs). On the other hand documents generated with LaTeX that contain non-ASCCII characters return garbage in my experience (even when I copy and paste in acrobat).
The only problem I foresee of grabbing every non-common word is that you'll dilute your search results and have to query the DB for more pdfs. One website to look at is Scribd which does something similar to what you are talking about doing with users uploading files and people being able to view them online via a flash app.
That is very interesting topic. The question is how many keywords do you need to define one PDF. If you say:
3 to 10 - I would check methods of text categorization such as bayesian classifier or K-NN (that method will group PDF files into clusters which are similar). I know that similar algorithms are used to filter spam. But it is a system that need input for example if you add keywords to 100 PDF this system will learn the schemas. I am not an expert but this is one way to do it.
more than 10 - then I would suggest brute force -> filter common words -> get most frequent words for a specific document.
I would explore first option. You must surely check such methods as "text categorization", "auto tagging", "text mining", "automatic keyword extraction".
Some links :
http://en.wikipedia.org/wiki/Naive_Bayes_classifier
Keyword Extraction Using Naive Bayes
If you are planning on indexing PDF documents, you should consider using a dedicated text search engine like Lucene. Lucene provides features that will be difficult to implement using only SQL and a relational database. You will still need to extract the text from the PDF documents, but won't have to worry about filtering out common words. By filtering out common words, you will completely lose the ability to do phrase searches.

Resources