I am working on Asp.Net project which needs to fill in a word document. My client provides a word template with last name, firstname, birth date,etc... . I have all those information in the sql database, and the client want the users of the application be able to download the word document with filled in information from the database.
What's the best way to archive this? Basically, I need identify those "fillable spot" in word document, fill those information in when the application user clicks on the download button.
If you can use Office 2007 the way to go is to use the Open XML API to format the documents:
http://support.microsoft.com/kb/257757. The reason you have to go that route is that you can't really use Word Automation in a server environment. (you CAN, but it's a huge pain to get working properly, and can EASILY break).
If you can't go the 2007 route, I've actually had pretty good success with just opening up a word template as a stream and finding and replacing the tokens and serving that to the user. This has actually worked surprisingly well in my experience and it's REALLY simple to implement.
I'm not sure about some of the ASP.Net aspects, but I am working on something similar and you might want to look into using an RTF instead. You can use pattern replacement in the RTF. For example you can add a tag like {USER_FIRST_NAME} in the RTF document. When the user clicks the download button, your application can take the information from the database and replace every instance of {USER_FIRST_NAME} with the data from the database. I am currently doing this with PHP and it works great. Word will open the RTF without a problem so that is another reason I chose this method.
I have used Aspose.Words for .NET. It's a little on the pricey side, but it works extremely well and the API is fairly intuitive for something that is potentially very complex.
If you want to pre-design your documents (or allow others to do that for you), anyone can put fields into the document. Aspose can open the document, find and fill the fields, and save a new filled-out copy for download.
Aspose works okay, but again: it's pricey.
Definitely avoid Office Automation in web apps as much as possible. It just doesn't scale well.
My preferred solution for this kind of problem is xml: specifically here I recommend WordProcessingML. You create an Xml document according to the schema, put a .doc extension on it, and MS Word will open it as if it were native in any version as far back as Office XP. This supports most Word features, and this way you can safely reduce the problem to replacing tokens in a text stream.
Be careful googling for more information on this: there's a lot of confusion between this and new Xml-based format for Office 2007. They're not the same thing.
This code works for WordMl text boxes and checkboxes. It's index based, so just pass in an array of strings for all textboxes and an array of bool's for all checkboxes.
public void FillInFields(
Stream sourceStream,
Stream destinationStream,
bool[] pageCheckboxFields,
string[] pageTextFields
) {
StreamUtil.Copy(sourceStream, destinationStream);
sourceStream.Close();
destinationStream.Seek(0, SeekOrigin.Begin);
Package package = Package.Open(destinationStream, FileMode.Open, FileAccess.ReadWrite);
Uri uri = new Uri("/word/document.xml", UriKind.Relative);
PackagePart packagePart = package.GetPart(uri);
Stream documentPart = packagePart.GetStream(FileMode.Open, FileAccess.ReadWrite);
XmlReader xmlReader = XmlReader.Create(documentPart);
XDocument xdocument = XDocument.Load(xmlReader);
List<XElement> textBookmarksList = xdocument
.Descendants(w + "fldChar")
.Where(e => (e.AttributeOrDefault(w + "fldCharType") ?? "") == "separate")
.ToList();
var textBookmarks = textBookmarksList.Select(e => new WordMlTextField(w, e, textBookmarksList.IndexOf(e)));
List<XElement> checkboxBookmarksList = xdocument
.Descendants(w + "checkBox")
.ToList();
IEnumerable<WordMlCheckboxField> checkboxBookmarks = checkboxBookmarksList
.Select(e => new WordMlCheckboxField(w, e, checkboxBookmarksList.IndexOf(e)));
for (int i = 0; i < pageTextFields.Length; i++) {
string value = pageTextFields[i];
if (!String.IsNullOrEmpty(value))
SetWordMlElement(textBookmarks, i, value);
}
for (int i = 0; i < pageCheckboxFields.Length; i++) {
bool value = pageCheckboxFields[i];
SetWordMlElement(checkboxBookmarks, i, value);
}
PackagePart newPart = packagePart;
StreamWriter streamWriter = new StreamWriter(newPart.GetStream(FileMode.Create, FileAccess.Write));
XmlWriter xmlWriter = XmlWriter.Create(streamWriter);
if (xmlWriter == null) throw new Exception("Could not open an XmlWriter to 4311Blank-1.docx.");
xdocument.Save(xmlWriter);
xmlWriter.Close();
streamWriter.Close();
package.Flush();
destinationStream.Seek(0, SeekOrigin.Begin);
}
private class WordMlTextField {
public int? Index { get; set; }
public XElement TextElement { get; set; }
public WordMlTextField(XNamespace ns, XObject element, int index) {
Index = index;
XElement parent = element.Parent;
if (parent == null) throw new NicException("fldChar must have a parent.");
if (parent.Name != ns + "r") {
log.Warn("Expected parent of fldChar to be a run for fldChar at position '" + Index + "'");
return;
}
var nextSibling = parent.ElementsAfterSelf().First();
if (nextSibling.Name != ns + "r") {
log.Warn("Expected a 'r' element after the parent of fldChar at position = " + Index);
return;
}
var text = nextSibling.Element(ns + "t");
if (text == null) {
log.Warn("Expected a 't' element inside the 'r' element after the parent of fldChar at position = " + Index);
}
TextElement = text;
}
}
private class WordMlCheckboxField {
public int? Index { get; set; }
public XElement CheckedElement { get; set; }
public readonly XNamespace _ns;
public WordMlCheckboxField(XNamespace ns, XContainer checkBoxElement, int index) {
_ns = ns;
Index = index;
XElement checkedElement = checkBoxElement.Elements(ns + "checked").FirstOrDefault();
if (checkedElement == null) {
checkedElement = new XElement(ns + "checked", new XAttribute(ns + "val", "0"));
checkBoxElement.Add(checkedElement);
}
CheckedElement = checkedElement;
}
public static void Copy(Stream readStream, Stream writeStream) {
const int Length = 256;
Byte[] buffer = new Byte[Length];
int bytesRead = readStream.Read(buffer, 0, Length);
// write the required bytes
while (bytesRead > 0) {
writeStream.Write(buffer, 0, bytesRead);
bytesRead = readStream.Read(buffer, 0, Length);
}
readStream.Flush();
writeStream.Flush();
}
In general you are going to want to avoid doing Office automation on a sever, and Microsoft has even stated that it is a bad idea as well. However, the technique that I generally use is the Office Open XML that was noted by aquinas. It does take a bit of time to learn your way around the format, but it is well worth it once you do as you don't have to worry about some of the issues involved with Office automation (e.g. processes hanging).
Awhile back I answered a similar question to this that you might find useful, you can find it here.
If you need to do this in DOC files (as opposed to DOCX), then the OpenXML SDK won't help you.
Also, just want to add another +1 about the danger of automating the Office apps on servers. You will run into problems with scale - I guarantee it.
To add another reference to a third-party tool that can be used to solve your problem:
http://www.officewriter.com
OfficeWriter lets you control docs with a full API, or a template-based approach (like what your requirement is) that basically lets you open, bind, and save DOC and DOCX in scenarios like this with little code.
Could you not use Microsofts own InterOp Framework to utilise Word Functionality
See Here
Related
Due to the limitation of the system, i am not allowed to use htmlagilitypack as i dont have the rights to refer the library. So i can only use native asp.net programming language to parse page.
e.g. i want to scrap this page https://sg.linkedin.com/job/google/jobs/ to get the list of google jobs ( just an example, i am not really planning to get this list but my own company's) , i see they are under how can i extra these jobs description and name.
My current codes are
System.Net.WebClient client = new System.Net.WebClient();
try{
System.IO.Stream myStream = client.OpenRead("https://sg.linkedin.com/job/google/jobs/");
System.IO.StreamReader sr = new System.IO.StreamReader(myStream);
string htmlContent = sr.ReadToEnd();
//do not know how to carry on
}catch(Exception e){
Response.Write(e.Message);
}
how can i carry on?
You can fetch that page and use a regular expression to isolate the useful parts. If you get real lucky, you may have a valid XML file:
var html = new WebClient().DownloadString("https://sg.linkedin.com/job/google/jobs/");
var jobs = new XmlDocument();
jobs.LoadXml(Regex.Replace(Regex.Match(html,
#"<ul class=""jobs"">[\s\S]*?</ul>").Value,
#"itemscope | itemprop="".*?""", "")); // clean invalid attributes
foreach (XmlElement job in jobs.SelectNodes("//li[#class='job']"))
{
Console.WriteLine(job.SelectSingleNode(".//a[#class='company']").InnerText);
Console.WriteLine(job.SelectSingleNode(".//h2/a").InnerText);
Console.WriteLine(job.SelectSingleNode(".//p[#class='abstract']").InnerText);
Console.WriteLine();
}
I am trying to understand and implement a piece of code for Tiff compression.
I have already used 2 separate techniques - Using 3rd party dll's LibTiff.NEt (1st method is bulky) and the Image save method, http://msdn.microsoft.com/en-us/library/ytz20d80%28v=vs.110%29.aspx (2nd method works only on windows 7 machine but not on windows 2003 or 2008 server).
Now I am looking to explore this 3rd method.
using System.Windows.Forms;
using System.Windows.Media.Imaging;
using System.Drawing.Imaging;
int width = 800;
int height = 1000;
int stride = width/8;
byte[] pixels = new byte[height*stride];
// Try creating a new image with a custom palette.
List<System.Windows.Media.Color> colors = new List<System.Windows.Media.Color>();
colors.Add(System.Windows.Media.Colors.Red);
colors.Add(System.Windows.Media.Colors.Blue);
colors.Add(System.Windows.Media.Colors.Green);
BitmapPalette myPalette = new BitmapPalette(colors);
// Creates a new empty image with the pre-defined palette
BitmapSource image = BitmapSource.Create(
width,
height,
96,
96,
System.Windows.Media.PixelFormats.BlackWhite,
myPalette,
pixels,
stride);
FileStream stream = new FileStream(Original_File, FileMode.Create);
TiffBitmapEncoder encoder = new TiffBitmapEncoder();
encoder.Compression = TiffCompressOption.Ccitt4;
encoder.Frames.Add(BitmapFrame.Create(image));
encoder.Save(stream);
But I don't have a full understanding of what is happening here.
There is obviously some kind of a memory stream that the compression technique is being applied to. But I am a bit confused how to apply this to my specific case. I have an original tiff file, I want to use this method to set its compression to CCITT and save it back. Can anyone help?
I copied the above code and the code runs. But my end output file is a solid black background image. Although on the positive side it is of the correct compression type.
http://msdn.microsoft.com/en-us/library/ms616002%28v=vs.110%29.aspx
http://msdn.microsoft.com/en-us/library/system.windows.media.imaging.tiffcompressoption%28v=vs.100%29.aspx
http://social.msdn.microsoft.com/Forums/vstudio/en-US/1585c562-f7a9-4cfd-9674-6855ffaa8653/parameter-is-not-valid-for-compressionccitt4-on-windows-server-2003-and-2008?forum=netfxbcl
LibTiff.net is a little bulky because it's based off LibTiff, which has its own set of problems.
My company (Atalasoft) has the ability to do that fairly easily, and the free version of the SDK will do the task you want with a few restrictions. The code for re-encoding a file would look like this:
public bool ReencodeFile(string path)
{
AtalaImage image = new AtalaImage(path);
if (image.PixelFormat == PixelFormat.Pixel1bppIndexed)
{
TiffEncoder encoder = new TiffEncoder();
encoder.Compression = TiffCompression.Group4FaxEncoding;
image.Save(path, encoder, null); // destroys the original - use carefully
return true;
}
return false;
}
Things you should be aware of:
this code will only work properly on 1bpp images
this code will NOT work properly on multi-page TIFFs
this code does NOT preserve metadata within the original file
and I would want the code to at least check for that. If you are inclined to have a solution that better preserves what's in the content of the file, you would want to do this:
public bool ReencodeFile(string origPath, string outputPath)
{
if (origPath == outputPath) throw new ArgumentException("outputPath needs to be different from input path.");
TiffDocument doc = new TiffDocuemnt(origPath);
bool needsReencoding = false;
for (int i=0; i < doc.Pages; i++) {
if (doc.Pages[i].PixelFormat == PixelFormat.Pixel1bppIndexed) {
doc.Pages[i] = new TiffPage(new AtalaImage(origPath, i, null), TiffCompression.Group4FaxEncoding);
needsReencoding = true;
}
}
if (needsReendcoding)
doc.Save(outputPath);
return needsReencoding;
}
This solution will respect all pages within the document as well as document metadata.
Actually my requirement is to search pdf files using the pdf content.
I have a folder with a lot of PDF files.
I would like to develop an ASP.net application that enables the user to
search pdf using the content provided by them inside a textbox.
how to perform this task?
thank u in advance.
Your task may be split into following subtasks:
Develop indexer that will index all of your PDF files
Develop the code to locate relevant PDF whenever a search performed (using the index, of course)
Develop functionality that will open relevant PDF or show a warning if nothing was found
To build index you may use some integrated solution like Apache Lucene or Lucene.Net or convert each PDF into text and build index from the text yourselves.
You may try Docotic.Pdf library for the indexer part (disclaimer: I work for Bit Miracle).
The library could be used to extract text from PDFs. It can extract text with or without formatting. The extracted text can be used to create an index.
The library can also retrieve a collection of words with their bounding rectangles from PDFs. This might be useful if you need to know exact position of a text in a file.
If you don't want to build an index then you still can use Docotic.Pdf to perform searches using a code like the following:
PdfDocument doc = new PdfDocument("file.pdf");
string textToSearch = "some text";
for (int i = 0; i < doc.Pages.Count; i++)
{
string pageText = doc.Pages[i].GetText();
int count = 0;
int lastStartIndex = pageText.IndexOf(textToSearch, 0, StringComparison.CurrentCultureIgnoreCase);
while (lastStartIndex != -1)
{
count++;
lastStartIndex = pageText.IndexOf(textToSearch, lastStartIndex + 1, StringComparison.CurrentCultureIgnoreCase);
}
if (count != 0)
Console.WriteLine("Page {0}: '{1}' found {2} times", i, textToSearch, count);
}
You can use any library for that, try iTextSharp its a free one.
You can read pdf as text like this:
public string ReadPdfFile(string fileName)
{
StringBuilder text = new StringBuilder();
if (File.Exists(fileName))
{
PdfReader pdfReader = new PdfReader(fileName);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
}
pdfReader.Close();
}
return text.ToString();
}
try Zoom Search it has a plugin for extracting pdf documents text (which you can search against) , and its easy to customize your search.You will need the standard edition which is not free (about $49).Zoom search does the searching for you out of the box, you do not need to do any complicated stuff eg if you prefer to extract the text from the pdf and then some how index it in a database for search or trying to use Lucene search engine which will require you to do implement /and customise(a bit of work).
Zoom works well with ASP.NET and you just need to use the GUI for customizing your search(not a lot of coding is required).
XmlTextReader reader = new XmlTextReader("D://project_elysian//data.xml");
while (reader.Read())
{
if (reader.NodeType == XmlNodeType.Element)
{
reader.Read();
//Response.Write(reader.Value + "</br>");
//Response.Write(reader.Depth);
switch (reader.Name)
{
case "Id": Response.Write(reader.Value + "</br>");
break;
case "Name": Response.Write(reader.Value + "</br>");
break;
}
}
}
I am trying to read data.xml file and display its contents of the specified tags, but the resultant page remains blank, and no compilation error is given, am stuck, can't figure out what is wrong with this code.
I suspect if you "View Source" on the resulting page you'll see the data you are expecting to see.
The problem is that your web browser sees these xml elements as unknown html tags and so doesn't know how to display them.
You need to "encode" your output, so your string is literally displayed as is.
Instead of writing:
Response.Write(reader.Value + "</br>");
try
Response.Write(Server.HtmlEncode(reader.Value) + "</br>");
What this does is replace < with < > with > and a few others. "<" tells the browser to render "<" rather than treat it as the beginning of a tag.
[Edit - in response to comment]
It sounds like your none of your cases are ever true. Without knowing the contents of the source xml file, it is hard to say - but have you tried putting a breakpoint on the Response.Writes in the cases? Are they ever hit?
If not, then this is not related to anything I mentioned above - but you are not getting what you expect from your reader.
Try starting with a small sample of the xml file and step through in the debugger. Try and determine what data (e.g. the reader.Name property) is present on the reader when you hit something you are interested in, and amend the switch statement accordingly.
[2nd Edit - in response to sample xml]
Your mistake is the Read() call just after the check for the XmlNodeType.Element. You are basically reading until you find an element (the Read() call in the while). Once you've found the element, you are then pushing past the element (the other Read() call) before trying to read the element name. This inner reader.Read() makes sure you are no longer on the element by the time you try to check its name.
Try this:
while (reader.Read())
{
if (reader.NodeType == XmlNodeType.Element)
{
// Capture the element name before pushing past it.
var elementName = reader.Name;
reader.Read();
//Response.Write(reader.Value + "</br>");
//Response.Write(reader.Depth);
switch (elementName)
{
case "Id":
Response.Write(reader.Value);
break;
case "Name":
Response.Write(reader.Value);
break;
}
}
}
The key to finding this sort of thing, is to debug carefully. Start with a cut down xml file and either actually step through in the debugger, or write debug output to a log or the response. It'll make identifying these sort of issues much easier.
Since it is working outside switch, I guess than name of the node is of different case.Check the case of the nodes
Edit:
You are calling reader.read() twice and the reader.value will not return proper value for an element.
If you still want to use xmlreader check the below code
XmlTextReader reader = new XmlTextReader(<XML Path>)
while (reader.Read())
{ if (reader.NodeType == XmlNodeType.Element)
{
switch (reader.Name)
{
case "Id": Response.Write(Server.HtmlEncode(reader.ReadString()) + "</br>");
break;
case "Name": Response.Write(Server.HtmlEncode(reader.ReadString()) + "</br>");
break;
}
}
}
reader.Close();
I would suggest that you take a look at XML (de)serialization instead of using XmlReader. .Net will verify the xml and you will more easily be able to debug the Xml input.
You simply create a class with fields mirroring your Xml structure and use the following code:
class searchResult
{
public List<item> itemList { get; set;}
}
A complex field class example
class item
{
public int Id { get; set; }
public string Name{ get; set; }
}
The actual work gets done like so:
XmlSerializer SerializerIn = new XmlSerializer(typeof(SerializeTest));
FileStream fs = new FileStream(#"C:\test.xml", FileMode.Open, FileAccess.Read, FileShare.Read);
SerializeTest loadTest = (SerializeTest)SerializerIn.Deserialize(fs);
fs.Close();
Where SerializeTest is the class your loading the xml into. It is much easier to work this way, because you never need to deal with the raw Xml unless it is invalid.
You can find more info here: http://www.codeproject.com/Articles/4491/Load-and-save-objects-to-XML-using-serialization
Probably a better tutorial: https://web.archive.org/web/20211020113423/https://www.4guysfromrolla.com/webtech/012302-1.shtml
I am developing my down blog engine based on file system storage, very much interested to use Markdown for keeping file and storage compressed. I am able to figure out the way when user submit the content using Markdown editor (that's I am using now while writing the code!!) but also would like to enhance the feature by allowing Window Live Writer and Metablog API thus it is very important for me to transform vice versa (HTML -> Markup).
I am not able to find any example or specific code snippet that can help me. Advise would be much appreciated.
Reference:
http://code.google.com/p/markdownsharp/
I am using above repository.
Cheer!
Nilay.
You can use Pandoc with a wrapper as shown in the answer to this question:
Convert Html or RTF to Markdown or Wiki Compatible syntax?
Edit:
Here's a slightly modified (to dispose of process resources) version of the function that #Rob wrote:
private string Convert(string source) {
string processName = #"C:\Program Files (x86)\Pandoc\bin\pandoc.exe";
string args = "-r html -t markdown";
ProcessStartInfo psi = new ProcessStartInfo(processName, args) {
RedirectStandardOutput = true,
RedirectStandardInput = true,
CreateNoWindow = true,
UseShellExecute = false
};
var outputString = "";
using (var p = new Process()) {
p.StartInfo = psi;
p.Start();
byte[] inputBuffer = ASCIIEncoding.UTF8.GetBytes(source);
p.StandardInput.BaseStream.Write(inputBuffer, 0, inputBuffer.Length);
p.StandardInput.Close();
using (var sr = new StreamReader(p.StandardOutput.BaseStream)) {
outputString = sr.ReadToEnd();
}
}
return outputString;
}
I'm not sure how practical this is but it works.