PDF content search in asp.net c# - asp.net

Actually my requirement is to search pdf files using the pdf content.
I have a folder with a lot of PDF files.
I would like to develop an ASP.net application that enables the user to
search pdf using the content provided by them inside a textbox.
how to perform this task?
thank u in advance.

Your task may be split into following subtasks:
Develop indexer that will index all of your PDF files
Develop the code to locate relevant PDF whenever a search performed (using the index, of course)
Develop functionality that will open relevant PDF or show a warning if nothing was found
To build index you may use some integrated solution like Apache Lucene or Lucene.Net or convert each PDF into text and build index from the text yourselves.
You may try Docotic.Pdf library for the indexer part (disclaimer: I work for Bit Miracle).
The library could be used to extract text from PDFs. It can extract text with or without formatting. The extracted text can be used to create an index.
The library can also retrieve a collection of words with their bounding rectangles from PDFs. This might be useful if you need to know exact position of a text in a file.
If you don't want to build an index then you still can use Docotic.Pdf to perform searches using a code like the following:
PdfDocument doc = new PdfDocument("file.pdf");
string textToSearch = "some text";
for (int i = 0; i < doc.Pages.Count; i++)
{
string pageText = doc.Pages[i].GetText();
int count = 0;
int lastStartIndex = pageText.IndexOf(textToSearch, 0, StringComparison.CurrentCultureIgnoreCase);
while (lastStartIndex != -1)
{
count++;
lastStartIndex = pageText.IndexOf(textToSearch, lastStartIndex + 1, StringComparison.CurrentCultureIgnoreCase);
}
if (count != 0)
Console.WriteLine("Page {0}: '{1}' found {2} times", i, textToSearch, count);
}

You can use any library for that, try iTextSharp its a free one.
You can read pdf as text like this:
public string ReadPdfFile(string fileName)
{
StringBuilder text = new StringBuilder();
if (File.Exists(fileName))
{
PdfReader pdfReader = new PdfReader(fileName);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
}
pdfReader.Close();
}
return text.ToString();
}

try Zoom Search it has a plugin for extracting pdf documents text (which you can search against) , and its easy to customize your search.You will need the standard edition which is not free (about $49).Zoom search does the searching for you out of the box, you do not need to do any complicated stuff eg if you prefer to extract the text from the pdf and then some how index it in a database for search or trying to use Lucene search engine which will require you to do implement /and customise(a bit of work).
Zoom works well with ASP.NET and you just need to use the GUI for customizing your search(not a lot of coding is required).

Related

Save text file with code page UTF-8 in Axapta 3.0

How do I make a text file with code page UTF-8 in Axapta 3.0?
I cannot use
myFile = new CommaIo(myFileName, 'W', 65001);
as we can in newer versions of Axapta. In Axapta 3.0 new CommaIo only have the first two parameters.
I've not worked in 3.0, but I have a few ideas for you that might send you the right way.
1) Do you have CommaTextIo or TextIo? Those are the objects where you can specify a code page.
2) Look in the AOT and see if you have a macro called #File, and inside if you have #utf8Format(65001), use the X-Ref (or Ctrl+F) to find other places in the system that use it. Then you can see how they might accomplish UTF-8
3) See if you can combine CommaIo with some .NET code, or just manually generated a CSV. Perhaps generate your CSV and write it, then read it and re-write it using a method like below (from MetadataXMLGenerator job):
void write(str _directory, str _name, str _text)
{
str path;
;
_text = System.Text.RegularExpressions.Regex::Replace(_text, '\n', '\r\n');
if (!System.IO.Directory::Exists(_directory))
{
System.IO.Directory::CreateDirectory(_directory);
}
path = System.IO.Path::Combine(_directory, _name);
System.IO.File::WriteAllText(path, _text, System.Text.Encoding::get_UTF8());
}

c# Tiff files compression methodology

I am trying to understand and implement a piece of code for Tiff compression.
I have already used 2 separate techniques - Using 3rd party dll's LibTiff.NEt (1st method is bulky) and the Image save method, http://msdn.microsoft.com/en-us/library/ytz20d80%28v=vs.110%29.aspx (2nd method works only on windows 7 machine but not on windows 2003 or 2008 server).
Now I am looking to explore this 3rd method.
using System.Windows.Forms;
using System.Windows.Media.Imaging;
using System.Drawing.Imaging;
int width = 800;
int height = 1000;
int stride = width/8;
byte[] pixels = new byte[height*stride];
// Try creating a new image with a custom palette.
List<System.Windows.Media.Color> colors = new List<System.Windows.Media.Color>();
colors.Add(System.Windows.Media.Colors.Red);
colors.Add(System.Windows.Media.Colors.Blue);
colors.Add(System.Windows.Media.Colors.Green);
BitmapPalette myPalette = new BitmapPalette(colors);
// Creates a new empty image with the pre-defined palette
BitmapSource image = BitmapSource.Create(
width,
height,
96,
96,
System.Windows.Media.PixelFormats.BlackWhite,
myPalette,
pixels,
stride);
FileStream stream = new FileStream(Original_File, FileMode.Create);
TiffBitmapEncoder encoder = new TiffBitmapEncoder();
encoder.Compression = TiffCompressOption.Ccitt4;
encoder.Frames.Add(BitmapFrame.Create(image));
encoder.Save(stream);
But I don't have a full understanding of what is happening here.
There is obviously some kind of a memory stream that the compression technique is being applied to. But I am a bit confused how to apply this to my specific case. I have an original tiff file, I want to use this method to set its compression to CCITT and save it back. Can anyone help?
I copied the above code and the code runs. But my end output file is a solid black background image. Although on the positive side it is of the correct compression type.
http://msdn.microsoft.com/en-us/library/ms616002%28v=vs.110%29.aspx
http://msdn.microsoft.com/en-us/library/system.windows.media.imaging.tiffcompressoption%28v=vs.100%29.aspx
http://social.msdn.microsoft.com/Forums/vstudio/en-US/1585c562-f7a9-4cfd-9674-6855ffaa8653/parameter-is-not-valid-for-compressionccitt4-on-windows-server-2003-and-2008?forum=netfxbcl
LibTiff.net is a little bulky because it's based off LibTiff, which has its own set of problems.
My company (Atalasoft) has the ability to do that fairly easily, and the free version of the SDK will do the task you want with a few restrictions. The code for re-encoding a file would look like this:
public bool ReencodeFile(string path)
{
AtalaImage image = new AtalaImage(path);
if (image.PixelFormat == PixelFormat.Pixel1bppIndexed)
{
TiffEncoder encoder = new TiffEncoder();
encoder.Compression = TiffCompression.Group4FaxEncoding;
image.Save(path, encoder, null); // destroys the original - use carefully
return true;
}
return false;
}
Things you should be aware of:
this code will only work properly on 1bpp images
this code will NOT work properly on multi-page TIFFs
this code does NOT preserve metadata within the original file
and I would want the code to at least check for that. If you are inclined to have a solution that better preserves what's in the content of the file, you would want to do this:
public bool ReencodeFile(string origPath, string outputPath)
{
if (origPath == outputPath) throw new ArgumentException("outputPath needs to be different from input path.");
TiffDocument doc = new TiffDocuemnt(origPath);
bool needsReencoding = false;
for (int i=0; i < doc.Pages; i++) {
if (doc.Pages[i].PixelFormat == PixelFormat.Pixel1bppIndexed) {
doc.Pages[i] = new TiffPage(new AtalaImage(origPath, i, null), TiffCompression.Group4FaxEncoding);
needsReencoding = true;
}
}
if (needsReendcoding)
doc.Save(outputPath);
return needsReencoding;
}
This solution will respect all pages within the document as well as document metadata.

Annotation in pdfclown

I am trying to put a sticky note at some x,y location. For this i am using the pdfclown annotation class in .net.
Below is what is available.
using files = org.pdfclown.files;
public override bool Run()
{
files::File file = new files::File();
Document document = file.Document;
Populate(document);
Serialize(file, false, "Annotations", "inserting annotations");
return true;
}
private void Populate(Document document)
{
Page page = new Page(document);
document.Pages.Add(page);
PrimitiveComposer composer = new PrimitiveComposer(page);
StandardType1Font font = new StandardType1Font(document, StandardType1Font.FamilyEnum.Courier, true, false);
composer.SetFont(font, 12);
annotations::Note note = new annotations::Note(page, new Point(78, 658), "this is my annotation...");
note.IconType = annotations::Note.IconTypeEnum.Help;
note.ModificationDate = new DateTime();
note.IsOpen = true;
composer.Flush();
}
Link for annotation
This is putting a sticky note at 78, 658 cordinates in a blank pdf.
The problem is that i want that sticky note in a particular pdf which has some data. How can i modify it...thanks for the help..
I'm the author of PDF Clown -- this is the right way to insert an annotation like a sticky note into an existing page:
using org.pdfclown.documents;
using annotations = org.pdfclown.documents.interaction.annotations;
using files = org.pdfclown.files;
using System.Drawing;
. . .
// Open the PDF file!
using(files::File file = new files::File(#"C:\mypath\myfile.pdf"))
{
// Get the document (high-level representation of the PDF file)!
Document document = file.Document;
// Get, e.g., the first page of the document!
Page page = document.Pages[0];
// Insert your sticky note into the page!
annotations::Note note = new annotations::Note(page, new Point(78, 658), "this is my annotation...");
note.IconType = annotations::Note.IconTypeEnum.Help;
note.ModificationDate = new DateTime();
note.IsOpen = true;
// Save the PDF file!
file.Save(files::SerializationModeEnum.Incremental);
}
Please consider that there are lots of options about the way you can save your file (to an output (in-memory) stream, to a distinct path, as a compacted file, as an appended file...).
If you look at the 50+ samples accompanying the library's distribution, along with the API documentation, you can discover how expressive and powerful it is. Its architecture strictly adheres to the official Adobe PDF Reference 1.7.
enjoy!

ASP.NET to PowerPoint: File gets corrupted when adding image

I have used this example when exporting data to PowerPoint:
I have modified the GenerateSlidesFromDB() method:
public void GenerateSlidesFromDB()
{
string slideName = #"C:\Users\x\Desktop\output.pptx";
File.Copy(#"C:\Users\x\Desktop\Test.pptx", slideName, true);
using (PresentationDocument presentationDocument = PresentationDocument.Open(slideName, true))
{
PresentationPart presentationPart = presentationDocument.PresentationPart;
SlidePart slideTemplate = (SlidePart)presentationPart.GetPartById("rId2");
string firstName = "Test User";
SlidePart newSlide = CloneSlidePart(presentationPart, slideTemplate);
InsertContent(newSlide, firstName);
newSlide.Slide.Save();
DeleteTemplateSlide(presentationPart, slideTemplate);
presentationPart.Presentation.Save();
}
}
As you can see I overwrite the placeholder with "Test User", and it works like a charm.
I need to add an image (as a placeholder) to this pptx-file.
When I do that (and run the code again) I get a corrupted pptx-file?
Error message:
PowerPoint removed unreadable content
in output.pptx. You should review
this presentation to determine whether
any content was unexpectedly changed
or removed.
Edit: If I try the original code (which is slightly modified since I dont have Adventureworks), I get some other kind of error message:
This file may have become corrupt or damaged for the following reasons:
Third-party XML editors sometimes create files that are not compatible with Microsoft Office XML specifications.
The file has been purposely corrupted with the intent to harm your computer or your data.
Be cautious when opening a file from an unknown source.
PowerPoint can attempt to recover data from the file, but some presentation data, such as shapes, text,and formatting, may be lost.
Do one of the following:
If you want to recover data from the file, click Yes.
If you do not want to recoverdata from the file, click No.
Ok, sorry for this useless post. My bad.
Solution:
string imgId = "rIdImg" + i;
ImagePart imagePart = newSlide.AddImagePart(ImagePartType.Jpeg, imgId);
MemoryStream stream3 = new MemoryStream();
using (FileStream file = File.Open(#"C:\Users\x\Desktop\Test.jpg", FileMode.Open))
{
byte[] buffer = new byte[file.Length];
file.Read(buffer, 0, (int)file.Length);
stream3.Write(buffer, 0, buffer.Length);
imagePart.FeedData(new MemoryStream(buffer));
}
SwapPhoto(newSlide, imgId);

Fill a word document in asp.net?

I am working on Asp.Net project which needs to fill in a word document. My client provides a word template with last name, firstname, birth date,etc... . I have all those information in the sql database, and the client want the users of the application be able to download the word document with filled in information from the database.
What's the best way to archive this? Basically, I need identify those "fillable spot" in word document, fill those information in when the application user clicks on the download button.
If you can use Office 2007 the way to go is to use the Open XML API to format the documents:
http://support.microsoft.com/kb/257757. The reason you have to go that route is that you can't really use Word Automation in a server environment. (you CAN, but it's a huge pain to get working properly, and can EASILY break).
If you can't go the 2007 route, I've actually had pretty good success with just opening up a word template as a stream and finding and replacing the tokens and serving that to the user. This has actually worked surprisingly well in my experience and it's REALLY simple to implement.
I'm not sure about some of the ASP.Net aspects, but I am working on something similar and you might want to look into using an RTF instead. You can use pattern replacement in the RTF. For example you can add a tag like {USER_FIRST_NAME} in the RTF document. When the user clicks the download button, your application can take the information from the database and replace every instance of {USER_FIRST_NAME} with the data from the database. I am currently doing this with PHP and it works great. Word will open the RTF without a problem so that is another reason I chose this method.
I have used Aspose.Words for .NET. It's a little on the pricey side, but it works extremely well and the API is fairly intuitive for something that is potentially very complex.
If you want to pre-design your documents (or allow others to do that for you), anyone can put fields into the document. Aspose can open the document, find and fill the fields, and save a new filled-out copy for download.
Aspose works okay, but again: it's pricey.
Definitely avoid Office Automation in web apps as much as possible. It just doesn't scale well.
My preferred solution for this kind of problem is xml: specifically here I recommend WordProcessingML. You create an Xml document according to the schema, put a .doc extension on it, and MS Word will open it as if it were native in any version as far back as Office XP. This supports most Word features, and this way you can safely reduce the problem to replacing tokens in a text stream.
Be careful googling for more information on this: there's a lot of confusion between this and new Xml-based format for Office 2007. They're not the same thing.
This code works for WordMl text boxes and checkboxes. It's index based, so just pass in an array of strings for all textboxes and an array of bool's for all checkboxes.
public void FillInFields(
Stream sourceStream,
Stream destinationStream,
bool[] pageCheckboxFields,
string[] pageTextFields
) {
StreamUtil.Copy(sourceStream, destinationStream);
sourceStream.Close();
destinationStream.Seek(0, SeekOrigin.Begin);
Package package = Package.Open(destinationStream, FileMode.Open, FileAccess.ReadWrite);
Uri uri = new Uri("/word/document.xml", UriKind.Relative);
PackagePart packagePart = package.GetPart(uri);
Stream documentPart = packagePart.GetStream(FileMode.Open, FileAccess.ReadWrite);
XmlReader xmlReader = XmlReader.Create(documentPart);
XDocument xdocument = XDocument.Load(xmlReader);
List<XElement> textBookmarksList = xdocument
.Descendants(w + "fldChar")
.Where(e => (e.AttributeOrDefault(w + "fldCharType") ?? "") == "separate")
.ToList();
var textBookmarks = textBookmarksList.Select(e => new WordMlTextField(w, e, textBookmarksList.IndexOf(e)));
List<XElement> checkboxBookmarksList = xdocument
.Descendants(w + "checkBox")
.ToList();
IEnumerable<WordMlCheckboxField> checkboxBookmarks = checkboxBookmarksList
.Select(e => new WordMlCheckboxField(w, e, checkboxBookmarksList.IndexOf(e)));
for (int i = 0; i < pageTextFields.Length; i++) {
string value = pageTextFields[i];
if (!String.IsNullOrEmpty(value))
SetWordMlElement(textBookmarks, i, value);
}
for (int i = 0; i < pageCheckboxFields.Length; i++) {
bool value = pageCheckboxFields[i];
SetWordMlElement(checkboxBookmarks, i, value);
}
PackagePart newPart = packagePart;
StreamWriter streamWriter = new StreamWriter(newPart.GetStream(FileMode.Create, FileAccess.Write));
XmlWriter xmlWriter = XmlWriter.Create(streamWriter);
if (xmlWriter == null) throw new Exception("Could not open an XmlWriter to 4311Blank-1.docx.");
xdocument.Save(xmlWriter);
xmlWriter.Close();
streamWriter.Close();
package.Flush();
destinationStream.Seek(0, SeekOrigin.Begin);
}
private class WordMlTextField {
public int? Index { get; set; }
public XElement TextElement { get; set; }
public WordMlTextField(XNamespace ns, XObject element, int index) {
Index = index;
XElement parent = element.Parent;
if (parent == null) throw new NicException("fldChar must have a parent.");
if (parent.Name != ns + "r") {
log.Warn("Expected parent of fldChar to be a run for fldChar at position '" + Index + "'");
return;
}
var nextSibling = parent.ElementsAfterSelf().First();
if (nextSibling.Name != ns + "r") {
log.Warn("Expected a 'r' element after the parent of fldChar at position = " + Index);
return;
}
var text = nextSibling.Element(ns + "t");
if (text == null) {
log.Warn("Expected a 't' element inside the 'r' element after the parent of fldChar at position = " + Index);
}
TextElement = text;
}
}
private class WordMlCheckboxField {
public int? Index { get; set; }
public XElement CheckedElement { get; set; }
public readonly XNamespace _ns;
public WordMlCheckboxField(XNamespace ns, XContainer checkBoxElement, int index) {
_ns = ns;
Index = index;
XElement checkedElement = checkBoxElement.Elements(ns + "checked").FirstOrDefault();
if (checkedElement == null) {
checkedElement = new XElement(ns + "checked", new XAttribute(ns + "val", "0"));
checkBoxElement.Add(checkedElement);
}
CheckedElement = checkedElement;
}
public static void Copy(Stream readStream, Stream writeStream) {
const int Length = 256;
Byte[] buffer = new Byte[Length];
int bytesRead = readStream.Read(buffer, 0, Length);
// write the required bytes
while (bytesRead > 0) {
writeStream.Write(buffer, 0, bytesRead);
bytesRead = readStream.Read(buffer, 0, Length);
}
readStream.Flush();
writeStream.Flush();
}
In general you are going to want to avoid doing Office automation on a sever, and Microsoft has even stated that it is a bad idea as well. However, the technique that I generally use is the Office Open XML that was noted by aquinas. It does take a bit of time to learn your way around the format, but it is well worth it once you do as you don't have to worry about some of the issues involved with Office automation (e.g. processes hanging).
Awhile back I answered a similar question to this that you might find useful, you can find it here.
If you need to do this in DOC files (as opposed to DOCX), then the OpenXML SDK won't help you.
Also, just want to add another +1 about the danger of automating the Office apps on servers. You will run into problems with scale - I guarantee it.
To add another reference to a third-party tool that can be used to solve your problem:
http://www.officewriter.com
OfficeWriter lets you control docs with a full API, or a template-based approach (like what your requirement is) that basically lets you open, bind, and save DOC and DOCX in scenarios like this with little code.
Could you not use Microsofts own InterOp Framework to utilise Word Functionality
See Here

Resources