How to update a PDF file? - asp.net

I am required to replace a word with a new word, selected from a drop-down list by user, in a PDF document in ASP.NET. I am using iTextSharp , but the new PDF that is created is all distorted as I am not able to extract the formatting/styling info of the PDF while extracting. Also, IS There a way to read a pdf line-by-line? Please help..
protected void Page_Load(object sender, EventArgs e)
{
String s = DropDownList1.SelectedValue;
Response.Write(s);
ListFieldNames(s);
}
private void CreatePDF(string text)
{
string outFileName = #"z:\TEMP\PDF\Test_abc.pdf";
Document doc = new Document();
doc.SetMargins(30f, 30f, 30f, 30f);
PdfWriter.GetInstance(doc, new FileStream(outFileName, FileMode.Create));
doc.Open();
BaseFont bfTimes = BaseFont.CreateFont(BaseFont.COURIER, BaseFont.CP1252, false);
Font times = new Font(bfTimes, 12, Font.BOLDITALIC);
//Chunk ch = new Chunk(text,times);
Paragraph para = new Paragraph(text,times);
//para.SpacingAfter = 9f;
para.Alignment = Element.ALIGN_CENTER;
//para.IndentationLeft = 100;
doc.Add(para);
//doc.Add(new Paragraph(text,times));
doc.Close();
Response.Redirect(#"z:\TEMP\PDF\Test_abc.pdf",false);
}
private void ListFieldNames(string s)
{
ArrayList arrCheck = new ArrayList();
try
{
string pdfTemplate = #"z:\TEMP\PDF\abc.pdf";
//string dest = #"z:\TEMP\PDF\Test_abc.pdf";
PdfReader pdfReader = new PdfReader(pdfTemplate);
string pdfText = string.Empty;
string extracttext = "";
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy();
PdfReader reader = new PdfReader((string)pdfTemplate);
extracttext = PdfTextExtractor.GetTextFromPage(reader, page, its);
extracttext = Encoding.Unicode.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.Unicode, Encoding.Default.GetBytes(extracttext)));
pdfText = pdfText + extracttext;
pdfText = pdfText.Replace("[xyz]", s);
pdfReader.Close();
}
CreatePDF(pdfText);
}
catch (Exception ex)
{
}
finally
{
}
}

You are making one wrong assumption after the other.
You assume that the concept of "lines" exists in PDF. This is wrong. In Text State, different snippets of text are drawn on the page at absolute positions. For every "show text" operator, iText will return a TextRenderInfo object with the portion of text that was drawn and its coordinates. One line can consist of multiple text snippets. A text snippet may contain whitespace or may even be empty.
You assume that all text in a PDF keeps its natural reading order. This should be true for PDF/UA (UA stands for Universal Accessibility), but it's certainly not true for most PDFs you can find in the wild. That's why iText provides location-based text extraction (see p521 of iText in Action, Second Edition). As explained on p516, the text "Hello World" can be stored in the PDF as "ld", "Wor", "llo", "He". The LocationTextExtractionStrategy will order all the text snippets, reconstructing words if necessary. For instance: it will concatenate "He" and "llo" to "Hello", because there's not sufficient space between the "He" snippet and the "llo" snippet. However, for reasons unknown (probably ignorance), you're using the SimpleTextExtractionStrategy which doesn't order the text based on its location.
You are completely ignoring all the Graphics State operators, as well as the Text State operators that define the font, etc...
You assume that PDF is a Word processing format. This is wrong on many levels, as is your code. Please read the intro of chapter 6 of my book.
All these wrong assumptions almost make me want to vote down your question. At the risk of being voted down myself for this answer, I must tell you that you shouldn't try to "do the same". You're asking something that is very complex, and in many cases even impossible!

Related

We found a problem with some content in "example.xlsx" - using ClosedXML library

I'm using ClosedXML library to generate a simple Excel file with 2 worksheets.
I keep getting error message whenever i try to open the file saying
"We found a problem with some content in "example.xlsx". Do you want us to try to recover as much as we can. if you trust source of this workbook, click Yes"
If i click Yes, it displays the data as expected, i don't see any
problems with it. Also if i generate only 1 worksheet this error does
not appear.
This is what my stored procedure returns, first result set is populated in sheet1 and second result set is populated in sheet2, which works as expected.
Workbook data
Here is the method i am using, it returns 2 result sets and populates both result sets in 2 different worksheets:
[HttpPost]
[ValidateAntiForgeryToken]
public ActionResult POAReport(POAReportVM model)
{
POAReportVM poaReportVM = reportService.GetPOAReport(model);
using (var workbook = new XLWorkbook())
{
IXLWorksheet worksheet1 = workbook.Worksheets.Add("ProductOrderAccuracy");
worksheet1.Cell("A1").Value = "DATE";
worksheet1.Cell("B1").Value = "ORDER";
worksheet1.Cell("C1").Value = "";
var countsheet1 = 2;
for (int i = 0; i < poaReportVM.productOrderAccuracyList.Count; i++)
{
worksheet1.Cell(countsheet1, 1).Value = poaReportVM.productOrderAccuracyList[i].CompletedDate.ToString();
worksheet1.Cell(countsheet1, 2).Value = poaReportVM.productOrderAccuracyList[i].WebOrderID.ToString();
worksheet1.Cell(countsheet1, 3).Value = poaReportVM.productOrderAccuracyList[i].CompletedIndicator;
countsheet1++;
}
IXLWorksheet worksheet2 = workbook.Worksheets.Add("Summary");
worksheet2.Cell("A1").Value = "Total Orders Sampled";
worksheet2.Cell("B1").Value = "Passed";
worksheet2.Cell("C1").Value = "% Passed";
worksheet2.Cell(2, 1).Value = poaReportVM.summaryVM.TotalOrdersSampled.ToString();
worksheet2.Cell(2, 2).Value = poaReportVM.summaryVM.Passed.ToString();
worksheet2.Cell(2, 3).Value = poaReportVM.summaryVM.PassedPercentage.ToString();
//save file to memory stream and return it as byte array
using (var ms = new MemoryStream())
{
workbook.SaveAs(ms);
ms.Position = 0;
var content = ms.ToArray();
return File(content, "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet");
}
}
}
I had similar problem. For me the cause was as mentioned in this answer:
I had this issue when I was using EPPlus to customise an existing template. For me the issue was in the template itself as it contained invalid references to lookup tables. I found this in Formula -> Name Manager.
You may find other solutions there.
Apologies as my reputation is too low to add a comment.

OpenXML - counting paragraphs in a document with altChunk elements

I am trying to count the number of paragraphs or runs in a Word document using OpenXML. For a simple document this can be accomplished this way:
using (WordprocessingDocument document = WordprocessingDocument.Open(file, false))
{
var paragraphs = document.MainDocumentPart.Document.Body.
Descendants<DocumentFormat.OpenXml.Wordprocessing.Paragraph>();
int numberOfParagraphs = paragraphs.ToArray().Length;
var runs = document.MainDocumentPart.Document.Body.
Descendants<DocumentFormat.OpenXml.Wordprocessing.Run>();
int numberOfRuns = runs.ToArray().Length;
}
But I run into trouble with documents that have been created from merging small documents using AltChunks. The values returned are wrong and way too low. So I presume that I am missing all or some of the paragraphs or runs in the chunks.
Any ideas for how to count the paragraph or run elements in an AltChunk? I tried looking at descendants in each AltChunk, but I get 0's.
I have an idea that I'd be able to count if I convert each AltChunk to a document, but the WordprocessingDocument.Open method doesn't work when I pass it an AltChunk.
Any ideas would be appreciated.
The mainDocumentPart will hold all the original paragraphs and the altChunks. You need to iterate over the altChunks to know how many paragraphs each altChunk has.
The following code works for me:
static void Main(string[] args)
{
string fileName1 = #"Destination.docx";
string fileName2 = #"Source.docx";
string testFile = #"Test.docx";
File.Delete(fileName1);
File.Copy(testFile, fileName1);
using (WordprocessingDocument myDoc =
WordprocessingDocument.Open(fileName1, true))
{
string altChunkId = "AltChunkId1";
MainDocumentPart mainPart = myDoc.MainDocumentPart;
AlternativeFormatImportPart chunk =
mainPart.AddAlternativeFormatImportPart(
AlternativeFormatImportPartType.WordprocessingML, altChunkId);
using (FileStream fileStream = File.Open(fileName2, FileMode.Open))
chunk.FeedData(fileStream);
AltChunk altChunk = new AltChunk();
altChunk.Id = altChunkId;
mainPart.Document
.Body
.InsertAfter(altChunk, mainPart.Document.Body
.Elements<Paragraph>().Last());
mainPart.Document.Save();
// Counting paragraphs
int paragraphCount = 0;
paragraphCount += mainPart.Document.Body.Elements<Paragraph>().Count();
var altChunks = mainPart.Document.Body.Descendants<AltChunk>().ToList();
foreach (var aChunk in altChunks)
{
paragraphCount += aChunk.Parent.Elements<Paragraph>().Count();
}
Console.WriteLine("Paragraph Count:: {0}", paragraphCount);
Console.ReadLine();
}
}

How does TextRenderInfo work in iTextSharp?

I have got some codes from online and they are providing me the font sizes. I did not understand how the TextRenderInfo is reading text. I tried with renderInfo.GetText()) which is giving random number of characters, sometimes 3 characters, sometimes 2 characters or more or less. I need to know how the renderInfo is reading data ?
My intention is to separate every lines and paragraphs from pdf and also read their properties individually such as font size, font style etc. If you have any suggestion, please mention them.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using iTextSharp.text.pdf.parser;
using iTextSharp.text.pdf;
namespace FontSizeDig1
{
class Program
{
static void Main(string[] args)
{
// reader ==> http://itextsupport.com/apidocs/itext5/5.5.9/com/itextpdf/text/pdf/PdfReader.html#pdfVersion
PdfReader reader = new PdfReader(System.IO.Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "document.pdf"));
TextWithFontExtractionStategy S = new TextWithFontExtractionStategy();//strategy==> http://itextsupport.com/apidocs/itext5/5.5.9/com/itextpdf/text/pdf/parser/TextExtractionStrategy.html
// for (int i = 1; i <= reader.NumberOfPages; i++)
// {
string F = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader, 1/*i*/, S);
// PdfTextExtractor.GetTextFromPage(reader, 6, S) ==>> http://itextsupport.com/apidocs/itext5/5.5.9/com/itextpdf/text/pdf/parser/PdfTextExtractor.html
Console.WriteLine(F);
// }
Console.ReadKey();
//this.Close();
}
}
public class TextWithFontExtractionStategy : iTextSharp.text.pdf.parser.ITextExtractionStrategy
{
//HTML buffer
private StringBuilder result = new StringBuilder();
//Store last used properties
private Vector lastBaseLine;
private string lastFont;
private float lastFontSize;
//http://api.itextpdf.com/itext/com/itextpdf/text/pdf/parser/TextRenderInfo.html
private enum TextRenderMode
{
FillText = 0,
StrokeText = 1,
FillThenStrokeText = 2,
Invisible = 3,
FillTextAndAddToPathForClipping = 4,
StrokeTextAndAddToPathForClipping = 5,
FillThenStrokeTextAndAddToPathForClipping = 6,
AddTextToPaddForClipping = 7
}
public void RenderText(iTextSharp.text.pdf.parser.TextRenderInfo renderInfo)
{
string curFont = renderInfo.GetFont().PostscriptFontName; // http://itextsupport.com/apidocs/itext5/5.5.9/com/itextpdf/text/pdf/parser/TextRenderInfo.html#getFont--
//Check if faux bold is used
if ((renderInfo.GetTextRenderMode() == 2/*(int)TextRenderMode.FillThenStrokeText*/))
{
curFont += "-Bold";
}
//This code assumes that if the baseline changes then we're on a newline
Vector curBaseline = renderInfo.GetBaseline().GetStartPoint();
Vector topRight = renderInfo.GetAscentLine().GetEndPoint();
iTextSharp.text.Rectangle rect = new iTextSharp.text.Rectangle(curBaseline[Vector.I1], curBaseline[Vector.I2], topRight[Vector.I1], topRight[Vector.I2]);
Single curFontSize = rect.Height;
//See if something has changed, either the baseline, the font or the font size
if ((this.lastBaseLine == null) || (curBaseline[Vector.I2] != lastBaseLine[Vector.I2]) || (curFontSize != lastFontSize) || (curFont != lastFont))
{
//if we've put down at least one span tag close it
if ((this.lastBaseLine != null))
{
this.result.AppendLine("</span>");
}
//If the baseline has changed then insert a line break
if ((this.lastBaseLine != null) && curBaseline[Vector.I2] != lastBaseLine[Vector.I2])
{
this.result.AppendLine("<br />");
}
//Create an HTML tag with appropriate styles
this.result.AppendFormat("<span style=\"font-family:{0};font-size:{1}\">", curFont, curFontSize);
}
//Append the current text
this.result.Append(renderInfo.GetText());
Console.WriteLine("me=" + renderInfo.GetText());//by imtiaj
//Set currently used properties
this.lastBaseLine = curBaseline;
this.lastFontSize = curFontSize;
this.lastFont = curFont;
}
public string GetResultantText()
{
//If we wrote anything then we'll always have a missing closing tag so close it here
if (result.Length > 0)
{
result.Append("</span>");
}
return result.ToString();
}
//Not needed
public void BeginTextBlock() { }
public void EndTextBlock() { }
public void RenderImage(ImageRenderInfo renderInfo) { }
}
}
Take a look at this PDF:
What do you see?
I see:
Hello World
Hello People
Now, let's parse this file? What do you expect?
You probably expect:
Hello World
Hello People
I don't.
That's where you and I differ, and that difference explains why you ask this question.
What do I expect?
Well, I'll start by looking inside the PDF, more specifically at the content stream of the first page:
I see 4 strings in the content stream: ld, Wor, llo, and He (in that order). I also see coordinates. Using those coordinates, I can compose what is shown:
Hello World
I don't immediately see "Hello People" anywhere, but I do see a reference to a Form XObject named /Xf1, so let's examine that Form XObject:
Woohoo! I'm in luck, "Hello People" is stored in the document as a single string value. I don't need to look at the coordinates to compose the actual text that I can see with my human eyes.
Now for your question. You say "I need to know how the renderInfo is reading data" and now you know: by default, iText will read all the strings from a page in the order they occur: ld, Wor, llo, He, and Hello People.
Depending on how the PDF is created, you can have output that is easy to read (Hello People), or output that is hard to read (ld, Wor, llo, He). iText comes with "strategies" that reorder all those snippets so that [ld, Wor, llo, He] is presented as [He, llo, Wor, ld], but detecting which of those parts belong to the same line, and which lines belong to the same paragraph, is something you will have to do.
NOTE: at iText Group, we already have plenty of closed source code that could save you plenty of time. Since we are the copyright owner of the iText library, we can ask money for that closed source code. That's something you typically can't do if you're using iText for free (because of the AGPL). However, if you are a customer of iText, we can probably disclose more source code. Do not expect us to give that code for free, as that code has too much commercial value.

Extract text with iText not works: encoding or crypted text?

I have a pdf file that as the follow security properties: printing: allowed; document assembly: NOT allowed; content copy: allowed; content copy for accessibility: allowed; page extraction:NOT allowed;
I try to get text with sample code as documentation sample as follow:
pdftext.Text = null;
StringBuilder text = new StringBuilder();
PdfReader pdfReader = new PdfReader(filename);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
text.Append(System.Environment.NewLine);
text.Append("\n Page Number:" + page);
text.Append(System.Environment.NewLine);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
progressBar1.Value++;
}
pdftext.Text += text.ToString();
pdfReader.Close();
but the output text is lines with ""??? ? ???????\n?? ??? ? " values;
seems that file is crypted or we have a encoding problem...
note that in the follow lines
var f = pdfReader.IsOpenedWithFullPermissions; -> FALSE
var f1 = pdfReader.IsEncrypted(); - > FALSE
var f2 = pdfReader.ComputeUserPassword(); - > NULL
var f3 = pdfReader.Is128Key(); - > FALSE
var f4 = pdfReader.HasUsageRights();
f, f1, f3, f4 return FALSE ...than seems that the document is not crypted,
...so I don't know if is a Encoding problem or question related to encrypet strings...
Someone can help me?
thanks in advance.
G.G.
Whenever you have trouble extracting text from a document using standard code, the first thing to do is try and copy&paste the text from it using Adobe Acrobat Reader. Adobe Reader copy&paste implements text extraction according to the recommendations of the PDF specification, and if this fails, this usually means that the necessary information required for text extraction in the document are either missing or broken (by accident or by design). To extract the text, one either needs to customize the code specifically to the specific PDF or resort to OCR.
In case of the document at hand, Adobe Reader copy&paste does result in garbage, too, just like when extracting with iText. Thus, there is something fishy in the document.
Inspecting the document one finds that the fonts contain ToUnicode mappings like this:
/CIDInit /ProcSet
findresource begin 12 dict begin begincmap /CIDSystemInfo<</Registry(Adobe)
/Ordering(Identity)
/Supplement 0
>>
def
/CMapName/F18 def
1 begincodespacerange <0000> <FFFF> endcodespacerange
44 beginbfrange
<20> <20> <0020>
<21> <21> <E0F9>
<22> <22> <E0F1>
<23> <23> <E0FA>
<24> <24> <E0F7>
<25> <25> <E0A3>
<26> <26> <E084>
<27> <27> <E097>
<28> <28> <E098>
<29> <29> <E09A>
<2A> <2A> <E08A>
<2B> <2B> <E099>
<2C> <2C> <E0A5>
<2D> <2D> <E086>
<2E> <2E> <E094>
<2F> <2F> <E0DE>
<30> <30> <E0A6>
<31> <31> <E096>
<32> <32> <E088>
<33> <33> <E082>
<34> <34> <E04C>
<35> <35> <E0A4>
<36> <36> <E0F6>
<37> <37> <E0F2>
<38> <38> <E0D8>
<39> <39> <E0AA>
<3A> <3A> <E06C>
<3B> <3B> <E087>
<3C> <3C> <E095>
<3D> <3D> <E0C4>
<3E> <3E> <E07E>
<3F> <3F> <E055>
<40> <40> <E089>
<41> <41> <E085>
<42> <42> <E083>
<43> <43> <E070>
<44> <44> <E0E6>
<45> <45> <E080>
<46> <46> <E0C8>
<47> <47> <E0F4>
<48> <48> <E062>
<49> <49> <E0F3>
<4A> <4A> <E04E>
<4B> <4B> <E05E>
endbfrange
endcmap CMapName currentdict /CMap defineresource pop end end
I.e., if you are not into this, the fonts claim that all their glyphs (with the exception of the space glyph at 0x20) represent characters U+E0xx from the Unicode private use area. As the name of that area indicates, there is no common meaning of characters with these values.
Thus, text extraction according to the PDF specification will return strings of characters with undefined meaning with results as you observed in iText or I saw in Adobe Reader.
Sometimes in such a situation one can still enforce proper text extraction by ignoring the ToUnicode map and using either the font Encoding or information inside the embedded font program.
Unfortunately it turns out that here the Encoding effectively contains the same information as does the ToUnicode map, e.g. for the same font as above
/Differences [ 32 /space /uniE0F9 /uniE0F1 /uniE0FA /uniE0F7 /uniE0A3 /uniE084 /uniE097 /uniE098
/uniE09A /uniE08A /uniE099 /uniE0A5 /uniE086 /uniE094 /uniE0DE /uniE0A6 /uniE096
/uniE088 /uniE082 /uniE04C /uniE0A4 /uniE0F6 /uniE0F2 /uniE0D8 /uniE0AA /uniE06C
/uniE087 /uniE095 /uniE0C4 /uniE07E /uniE055 /uniE089 /uniE085 /uniE083 /uniE070
/uniE0E6 /uniE080 /uniE0C8 /uniE0F4 /uniE062 /uniE0F3 /uniE04E /uniE05E ]
and the fonts turns out to be Type3 fonts, i.e. there is no embedded font program but each glyph is defined as an individual PDF canvas without further character information.
Thus, nothing to gain here either.
Actually these small PDF canvasses contain inlined bitmap graphics of the respective glyph which also is the cause of the poor graphical quality of the document (if you don't see that immediately, simply zoom in a bit and you'll see the ragged outlines of the glyphs).
By the way, such a construct usually means that the producer of the PDF explicitly wants to prevent text extraction.
If you happen to have to extract text from many such documents, you can try and determine a mapping from their U+E0xx characters to actually sensible Unicode characters and apply that mapping to your extracted text.
If all those fonts in all those documents happen to use the same U+E0xx codepoints for the same actual characters, you'll be able to do text extraction from those documents after investing a certain amount of initial work.
Otherwise do try OCR.
The following code adds pages to a document which map the ToUnicode values to the characters shown:
void AddFontsTo(PdfReader reader, PdfStamper stamper)
{
int documentPages = reader.NumberOfPages;
for (int page = 1; page <= documentPages; page++)
{
// ignore inherited resources for now
PdfDictionary pageResources = reader.GetPageResources(page);
if (pageResources == null)
continue;
PdfDictionary pageFonts = pageResources.GetAsDict(PdfName.FONT);
if (pageFonts == null || pageFonts.Size == 0)
continue;
List<BaseFont> fonts = new List<BaseFont>();
List<string> fontNames = new List<string>();
HashSet<char> chars = new HashSet<char>();
foreach (PdfName key in pageFonts.Keys)
{
PdfIndirectReference fontReference = pageFonts.GetAsIndirectObject(key);
if (fontReference == null)
continue;
DocumentFont font = (DocumentFont) BaseFont.CreateFont((PRIndirectReference)fontReference);
if (font == null)
continue;
PdfObject toUni = PdfReader.GetPdfObjectRelease(font.FontDictionary.Get(PdfName.TOUNICODE));
CMapToUnicode toUnicodeCmap = null;
if (toUni is PRStream)
{
try
{
byte[] touni = PdfReader.GetStreamBytes((PRStream)toUni);
CidLocationFromByte lb = new CidLocationFromByte(touni);
toUnicodeCmap = new CMapToUnicode();
CMapParserEx.ParseCid("", toUnicodeCmap, lb);
}
catch
{
toUnicodeCmap = null;
}
}
if (toUnicodeCmap == null)
continue;
ICollection<int> mapValues = toUnicodeCmap.CreateDirectMapping().Values;
if (mapValues.Count == 0)
continue;
fonts.Add(font);
fontNames.Add(key.ToString());
foreach (int value in mapValues)
chars.Add((char)value);
}
if (fonts.Count == 0 || chars.Count == 0)
continue;
Rectangle size = (fonts.Count > 10) ? PageSize.A4.Rotate() : PageSize.A4;
PdfPTable table = new PdfPTable(fonts.Count + 1);
table.AddCell("Page " + page);
foreach (String name in fontNames)
{
table.AddCell(name);
}
table.HeaderRows = 1;
float[] widths = new float[fonts.Count + 1];
widths[0] = 2;
for (int i = 1; i <= fonts.Count; i++)
widths[i] = 1;
table.SetWidths(widths);
table.WidthPercentage = 100;
List<char> charList = new List<char>(chars);
charList.Sort();
foreach (char character in charList)
{
table.AddCell(((int)character).ToString("X4"));
foreach (BaseFont font in fonts)
{
table.AddCell(new PdfPCell(new Phrase(character.ToString(), new Font(font))));
}
}
stamper.InsertPage(reader.NumberOfPages + 1, size);
ColumnText columnText = new ColumnText(stamper.GetUnderContent(reader.NumberOfPages));
columnText.AddElement(table);
columnText.SetSimpleColumn(size);
while ((ColumnText.NO_MORE_TEXT & columnText.Go(false)) == 0)
{
stamper.InsertPage(reader.NumberOfPages + 1, size);
columnText.Canvas = stamper.GetUnderContent(reader.NumberOfPages);
columnText.SetSimpleColumn(size);
}
}
}
I applied it to your document like this:
string input = #"4700198773.pdf";
string output = #"4700198773-fonts.pdf";
using (PdfReader reader = new PdfReader(input))
using (FileStream stream = new FileStream(output, FileMode.Create, FileAccess.Write))
using (PdfStamper stamper = new PdfStamper(reader, stream))
{
AddFontsTo(reader, stamper);
}
The additional pages look like this:
Now you have to compare the outputs for the different fonts and pages of this document with each other and with those of a representative selection of file. If you find good enough a pattern, you can try this replacement way.

Can PDFsharp automatically split a string over multiple pages?

I want to add the ability to generate a PDF of the content in my application (for simplicity, it will be text-only).
Is there any way to automatically work out how much content will fit in a single page, or to get any content that spills over one page to create a second (third, fourth, etc) page?
I can easily work it out for blocks of text - just split the text by a number of characters into a string array and then print each page in turn - but when the text has a lot of white space and character returns, this doesn't work.
Any advice?
Current code:
public void Generate(string title, string content, string filename)
{
PdfDocument document = new PdfDocument();
PdfPage page;
document.Info.Title = title;
XFont font = new XFont("Verdana", 10, XFontStyle.Regular);
List<String> splitText = new List<string>();
string textCopy = content;
int ptr = 0;
int maxCharacters = 3000;
while (textCopy.Length > 0)
{
//find a space in the text near the max character limit
int textLength = 0;
if (textCopy.Length > maxCharacters)
{
textLength = maxCharacters;
int spacePtr = textCopy.IndexOf(' ', textLength);
string startString = textCopy.Substring(ptr, spacePtr);
splitText.Add(startString);
int length = textCopy.Length - startString.Length;
textCopy = textCopy.Substring(spacePtr, length);
}
else
{
splitText.Add(textCopy);
textCopy = String.Empty;
}
}
foreach (string str in splitText)
{
page = document.AddPage();
// Get an XGraphics object for drawing
XGraphics gfx = XGraphics.FromPdfPage(page);
XTextFormatter tf = new XTextFormatter(gfx);
XRect rect = new XRect(40, 100, 500, 600);
gfx.DrawRectangle(XBrushes.Transparent, rect);
tf.DrawString(str, font, XBrushes.Black, rect, XStringFormats.TopLeft);
}
document.Save(filename);
}
You can download PDFsharp together with MigraDoc. MigraDoc will automatically add pages as needed, you just create a document and add your text as paragraphs.
See MigraDoc samples page:
http://pdfsharp.net/wiki/MigraDocSamples.ashx
MigraDoc is the recommended way.
If you want to stick to PDFsharp, you can use the XTextFormatter class (source included with PDFsharp) to create a new class that also supports page breaks (e.g. by returning the count of chars that fit on the current page and have the calling code create a new page and call the formatter again with the remaining text).

Resources