OpenXML - counting paragraphs in a document with altChunk elements - count

I am trying to count the number of paragraphs or runs in a Word document using OpenXML. For a simple document this can be accomplished this way:
using (WordprocessingDocument document = WordprocessingDocument.Open(file, false))
{
var paragraphs = document.MainDocumentPart.Document.Body.
Descendants<DocumentFormat.OpenXml.Wordprocessing.Paragraph>();
int numberOfParagraphs = paragraphs.ToArray().Length;
var runs = document.MainDocumentPart.Document.Body.
Descendants<DocumentFormat.OpenXml.Wordprocessing.Run>();
int numberOfRuns = runs.ToArray().Length;
}
But I run into trouble with documents that have been created from merging small documents using AltChunks. The values returned are wrong and way too low. So I presume that I am missing all or some of the paragraphs or runs in the chunks.
Any ideas for how to count the paragraph or run elements in an AltChunk? I tried looking at descendants in each AltChunk, but I get 0's.
I have an idea that I'd be able to count if I convert each AltChunk to a document, but the WordprocessingDocument.Open method doesn't work when I pass it an AltChunk.
Any ideas would be appreciated.

The mainDocumentPart will hold all the original paragraphs and the altChunks. You need to iterate over the altChunks to know how many paragraphs each altChunk has.
The following code works for me:
static void Main(string[] args)
{
string fileName1 = #"Destination.docx";
string fileName2 = #"Source.docx";
string testFile = #"Test.docx";
File.Delete(fileName1);
File.Copy(testFile, fileName1);
using (WordprocessingDocument myDoc =
WordprocessingDocument.Open(fileName1, true))
{
string altChunkId = "AltChunkId1";
MainDocumentPart mainPart = myDoc.MainDocumentPart;
AlternativeFormatImportPart chunk =
mainPart.AddAlternativeFormatImportPart(
AlternativeFormatImportPartType.WordprocessingML, altChunkId);
using (FileStream fileStream = File.Open(fileName2, FileMode.Open))
chunk.FeedData(fileStream);
AltChunk altChunk = new AltChunk();
altChunk.Id = altChunkId;
mainPart.Document
.Body
.InsertAfter(altChunk, mainPart.Document.Body
.Elements<Paragraph>().Last());
mainPart.Document.Save();
// Counting paragraphs
int paragraphCount = 0;
paragraphCount += mainPart.Document.Body.Elements<Paragraph>().Count();
var altChunks = mainPart.Document.Body.Descendants<AltChunk>().ToList();
foreach (var aChunk in altChunks)
{
paragraphCount += aChunk.Parent.Elements<Paragraph>().Count();
}
Console.WriteLine("Paragraph Count:: {0}", paragraphCount);
Console.ReadLine();
}
}

Related

We found a problem with some content in "example.xlsx" - using ClosedXML library

I'm using ClosedXML library to generate a simple Excel file with 2 worksheets.
I keep getting error message whenever i try to open the file saying
"We found a problem with some content in "example.xlsx". Do you want us to try to recover as much as we can. if you trust source of this workbook, click Yes"
If i click Yes, it displays the data as expected, i don't see any
problems with it. Also if i generate only 1 worksheet this error does
not appear.
This is what my stored procedure returns, first result set is populated in sheet1 and second result set is populated in sheet2, which works as expected.
Workbook data
Here is the method i am using, it returns 2 result sets and populates both result sets in 2 different worksheets:
[HttpPost]
[ValidateAntiForgeryToken]
public ActionResult POAReport(POAReportVM model)
{
POAReportVM poaReportVM = reportService.GetPOAReport(model);
using (var workbook = new XLWorkbook())
{
IXLWorksheet worksheet1 = workbook.Worksheets.Add("ProductOrderAccuracy");
worksheet1.Cell("A1").Value = "DATE";
worksheet1.Cell("B1").Value = "ORDER";
worksheet1.Cell("C1").Value = "";
var countsheet1 = 2;
for (int i = 0; i < poaReportVM.productOrderAccuracyList.Count; i++)
{
worksheet1.Cell(countsheet1, 1).Value = poaReportVM.productOrderAccuracyList[i].CompletedDate.ToString();
worksheet1.Cell(countsheet1, 2).Value = poaReportVM.productOrderAccuracyList[i].WebOrderID.ToString();
worksheet1.Cell(countsheet1, 3).Value = poaReportVM.productOrderAccuracyList[i].CompletedIndicator;
countsheet1++;
}
IXLWorksheet worksheet2 = workbook.Worksheets.Add("Summary");
worksheet2.Cell("A1").Value = "Total Orders Sampled";
worksheet2.Cell("B1").Value = "Passed";
worksheet2.Cell("C1").Value = "% Passed";
worksheet2.Cell(2, 1).Value = poaReportVM.summaryVM.TotalOrdersSampled.ToString();
worksheet2.Cell(2, 2).Value = poaReportVM.summaryVM.Passed.ToString();
worksheet2.Cell(2, 3).Value = poaReportVM.summaryVM.PassedPercentage.ToString();
//save file to memory stream and return it as byte array
using (var ms = new MemoryStream())
{
workbook.SaveAs(ms);
ms.Position = 0;
var content = ms.ToArray();
return File(content, "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet");
}
}
}
I had similar problem. For me the cause was as mentioned in this answer:
I had this issue when I was using EPPlus to customise an existing template. For me the issue was in the template itself as it contained invalid references to lookup tables. I found this in Formula -> Name Manager.
You may find other solutions there.
Apologies as my reputation is too low to add a comment.

Extract text with iText not works: encoding or crypted text?

I have a pdf file that as the follow security properties: printing: allowed; document assembly: NOT allowed; content copy: allowed; content copy for accessibility: allowed; page extraction:NOT allowed;
I try to get text with sample code as documentation sample as follow:
pdftext.Text = null;
StringBuilder text = new StringBuilder();
PdfReader pdfReader = new PdfReader(filename);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
text.Append(System.Environment.NewLine);
text.Append("\n Page Number:" + page);
text.Append(System.Environment.NewLine);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
progressBar1.Value++;
}
pdftext.Text += text.ToString();
pdfReader.Close();
but the output text is lines with ""??? ? ???????\n?? ??? ? " values;
seems that file is crypted or we have a encoding problem...
note that in the follow lines
var f = pdfReader.IsOpenedWithFullPermissions; -> FALSE
var f1 = pdfReader.IsEncrypted(); - > FALSE
var f2 = pdfReader.ComputeUserPassword(); - > NULL
var f3 = pdfReader.Is128Key(); - > FALSE
var f4 = pdfReader.HasUsageRights();
f, f1, f3, f4 return FALSE ...than seems that the document is not crypted,
...so I don't know if is a Encoding problem or question related to encrypet strings...
Someone can help me?
thanks in advance.
G.G.
Whenever you have trouble extracting text from a document using standard code, the first thing to do is try and copy&paste the text from it using Adobe Acrobat Reader. Adobe Reader copy&paste implements text extraction according to the recommendations of the PDF specification, and if this fails, this usually means that the necessary information required for text extraction in the document are either missing or broken (by accident or by design). To extract the text, one either needs to customize the code specifically to the specific PDF or resort to OCR.
In case of the document at hand, Adobe Reader copy&paste does result in garbage, too, just like when extracting with iText. Thus, there is something fishy in the document.
Inspecting the document one finds that the fonts contain ToUnicode mappings like this:
/CIDInit /ProcSet
findresource begin 12 dict begin begincmap /CIDSystemInfo<</Registry(Adobe)
/Ordering(Identity)
/Supplement 0
>>
def
/CMapName/F18 def
1 begincodespacerange <0000> <FFFF> endcodespacerange
44 beginbfrange
<20> <20> <0020>
<21> <21> <E0F9>
<22> <22> <E0F1>
<23> <23> <E0FA>
<24> <24> <E0F7>
<25> <25> <E0A3>
<26> <26> <E084>
<27> <27> <E097>
<28> <28> <E098>
<29> <29> <E09A>
<2A> <2A> <E08A>
<2B> <2B> <E099>
<2C> <2C> <E0A5>
<2D> <2D> <E086>
<2E> <2E> <E094>
<2F> <2F> <E0DE>
<30> <30> <E0A6>
<31> <31> <E096>
<32> <32> <E088>
<33> <33> <E082>
<34> <34> <E04C>
<35> <35> <E0A4>
<36> <36> <E0F6>
<37> <37> <E0F2>
<38> <38> <E0D8>
<39> <39> <E0AA>
<3A> <3A> <E06C>
<3B> <3B> <E087>
<3C> <3C> <E095>
<3D> <3D> <E0C4>
<3E> <3E> <E07E>
<3F> <3F> <E055>
<40> <40> <E089>
<41> <41> <E085>
<42> <42> <E083>
<43> <43> <E070>
<44> <44> <E0E6>
<45> <45> <E080>
<46> <46> <E0C8>
<47> <47> <E0F4>
<48> <48> <E062>
<49> <49> <E0F3>
<4A> <4A> <E04E>
<4B> <4B> <E05E>
endbfrange
endcmap CMapName currentdict /CMap defineresource pop end end
I.e., if you are not into this, the fonts claim that all their glyphs (with the exception of the space glyph at 0x20) represent characters U+E0xx from the Unicode private use area. As the name of that area indicates, there is no common meaning of characters with these values.
Thus, text extraction according to the PDF specification will return strings of characters with undefined meaning with results as you observed in iText or I saw in Adobe Reader.
Sometimes in such a situation one can still enforce proper text extraction by ignoring the ToUnicode map and using either the font Encoding or information inside the embedded font program.
Unfortunately it turns out that here the Encoding effectively contains the same information as does the ToUnicode map, e.g. for the same font as above
/Differences [ 32 /space /uniE0F9 /uniE0F1 /uniE0FA /uniE0F7 /uniE0A3 /uniE084 /uniE097 /uniE098
/uniE09A /uniE08A /uniE099 /uniE0A5 /uniE086 /uniE094 /uniE0DE /uniE0A6 /uniE096
/uniE088 /uniE082 /uniE04C /uniE0A4 /uniE0F6 /uniE0F2 /uniE0D8 /uniE0AA /uniE06C
/uniE087 /uniE095 /uniE0C4 /uniE07E /uniE055 /uniE089 /uniE085 /uniE083 /uniE070
/uniE0E6 /uniE080 /uniE0C8 /uniE0F4 /uniE062 /uniE0F3 /uniE04E /uniE05E ]
and the fonts turns out to be Type3 fonts, i.e. there is no embedded font program but each glyph is defined as an individual PDF canvas without further character information.
Thus, nothing to gain here either.
Actually these small PDF canvasses contain inlined bitmap graphics of the respective glyph which also is the cause of the poor graphical quality of the document (if you don't see that immediately, simply zoom in a bit and you'll see the ragged outlines of the glyphs).
By the way, such a construct usually means that the producer of the PDF explicitly wants to prevent text extraction.
If you happen to have to extract text from many such documents, you can try and determine a mapping from their U+E0xx characters to actually sensible Unicode characters and apply that mapping to your extracted text.
If all those fonts in all those documents happen to use the same U+E0xx codepoints for the same actual characters, you'll be able to do text extraction from those documents after investing a certain amount of initial work.
Otherwise do try OCR.
The following code adds pages to a document which map the ToUnicode values to the characters shown:
void AddFontsTo(PdfReader reader, PdfStamper stamper)
{
int documentPages = reader.NumberOfPages;
for (int page = 1; page <= documentPages; page++)
{
// ignore inherited resources for now
PdfDictionary pageResources = reader.GetPageResources(page);
if (pageResources == null)
continue;
PdfDictionary pageFonts = pageResources.GetAsDict(PdfName.FONT);
if (pageFonts == null || pageFonts.Size == 0)
continue;
List<BaseFont> fonts = new List<BaseFont>();
List<string> fontNames = new List<string>();
HashSet<char> chars = new HashSet<char>();
foreach (PdfName key in pageFonts.Keys)
{
PdfIndirectReference fontReference = pageFonts.GetAsIndirectObject(key);
if (fontReference == null)
continue;
DocumentFont font = (DocumentFont) BaseFont.CreateFont((PRIndirectReference)fontReference);
if (font == null)
continue;
PdfObject toUni = PdfReader.GetPdfObjectRelease(font.FontDictionary.Get(PdfName.TOUNICODE));
CMapToUnicode toUnicodeCmap = null;
if (toUni is PRStream)
{
try
{
byte[] touni = PdfReader.GetStreamBytes((PRStream)toUni);
CidLocationFromByte lb = new CidLocationFromByte(touni);
toUnicodeCmap = new CMapToUnicode();
CMapParserEx.ParseCid("", toUnicodeCmap, lb);
}
catch
{
toUnicodeCmap = null;
}
}
if (toUnicodeCmap == null)
continue;
ICollection<int> mapValues = toUnicodeCmap.CreateDirectMapping().Values;
if (mapValues.Count == 0)
continue;
fonts.Add(font);
fontNames.Add(key.ToString());
foreach (int value in mapValues)
chars.Add((char)value);
}
if (fonts.Count == 0 || chars.Count == 0)
continue;
Rectangle size = (fonts.Count > 10) ? PageSize.A4.Rotate() : PageSize.A4;
PdfPTable table = new PdfPTable(fonts.Count + 1);
table.AddCell("Page " + page);
foreach (String name in fontNames)
{
table.AddCell(name);
}
table.HeaderRows = 1;
float[] widths = new float[fonts.Count + 1];
widths[0] = 2;
for (int i = 1; i <= fonts.Count; i++)
widths[i] = 1;
table.SetWidths(widths);
table.WidthPercentage = 100;
List<char> charList = new List<char>(chars);
charList.Sort();
foreach (char character in charList)
{
table.AddCell(((int)character).ToString("X4"));
foreach (BaseFont font in fonts)
{
table.AddCell(new PdfPCell(new Phrase(character.ToString(), new Font(font))));
}
}
stamper.InsertPage(reader.NumberOfPages + 1, size);
ColumnText columnText = new ColumnText(stamper.GetUnderContent(reader.NumberOfPages));
columnText.AddElement(table);
columnText.SetSimpleColumn(size);
while ((ColumnText.NO_MORE_TEXT & columnText.Go(false)) == 0)
{
stamper.InsertPage(reader.NumberOfPages + 1, size);
columnText.Canvas = stamper.GetUnderContent(reader.NumberOfPages);
columnText.SetSimpleColumn(size);
}
}
}
I applied it to your document like this:
string input = #"4700198773.pdf";
string output = #"4700198773-fonts.pdf";
using (PdfReader reader = new PdfReader(input))
using (FileStream stream = new FileStream(output, FileMode.Create, FileAccess.Write))
using (PdfStamper stamper = new PdfStamper(reader, stream))
{
AddFontsTo(reader, stamper);
}
The additional pages look like this:
Now you have to compare the outputs for the different fonts and pages of this document with each other and with those of a representative selection of file. If you find good enough a pattern, you can try this replacement way.

How to repeat a paragraph which is in a section using ASPOSE.DLL

In my Requirement i want to repeat a particular paragraph which is in section in a word document.
here word document divided into sections, in sections we have paragraphs like below
#Section Start
1) TO RECEIVE AND ADOPT FINANCIAL STATEMENTS FOR THE YEAR ENDED [FYE]
a.That the Financial Statements of the Company for the financial year ended [FYE] together with the Director(s)' Report and Statement thereon be hereby received and adopted.
b. Second paragraph.
c. Third paragraph.
#Section End
i want to repeat "a" point into 3 times
i tried the below code
// Copy all content including headers and footers from the specified
//pages into the destination document.
ArrayList pageSections = finder.RetrieveAllNodesOnPages(1, doc.Sections.Count, NodeType.Section);
System.Data.DataTable dt = GetDataTable(); //Sample DataTable which is having Keys and Values
int sectionCount = 0;
foreach (Section section in pageSections)
{
NodeCollection paragraphs = section.GetChildNodes(NodeType.Paragraph, true);
for (int i = 0; i < paragraphs.Count; i++)
{
string text = paragraphs[i].Range.Text;
}
}
Please help me how to repeat a paragraph.
I am working as Social Media Developer at Aspose. Please use the following sample code to repeat a paragraph using Aspose.Words for .NET.
Document doc = new Document("document.docx");
PageNumberFinder finder = new PageNumberFinder(doc);
// Split nodes which are found across pages.
finder.SplitNodesAcrossPages(true);
// Copy all content including headers and footers from the specified pages into the
//destination document.
ArrayList pageSections = finder.RetrieveAllNodesOnPages(1, doc.Sections.Count, NodeType.Section);
//Sample DataTable which is having Keys and Values
System.Data.DataTable dt = GetDataTable();
int sectionCount = 0;
foreach (Section section in pageSections)
{
NodeCollection paragraphs = section.GetChildNodes(NodeType.Paragraph, true);
for (int i = 0; i < paragraphs.Count; i++)
{
//Paragraph you want to copy
if (i == 10)
{
//Use Document Builder to Navigate to the paragraph
DocumentBuilder builder = new DocumentBuilder(doc);
builder.MoveTo(paragraphs[i]);
//Insert a Paragraph break
builder.InsertParagraph();
//Insert the Paragraph to repeat it
builder.Writeln(paragraphs[i].ToString(SaveFormat.Text));
}
}
}
doc.Save("test.docx");

Can PDFsharp automatically split a string over multiple pages?

I want to add the ability to generate a PDF of the content in my application (for simplicity, it will be text-only).
Is there any way to automatically work out how much content will fit in a single page, or to get any content that spills over one page to create a second (third, fourth, etc) page?
I can easily work it out for blocks of text - just split the text by a number of characters into a string array and then print each page in turn - but when the text has a lot of white space and character returns, this doesn't work.
Any advice?
Current code:
public void Generate(string title, string content, string filename)
{
PdfDocument document = new PdfDocument();
PdfPage page;
document.Info.Title = title;
XFont font = new XFont("Verdana", 10, XFontStyle.Regular);
List<String> splitText = new List<string>();
string textCopy = content;
int ptr = 0;
int maxCharacters = 3000;
while (textCopy.Length > 0)
{
//find a space in the text near the max character limit
int textLength = 0;
if (textCopy.Length > maxCharacters)
{
textLength = maxCharacters;
int spacePtr = textCopy.IndexOf(' ', textLength);
string startString = textCopy.Substring(ptr, spacePtr);
splitText.Add(startString);
int length = textCopy.Length - startString.Length;
textCopy = textCopy.Substring(spacePtr, length);
}
else
{
splitText.Add(textCopy);
textCopy = String.Empty;
}
}
foreach (string str in splitText)
{
page = document.AddPage();
// Get an XGraphics object for drawing
XGraphics gfx = XGraphics.FromPdfPage(page);
XTextFormatter tf = new XTextFormatter(gfx);
XRect rect = new XRect(40, 100, 500, 600);
gfx.DrawRectangle(XBrushes.Transparent, rect);
tf.DrawString(str, font, XBrushes.Black, rect, XStringFormats.TopLeft);
}
document.Save(filename);
}
You can download PDFsharp together with MigraDoc. MigraDoc will automatically add pages as needed, you just create a document and add your text as paragraphs.
See MigraDoc samples page:
http://pdfsharp.net/wiki/MigraDocSamples.ashx
MigraDoc is the recommended way.
If you want to stick to PDFsharp, you can use the XTextFormatter class (source included with PDFsharp) to create a new class that also supports page breaks (e.g. by returning the count of chars that fit on the current page and have the calling code create a new page and call the formatter again with the remaining text).

How to update a PDF file?

I am required to replace a word with a new word, selected from a drop-down list by user, in a PDF document in ASP.NET. I am using iTextSharp , but the new PDF that is created is all distorted as I am not able to extract the formatting/styling info of the PDF while extracting. Also, IS There a way to read a pdf line-by-line? Please help..
protected void Page_Load(object sender, EventArgs e)
{
String s = DropDownList1.SelectedValue;
Response.Write(s);
ListFieldNames(s);
}
private void CreatePDF(string text)
{
string outFileName = #"z:\TEMP\PDF\Test_abc.pdf";
Document doc = new Document();
doc.SetMargins(30f, 30f, 30f, 30f);
PdfWriter.GetInstance(doc, new FileStream(outFileName, FileMode.Create));
doc.Open();
BaseFont bfTimes = BaseFont.CreateFont(BaseFont.COURIER, BaseFont.CP1252, false);
Font times = new Font(bfTimes, 12, Font.BOLDITALIC);
//Chunk ch = new Chunk(text,times);
Paragraph para = new Paragraph(text,times);
//para.SpacingAfter = 9f;
para.Alignment = Element.ALIGN_CENTER;
//para.IndentationLeft = 100;
doc.Add(para);
//doc.Add(new Paragraph(text,times));
doc.Close();
Response.Redirect(#"z:\TEMP\PDF\Test_abc.pdf",false);
}
private void ListFieldNames(string s)
{
ArrayList arrCheck = new ArrayList();
try
{
string pdfTemplate = #"z:\TEMP\PDF\abc.pdf";
//string dest = #"z:\TEMP\PDF\Test_abc.pdf";
PdfReader pdfReader = new PdfReader(pdfTemplate);
string pdfText = string.Empty;
string extracttext = "";
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy();
PdfReader reader = new PdfReader((string)pdfTemplate);
extracttext = PdfTextExtractor.GetTextFromPage(reader, page, its);
extracttext = Encoding.Unicode.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.Unicode, Encoding.Default.GetBytes(extracttext)));
pdfText = pdfText + extracttext;
pdfText = pdfText.Replace("[xyz]", s);
pdfReader.Close();
}
CreatePDF(pdfText);
}
catch (Exception ex)
{
}
finally
{
}
}
You are making one wrong assumption after the other.
You assume that the concept of "lines" exists in PDF. This is wrong. In Text State, different snippets of text are drawn on the page at absolute positions. For every "show text" operator, iText will return a TextRenderInfo object with the portion of text that was drawn and its coordinates. One line can consist of multiple text snippets. A text snippet may contain whitespace or may even be empty.
You assume that all text in a PDF keeps its natural reading order. This should be true for PDF/UA (UA stands for Universal Accessibility), but it's certainly not true for most PDFs you can find in the wild. That's why iText provides location-based text extraction (see p521 of iText in Action, Second Edition). As explained on p516, the text "Hello World" can be stored in the PDF as "ld", "Wor", "llo", "He". The LocationTextExtractionStrategy will order all the text snippets, reconstructing words if necessary. For instance: it will concatenate "He" and "llo" to "Hello", because there's not sufficient space between the "He" snippet and the "llo" snippet. However, for reasons unknown (probably ignorance), you're using the SimpleTextExtractionStrategy which doesn't order the text based on its location.
You are completely ignoring all the Graphics State operators, as well as the Text State operators that define the font, etc...
You assume that PDF is a Word processing format. This is wrong on many levels, as is your code. Please read the intro of chapter 6 of my book.
All these wrong assumptions almost make me want to vote down your question. At the risk of being voted down myself for this answer, I must tell you that you shouldn't try to "do the same". You're asking something that is very complex, and in many cases even impossible!

Resources