Extract text with iText not works: encoding or crypted text? - encryption

I have a pdf file that as the follow security properties: printing: allowed; document assembly: NOT allowed; content copy: allowed; content copy for accessibility: allowed; page extraction:NOT allowed;
I try to get text with sample code as documentation sample as follow:
pdftext.Text = null;
StringBuilder text = new StringBuilder();
PdfReader pdfReader = new PdfReader(filename);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
text.Append(System.Environment.NewLine);
text.Append("\n Page Number:" + page);
text.Append(System.Environment.NewLine);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
progressBar1.Value++;
}
pdftext.Text += text.ToString();
pdfReader.Close();
but the output text is lines with ""??? ? ???????\n?? ??? ? " values;
seems that file is crypted or we have a encoding problem...
note that in the follow lines
var f = pdfReader.IsOpenedWithFullPermissions; -> FALSE
var f1 = pdfReader.IsEncrypted(); - > FALSE
var f2 = pdfReader.ComputeUserPassword(); - > NULL
var f3 = pdfReader.Is128Key(); - > FALSE
var f4 = pdfReader.HasUsageRights();
f, f1, f3, f4 return FALSE ...than seems that the document is not crypted,
...so I don't know if is a Encoding problem or question related to encrypet strings...
Someone can help me?
thanks in advance.
G.G.

Whenever you have trouble extracting text from a document using standard code, the first thing to do is try and copy&paste the text from it using Adobe Acrobat Reader. Adobe Reader copy&paste implements text extraction according to the recommendations of the PDF specification, and if this fails, this usually means that the necessary information required for text extraction in the document are either missing or broken (by accident or by design). To extract the text, one either needs to customize the code specifically to the specific PDF or resort to OCR.
In case of the document at hand, Adobe Reader copy&paste does result in garbage, too, just like when extracting with iText. Thus, there is something fishy in the document.
Inspecting the document one finds that the fonts contain ToUnicode mappings like this:
/CIDInit /ProcSet
findresource begin 12 dict begin begincmap /CIDSystemInfo<</Registry(Adobe)
/Ordering(Identity)
/Supplement 0
>>
def
/CMapName/F18 def
1 begincodespacerange <0000> <FFFF> endcodespacerange
44 beginbfrange
<20> <20> <0020>
<21> <21> <E0F9>
<22> <22> <E0F1>
<23> <23> <E0FA>
<24> <24> <E0F7>
<25> <25> <E0A3>
<26> <26> <E084>
<27> <27> <E097>
<28> <28> <E098>
<29> <29> <E09A>
<2A> <2A> <E08A>
<2B> <2B> <E099>
<2C> <2C> <E0A5>
<2D> <2D> <E086>
<2E> <2E> <E094>
<2F> <2F> <E0DE>
<30> <30> <E0A6>
<31> <31> <E096>
<32> <32> <E088>
<33> <33> <E082>
<34> <34> <E04C>
<35> <35> <E0A4>
<36> <36> <E0F6>
<37> <37> <E0F2>
<38> <38> <E0D8>
<39> <39> <E0AA>
<3A> <3A> <E06C>
<3B> <3B> <E087>
<3C> <3C> <E095>
<3D> <3D> <E0C4>
<3E> <3E> <E07E>
<3F> <3F> <E055>
<40> <40> <E089>
<41> <41> <E085>
<42> <42> <E083>
<43> <43> <E070>
<44> <44> <E0E6>
<45> <45> <E080>
<46> <46> <E0C8>
<47> <47> <E0F4>
<48> <48> <E062>
<49> <49> <E0F3>
<4A> <4A> <E04E>
<4B> <4B> <E05E>
endbfrange
endcmap CMapName currentdict /CMap defineresource pop end end
I.e., if you are not into this, the fonts claim that all their glyphs (with the exception of the space glyph at 0x20) represent characters U+E0xx from the Unicode private use area. As the name of that area indicates, there is no common meaning of characters with these values.
Thus, text extraction according to the PDF specification will return strings of characters with undefined meaning with results as you observed in iText or I saw in Adobe Reader.
Sometimes in such a situation one can still enforce proper text extraction by ignoring the ToUnicode map and using either the font Encoding or information inside the embedded font program.
Unfortunately it turns out that here the Encoding effectively contains the same information as does the ToUnicode map, e.g. for the same font as above
/Differences [ 32 /space /uniE0F9 /uniE0F1 /uniE0FA /uniE0F7 /uniE0A3 /uniE084 /uniE097 /uniE098
/uniE09A /uniE08A /uniE099 /uniE0A5 /uniE086 /uniE094 /uniE0DE /uniE0A6 /uniE096
/uniE088 /uniE082 /uniE04C /uniE0A4 /uniE0F6 /uniE0F2 /uniE0D8 /uniE0AA /uniE06C
/uniE087 /uniE095 /uniE0C4 /uniE07E /uniE055 /uniE089 /uniE085 /uniE083 /uniE070
/uniE0E6 /uniE080 /uniE0C8 /uniE0F4 /uniE062 /uniE0F3 /uniE04E /uniE05E ]
and the fonts turns out to be Type3 fonts, i.e. there is no embedded font program but each glyph is defined as an individual PDF canvas without further character information.
Thus, nothing to gain here either.
Actually these small PDF canvasses contain inlined bitmap graphics of the respective glyph which also is the cause of the poor graphical quality of the document (if you don't see that immediately, simply zoom in a bit and you'll see the ragged outlines of the glyphs).
By the way, such a construct usually means that the producer of the PDF explicitly wants to prevent text extraction.
If you happen to have to extract text from many such documents, you can try and determine a mapping from their U+E0xx characters to actually sensible Unicode characters and apply that mapping to your extracted text.
If all those fonts in all those documents happen to use the same U+E0xx codepoints for the same actual characters, you'll be able to do text extraction from those documents after investing a certain amount of initial work.
Otherwise do try OCR.
The following code adds pages to a document which map the ToUnicode values to the characters shown:
void AddFontsTo(PdfReader reader, PdfStamper stamper)
{
int documentPages = reader.NumberOfPages;
for (int page = 1; page <= documentPages; page++)
{
// ignore inherited resources for now
PdfDictionary pageResources = reader.GetPageResources(page);
if (pageResources == null)
continue;
PdfDictionary pageFonts = pageResources.GetAsDict(PdfName.FONT);
if (pageFonts == null || pageFonts.Size == 0)
continue;
List<BaseFont> fonts = new List<BaseFont>();
List<string> fontNames = new List<string>();
HashSet<char> chars = new HashSet<char>();
foreach (PdfName key in pageFonts.Keys)
{
PdfIndirectReference fontReference = pageFonts.GetAsIndirectObject(key);
if (fontReference == null)
continue;
DocumentFont font = (DocumentFont) BaseFont.CreateFont((PRIndirectReference)fontReference);
if (font == null)
continue;
PdfObject toUni = PdfReader.GetPdfObjectRelease(font.FontDictionary.Get(PdfName.TOUNICODE));
CMapToUnicode toUnicodeCmap = null;
if (toUni is PRStream)
{
try
{
byte[] touni = PdfReader.GetStreamBytes((PRStream)toUni);
CidLocationFromByte lb = new CidLocationFromByte(touni);
toUnicodeCmap = new CMapToUnicode();
CMapParserEx.ParseCid("", toUnicodeCmap, lb);
}
catch
{
toUnicodeCmap = null;
}
}
if (toUnicodeCmap == null)
continue;
ICollection<int> mapValues = toUnicodeCmap.CreateDirectMapping().Values;
if (mapValues.Count == 0)
continue;
fonts.Add(font);
fontNames.Add(key.ToString());
foreach (int value in mapValues)
chars.Add((char)value);
}
if (fonts.Count == 0 || chars.Count == 0)
continue;
Rectangle size = (fonts.Count > 10) ? PageSize.A4.Rotate() : PageSize.A4;
PdfPTable table = new PdfPTable(fonts.Count + 1);
table.AddCell("Page " + page);
foreach (String name in fontNames)
{
table.AddCell(name);
}
table.HeaderRows = 1;
float[] widths = new float[fonts.Count + 1];
widths[0] = 2;
for (int i = 1; i <= fonts.Count; i++)
widths[i] = 1;
table.SetWidths(widths);
table.WidthPercentage = 100;
List<char> charList = new List<char>(chars);
charList.Sort();
foreach (char character in charList)
{
table.AddCell(((int)character).ToString("X4"));
foreach (BaseFont font in fonts)
{
table.AddCell(new PdfPCell(new Phrase(character.ToString(), new Font(font))));
}
}
stamper.InsertPage(reader.NumberOfPages + 1, size);
ColumnText columnText = new ColumnText(stamper.GetUnderContent(reader.NumberOfPages));
columnText.AddElement(table);
columnText.SetSimpleColumn(size);
while ((ColumnText.NO_MORE_TEXT & columnText.Go(false)) == 0)
{
stamper.InsertPage(reader.NumberOfPages + 1, size);
columnText.Canvas = stamper.GetUnderContent(reader.NumberOfPages);
columnText.SetSimpleColumn(size);
}
}
}
I applied it to your document like this:
string input = #"4700198773.pdf";
string output = #"4700198773-fonts.pdf";
using (PdfReader reader = new PdfReader(input))
using (FileStream stream = new FileStream(output, FileMode.Create, FileAccess.Write))
using (PdfStamper stamper = new PdfStamper(reader, stream))
{
AddFontsTo(reader, stamper);
}
The additional pages look like this:
Now you have to compare the outputs for the different fonts and pages of this document with each other and with those of a representative selection of file. If you find good enough a pattern, you can try this replacement way.

Related

How does TextRenderInfo work in iTextSharp?

I have got some codes from online and they are providing me the font sizes. I did not understand how the TextRenderInfo is reading text. I tried with renderInfo.GetText()) which is giving random number of characters, sometimes 3 characters, sometimes 2 characters or more or less. I need to know how the renderInfo is reading data ?
My intention is to separate every lines and paragraphs from pdf and also read their properties individually such as font size, font style etc. If you have any suggestion, please mention them.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using iTextSharp.text.pdf.parser;
using iTextSharp.text.pdf;
namespace FontSizeDig1
{
class Program
{
static void Main(string[] args)
{
// reader ==> http://itextsupport.com/apidocs/itext5/5.5.9/com/itextpdf/text/pdf/PdfReader.html#pdfVersion
PdfReader reader = new PdfReader(System.IO.Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "document.pdf"));
TextWithFontExtractionStategy S = new TextWithFontExtractionStategy();//strategy==> http://itextsupport.com/apidocs/itext5/5.5.9/com/itextpdf/text/pdf/parser/TextExtractionStrategy.html
// for (int i = 1; i <= reader.NumberOfPages; i++)
// {
string F = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader, 1/*i*/, S);
// PdfTextExtractor.GetTextFromPage(reader, 6, S) ==>> http://itextsupport.com/apidocs/itext5/5.5.9/com/itextpdf/text/pdf/parser/PdfTextExtractor.html
Console.WriteLine(F);
// }
Console.ReadKey();
//this.Close();
}
}
public class TextWithFontExtractionStategy : iTextSharp.text.pdf.parser.ITextExtractionStrategy
{
//HTML buffer
private StringBuilder result = new StringBuilder();
//Store last used properties
private Vector lastBaseLine;
private string lastFont;
private float lastFontSize;
//http://api.itextpdf.com/itext/com/itextpdf/text/pdf/parser/TextRenderInfo.html
private enum TextRenderMode
{
FillText = 0,
StrokeText = 1,
FillThenStrokeText = 2,
Invisible = 3,
FillTextAndAddToPathForClipping = 4,
StrokeTextAndAddToPathForClipping = 5,
FillThenStrokeTextAndAddToPathForClipping = 6,
AddTextToPaddForClipping = 7
}
public void RenderText(iTextSharp.text.pdf.parser.TextRenderInfo renderInfo)
{
string curFont = renderInfo.GetFont().PostscriptFontName; // http://itextsupport.com/apidocs/itext5/5.5.9/com/itextpdf/text/pdf/parser/TextRenderInfo.html#getFont--
//Check if faux bold is used
if ((renderInfo.GetTextRenderMode() == 2/*(int)TextRenderMode.FillThenStrokeText*/))
{
curFont += "-Bold";
}
//This code assumes that if the baseline changes then we're on a newline
Vector curBaseline = renderInfo.GetBaseline().GetStartPoint();
Vector topRight = renderInfo.GetAscentLine().GetEndPoint();
iTextSharp.text.Rectangle rect = new iTextSharp.text.Rectangle(curBaseline[Vector.I1], curBaseline[Vector.I2], topRight[Vector.I1], topRight[Vector.I2]);
Single curFontSize = rect.Height;
//See if something has changed, either the baseline, the font or the font size
if ((this.lastBaseLine == null) || (curBaseline[Vector.I2] != lastBaseLine[Vector.I2]) || (curFontSize != lastFontSize) || (curFont != lastFont))
{
//if we've put down at least one span tag close it
if ((this.lastBaseLine != null))
{
this.result.AppendLine("</span>");
}
//If the baseline has changed then insert a line break
if ((this.lastBaseLine != null) && curBaseline[Vector.I2] != lastBaseLine[Vector.I2])
{
this.result.AppendLine("<br />");
}
//Create an HTML tag with appropriate styles
this.result.AppendFormat("<span style=\"font-family:{0};font-size:{1}\">", curFont, curFontSize);
}
//Append the current text
this.result.Append(renderInfo.GetText());
Console.WriteLine("me=" + renderInfo.GetText());//by imtiaj
//Set currently used properties
this.lastBaseLine = curBaseline;
this.lastFontSize = curFontSize;
this.lastFont = curFont;
}
public string GetResultantText()
{
//If we wrote anything then we'll always have a missing closing tag so close it here
if (result.Length > 0)
{
result.Append("</span>");
}
return result.ToString();
}
//Not needed
public void BeginTextBlock() { }
public void EndTextBlock() { }
public void RenderImage(ImageRenderInfo renderInfo) { }
}
}
Take a look at this PDF:
What do you see?
I see:
Hello World
Hello People
Now, let's parse this file? What do you expect?
You probably expect:
Hello World
Hello People
I don't.
That's where you and I differ, and that difference explains why you ask this question.
What do I expect?
Well, I'll start by looking inside the PDF, more specifically at the content stream of the first page:
I see 4 strings in the content stream: ld, Wor, llo, and He (in that order). I also see coordinates. Using those coordinates, I can compose what is shown:
Hello World
I don't immediately see "Hello People" anywhere, but I do see a reference to a Form XObject named /Xf1, so let's examine that Form XObject:
Woohoo! I'm in luck, "Hello People" is stored in the document as a single string value. I don't need to look at the coordinates to compose the actual text that I can see with my human eyes.
Now for your question. You say "I need to know how the renderInfo is reading data" and now you know: by default, iText will read all the strings from a page in the order they occur: ld, Wor, llo, He, and Hello People.
Depending on how the PDF is created, you can have output that is easy to read (Hello People), or output that is hard to read (ld, Wor, llo, He). iText comes with "strategies" that reorder all those snippets so that [ld, Wor, llo, He] is presented as [He, llo, Wor, ld], but detecting which of those parts belong to the same line, and which lines belong to the same paragraph, is something you will have to do.
NOTE: at iText Group, we already have plenty of closed source code that could save you plenty of time. Since we are the copyright owner of the iText library, we can ask money for that closed source code. That's something you typically can't do if you're using iText for free (because of the AGPL). However, if you are a customer of iText, we can probably disclose more source code. Do not expect us to give that code for free, as that code has too much commercial value.

How to repeat a paragraph which is in a section using ASPOSE.DLL

In my Requirement i want to repeat a particular paragraph which is in section in a word document.
here word document divided into sections, in sections we have paragraphs like below
#Section Start
1) TO RECEIVE AND ADOPT FINANCIAL STATEMENTS FOR THE YEAR ENDED [FYE]
a.That the Financial Statements of the Company for the financial year ended [FYE] together with the Director(s)' Report and Statement thereon be hereby received and adopted.
b. Second paragraph.
c. Third paragraph.
#Section End
i want to repeat "a" point into 3 times
i tried the below code
// Copy all content including headers and footers from the specified
//pages into the destination document.
ArrayList pageSections = finder.RetrieveAllNodesOnPages(1, doc.Sections.Count, NodeType.Section);
System.Data.DataTable dt = GetDataTable(); //Sample DataTable which is having Keys and Values
int sectionCount = 0;
foreach (Section section in pageSections)
{
NodeCollection paragraphs = section.GetChildNodes(NodeType.Paragraph, true);
for (int i = 0; i < paragraphs.Count; i++)
{
string text = paragraphs[i].Range.Text;
}
}
Please help me how to repeat a paragraph.
I am working as Social Media Developer at Aspose. Please use the following sample code to repeat a paragraph using Aspose.Words for .NET.
Document doc = new Document("document.docx");
PageNumberFinder finder = new PageNumberFinder(doc);
// Split nodes which are found across pages.
finder.SplitNodesAcrossPages(true);
// Copy all content including headers and footers from the specified pages into the
//destination document.
ArrayList pageSections = finder.RetrieveAllNodesOnPages(1, doc.Sections.Count, NodeType.Section);
//Sample DataTable which is having Keys and Values
System.Data.DataTable dt = GetDataTable();
int sectionCount = 0;
foreach (Section section in pageSections)
{
NodeCollection paragraphs = section.GetChildNodes(NodeType.Paragraph, true);
for (int i = 0; i < paragraphs.Count; i++)
{
//Paragraph you want to copy
if (i == 10)
{
//Use Document Builder to Navigate to the paragraph
DocumentBuilder builder = new DocumentBuilder(doc);
builder.MoveTo(paragraphs[i]);
//Insert a Paragraph break
builder.InsertParagraph();
//Insert the Paragraph to repeat it
builder.Writeln(paragraphs[i].ToString(SaveFormat.Text));
}
}
}
doc.Save("test.docx");

Can PDFsharp automatically split a string over multiple pages?

I want to add the ability to generate a PDF of the content in my application (for simplicity, it will be text-only).
Is there any way to automatically work out how much content will fit in a single page, or to get any content that spills over one page to create a second (third, fourth, etc) page?
I can easily work it out for blocks of text - just split the text by a number of characters into a string array and then print each page in turn - but when the text has a lot of white space and character returns, this doesn't work.
Any advice?
Current code:
public void Generate(string title, string content, string filename)
{
PdfDocument document = new PdfDocument();
PdfPage page;
document.Info.Title = title;
XFont font = new XFont("Verdana", 10, XFontStyle.Regular);
List<String> splitText = new List<string>();
string textCopy = content;
int ptr = 0;
int maxCharacters = 3000;
while (textCopy.Length > 0)
{
//find a space in the text near the max character limit
int textLength = 0;
if (textCopy.Length > maxCharacters)
{
textLength = maxCharacters;
int spacePtr = textCopy.IndexOf(' ', textLength);
string startString = textCopy.Substring(ptr, spacePtr);
splitText.Add(startString);
int length = textCopy.Length - startString.Length;
textCopy = textCopy.Substring(spacePtr, length);
}
else
{
splitText.Add(textCopy);
textCopy = String.Empty;
}
}
foreach (string str in splitText)
{
page = document.AddPage();
// Get an XGraphics object for drawing
XGraphics gfx = XGraphics.FromPdfPage(page);
XTextFormatter tf = new XTextFormatter(gfx);
XRect rect = new XRect(40, 100, 500, 600);
gfx.DrawRectangle(XBrushes.Transparent, rect);
tf.DrawString(str, font, XBrushes.Black, rect, XStringFormats.TopLeft);
}
document.Save(filename);
}
You can download PDFsharp together with MigraDoc. MigraDoc will automatically add pages as needed, you just create a document and add your text as paragraphs.
See MigraDoc samples page:
http://pdfsharp.net/wiki/MigraDocSamples.ashx
MigraDoc is the recommended way.
If you want to stick to PDFsharp, you can use the XTextFormatter class (source included with PDFsharp) to create a new class that also supports page breaks (e.g. by returning the count of chars that fit on the current page and have the calling code create a new page and call the formatter again with the remaining text).

How to update a PDF file?

I am required to replace a word with a new word, selected from a drop-down list by user, in a PDF document in ASP.NET. I am using iTextSharp , but the new PDF that is created is all distorted as I am not able to extract the formatting/styling info of the PDF while extracting. Also, IS There a way to read a pdf line-by-line? Please help..
protected void Page_Load(object sender, EventArgs e)
{
String s = DropDownList1.SelectedValue;
Response.Write(s);
ListFieldNames(s);
}
private void CreatePDF(string text)
{
string outFileName = #"z:\TEMP\PDF\Test_abc.pdf";
Document doc = new Document();
doc.SetMargins(30f, 30f, 30f, 30f);
PdfWriter.GetInstance(doc, new FileStream(outFileName, FileMode.Create));
doc.Open();
BaseFont bfTimes = BaseFont.CreateFont(BaseFont.COURIER, BaseFont.CP1252, false);
Font times = new Font(bfTimes, 12, Font.BOLDITALIC);
//Chunk ch = new Chunk(text,times);
Paragraph para = new Paragraph(text,times);
//para.SpacingAfter = 9f;
para.Alignment = Element.ALIGN_CENTER;
//para.IndentationLeft = 100;
doc.Add(para);
//doc.Add(new Paragraph(text,times));
doc.Close();
Response.Redirect(#"z:\TEMP\PDF\Test_abc.pdf",false);
}
private void ListFieldNames(string s)
{
ArrayList arrCheck = new ArrayList();
try
{
string pdfTemplate = #"z:\TEMP\PDF\abc.pdf";
//string dest = #"z:\TEMP\PDF\Test_abc.pdf";
PdfReader pdfReader = new PdfReader(pdfTemplate);
string pdfText = string.Empty;
string extracttext = "";
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy();
PdfReader reader = new PdfReader((string)pdfTemplate);
extracttext = PdfTextExtractor.GetTextFromPage(reader, page, its);
extracttext = Encoding.Unicode.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.Unicode, Encoding.Default.GetBytes(extracttext)));
pdfText = pdfText + extracttext;
pdfText = pdfText.Replace("[xyz]", s);
pdfReader.Close();
}
CreatePDF(pdfText);
}
catch (Exception ex)
{
}
finally
{
}
}
You are making one wrong assumption after the other.
You assume that the concept of "lines" exists in PDF. This is wrong. In Text State, different snippets of text are drawn on the page at absolute positions. For every "show text" operator, iText will return a TextRenderInfo object with the portion of text that was drawn and its coordinates. One line can consist of multiple text snippets. A text snippet may contain whitespace or may even be empty.
You assume that all text in a PDF keeps its natural reading order. This should be true for PDF/UA (UA stands for Universal Accessibility), but it's certainly not true for most PDFs you can find in the wild. That's why iText provides location-based text extraction (see p521 of iText in Action, Second Edition). As explained on p516, the text "Hello World" can be stored in the PDF as "ld", "Wor", "llo", "He". The LocationTextExtractionStrategy will order all the text snippets, reconstructing words if necessary. For instance: it will concatenate "He" and "llo" to "Hello", because there's not sufficient space between the "He" snippet and the "llo" snippet. However, for reasons unknown (probably ignorance), you're using the SimpleTextExtractionStrategy which doesn't order the text based on its location.
You are completely ignoring all the Graphics State operators, as well as the Text State operators that define the font, etc...
You assume that PDF is a Word processing format. This is wrong on many levels, as is your code. Please read the intro of chapter 6 of my book.
All these wrong assumptions almost make me want to vote down your question. At the risk of being voted down myself for this answer, I must tell you that you shouldn't try to "do the same". You're asking something that is very complex, and in many cases even impossible!

how do i get section target page number in pdf file using iTextSharp?

I have a pdf file which contains Index Page that includes section with target page.
I could get the section name(Section 1.1, Section 5.2) but i can not get the target page number...
For ex:
http://www.mikesdotnetting.com/Article/84/iTextSharp-Links-and-Bookmarks
Here is my code:
string FileName = AppDomain.CurrentDomain.BaseDirectory + "TestPDF.pdf";
PdfReader pdfreader = new PdfReader(FileName);
PdfDictionary PageDictionary = pdfreader.GetPageN(9);
PdfArray Annots = PageDictionary.GetAsArray(PdfName.ANNOTS);
if ((Annots == null) || (Annots.Length == 0))
return;
foreach (PdfObject oAnnot in Annots.ArrayList)
{
PdfDictionary AnnotationDictionary = (PdfDictionary)PdfReader.GetPdfObject(oAnnot);
if (AnnotationDictionary.Keys.Contains(PdfName.A))
{
PdfDictionary oALink = AnnotationDictionary.GetAsDict(PdfName.A);
if (oALink.Get(PdfName.S).Equals(PdfName.GOTO))
{
if (oALink.Keys.Contains(PdfName.D))
{
PdfObject objs = oALink.Get(PdfName.D);
if (objs.IsString())
{
string SectionName = objs.ToString(); // here i could see the section name...
}
}
}
}
}
How do i get the target page number?
also I couldn't access the Section name for some pdf ex: http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/adobe_supplement_iso32000.pdf
In this PDF 9th page contains a section I could not get the section.
so please give me solution....
There's two possible types of Link Annotations, either A or Dest. The A is the more powerful type but is often overkill. The Dest type just specifies an indirect reference to a page along with some fitting and zooming options.
The Dest value can be a couple of different things but is usually (as far as I've ever seen) a named string destination. You can look up named destinations in the document's name destination dictionary. So before your main loop add this so that it can be referenced later:
//Get all existing named destinations
Dictionary<string, PdfObject> dests = pdfreader.GetNamedDestinationFromStrings();
Once you've got the Dest as a string you can look that object up as a key in the above dictionary.
PdfArray thisDest = (PdfArray)dests[AnnotationDictionary.GetAsString(PdfName.DEST).ToString()];
The first item in the array returned is the indirect reference that you're used to. (Actually, the first item could be an integer representing a page number in a remote document so you might have to check for that.)
PdfIndirectReference a = (PdfIndirectReference)thisDest[0];
PdfObject thisPage = PdfReader.GetPdfObject(a);
Below is code that puts most of the above together, omitting some of the code that you already have. A and Dest are mutually exclusive per the spec so no annotation should ever have both specified.
//Get all existing named desitnations
Dictionary<string, PdfObject> dests = pdfreader.GetNamedDestinationFromStrings();
foreach (PdfObject oAnnot in Annots.ArrayList) {
PdfDictionary AnnotationDictionary = (PdfDictionary)PdfReader.GetPdfObject(oAnnot);
if (AnnotationDictionary.Get(PdfName.SUBTYPE).Equals(PdfName.LINK)) {
if (AnnotationDictionary.Contains(PdfName.A)) {
//...Do normal A stuff here
} else if (AnnotationDictionary.Contains(PdfName.DEST)) {
if (AnnotationDictionary.Get(PdfName.DEST).IsString()) {//Named-based destination
if (dests.ContainsKey(AnnotationDictionary.GetAsString(PdfName.DEST).ToString())) {//See if it exists in the global name dictionary
PdfArray thisDest = (PdfArray)dests[AnnotationDictionary.GetAsString(PdfName.DEST).ToString()];//Get the destination
PdfIndirectReference a = (PdfIndirectReference)thisDest[0];//TODO, this could actually be an integer for the case of Remote Destinations
PdfObject thisPage = PdfReader.GetPdfObject(a);//Get the actual PDF object
}
} else if(AnnotationDictionary.Get(PdfName.DEST).IsArray()) {
//Technically possible, I think the array matches the code directly above but I don't have a sample PDF
}
}
}
}

Resources