i have a problem where i want to remove the last character from a textfield (including linebreaks) that has multiple textformats without removing the formats.
so far i have:
textfield.replaceText(textField.length-1,textField.length-1,'');
i guess this doesn't remove linebreaks, and is very slow, seems to destroy my textformats.
or:
textfield.text = textfield.text.slice(0,-1);
this is faster but removes all textformats as well.
It is a bit tedious, but you can use the htmlText-property of TextField, even though you are not formatting your text with StyleSheets: Flash will transform all your formatting information into HTML text internally, so even though you set textField.text, you can still get xml formatted text to work with:
textField.text = "A test.";
trace (textField.htmlText);
will actually return:
<P ALIGN="LEFT"><FONT FACE="Times Roman" SIZE="12" COLOR="#000000" LETTERSPACING="0" KERNING="0">A test.</FONT></P>
Text will always appear within <FONT> tags reflecting the changes you made using setTextFormat(). You can, therefore, iterate over the XML contained in this line, and remove only the last character in the last TextNode:
private function removeLastCharacter (textField:TextField) : void {
var xml:XML = new XML (textField.htmlText);
for ( var i : int = xml.children().length()-1; i >= 0; i-- ){
var node:XML = xml.children()[i];
if ( node.name() == "FONT") {
var tx:String = node.text()[0].toString();
node.setChildren (tx.substr (0, tx.length-1));
break;
}
}
textField.htmlText = xml;
trace (textField.text); // In the above example, output will be: "A test";
}
I hope I understand your problem correctly. If you keep your formatting in htmlText, I have one possible solution:
The idea is to keep the formatted text in an XML format, and modify the XML. XML will keep your formatting intact, you don't have to do string aerobatics to maintain them. The downsides are of course having to keep the formatting XML valid, and the extra variable.
Here's an example:
var tf:TextField = new TextField();
var t:XML = new XML("<html><p>lalala</p><font color='#ff0000'> lol</font></html>");
tf.htmlText = t.toXMLString();
t.font[0] = t.font[0].text().slice(0, -1);
tf.htmlText = t.toXMLString();
addChild(tf);
Related
Sorry in advance if I’m not phrasing this question correctly. I know nothing about InDesign scripting, but this would solve a workflow problem I’m having at my company, so I would appreciate any and all help.
I want to find all strings in an InDesign file that are between angle brackets (i.e. <variable>) and export that out into a list. Ideally this list would be a separate document but if it can just be dumped into a text frame or something that’s fine.
Any ideas on how to do this? Thank you in advance for any and all help.
Here is something simple:
app.findGrepPreferences=NothingEnum.NOTHING; //to reset the Grep search
app.findGrepPreferences.findWhat = '<[^<]+?>'; //the word(s) you are searching
var fnd = app.activeDocument.findGrep (); //execute search
var temp_str = ''; //initialize an empty string
for (var i = 0; i < fnd.length; i++) { //loop through results and store the results
temp_str += fnd[i].contents.toString() + '\r'; // add next found item text to string
}
var new_doc = app.documents.add (); //create a new document
app.scriptPreferences.measurementUnit = MeasurementUnits.POINTS; //set measurement units to points
var text_frame = new_doc.pages[0].textFrames.add({geometricBounds:[0,0,100,100]});// create a new text frame on the first page
text_frame.contents = temp_str; //output your text to the text frame in the new document
For more data see here.
I have got some codes from online and they are providing me the font sizes. I did not understand how the TextRenderInfo is reading text. I tried with renderInfo.GetText()) which is giving random number of characters, sometimes 3 characters, sometimes 2 characters or more or less. I need to know how the renderInfo is reading data ?
My intention is to separate every lines and paragraphs from pdf and also read their properties individually such as font size, font style etc. If you have any suggestion, please mention them.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using iTextSharp.text.pdf.parser;
using iTextSharp.text.pdf;
namespace FontSizeDig1
{
class Program
{
static void Main(string[] args)
{
// reader ==> http://itextsupport.com/apidocs/itext5/5.5.9/com/itextpdf/text/pdf/PdfReader.html#pdfVersion
PdfReader reader = new PdfReader(System.IO.Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "document.pdf"));
TextWithFontExtractionStategy S = new TextWithFontExtractionStategy();//strategy==> http://itextsupport.com/apidocs/itext5/5.5.9/com/itextpdf/text/pdf/parser/TextExtractionStrategy.html
// for (int i = 1; i <= reader.NumberOfPages; i++)
// {
string F = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader, 1/*i*/, S);
// PdfTextExtractor.GetTextFromPage(reader, 6, S) ==>> http://itextsupport.com/apidocs/itext5/5.5.9/com/itextpdf/text/pdf/parser/PdfTextExtractor.html
Console.WriteLine(F);
// }
Console.ReadKey();
//this.Close();
}
}
public class TextWithFontExtractionStategy : iTextSharp.text.pdf.parser.ITextExtractionStrategy
{
//HTML buffer
private StringBuilder result = new StringBuilder();
//Store last used properties
private Vector lastBaseLine;
private string lastFont;
private float lastFontSize;
//http://api.itextpdf.com/itext/com/itextpdf/text/pdf/parser/TextRenderInfo.html
private enum TextRenderMode
{
FillText = 0,
StrokeText = 1,
FillThenStrokeText = 2,
Invisible = 3,
FillTextAndAddToPathForClipping = 4,
StrokeTextAndAddToPathForClipping = 5,
FillThenStrokeTextAndAddToPathForClipping = 6,
AddTextToPaddForClipping = 7
}
public void RenderText(iTextSharp.text.pdf.parser.TextRenderInfo renderInfo)
{
string curFont = renderInfo.GetFont().PostscriptFontName; // http://itextsupport.com/apidocs/itext5/5.5.9/com/itextpdf/text/pdf/parser/TextRenderInfo.html#getFont--
//Check if faux bold is used
if ((renderInfo.GetTextRenderMode() == 2/*(int)TextRenderMode.FillThenStrokeText*/))
{
curFont += "-Bold";
}
//This code assumes that if the baseline changes then we're on a newline
Vector curBaseline = renderInfo.GetBaseline().GetStartPoint();
Vector topRight = renderInfo.GetAscentLine().GetEndPoint();
iTextSharp.text.Rectangle rect = new iTextSharp.text.Rectangle(curBaseline[Vector.I1], curBaseline[Vector.I2], topRight[Vector.I1], topRight[Vector.I2]);
Single curFontSize = rect.Height;
//See if something has changed, either the baseline, the font or the font size
if ((this.lastBaseLine == null) || (curBaseline[Vector.I2] != lastBaseLine[Vector.I2]) || (curFontSize != lastFontSize) || (curFont != lastFont))
{
//if we've put down at least one span tag close it
if ((this.lastBaseLine != null))
{
this.result.AppendLine("</span>");
}
//If the baseline has changed then insert a line break
if ((this.lastBaseLine != null) && curBaseline[Vector.I2] != lastBaseLine[Vector.I2])
{
this.result.AppendLine("<br />");
}
//Create an HTML tag with appropriate styles
this.result.AppendFormat("<span style=\"font-family:{0};font-size:{1}\">", curFont, curFontSize);
}
//Append the current text
this.result.Append(renderInfo.GetText());
Console.WriteLine("me=" + renderInfo.GetText());//by imtiaj
//Set currently used properties
this.lastBaseLine = curBaseline;
this.lastFontSize = curFontSize;
this.lastFont = curFont;
}
public string GetResultantText()
{
//If we wrote anything then we'll always have a missing closing tag so close it here
if (result.Length > 0)
{
result.Append("</span>");
}
return result.ToString();
}
//Not needed
public void BeginTextBlock() { }
public void EndTextBlock() { }
public void RenderImage(ImageRenderInfo renderInfo) { }
}
}
Take a look at this PDF:
What do you see?
I see:
Hello World
Hello People
Now, let's parse this file? What do you expect?
You probably expect:
Hello World
Hello People
I don't.
That's where you and I differ, and that difference explains why you ask this question.
What do I expect?
Well, I'll start by looking inside the PDF, more specifically at the content stream of the first page:
I see 4 strings in the content stream: ld, Wor, llo, and He (in that order). I also see coordinates. Using those coordinates, I can compose what is shown:
Hello World
I don't immediately see "Hello People" anywhere, but I do see a reference to a Form XObject named /Xf1, so let's examine that Form XObject:
Woohoo! I'm in luck, "Hello People" is stored in the document as a single string value. I don't need to look at the coordinates to compose the actual text that I can see with my human eyes.
Now for your question. You say "I need to know how the renderInfo is reading data" and now you know: by default, iText will read all the strings from a page in the order they occur: ld, Wor, llo, He, and Hello People.
Depending on how the PDF is created, you can have output that is easy to read (Hello People), or output that is hard to read (ld, Wor, llo, He). iText comes with "strategies" that reorder all those snippets so that [ld, Wor, llo, He] is presented as [He, llo, Wor, ld], but detecting which of those parts belong to the same line, and which lines belong to the same paragraph, is something you will have to do.
NOTE: at iText Group, we already have plenty of closed source code that could save you plenty of time. Since we are the copyright owner of the iText library, we can ask money for that closed source code. That's something you typically can't do if you're using iText for free (because of the AGPL). However, if you are a customer of iText, we can probably disclose more source code. Do not expect us to give that code for free, as that code has too much commercial value.
Is it possible to export ASPxGridView with column which contains multiline text to excel using ASPxGridViewExporter?
I have an ASPxGridView with column that contains a multiline text.
When I export a grid's data using ASPxGridViewExporter that multiline text is rendered as one line (without line breaks).
I tried both <br/> and "\n" (newline) as line separators.
btw value of property PropertiesTextEdit-EncodeHtml is false on that column.
Thanks
There are 2 ways to achieve what you're looking for:
Specify WYSIWYG export type in XlsxExportOptionsEx:
XlsxExportOptionsEx options = new XlsxExportOptionsEx()
{
ExportType = DevExpress.Export.ExportType.WYSIWYG
};
ASPxGridView1.ExportToXlsx("Test.xlsx", options);
Tell the exporter you want to have data aware export and handle CustomizeCell event to set cell wrapping to true:
XlsxExportOptionsEx options = new XlsxExportOptionsEx()
{
ExportType = DevExpress.Export.ExportType.DataAware
};
options.CustomizeCell += options_CustomizeCell;
void options_CustomizeCell(DevExpress.Export.CustomizeCellEventArgs e)
{
e.Formatting.Alignment = new XlCellAlignment() { WrapText = true };
e.Handled = true;
}
Then use the customized options object for export.
See: https://www.devexpress.com/Support/Center/Question/Details/T381176
There is also the RenderBrick event which sometimes may be helpful. You may handle it like:
gveExporter.RenderBrick += gveExporter_RenderBrick;
void gveExporter_RenderBrick(object sender, DevExpress.Web.ASPxGridViewExportRenderingEventArgs e)
{
...
StringFormat sFormat = new StringFormat(StringFormatFlags.NoWrap);
BrickStringFormat brickSFormat = new BrickStringFormat(sFormat);
e.BrickStyle.StringFormat = brickSFormat;
...
}
But I have not found how to actually force cell wrap there, because StringFormatFlags has only NoWrap among suitable items. In my experience I did have cell wrapping long text in the exported Excel doc, so I've used RenderBrick to switch that wrapping off.
Hope that helps.
I have a pdf file that as the follow security properties: printing: allowed; document assembly: NOT allowed; content copy: allowed; content copy for accessibility: allowed; page extraction:NOT allowed;
I try to get text with sample code as documentation sample as follow:
pdftext.Text = null;
StringBuilder text = new StringBuilder();
PdfReader pdfReader = new PdfReader(filename);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
text.Append(System.Environment.NewLine);
text.Append("\n Page Number:" + page);
text.Append(System.Environment.NewLine);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
progressBar1.Value++;
}
pdftext.Text += text.ToString();
pdfReader.Close();
but the output text is lines with ""??? ? ???????\n?? ??? ? " values;
seems that file is crypted or we have a encoding problem...
note that in the follow lines
var f = pdfReader.IsOpenedWithFullPermissions; -> FALSE
var f1 = pdfReader.IsEncrypted(); - > FALSE
var f2 = pdfReader.ComputeUserPassword(); - > NULL
var f3 = pdfReader.Is128Key(); - > FALSE
var f4 = pdfReader.HasUsageRights();
f, f1, f3, f4 return FALSE ...than seems that the document is not crypted,
...so I don't know if is a Encoding problem or question related to encrypet strings...
Someone can help me?
thanks in advance.
G.G.
Whenever you have trouble extracting text from a document using standard code, the first thing to do is try and copy&paste the text from it using Adobe Acrobat Reader. Adobe Reader copy&paste implements text extraction according to the recommendations of the PDF specification, and if this fails, this usually means that the necessary information required for text extraction in the document are either missing or broken (by accident or by design). To extract the text, one either needs to customize the code specifically to the specific PDF or resort to OCR.
In case of the document at hand, Adobe Reader copy&paste does result in garbage, too, just like when extracting with iText. Thus, there is something fishy in the document.
Inspecting the document one finds that the fonts contain ToUnicode mappings like this:
/CIDInit /ProcSet
findresource begin 12 dict begin begincmap /CIDSystemInfo<</Registry(Adobe)
/Ordering(Identity)
/Supplement 0
>>
def
/CMapName/F18 def
1 begincodespacerange <0000> <FFFF> endcodespacerange
44 beginbfrange
<20> <20> <0020>
<21> <21> <E0F9>
<22> <22> <E0F1>
<23> <23> <E0FA>
<24> <24> <E0F7>
<25> <25> <E0A3>
<26> <26> <E084>
<27> <27> <E097>
<28> <28> <E098>
<29> <29> <E09A>
<2A> <2A> <E08A>
<2B> <2B> <E099>
<2C> <2C> <E0A5>
<2D> <2D> <E086>
<2E> <2E> <E094>
<2F> <2F> <E0DE>
<30> <30> <E0A6>
<31> <31> <E096>
<32> <32> <E088>
<33> <33> <E082>
<34> <34> <E04C>
<35> <35> <E0A4>
<36> <36> <E0F6>
<37> <37> <E0F2>
<38> <38> <E0D8>
<39> <39> <E0AA>
<3A> <3A> <E06C>
<3B> <3B> <E087>
<3C> <3C> <E095>
<3D> <3D> <E0C4>
<3E> <3E> <E07E>
<3F> <3F> <E055>
<40> <40> <E089>
<41> <41> <E085>
<42> <42> <E083>
<43> <43> <E070>
<44> <44> <E0E6>
<45> <45> <E080>
<46> <46> <E0C8>
<47> <47> <E0F4>
<48> <48> <E062>
<49> <49> <E0F3>
<4A> <4A> <E04E>
<4B> <4B> <E05E>
endbfrange
endcmap CMapName currentdict /CMap defineresource pop end end
I.e., if you are not into this, the fonts claim that all their glyphs (with the exception of the space glyph at 0x20) represent characters U+E0xx from the Unicode private use area. As the name of that area indicates, there is no common meaning of characters with these values.
Thus, text extraction according to the PDF specification will return strings of characters with undefined meaning with results as you observed in iText or I saw in Adobe Reader.
Sometimes in such a situation one can still enforce proper text extraction by ignoring the ToUnicode map and using either the font Encoding or information inside the embedded font program.
Unfortunately it turns out that here the Encoding effectively contains the same information as does the ToUnicode map, e.g. for the same font as above
/Differences [ 32 /space /uniE0F9 /uniE0F1 /uniE0FA /uniE0F7 /uniE0A3 /uniE084 /uniE097 /uniE098
/uniE09A /uniE08A /uniE099 /uniE0A5 /uniE086 /uniE094 /uniE0DE /uniE0A6 /uniE096
/uniE088 /uniE082 /uniE04C /uniE0A4 /uniE0F6 /uniE0F2 /uniE0D8 /uniE0AA /uniE06C
/uniE087 /uniE095 /uniE0C4 /uniE07E /uniE055 /uniE089 /uniE085 /uniE083 /uniE070
/uniE0E6 /uniE080 /uniE0C8 /uniE0F4 /uniE062 /uniE0F3 /uniE04E /uniE05E ]
and the fonts turns out to be Type3 fonts, i.e. there is no embedded font program but each glyph is defined as an individual PDF canvas without further character information.
Thus, nothing to gain here either.
Actually these small PDF canvasses contain inlined bitmap graphics of the respective glyph which also is the cause of the poor graphical quality of the document (if you don't see that immediately, simply zoom in a bit and you'll see the ragged outlines of the glyphs).
By the way, such a construct usually means that the producer of the PDF explicitly wants to prevent text extraction.
If you happen to have to extract text from many such documents, you can try and determine a mapping from their U+E0xx characters to actually sensible Unicode characters and apply that mapping to your extracted text.
If all those fonts in all those documents happen to use the same U+E0xx codepoints for the same actual characters, you'll be able to do text extraction from those documents after investing a certain amount of initial work.
Otherwise do try OCR.
The following code adds pages to a document which map the ToUnicode values to the characters shown:
void AddFontsTo(PdfReader reader, PdfStamper stamper)
{
int documentPages = reader.NumberOfPages;
for (int page = 1; page <= documentPages; page++)
{
// ignore inherited resources for now
PdfDictionary pageResources = reader.GetPageResources(page);
if (pageResources == null)
continue;
PdfDictionary pageFonts = pageResources.GetAsDict(PdfName.FONT);
if (pageFonts == null || pageFonts.Size == 0)
continue;
List<BaseFont> fonts = new List<BaseFont>();
List<string> fontNames = new List<string>();
HashSet<char> chars = new HashSet<char>();
foreach (PdfName key in pageFonts.Keys)
{
PdfIndirectReference fontReference = pageFonts.GetAsIndirectObject(key);
if (fontReference == null)
continue;
DocumentFont font = (DocumentFont) BaseFont.CreateFont((PRIndirectReference)fontReference);
if (font == null)
continue;
PdfObject toUni = PdfReader.GetPdfObjectRelease(font.FontDictionary.Get(PdfName.TOUNICODE));
CMapToUnicode toUnicodeCmap = null;
if (toUni is PRStream)
{
try
{
byte[] touni = PdfReader.GetStreamBytes((PRStream)toUni);
CidLocationFromByte lb = new CidLocationFromByte(touni);
toUnicodeCmap = new CMapToUnicode();
CMapParserEx.ParseCid("", toUnicodeCmap, lb);
}
catch
{
toUnicodeCmap = null;
}
}
if (toUnicodeCmap == null)
continue;
ICollection<int> mapValues = toUnicodeCmap.CreateDirectMapping().Values;
if (mapValues.Count == 0)
continue;
fonts.Add(font);
fontNames.Add(key.ToString());
foreach (int value in mapValues)
chars.Add((char)value);
}
if (fonts.Count == 0 || chars.Count == 0)
continue;
Rectangle size = (fonts.Count > 10) ? PageSize.A4.Rotate() : PageSize.A4;
PdfPTable table = new PdfPTable(fonts.Count + 1);
table.AddCell("Page " + page);
foreach (String name in fontNames)
{
table.AddCell(name);
}
table.HeaderRows = 1;
float[] widths = new float[fonts.Count + 1];
widths[0] = 2;
for (int i = 1; i <= fonts.Count; i++)
widths[i] = 1;
table.SetWidths(widths);
table.WidthPercentage = 100;
List<char> charList = new List<char>(chars);
charList.Sort();
foreach (char character in charList)
{
table.AddCell(((int)character).ToString("X4"));
foreach (BaseFont font in fonts)
{
table.AddCell(new PdfPCell(new Phrase(character.ToString(), new Font(font))));
}
}
stamper.InsertPage(reader.NumberOfPages + 1, size);
ColumnText columnText = new ColumnText(stamper.GetUnderContent(reader.NumberOfPages));
columnText.AddElement(table);
columnText.SetSimpleColumn(size);
while ((ColumnText.NO_MORE_TEXT & columnText.Go(false)) == 0)
{
stamper.InsertPage(reader.NumberOfPages + 1, size);
columnText.Canvas = stamper.GetUnderContent(reader.NumberOfPages);
columnText.SetSimpleColumn(size);
}
}
}
I applied it to your document like this:
string input = #"4700198773.pdf";
string output = #"4700198773-fonts.pdf";
using (PdfReader reader = new PdfReader(input))
using (FileStream stream = new FileStream(output, FileMode.Create, FileAccess.Write))
using (PdfStamper stamper = new PdfStamper(reader, stream))
{
AddFontsTo(reader, stamper);
}
The additional pages look like this:
Now you have to compare the outputs for the different fonts and pages of this document with each other and with those of a representative selection of file. If you find good enough a pattern, you can try this replacement way.
I'm parsing a document with AngleSharp. I have a text node (NodeName: "#text") and I want to insert some HTML in it. I can certainly reset NodeValue to whatever I want, but it's still a text node, so all the brackets are escaped.
How do I take the string value of a text node, inject some HTML into it, then have a parsed DOM representation that that HTML take the place of the original text node?
I guess what you want is to replace a single text node by multiple nodes.
For instance <div>foo</div>, i.e.,
+ root
+ textnode
becomes
+ root
+ textnode (1)
+ element
+ textnode (2)
which could <div>f<b>o</b>o</div>. The simplest way I can think of is just replacing the node.
var source = #"<div>foo</div>";
var parser = new HtmlParser();
var document = parser.Parse(source);
var div = document.QuerySelector("div");
div.InnerHtml = div.InnerHtml.Replace("foo", "f<b>o</b>o");
Now you can argue that just replacing the text may not be what you want. You maybe have already elements that you want to insert. Therefore a better (yet more complex) way would be to split the text node and insert the remaining contents.
var source = #"<div>foo</div>";
var parser = new HtmlParser();
var document = parser.Parse(source);
var div = document.QuerySelector("div");
var text = div.TextContent;
div.RemoveChild(div.FirstChild); // assuming there is only one child
var bold = document.CreateElement("b");
bold.TextContent = text.Substring(1, 1); //o
div.Append(
document.CreateTextNode(text.Substring(0, 1)), //f
bold,
document.CreateTextNode(text.Substring(2, 1)));//o
Depending in your use-case there may be a more simple solution.