I currently have a method that appends two strings together based on the tag in an XML file that is being loaded. I would also like to add a unique key between these two strings for parsing reasons later. Below are examples of the way it is working now and what I would like it to do.
-CURRENT: strValue~&elem.text()~&
-GOAL: strValue~&elem.text()
// If the tag is "Tag" or "Building append its text to strValue (part of item name)
elem = elemTag.selectSingleNode("ofda:Type",nsmgr);
if(elem && (elem.text() == "Tag" || elem.text() == "Building"))
{
elem = elemTag.selectSingleNode("ofda:Value",nsmgr);
if(elem)
{
strValue += elem.text() + "~&";
}
}
strValue += strValue ? "~&" + elem.text() : elem.text(); ?
Related
I have a pdf file that as the follow security properties: printing: allowed; document assembly: NOT allowed; content copy: allowed; content copy for accessibility: allowed; page extraction:NOT allowed;
I try to get text with sample code as documentation sample as follow:
pdftext.Text = null;
StringBuilder text = new StringBuilder();
PdfReader pdfReader = new PdfReader(filename);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
text.Append(System.Environment.NewLine);
text.Append("\n Page Number:" + page);
text.Append(System.Environment.NewLine);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
progressBar1.Value++;
}
pdftext.Text += text.ToString();
pdfReader.Close();
but the output text is lines with ""??? ? ???????\n?? ??? ? " values;
seems that file is crypted or we have a encoding problem...
note that in the follow lines
var f = pdfReader.IsOpenedWithFullPermissions; -> FALSE
var f1 = pdfReader.IsEncrypted(); - > FALSE
var f2 = pdfReader.ComputeUserPassword(); - > NULL
var f3 = pdfReader.Is128Key(); - > FALSE
var f4 = pdfReader.HasUsageRights();
f, f1, f3, f4 return FALSE ...than seems that the document is not crypted,
...so I don't know if is a Encoding problem or question related to encrypet strings...
Someone can help me?
thanks in advance.
G.G.
Whenever you have trouble extracting text from a document using standard code, the first thing to do is try and copy&paste the text from it using Adobe Acrobat Reader. Adobe Reader copy&paste implements text extraction according to the recommendations of the PDF specification, and if this fails, this usually means that the necessary information required for text extraction in the document are either missing or broken (by accident or by design). To extract the text, one either needs to customize the code specifically to the specific PDF or resort to OCR.
In case of the document at hand, Adobe Reader copy&paste does result in garbage, too, just like when extracting with iText. Thus, there is something fishy in the document.
Inspecting the document one finds that the fonts contain ToUnicode mappings like this:
/CIDInit /ProcSet
findresource begin 12 dict begin begincmap /CIDSystemInfo<</Registry(Adobe)
/Ordering(Identity)
/Supplement 0
>>
def
/CMapName/F18 def
1 begincodespacerange <0000> <FFFF> endcodespacerange
44 beginbfrange
<20> <20> <0020>
<21> <21> <E0F9>
<22> <22> <E0F1>
<23> <23> <E0FA>
<24> <24> <E0F7>
<25> <25> <E0A3>
<26> <26> <E084>
<27> <27> <E097>
<28> <28> <E098>
<29> <29> <E09A>
<2A> <2A> <E08A>
<2B> <2B> <E099>
<2C> <2C> <E0A5>
<2D> <2D> <E086>
<2E> <2E> <E094>
<2F> <2F> <E0DE>
<30> <30> <E0A6>
<31> <31> <E096>
<32> <32> <E088>
<33> <33> <E082>
<34> <34> <E04C>
<35> <35> <E0A4>
<36> <36> <E0F6>
<37> <37> <E0F2>
<38> <38> <E0D8>
<39> <39> <E0AA>
<3A> <3A> <E06C>
<3B> <3B> <E087>
<3C> <3C> <E095>
<3D> <3D> <E0C4>
<3E> <3E> <E07E>
<3F> <3F> <E055>
<40> <40> <E089>
<41> <41> <E085>
<42> <42> <E083>
<43> <43> <E070>
<44> <44> <E0E6>
<45> <45> <E080>
<46> <46> <E0C8>
<47> <47> <E0F4>
<48> <48> <E062>
<49> <49> <E0F3>
<4A> <4A> <E04E>
<4B> <4B> <E05E>
endbfrange
endcmap CMapName currentdict /CMap defineresource pop end end
I.e., if you are not into this, the fonts claim that all their glyphs (with the exception of the space glyph at 0x20) represent characters U+E0xx from the Unicode private use area. As the name of that area indicates, there is no common meaning of characters with these values.
Thus, text extraction according to the PDF specification will return strings of characters with undefined meaning with results as you observed in iText or I saw in Adobe Reader.
Sometimes in such a situation one can still enforce proper text extraction by ignoring the ToUnicode map and using either the font Encoding or information inside the embedded font program.
Unfortunately it turns out that here the Encoding effectively contains the same information as does the ToUnicode map, e.g. for the same font as above
/Differences [ 32 /space /uniE0F9 /uniE0F1 /uniE0FA /uniE0F7 /uniE0A3 /uniE084 /uniE097 /uniE098
/uniE09A /uniE08A /uniE099 /uniE0A5 /uniE086 /uniE094 /uniE0DE /uniE0A6 /uniE096
/uniE088 /uniE082 /uniE04C /uniE0A4 /uniE0F6 /uniE0F2 /uniE0D8 /uniE0AA /uniE06C
/uniE087 /uniE095 /uniE0C4 /uniE07E /uniE055 /uniE089 /uniE085 /uniE083 /uniE070
/uniE0E6 /uniE080 /uniE0C8 /uniE0F4 /uniE062 /uniE0F3 /uniE04E /uniE05E ]
and the fonts turns out to be Type3 fonts, i.e. there is no embedded font program but each glyph is defined as an individual PDF canvas without further character information.
Thus, nothing to gain here either.
Actually these small PDF canvasses contain inlined bitmap graphics of the respective glyph which also is the cause of the poor graphical quality of the document (if you don't see that immediately, simply zoom in a bit and you'll see the ragged outlines of the glyphs).
By the way, such a construct usually means that the producer of the PDF explicitly wants to prevent text extraction.
If you happen to have to extract text from many such documents, you can try and determine a mapping from their U+E0xx characters to actually sensible Unicode characters and apply that mapping to your extracted text.
If all those fonts in all those documents happen to use the same U+E0xx codepoints for the same actual characters, you'll be able to do text extraction from those documents after investing a certain amount of initial work.
Otherwise do try OCR.
The following code adds pages to a document which map the ToUnicode values to the characters shown:
void AddFontsTo(PdfReader reader, PdfStamper stamper)
{
int documentPages = reader.NumberOfPages;
for (int page = 1; page <= documentPages; page++)
{
// ignore inherited resources for now
PdfDictionary pageResources = reader.GetPageResources(page);
if (pageResources == null)
continue;
PdfDictionary pageFonts = pageResources.GetAsDict(PdfName.FONT);
if (pageFonts == null || pageFonts.Size == 0)
continue;
List<BaseFont> fonts = new List<BaseFont>();
List<string> fontNames = new List<string>();
HashSet<char> chars = new HashSet<char>();
foreach (PdfName key in pageFonts.Keys)
{
PdfIndirectReference fontReference = pageFonts.GetAsIndirectObject(key);
if (fontReference == null)
continue;
DocumentFont font = (DocumentFont) BaseFont.CreateFont((PRIndirectReference)fontReference);
if (font == null)
continue;
PdfObject toUni = PdfReader.GetPdfObjectRelease(font.FontDictionary.Get(PdfName.TOUNICODE));
CMapToUnicode toUnicodeCmap = null;
if (toUni is PRStream)
{
try
{
byte[] touni = PdfReader.GetStreamBytes((PRStream)toUni);
CidLocationFromByte lb = new CidLocationFromByte(touni);
toUnicodeCmap = new CMapToUnicode();
CMapParserEx.ParseCid("", toUnicodeCmap, lb);
}
catch
{
toUnicodeCmap = null;
}
}
if (toUnicodeCmap == null)
continue;
ICollection<int> mapValues = toUnicodeCmap.CreateDirectMapping().Values;
if (mapValues.Count == 0)
continue;
fonts.Add(font);
fontNames.Add(key.ToString());
foreach (int value in mapValues)
chars.Add((char)value);
}
if (fonts.Count == 0 || chars.Count == 0)
continue;
Rectangle size = (fonts.Count > 10) ? PageSize.A4.Rotate() : PageSize.A4;
PdfPTable table = new PdfPTable(fonts.Count + 1);
table.AddCell("Page " + page);
foreach (String name in fontNames)
{
table.AddCell(name);
}
table.HeaderRows = 1;
float[] widths = new float[fonts.Count + 1];
widths[0] = 2;
for (int i = 1; i <= fonts.Count; i++)
widths[i] = 1;
table.SetWidths(widths);
table.WidthPercentage = 100;
List<char> charList = new List<char>(chars);
charList.Sort();
foreach (char character in charList)
{
table.AddCell(((int)character).ToString("X4"));
foreach (BaseFont font in fonts)
{
table.AddCell(new PdfPCell(new Phrase(character.ToString(), new Font(font))));
}
}
stamper.InsertPage(reader.NumberOfPages + 1, size);
ColumnText columnText = new ColumnText(stamper.GetUnderContent(reader.NumberOfPages));
columnText.AddElement(table);
columnText.SetSimpleColumn(size);
while ((ColumnText.NO_MORE_TEXT & columnText.Go(false)) == 0)
{
stamper.InsertPage(reader.NumberOfPages + 1, size);
columnText.Canvas = stamper.GetUnderContent(reader.NumberOfPages);
columnText.SetSimpleColumn(size);
}
}
}
I applied it to your document like this:
string input = #"4700198773.pdf";
string output = #"4700198773-fonts.pdf";
using (PdfReader reader = new PdfReader(input))
using (FileStream stream = new FileStream(output, FileMode.Create, FileAccess.Write))
using (PdfStamper stamper = new PdfStamper(reader, stream))
{
AddFontsTo(reader, stamper);
}
The additional pages look like this:
Now you have to compare the outputs for the different fonts and pages of this document with each other and with those of a representative selection of file. If you find good enough a pattern, you can try this replacement way.
I have a hidden field that gets populated with a javascript array of ID's. When I try to iterate the hidden field(called "hidExhibitsIDs") it gives me an error(in the title).
this is my loop:
foreach(string exhibit in hidExhibitsIDs.Value)
{
comLinkExhibitToTask.Parameters.AddWithValue("#ExhibitID", exhibit);
}
when I hover over the .value it says it is "string". But when I change the "string exhibit" to "int exhibit" it works, but gives me an internal error(not important right now).
You need to convert string to string array to using in for loop to get strings not characters as your loop suggests. Assuming comma is delimiter character in the hidden field, hidden field value will be converted to string array by split.
foreach(string exhibit in hidExhibitsIDs.Value.Split(','))
{
comLinkExhibitToTask.Parameters.AddWithValue("#ExhibitID", exhibit);
}
Value is returning a String. When you do a foreach on a String, it iterates over the individual characters in it. What does the value actually look like? You'll have to parse it correctly before you try to use the data.
Example of what your code is somewhat doing right now:
var myString = "Hey";
foreach (var c in myString)
{
Console.WriteLine(c);
}
Will output:
H
e
y
You can use Char.ToString in order to convert
Link : http://msdn.microsoft.com/en-us/library/3d315df2.aspx
Or you can use this if you want convert your tab of char
char[] tab = new char[] { 'a', 'b', 'c', 'd' };
string str = new string(tab);
Value is a string, which implements IEnumerable<char>, so when you foreach over a string, it loops over each character.
I would run the debugger and see what the actual value of the hidden field is. It can't be an array, since when the POST happens, it is converted into a string.
On the server side, The Value property of a HiddenField (or HtmlInputHidden) is just a string, whose enumerator returns char structs. You'll need to split it to iterate over your IDs.
If you set the value of the hidden field on the client side with a JavaScript array, it will be a comma-separated string on the server side, so something like this will work:
foreach(string exhibit in hidExhibitsIDs.Value.Split(','))
{
comLinkExhibitToTask.Parameters.AddWithValue("#ExhibitID", exhibit);
}
public static string reversewordsInsentence(string sentence)
{
string output = string.Empty;
string word = string.Empty;
foreach(char c in sentence)
{
if (c == ' ')
{
output = word + ' ' + output;
word = string.Empty;
}
else
{
word = word + c;
}
}
output = word + ' ' + output;
return output;
}
I did read more function but it's not working correctly. I mean I can split my test post and I can cut my string with substring function. And I did this using < !--kamore--> keyword.
But after I cut this with substring and do innerhtml and if there is some html tag before the index the css is going crazy. (< p>< !--kamore-->) I can't solve this. If I'm using regex it just make all of them like text and there is no html tags in my post and it's not good. I mean if there is some links or table in my post they will not showing. They are just text.
Here is my little code.
#region ReadMore
string strContent = drvRow["cont"].ToString();
//strContent = Server.HtmlDecode(strContent);
//strContent = Regex.Replace(strContent, #"</?\w+((\s+\w+(\s*=\s*(?:"".*?""|'.*?'|[^'"">\s]+))?)+\s*|\s*)/?>", string.Empty);
// More extension by kad1r
int kaMoreIndex;
kaMoreIndex = strContent.IndexOf("<!--kamore-->");
if (kaMoreIndex > 0)
{
if (strContent.Length >= kaMoreIndex)
{
aReadMore.Visible = true;
article.InnerHtml = strContent.Substring(0, kaMoreIndex);
// if this ends like this there is a problem
// < p>< !--kamore--> or < div>< !--kamore-->
// because there is no end of this tag!
}
else
{
article.InnerHtml = strContent;
}
}
else
{
article.InnerHtml = strContent;
}
#endregion
I fix it. I found this code and after finding I added to my string. Now everything works fine.
http://social.msdn.microsoft.com/Forums/en-US/csharpgeneral/thread/0f06a2e9-ab09-4692-890e-91a6974725c0
I have a standard HTML page with an CKEditor on it wrapped in a form.
The form submits (POSTS) to Send_Emails.aspx
Send_Emails.aspx reads the content of the FCKEditor into a variable
Dim html As String = Request.Form("ck_content")
Then it sends an email.
Problem
Characters such as:
 -> this seems to show as a special character for blank spaces/carriage returns
’ -> this seems to show as apostrophe's
Can you reccomend some methods to cleanze my post data of these non-standard characters?
Thanks
I figured out how to strip unwanted characters by using this function:
function removeMSWordChars(str) {
var myReplacements = new Array();
var myCode, intReplacement;
myReplacements[8216] = 39;
myReplacements[8217] = 39;
myReplacements[8220] = 34;
myReplacements[8221] = 34;
myReplacements[8212] = 45;
for(c=0; c<str.length; c++) {
var myCode = str.charCodeAt(c);
if(myReplacements[myCode] != undefined) {
intReplacement = myReplacements[myCode];
str = str.substr(0,c) + String.fromCharCode(intReplacement) + str.substr(c+1);
}
}
return str;
}
I'm sending back a bunch of image tags via JSON in my .ashx response.
I am not sure how to format this so that the string comes back with real tags. I tried to HtmlEncode and that sort of fixed it but then I ended up with this stupid \u003c crap:
["\u003cimg src=\"http://www.sss.com/image/65.jpg\" alt=\"\"\u003e\u003c/li\u003e","\u003cimg src=\"http://www.xxx.com/image/61.jpg\" alt=\"\"\u003e\u003c/li\u003e"]
What the heck is \u003c ?
here's my code that created the JSON for response to my .ashx:
private void GetProductsJSON(HttpContext context)
{
context.Response.ContentType = "text/plain";
int i = 1;
...do some more stuff
foreach(Product p in products)
{
string imageTag = string.Format(#"<img src=""{0}"" alt=""""></li>", WebUtil.ImageUrl(p.Image, false));
images.Add(imageTag);
i++;
}
string jsonString = images.ToJSON();
context.Response.Write(HttpUtility.HtmlEncode(jsonString));
}
the toJSON is simply using the helper method outlined here:
http://weblogs.asp.net/scottgu/archive/2007/10/01/tip-trick-building-a-tojson-extension-method-using-net-3-5.aspx
\u003c is an escaped less-than character in unicode (Unicode character 0x003C).
The AJAX response is fine. When that string is written to the DOM, it will show up as a normal "<" character.
You are returning JSON array. Once parsed using eval("("+returnValue+")") it is in readily usable condition.
EDIT: This code is from jquery.json.js file:
var escapeable = /["\\\x00-\x1f\x7f-\x9f]/g;
var meta = { // table of character substitutions
'\b': '\\b',
'\t': '\\t',
'\n': '\\n',
'\f': '\\f',
'\r': '\\r',
'"' : '\\"',
'\\': '\\\\'
};
$.quoteString = function(string)
// Places quotes around a string, inteligently.
// If the string contains no control characters, no quote characters, and no
// backslash characters, then we can safely slap some quotes around it.
// Otherwise we must also replace the offending characters with safe escape
// sequences.
{
if (escapeable.test(string))
{
return '"' + string.replace(escapeable, function (a)
{
var c = meta[a];
if (typeof c === 'string') {
return c;
}
c = a.charCodeAt();
return '\\u00' + Math.floor(c / 16).toString(16) + (c % 16).toString(16);
}) + '"';
}
return '"' + string + '"';
};
Hope this gives you some direction to go ahead.
all you need to do is to use javascript eval function to get a pure HTML (XML) markup on the front end.
i.e. in a ajax call to a webservice, this can be the success handler of tha call,
the service returns a complex html element:
...
success: function(msg) {$(divToBeWorkedOn).html(**eval(**msg**)**);alert(eval(msg));},
...