Show all text of a docx in a stringBuilder with docx4j - docx

i need to put all text of a docx in a stringBuilder, also with tab and hyphen.
i've tried the use of org.docx4j.TextUtils, but in the resultant string doesn't seen tab.
String inputfilepath = System.getProperty("user.home") + "test.docx";
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(new java.io.File(inputfilepath));
MainDocumentPart documentPart = wordMLPackage.getMainDocumentPart();
org.docx4j.wml.Document wmlDocumentEl = (org.docx4j.wml.Document)documentPart.getJaxbElement();
Writer out = new OutputStreamWriter(System.out);
extractText(wmlDocumentEl, out);
out.close();

As per my answer at http://www.docx4java.org/forums/docx-java-f6/is-it-possible-to-extract-all-text-also-tab-and-hyphen-t1996.html#p6933?sid=b0d58fec2ba349d0f3f49cf66411397c
The problem with tab and hyphen, as I guess you know, is that they aren't represented in the docx as normal characters.
Tab is w:tab
A hyphen might be a hyphen character, or it might be displayed (without being actually in the docx), or it might be:
http://webapp.docx4java.org/OnlineDemo/ecma376/WordML/noBreakHyphen.html
or http://webapp.docx4java.org/OnlineDemo/ecma376/WordML/softHyphen.html
Replicating Word's hyphenation behaviour would be a challenge.
But for the others, there are three approaches which occur to me:
generalising your traverse approach (are you using TraversalUtil.getChildrenImpl?)
doing it in XSLT (you can do this in docx4j, but XSLT is probably slower, and a mix of technologies)
marshal the main document part to a string, do suitable string replacements, then unmarshal, then use TextUtils
For (3), assuming MainDocumentPart mdp, to get it as a String:
String stringContent = mdp.getXML();
Then to inject the modified content:
mdp.setContents((Document)XmlUtils.unmarshalString(stringContent) );

Related

issue related to space after link

I am generating a link using below code
string EncryptPath = Common.Encrypt(Path);
string SourceLinkPath= string.Empty;
if (File.Exists(Server.MapPath("Image.txt")))
{
SourceLinkPath = System.IO.File.ReadAllText(Server.MapPath ("Image.txt"));
}
string link2 = SourceLinkPath + EncryptPath;
TxtPathLink2.Text = link2;
the link is generating but it is giving space after sourcepath. OUTPUT like
http://18.10.10.11/test/View.aspx?Value=
67534ERT
i want to generate like http://18.10.10.11/test/View.aspx?Value=67534ERT
How can i generate link in one line
The .txt file probably has a whitespace you are missing.
Change System.IO.File.ReadAllText(Server.MapPath ("Image.txt"))
To:
System.IO.File.ReadAllText(Server.MapPath("Image.txt")).Trim()
String.Trim() removes all leading and trailing white-space characters from the String object.

Find word (not containing substrings) in comma separated string

I'm using a linq query where i do something liike this:
viewModel.REGISTRATIONGRPS = (From a In db.TABLEA
Select New SubViewModel With {
.SOMEVALUE1 = a.SOMEVALUE1,
...
...
.SOMEVALUE2 = If(commaseparatedstring.Contains(a.SOMEVALUE1), True, False)
}).ToList()
Now my Problem is that this does'n search for words but for substrings so for example:
commaseparatedstring = "EWM,KI,KP"
SOMEVALUE1 = "EW"
It returns true because it's contained in EWM?
What i would need is to find words (not containing substrings) in the comma separated string!
Option 1: Regular Expressions
Regex.IsMatch(commaseparatedstring, #"\b" + Regex.Escape(a.SOMEVALUE1) + #"\b")
The \b parts are called "word boundaries" and tell the regex engine that you are looking for a "full word". The Regex.Escape(...) ensures that the regex engine will not try to interpret "special characters" in the text you are trying to match. For example, if you are trying to match "one+two", the Regex.Escape method will return "one\+two".
Also, be sure to include the System.Text.RegularExpressions at the top of your code file.
See Regex.IsMatch Method (String, String) on MSDN for more information.
Option 2: Split the String
You could also try splitting the string which would be a bit simpler, though probably less efficient.
commaseparatedstring.Split(new Char[] { ',' }).Contains( a.SOMEVALUE1 )
what about:
- separating the commaseparatedstring by comma
- calling equals() on each substring instead of contains() on whole thing?
.SOMEVALUE2 = If(commaseparatedstring.Split(',').Contains(a.SOMEVALUE1), True, False)

How to encode the plus (+) symbol in a URL

The URL link below will open a new Google mail window. The problem I have is that Google replaces all the plus (+) signs in the email body with blank space. It looks like it only happens with the + sign. How can I remedy this? (I am working on a ASP.NET web page.)
https://mail.google.com/mail?view=cm&tf=0&to=someemail#somedomain.com&su=some subject&body=Hi there+Hello there
(In the body email, "Hi there+Hello there" will show up as "Hi there Hello there")
The + character has a special meaning in [the query segment of] a URL => it means whitespace: . If you want to use the literal + sign there, you need to URL encode it to %2b:
body=Hi+there%2bHello+there
Here's an example of how you could properly generate URLs in .NET:
var uriBuilder = new UriBuilder("https://mail.google.com/mail");
var values = HttpUtility.ParseQueryString(string.Empty);
values["view"] = "cm";
values["tf"] = "0";
values["to"] = "someemail#somedomain.com";
values["su"] = "some subject";
values["body"] = "Hi there+Hello there";
uriBuilder.Query = values.ToString();
Console.WriteLine(uriBuilder.ToString());
The result:
https://mail.google.com:443/mail?view=cm&tf=0&to=someemail%40somedomain.com&su=some+subject&body=Hi+there%2bHello+there
If you want a plus + symbol in the body you have to encode it as 2B.
For example:
Try this
In order to encode a + value using JavaScript, you can use the encodeURIComponent function.
Example:
var url = "+11";
var encoded_url = encodeURIComponent(url);
console.log(encoded_url)
It's safer to always percent-encode all characters except those defined as "unreserved" in RFC-3986.
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
So, percent-encode the plus character and other special characters.
The problem that you are having with pluses is because, according to RFC-1866 (HTML 2.0 specification), paragraph 8.2.1. subparagraph 1., "The form field names and values are escaped: space characters are replaced by `+', and then reserved characters are escaped"). This way of encoding form data is also given in later HTML specifications, look for relevant paragraphs about application/x-www-form-urlencoded.
Just to add this to the list:
Uri.EscapeUriString("Hi there+Hello there") // Hi%20there+Hello%20there
Uri.EscapeDataString("Hi there+Hello there") // Hi%20there%2BHello%20there
See https://stackoverflow.com/a/34189188/98491
Usually you want to use EscapeDataString which does it right.
Generally if you use .NET API's - new Uri("someproto:with+plus").LocalPath or AbsolutePath will keep plus character in URL. (Same "someproto:with+plus" string)
but Uri.EscapeDataString("with+plus") will escape plus character and will produce "with%2Bplus".
Just to be consistent I would recommend to always escape plus character to "%2B" and use it everywhere - then no need to guess who thinks and what about your plus character.
I'm not sure why from escaped character '+' decoding would produce space character ' ' - but apparently it's the issue with some of components.

Formatting asp.net label when the value is sourced from a query string

Afternoon all.
A very simple one for you today from thicky Rich.
I have a label I want to display as a lovely number format i.e. {0:N0}
Now, this label text equates to a query string value.
How do I go about formatting a label's text from a query string value in one fell swoop?
I have tried this
lblTotalPurchQS.Text = String.Format("{0:N0}",Request.QueryString["totalpurchasequantity"].ToString());
but with little success.
Any ideas or pointers?
Don't use ToString on the incoming query string parameter, but convert it to an int first:
lblTotalPurchQS.Text = String.Format("{0:N0}", int.Parse(Request.QueryString["totalpurchasequantity"]));
Note:
The above is not safe code. First, the conversion may fail with a conversion exception. You should also be HTML escaping the output, in case of XSS.
This is better:
int totalPurchaseQuantity;
if(int.TryParse(Request.QueryString["totalpurchasequantity"], out totalPurchaseQuantity))
{
lblTotalPurchQS.Text = Server.HtmlEncode(String.Format("{0:N0}", totalPurchaseQuantity);
}

How to correctly uppercase Greek words in .NET?

We have ASP.NET application which runs different clients around the world. In this application we have dictionary for each language. In dictionary we have words in lowercase and sometimes we uppercase it in code for typographic reasons.
var greek= new CultureInfo("el-GR");
string grrr = "Πόλη";
string GRRR = grrr.ToUpper(greek); // "ΠΌΛΗ"
The problem is:
...if you're using capital letters
then they must appear like this: f.e.
ΠΟΛΗ and not like ΠΌΛΗ, same for all
other words written in capital letters
So is it possible generically to uppercase Greek words correctly in .NET? Or should I wrote my own custom algorithm for Greek uppercase?
How do they solve this problem in Greece?
I suspect that you're going to have to write your own method, if el-GR doesn't do what you want. Don't think you need to go to the full length of creating a custom CultureInfo, if this is all you need. Which is good, because that looks quite fiddly.
What I do suggest you do is read this Michael Kaplan blog post and anything else relevant you can find by him - he's been working on and writing about i18n and language issues for years and years and his commentary is my first point of call for any such issues on Windows.
I don't know much about ASP.Net but I know how I'd do this in Java.
If the characters are Unicode, I would just post-process the output from ToUpper with some simple substitutions, one being the conversion of \u038C (Ό) to \u039F (Ο) or \u0386 (Ά) to \u0391 (Α).
From the looks of the Greek/Coptic code page (\u0370 through \u03ff), there's only a few characters (6 or 7) you'll need to change.
Check out How do I remove diacritics (accents) from a string in .NET?
How about replacing the wrong characters with the right ones:
/// <summary>
/// Returns the string to uppercase using Greek uppercase rules.
/// </summary>
/// <param name="source">The string that will be converted to uppercase</param>
public static string ToUpperGreek(this string source)
{
Dictionary<char, char> mappings = new Dictionary<char, char>(){
{'Ά','Α'}, {'Έ','Ε'}, {'Ή','Η'}, {'Ί','Ι'}, {'Ό','Ο'}, {'Ύ','Υ'}, {'Ώ','Ω'}
};
source = source.ToUpper();
char[] result = new char[source.Length];
for (int i = 0; i < result.Length; i++)
{
result[i] = mappings.ContainsKey(source[i]) ? mappings[source[i]] : source[i];
}
return new string(result);
}

Resources