Number of characters in unicode string - len() not working

Number of characters in unicode string - len() not working - asp-classic

I am updating some asp and vb code and have a string which I need to count the number of characters in. part of the string has non-English characters. Using len() does give the number of characters, it gives the length of the string, and because some of the characters are not in english the len() function does not work.
For example len("abc")=3 but len("אבג") is 6. The len() of the combined string is 9.
Is there a function or another way that would calculate the number of characters?

I found out the problem - if you save an asp page in as UTF-8 then the len() function does not work - it gives you double the number of actual characters, for non-English characters only - see example in the question.
To avoid this problem save the asp page in UTF-8 +BOM and then the len() function works correctly in all cases.

Related

R - intToUtf8 function

Using the gsub function, I am replacing certain string of texts that contain the Unicode Character 'START OF GUARDED AREA' (U+0096).
gsub("A string with the following character –","A string without the character")
My code works, but if I close my script and reopen it, that character in my code is replaced by a normal dash.
To work around this problem, I was thinking to replace the actual character by a function. I came across function intToUtf8(), which I thought would return my character in question if I use it as follows:
intToUtf8(150)
However, when typing this in my console, it returns "\u0096"
Question 1: why is the character replaced in my script?
Question 2: why isn't my console returning the character '–'?
Many thanks in advance for your precious help!

How to generate all possible unicode characters?

If we type in letters we get all lowercase letters from english alphabet. However, there are many more possible characters like ä, é and so on. And there are symbols like $ or (, too. I found this table of unicode characters which is exactly what I need. Of course I do not want to copy and paste hundreds of possible unicode characters in one vector.
What I've tried so far: The table gives the decimals for (some of) the unicode characters. For example, see the following small table:
Glyph Decimal Unicode Usage in R
! 33 U+0021 "\U0021"
So if type "\U0021" we get a !. Further, paste0("U", format(as.hexmode(33), width= 4, flag="0")) returns "U0021" which is quite close to what I need but adding \ results in an error:
paste0("\U", format(as.hexmode(33), width= 4, flag="0"))
Error: '\U' used without hex digits in character string starting ""\U"
I am stuck. And I am afraid even if I figure out how to transform numbers to characters usings as.hexmode() there is still the problem that there are not Decimals for all unicode characters (see table, Decimals end with 591).
Any idea how to generate a vector with all the unicode characters listed in the table linked?
(The question started with a real world problem but now I am mostly simply eager to know how to do this.)

There may be easier ways to do this, but here goes. The Unicode package contains everything you need.
First we can get a list of unicode scripts and the block ranges:
library(Unicode)
uranges <- u_scripts()
Check what we've got:
head(uranges, 3)
$Adlam
[1] U+1E900..U+1E943 U+1E944..U+1E94A U+1E94B U+1E950..U+1E959 U+1E95E..U+1E95F
$Ahom
[1] U+11700..U+1171A U+1171D..U+1171F U+11720..U+11721 U+11722..U+11725 U+11726 U+11727..U+1172B U+11730..U+11739 U+1173A..U+1173B U+1173C..U+1173E U+1173F
[11] U+11740..U+11746
$Anatolian_Hieroglyphs
[1] U+14400..U+14646
Next we can convert the ranges into their sequences.
expand_uranges <- lapply(uranges, as.u_char_seq)
To get a single vector of all characters we can unlist it. This won't be easy to work with so really it would be better to keep them as a list:
all_unicode_chars <- unlist(expand_uranges)
# The Wikipedia page linked states there are 144,697 characters
length(all_unicode_chars)
[1] 144762
So seems to be all of them and the page needs updating. They are stored as integers so to print them (assuming the glyph is supported) we can do, for example, printing Japanese katakana:
intToUtf8(expand_uranges$Katakana[[1]])
[1] "ァアィイゥウェエォオカガキギクグケゲコゴサザシジスズセゼソゾタダチヂッツヅテデトドナニヌネノハバパヒビピフブプヘベペホボポマミムメモャヤュユョヨラリルレロヮワヰヱヲンヴヵヶヷヸヹヺ"

Semicolon in URLs

I have a URL like that: localhost:8080/demo/
And when I call localhost:8080/demo/''''''''' It working fine.
But when I try with localhost:8080/demo/;;; It not working and return HTTP code 404 Not Found.
I tried with few special character # % \ ? / , it returned 400 too.
Anyone can explain it for me?
Thank you so much!

These special characters are not directly allowed in URLs,
because they have special meanings there.
For example:
/ is separator within the path,
? marks the query-part of an URL,
# marks a page-internal link,
etc.
Quoted from Wikipedia: Percent-encoding reserved characters:
When a character from the reserved set (a "reserved character")
has special meaning (a "reserved purpose") in a certain context,
and a URI scheme says that it is necessary to use that character
for some other purpose, then the character must be percent-encoded.
Percent-encoding a reserved character involves converting the
character to its corresponding byte value in ASCII and then
representing that value as a pair of hexadecimal digits. The digits,
preceded by a percent sign (%) which is used as an escape character,
are then used in the URI in place of the reserved character.
For example: ; is a reserved character. Therefore, when ; shall occur
in an URL but without having its special meaning, then it needs to be
replaced by %3B as defined here

HTML and XML Parsing in Fortran

I am studying mathematical computation and I am completely stuck on this task! I don't even know how to go about starting it!
**Write a program in Fortran that can parse a single line of well-formed HTML or XML markup so that it takes input on a single line (guaranteed to not exceed 80 characters in total) like
-lots of lovely text
where
tag might be anything from 1 to 37 ASCII characters and will not contain spaces
text could contain spaces and be anything from 1 to 73 characters in length
so that the program outputs one of two lines:
tag : text if the two occurrences of tag match inside <...> and
syntax error if anything else is input.
Any help is hugely appreciated !**

There are a number of intrinsic functions for working with strings that may be helpful.
result = index(string, substring) - returns the position of the start of the first occurrence of string substring as a substring in string, counting from one. (Fortran 77)
result = scan(string, set) - scans a string for any of the characters in a set of characters. (Fortran 95)
result = verify(string, set) - verifies that all the characters in a string are present in a set. (Fortran 95)
There are a few user-contributed string tokenization functions on the Fortran Wiki that might be helpful:
delim, strtok, and find_field. Also, FLIBS includes some string manipulation and tokenization routines that might be useful as examples.
Finally, there are a number of existing open-source XML parsers written in Fortran: xmlf90 and xml-fortran. Looking at the source code for these libraries should be helpful.

: is not a valid identifier

A colon cannot be used in the ID of a .NET control. I quote from the following website: http://msdn.microsoft.com/en-us/library/system.web.ui.control.id.aspx
"Only combinations of alphanumeric characters and the underscore character ( _ ) are valid values for this property. Including spaces or other invalid characters will cause an ASP.NET page parser error."
Is there a reason why alphanumeric characters have to be used?

Easy to find on MSN...
Only combinations of alphanumeric characters and the underscore character ( _ ) are valid values for this property. Including spaces or other invalid characters will cause an ASP.NET page parser error.
As for why, I cannot give an answer, other than it makes sense to me as a developer that you never use anything other than alphanumerics and underscores for variable names. There's no obvious reason why that should not extend to control IDs as well.