Unifying surrogate pairs in Japanese "dakuten" characters using R

Unifying surrogate pairs in Japanese "dakuten" characters using R - r

I was trying to match a vector of Japanese strings (originally imported from a comma-separated file) with a list of filenames extracted from a folder under Mac OSX.
One element from the vector is a:
> a
[1] "立ち上げる.mp3"
The corresponding element from the filename is b
> b
[1] "立ち上げる.mp3"
The problem is that they are not logically equal to each other in R:
> a == b
[1] FALSE
I already found out that this is a problem emerged from the surrogate pairs of Japanese "dakuten" characters (i.e. the げ character that was extended from け by adding additional dots). So they're in fact different from each other:
> iconv(a, "latin1")
[1] "ç«\u008bã\u0081¡ä¸\u008aã\u0081\u0092ã\u0082\u008b.mp3"
> iconv(b, "latin1")
[1] "ç«\u008bã\u0081¡ä¸\u008aã\u0081\u0091ã\u0082\u0099ã\u0082\u008b.mp3"
> nchar(a)
[1] 9
> nchar(b)
[1] 10
How do I convert these two versions of the same Japanese characters so that they can be matched validly (i.e. they should be the same) using R?

There is an open-source bridge library to call ICU library RUnicode. You may normalize search key to NFD(Mac OS X style) when on Mac OS X.
It normalizes other Japanese letters like full-width and half-width katakana, which might or might not for your purpose.

Related

How to make R display 19 digit number as it is stored?

I have a dataset with a key column which is basically a 19 digit integer.
I'm using tibbles so I use options(pillar.sigfig = 22) to display larger numbers and not scientific notation.
Problem is, I notice that the number stored in the column and the one that is displayed are slightly different, to be specific last 3 digits are different.
E.g
options(pillar.sigfig = 22)
x <- 1099324498500011011
But when I try to return the number I get 1099324498500011008.
I'm not sure why R would change the last 3 digits and since it is a key, it makes my data unusable for analysis.
I have tried the usual options(scipen = 999) for suppressing scientific notation but it does not seem to work on tibbles.
How do I get the same 19 digit number as I intend to store it?

Sorry to be bearer of bad news but R only has
a numeric type (double) using 64 bits and approximately sixteen decimals precision
an integer type (int) using 32 bits
There is nothing else. You may force the print function to show you nineteen digits but that just means ... you are looking at three digits of randomness.
19 digits for (countable) items are common, and often provided by (signed or unsigned) int64_t types. Which R does not have natively but approximates via the integer64 call in the bit64 package.
So the following may be your only workaround:
> suppressMessages(library(bit64))
> x <- as.integer64("123456790123456789")
> x
integer64
[1] 123456790123456789
> x - 1
integer64
[1] 123456790123456788
>
The good news is that integer64 is reasonably well supported by data.table and a number of other packages.
PS It really is 19 digits where it bites:
> as.integer64(1.2e18) + 1
integer64
[1] 1200000000000000001
> as.integer64(1.2e19) + 1
integer64
[1] <NA>
Warning message:
In as.integer64.double(1.2e+19) : NAs produced by integer64 overflow
>

R: How to convert long number to string to save precision

I have a problem to convert a long number to a string in R. How to easily convert a number to string to preserve precision? A have a simple example below.
a = -8664354335142704128
toString(a)
[1] "-8664354335142704128"
b = -8664354335142703762
toString(b)
[1] "-8664354335142704128"
a == b
[1] TRUE
I expected toString(a) == toString(b), but I got different values. I suppose toString() converts the number to float or something like that before converting to string.
Thank you for your help.
Edit:
> -8664354335142704128 == -8664354335142703762
[1] TRUE
> along = bit64::as.integer64(-8664354335142704128)
> blong = bit64::as.integer64(-8664354335142703762)
> along == blong
[1] TRUE
> blong
integer64
[1] -8664354335142704128
I also tried:
> as.character(blong)
[1] "-8664354335142704128"
> sprintf("%f", -8664354335142703762)
[1] "-8664354335142704128.000000"
> sprintf("%f", blong)
[1] "-0.000000"
Edit 2:
My question first was, if I can convert a long number to string without loss. Then I realized, in R is impossible to get the real value of a long number passed into a function, because R automatically read the value with the loss.
For example, I have the function:
> my_function <- function(long_number){
+ string_number <- toString(long_number)
+ print(string_number)
+ }
If someone used it and passed a long number, I am not able to get the information, which number was passed exactly.
> my_function(-8664354335142703762)
[1] "-8664354335142704128"
For example, if I read some numbers from a file, it is easy. But it is not my case. I just need to use something that some user passed.
I am not R expert, so I just was curious why in another language it works and in R not. For example in Python:
>>> def my_function(long_number):
... string_number = str(long_number)
... print(string_number)
...
>>> my_function(-8664354335142703762)
-8664354335142703762
Now I know, the problem is how R reads and stores numbers. Every language can do it differently. I have to change the way how to pass numbers to R function, and it solves my problem.
So the correct answer to my question is:
""I suppose toString() converts the number to float", nope, you did it yourself (even if unintentionally)." - Nope, R did it itself, that is the way how R reads numbers.
So I marked r2evans answer as the best answer because this user helped me to find the right solution. Thank you!

Bottom line up front, you must (in this case) read in your large numbers as string before converting to 64-bit integers:
bit64::as.integer64("-8664354335142704128") == bit64::as.integer64("-8664354335142703762")
# [1] FALSE
Some points about what you've tried:
"I suppose toString() converts the number to float", nope, you did it yourself (even if unintentionally). In R, when creating a number, 5 is a float and 5L is an integer. Even if you had tried to create it as an integer, it would have complained and lost precision anyway:
class(5)
# [1] "numeric"
class(5L)
# [1] "integer"
class(-8664354335142703762)
# [1] "numeric"
class(-8664354335142703762L)
# Warning: non-integer value 8664354335142703762L qualified with L; using numeric value
# [1] "numeric"
more appropriately, when you type it in as a number and then try to convert it, R processes the inside of the parentheses first. That is, with
bit64::as.integer64(-8664354335142704128)
R first has to parse and "understand" everything inside the parentheses before it can be passed to the function. (This is typically a compiler/language-parsing thing, not just an R thing.) In this case, it sees that it appears to be a (large) negative float, so it creates a class numeric (float). Only then does it send this numeric to the function, but by this point the precision has already been lost. Ergo the otherwise-illogical
bit64::as.integer64(-8664354335142704128) == bit64::as.integer64(-8664354335142703762)
# [1] TRUE
In this case, it just *happens that the 64-bit version of that number is equal to what you intended.
bit64::as.integer64(-8664254335142704128) # ends in 4128
# integer64
# [1] -8664254335142704128 # ends in 4128, yay! (coincidence?)
If you subtract one, it results in the same effective integer64:
bit64::as.integer64(-8664354335142704127) # ends in 4127
# integer64
# [1] -8664354335142704128 # ends in 4128 ?
This continues for quite a while, until it finally shifts to the next rounding point
bit64::as.integer64(-8664254335142703617)
# integer64
# [1] -8664254335142704128
bit64::as.integer64(-8664254335142703616)
# integer64
# [1] -8664254335142703104
It is unlikely to be coincidence that the difference is 1024, or 2^10. I haven't fished yet, but I'm guessing there's something meaningful about this with respect to floating point precision in 32-bit land.
fortunately, bit64::as.integer64 has several S3 methods, useful for converting different formats/classes to a integer64
library(bit64)
methods(as.integer64)
# [1] as.integer64.character as.integer64.double as.integer64.factor
# [4] as.integer64.integer as.integer64.integer64 as.integer64.logical
# [7] as.integer64.NULL
So, bit64::as.integer64.character can be useful, since precision is not lost when you type it or read it in as a string:
bit64::as.integer64("-8664354335142704128")
# integer64
# [1] -8664354335142704128
bit64::as.integer64("-8664354335142704128") == bit64::as.integer64("-8664354335142703762")
# [1] FALSE
FYI, your number is already near the 64-bit boundary:
-.Machine$integer.max
# [1] -2147483647
-(2^31-1)
# [1] -2147483647
log(8664354335142704128, 2)
# [1] 62.9098
-2^63 # the approximate +/- range of 64-bit integers
# [1] -9.223372e+18
-8664354335142704128
# [1] -8.664354e+18

Why my output has different spacing for a specific vector with different data types in R?

I have recently started learning R language and was working on combination of vectors. I was following a tutorial and when I try to print character, complex, integer vector in c() there is a space difference between them.
I have enclosed the snapshot for the same as I might not be able to articulate it properly in words.

As Roland commented, a vector can only contain one specific data type. Here since you have character datatype, all the other data types are coerced into character datatype.
x <- c(123.56, 21, "rajat", 2+4i); print(x)
The space which should not be a problem as far as I understand is created because you have different number of characters in each elements of the vector.
>nchar(x)
[1] 6 2 5 4
Now, if you have equal number of characters the space distribution is as expected:
x <- c(123.56, 210000, "rajata", 2+442i); print(x)
[1] "123.56" "210000" "rajata" "2+442i"
nchar(x)
[1] 6 6 6 6

Why "<some string>" >= <a number> is TRUE?

Maybe it is a silly question but playing with subsetting I faced this thing and I can't understand why it happens.
For example let's consider a string, say "a", and an integer, say 3, why this expression returns TRUE?
"a" >= 3
[1] TRUE

When you try to compare a string to an integer, R will coerce the number into a string, so 3 becomes "3".
Using logical operators on strings will check if the condition is true or false given their alphabetical order. For example:
> "a" < "b"
[1] TRUE
> "b" > "c"
[1] FALSE
This results happen because for R, the ascending order is a, b, c. Numbers usually come before letters in alphabetical orders (just check files ordered by name which start with a number). This is why you get
"a" >= 3
[1] TRUE
Finally, note that your result can vary depending on your locale and how the alphabetical order is defined on it. The manual says:
Comparison of strings in character vectors is lexicographic within the
strings using the collating sequence of the locale in use: see
locales. The collating sequence of locales such as en_US is normally
different from C (which should use ASCII) and can be surprising.
Beware of making any assumptions about the collation order: e.g. in
Estonian Z comes between S and T, and collation is not necessarily
character-by-character – in Danish aa sorts as a single letter, after
z. In Welsh ng may or may not be a single sorting unit: if it is it
follows g. Some platforms may not respect the locale and always sort
in numerical order of the bytes in an 8-bit locale, or in Unicode
code-point order for a UTF-8 locale (and may not sort in the same
order for the same language in different character sets). Collation of
non-letters (spaces, punctuation signs, hyphens, fractions and so on)
is even more problematic.
This is important and should be considered if the logical operators are used to compare strings (regardless of comparing them to numbers or not).

unexpected agrep() results related to max.distance in R

EDIT: This bug was found in 32-bit versions of R was fixed in R version 2.9.2.
This was tweeted to me by #leoniedu today and I don't have an answer for him so I thought I would post it here.
I have read the documentation for agrep() (fuzzy string matching) and it appears that I don't fully understand the max.distance parameter. Here's an example:
pattern <- "Staatssekretar im Bundeskanzleramt"
x <- "Bundeskanzleramt"
agrep(pattern,x,max.distance=18)
agrep(pattern,x,max.distance=19)
That behaves exactly like I would expect. There are 18 characters different between the strings so I would expect that to be the threshold of a match. Here's what's confusing me:
agrep(pattern,x,max.distance=30)
agrep(pattern,x,max.distance=31)
agrep(pattern,x,max.distance=32)
agrep(pattern,x,max.distance=33)
Why are 30 and 33 matches, but not 31 and 32? To save you some counting,
> nchar("Staatssekretar im Bundeskanzleramt")
[1] 34
> nchar("Bundeskanzleramt")
[1] 16

I posted this on the R list a while back and reported as a bug in R-bugs-list. I had no useful responses, so I twitted to see if the bug was reproducible or I was just missing something. JD Long was able to reproduce it and kindly posted the question here.
Note that, at least in R, then, agrep is a misnomer since it does not matches regular expressions, while grep stands for "Globally search for the Regular Expression and Print". It shouldn't have a problem with patterns longer than the target vector. (i think!)
In my linux server, all is well but not so in my Mac and Windows machines.
Mac:
sessionInfo()
R version 2.9.1 (2009-06-26)
i386-apple-darwin8.11.1
locale:
en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
agrep(pattern,x,max.distance=30)
[1] 1
agrep(pattern,x,max.distance=31)
integer(0)
agrep(pattern,x,max.distance=32)
integer(0)
agrep(pattern,x,max.distance=33)
[1] 1
Linux:
R version 2.9.1 (2009-06-26)
x86_64-unknown-linux-gnu
locale:
LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
agrep(pattern,x,max.distance=30)
[1] 1
agrep(pattern,x,max.distance=31)
[1] 1
agrep(pattern,x,max.distance=32)
[1] 1
agrep(pattern,x,max.distance=33)
[1] 1

I am not sure if your example makes sense. For the basic grep(), pattern is often a simple or a regular expression, and x is a vector whose element get matched to pattern. Having pattern as longer string that x strikes me as odd.
Consider this where we just use grep instead of substr:
R> grep("vo", c("foo","bar","baz")) # vo is not in the vector
integer(0)
R> agrep("vo", c("foo","bar","baz"), value=TRUE) # but is close enough to foo
[1] "foo"
R> agrep("vo", c("foo","bar","baz"), value=TRUE, max.dist=0.25) # still foo
[1] "foo"
R> agrep("vo", c("foo","bar","baz"), value=TRUE, max.dist=0.75) # now all match
[1] "foo" "bar" "baz"
R>

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex