I have a data frame containing character vectors. The data I used is a web scraping output of a particular website gathered during previous year within different computers (PC). Operational system probably was the same (Windows). After I combined all pieces together into single data frame I found, that after applying of gsub function for instance (str_replace causes same issue) cause an encoding distortion, so that some of Polish characters become wrongly encoded (Figure and system specification of PC presented below). voivodeshipRaw is a raw\original data, while voivodeshipProcessed is a character vector after applying gsub() (I tried to remove unusual spaces “\s+”). I applied Encoding and stri_enc_detect to detect encoding. The output presented in columns: encodingType and stri_enc_detect. As you might see output differs. Cells with encoding distortions (column voivodeshipProcessed id: 4, 7, 1946505 and 1946507) have unknown encoding based on Encoding function and windows-1250 based on stri_enc_detect. I tried to change encoding of such cells using following functions:
stri_enc_toutf8 = stri_enc_toutf8(str = voivodeshipRaw)
encoding = sapply(voivodeshipRaw, function(x){
Encoding(x) <- "UTF-8"
return(x)
})
iconv = iconv(voivodeshipRaw, from = "windows-1250", to = "utf-8")
The output presented in Figure below. A you might see distortions still exist.
charToRaw() output for voivodeshipRaw column in figure below (raw data, śląskie voivodeship, id: 1946504 and 1946505) without and with an encoding issue:
# id: 1946504 -> proper encoding
c5 9b 6c c4 85 73 6b 69 65
# id: 1946505 -> wrong encoding
9c 6c b9 73 6b 69 65
My question is how could I avoid such encoding distortions after applying gsub, str or stri functions?
Figure
System specification:
sessionInfo()
# R version 4.1.1 (2021-08-10)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)
locale:
[1] LC_COLLATE=Polish_Poland.1250 LC_CTYPE=Polish_Poland.1250 LC_MONETARY=Polish_Poland.1250 LC_NUMERIC=C LC_TIME=Polish_Poland.1250
system code page: 1251
Related
This question already has answers here:
Convert accented characters into ascii character
(2 answers)
Closed 1 year ago.
I have the following data:
authors <- c("Fernando Carré", "Adrüne Coça", "Pìso Därço")
And I want to extract non-english characters and convert them into ASCII, but without the spaces. This is what I have tried:
gsub("[^[:alnum:]]","",authors)
But it returns:
[1] "FernandoCarré" "AdrüneCoça" "PìsoDärço"
It should return:
"Fernando Carre" "Adrune Coca", "Piso Darco"
Any help will be greatly appreciated.
Thanks for Onyambu correction, the following statement is not correct
The expression [[:alnum:]] is made for the package stringr only. It cannot be used in other packages. Hence we can use
But here is what I got from the console.
> authors <- c("Fernando Carré", "Adrüne Coça", "Pìso Därço")
> iconv(authors ,to="ASCII//TRANSLIT")
[1] "Fernando Carre" "Adrune Coca" "Piso Darco"
> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS
I am trying to use an adjacency matrix that has labels in UTF-8. Is there a way to make sure iGraph functions use UTF8, something along the line of "encoding = "UTF8""? This is to avoid the following result (French text on a japanese system shows kanji instead of french diacritics). Thanks for any pointers.
> m1 <- graph_from_adjacency_matrix(m, mode = "directed", weighted = TRUE)
> m1
> IGRAPH 7a99453 DNW- 391 1454 --
+ attr: name (v/c), weight (e/n)
+ edges from 7a99453 (vertex names):
[1] Accept ->Accepter Acknowledge->Appr馗ier Acknowledge->Confirmer Acknowledge->Conscient
[5] Acknowledge->Consid駻er Acknowledge->Constater Acknowledge->Convenir Acknowledge->Donner
+ ... omitted several edges
As per requested:
R version 3.3.3 (2017-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252 LC_MONETARY=French_France.1252
[4] LC_NUMERIC=C LC_TIME=French_France.1252
Actually, this may not be an issue with igraph, but with RStudio, since I just realised that although my tables are properly displayed with the View() command, if I just call them in the console, the French diacritics are displayed with Japanese kanji. In any cases, igraph also does this.
I was trying to match a vector of Japanese strings (originally imported from a comma-separated file) with a list of filenames extracted from a folder under Mac OSX.
One element from the vector is a:
> a
[1] "立ち上げる.mp3"
The corresponding element from the filename is b
> b
[1] "立ち上げる.mp3"
The problem is that they are not logically equal to each other in R:
> a == b
[1] FALSE
I already found out that this is a problem emerged from the surrogate pairs of Japanese "dakuten" characters (i.e. the げ character that was extended from け by adding additional dots). So they're in fact different from each other:
> iconv(a, "latin1")
[1] "ç«\u008bã\u0081¡ä¸\u008aã\u0081\u0092ã\u0082\u008b.mp3"
> iconv(b, "latin1")
[1] "ç«\u008bã\u0081¡ä¸\u008aã\u0081\u0091ã\u0082\u0099ã\u0082\u008b.mp3"
> nchar(a)
[1] 9
> nchar(b)
[1] 10
How do I convert these two versions of the same Japanese characters so that they can be matched validly (i.e. they should be the same) using R?
There is an open-source bridge library to call ICU library RUnicode. You may normalize search key to NFD(Mac OS X style) when on Mac OS X.
It normalizes other Japanese letters like full-width and half-width katakana, which might or might not for your purpose.
I have a zipped binary file under the Windows operating system that I am trying to read with R. So far it works using the unz() function in combination with the readBin() function.
> bin.con <- unz(zip_path, file_in_zip, open = 'rb')
> readBin(bin.con,
"double",
n = byte_chunk,
size = 8L,
endian = "little")
> close(bin.con)
Where zip_path is the path to the zip file, file_in_zip is the filename within the zip file that is to be read and byte_chunk the number of bytes that I want to read.
In my use case, the readBin operation is part of a loop and gradually reads the whole binary file. However, I rarely want to read everything and often I know precisely which parts I want to read. Unfortunately, readBin doesn't have a start/skip argument to skip the first n bytes. Therefore I tried to conditionally replace readBin() with seek() in order to skip the actual reading of the unwanted parts.
When I try this, I get an error:
> bin.con <- unz(zip_path, file_in_zip, open = 'rb')
> seek(bin.con, where = bytes_to_skip, origin = 'current')
Error in seek.connection(bin.con, where = bytes_to_skip, origin = "current") :
seek not enabled for this connection
> close(bin.con)
So far, I didn't find a way to solve this error. Similar questions can be found here (unfortunately without a satisfactory answer):
https://stat.ethz.ch/pipermail/r-help/2007-December/148847.html (no answer)
http://r.789695.n4.nabble.com/reading-file-in-zip-archive-td4631853.html (no answer but reproducible example)
Tips all over the internet suggest adding the open = 'r' argument to unz() or dropping the open argument altogether but that only works for non-binary files (since the default is 'r'). People also suggest to unzip the files first, but since the files are quite big, this is practically impossible.
Is there any work-around to seek in a binary zipped file or read with a byte offset (potentially using C++ via the Rcpp package)?
Update:
Further research seems to indicate that seek() in zip files is not an easy problem. This question suggests a c++ library that can at best use a coarse seek. This Python question indicates that an exact seek is completely impossible because of the way how zip is implemented (although it doesn't contradict the coarse seek method).
Here's a bit of a hack that might work for you. Here's a fake binary file:
writeBin(as.raw(1:255), "file.bin")
readBin("file.bin", raw(1), n = 16)
# [1] 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 10
And here's the produced zip file:
zip("file.zip", "file.bin")
# adding: file.bin (stored 0%)
readBin("file.zip", raw(1), n = 16)
# [1] 50 4b 03 04 0a 00 02 00 00 00 7b ab 45 4a 87 1f
This uses a temporary intermediate binary file.
system('sh -c "unzip -p file.zip file.bin | dd of=tempfile.bin bs=1c skip=5c count=4c"')
# 4+0 records in
# 4+0 records out
# 4 bytes copied, 0.00044964 s, 8.9 kB/s
file.info("tempfile.bin")$size
# [1] 4
readBin("tempfile.bin", raw(1), n = 16)
# [1] 06 07 08 09
This method offsets the "expense" of dealing with the size of the stored binary data to the shell/pipe, out of R.
This worked on win10, R-3.3.2. I'm using dd from Git for Windows (version 2.11.0.3, though 2.11.1 is available), and unzip and sh from RTools.
Sys.which(c("dd", "unzip", "sh"))
# dd
# "C:\\PROGRA~1\\Git\\usr\\bin\\dd.exe"
# unzip
# "c:\\Rtools\\bin\\unzip.exe"
# sh
# "c:\\Rtools\\bin\\sh.exe"
EDIT: This bug was found in 32-bit versions of R was fixed in R version 2.9.2.
This was tweeted to me by #leoniedu today and I don't have an answer for him so I thought I would post it here.
I have read the documentation for agrep() (fuzzy string matching) and it appears that I don't fully understand the max.distance parameter. Here's an example:
pattern <- "Staatssekretar im Bundeskanzleramt"
x <- "Bundeskanzleramt"
agrep(pattern,x,max.distance=18)
agrep(pattern,x,max.distance=19)
That behaves exactly like I would expect. There are 18 characters different between the strings so I would expect that to be the threshold of a match. Here's what's confusing me:
agrep(pattern,x,max.distance=30)
agrep(pattern,x,max.distance=31)
agrep(pattern,x,max.distance=32)
agrep(pattern,x,max.distance=33)
Why are 30 and 33 matches, but not 31 and 32? To save you some counting,
> nchar("Staatssekretar im Bundeskanzleramt")
[1] 34
> nchar("Bundeskanzleramt")
[1] 16
I posted this on the R list a while back and reported as a bug in R-bugs-list. I had no useful responses, so I twitted to see if the bug was reproducible or I was just missing something. JD Long was able to reproduce it and kindly posted the question here.
Note that, at least in R, then, agrep is a misnomer since it does not matches regular expressions, while grep stands for "Globally search for the Regular Expression and Print". It shouldn't have a problem with patterns longer than the target vector. (i think!)
In my linux server, all is well but not so in my Mac and Windows machines.
Mac:
sessionInfo()
R version 2.9.1 (2009-06-26)
i386-apple-darwin8.11.1
locale:
en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
agrep(pattern,x,max.distance=30)
[1] 1
agrep(pattern,x,max.distance=31)
integer(0)
agrep(pattern,x,max.distance=32)
integer(0)
agrep(pattern,x,max.distance=33)
[1] 1
Linux:
R version 2.9.1 (2009-06-26)
x86_64-unknown-linux-gnu
locale:
LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
agrep(pattern,x,max.distance=30)
[1] 1
agrep(pattern,x,max.distance=31)
[1] 1
agrep(pattern,x,max.distance=32)
[1] 1
agrep(pattern,x,max.distance=33)
[1] 1
I am not sure if your example makes sense. For the basic grep(), pattern is often a simple or a regular expression, and x is a vector whose element get matched to pattern. Having pattern as longer string that x strikes me as odd.
Consider this where we just use grep instead of substr:
R> grep("vo", c("foo","bar","baz")) # vo is not in the vector
integer(0)
R> agrep("vo", c("foo","bar","baz"), value=TRUE) # but is close enough to foo
[1] "foo"
R> agrep("vo", c("foo","bar","baz"), value=TRUE, max.dist=0.25) # still foo
[1] "foo"
R> agrep("vo", c("foo","bar","baz"), value=TRUE, max.dist=0.75) # now all match
[1] "foo" "bar" "baz"
R>