I have a web response being returned in raw format which I'm unable to properly encode. It contains the following values:
ef bc 86
The character is meant to be a Fullwidth Ampersand (to illustrate below):
> as.character("\uFF06")
[1] "&"
> charToRaw("\uFF02")
[1] ef bc 82
However, no matter what I've tried it gets converted to ". To illustrate:
> rawToChar(charToRaw("\uFF02"))
[1] """
Because of the equivalence of the raw values, I don't think there's anything I can do in my web call to influence the problem I'm having (happy to be corrected). I believe I need to work out how to properly do the character encoding.
I also took an extreme approach of trying all other encodings as follows but none converted to the fullwidth ampersand:
> x_raw <- charToRaw("\uFF02")
> x_raw
[1] ef bc 82
> sapply(
+ stringi::stri_enc_list()
+ ,function(encoding) stringi::stri_encode(str = x_raw, encoding)
+ ) |> # R's new native pipe
+ tibble::enframe(name = "encoding")
# A tibble: 1,203 x 2
encoding value
<chr> <chr>
1 037 "Õ¯b"
2 273 "Õ¯b"
3 277 "Õ¯b"
4 278 "Õ¯b"
5 280 "Õ¯b"
6 284 "Õ¯b"
7 285 "Õ~b"
8 297 "Õ¯b"
9 420 "\u001a\u001ab"
10 424 "\u001a\u001ab"
# ... with 1,193 more rows
My work around at the moment is to replace the strings after the encoding, but this character is just one example of many, and hard-coding every instance doesn't seem practical.
> rawToChar(x_raw)
[1] """
> stringr::str_replace_all(rawToChar(x_raw), c(""" = "\uFF06"))
[1] "&"
The substitution workaround is also complicated that I've also got characters like the HYPHEN (not HYPEN-MINUS) somehow getting converted where the last to raw values are getting converted to a string with what appears to be octal values:
> as.character("\u2010") # HYPHEN
[1] "‐"
> as.character("\u2010") |> charToRaw() # As raw
[1] e2 80 90
> as.character("\u2010") |> charToRaw() |> rawToChar() # Converted back to string
[1] "â€\u0090"
> charToRaw("â\200\220") # string with equivalent raw
[1] e2 80 90
Any help appreciated.
I'm not totally clear on exactly what you are trying to do, but the problem with getting back your original character is that R cannot determine the encoding automatically from the raw bytes. I assume you are on Windows. If you do
val <- rawToChar(charToRaw("\uFF06"))
val
# [1] "&"
Encoding(val)
# [1] "unknown"
Encoding(val) <- "UTF-8"
val
# [1] "&"
Just make sure to set the encoding properly.
Related
I have an object called data (representing a field of wind strengths at geographical locations) obtained by using some code to read in a .grib file:
> data
ecmf : u-component of wind
Time:
2020/07/09 z00:00 0-0 h
Domain summary:
601 x 351 domain
Projection summary:
proj= latlong
NE = ( 50 , 75 )
SW = ( -10 , 40 )
Data summary:
-89.06099 -50.08242 -41.13694 -43.42623 -34.77617 -25.03278
data is 601 x 351 array of doubles:
> typeof(data)
[1] "double"
> is.array(data)
[1] TRUE
> dim(data)
[1] 601 351
but, as shown above, it also has extra information attached beyond the numerical values of the array elements (Time:, Projection summary etc). How do I extract these? Attempts such as data$time do not seem to work.
As suggested in the comments to the question, I was able to access the values I wanted using attributes(). attributes(data) returns a list of all the relevant elements.
How can I convert Ab9876543210 into Ab9876543210? Is there a solution by regular expression?
test <- dput("Ab9876543210")
Disclaimer: The following works on my machine, but since I can't replicate your full width string based purely on the example provided, this is a best guess based on my version of the problem (pasting the string into a text file, save it with UTF-8 encoding, & loading it in with coding specified as UTF-8.
Step 1. Reading in the text (I added a half width version for comparison):
> test <- readLines("fullwidth.txt", encoding = "UTF-8")
> test
[1] "Ab9876543210" "Ab9876543210"
Step 2. Verifying that the full & half width versions are not equal:
# using all.equal()
test1 <- test[1]
test2 <- test[2]
> all.equal(test1, test2)
[1] "1 string mismatch"
# compare raw bytes
> charToRaw(test1)
[1] ef bb bf ef bc a1 62 ef bc 99 ef bc 98 ef bc 97 ef bc 96 ef bc 95 ef
[24] bc 94 ef bc 93 ef bc 92 ef bc 91 ef bc 90
> charToRaw(test2)
[1] 41 62 39 38 37 36 35 34 33 32 31 30
For anyone interested, if you paste the raw byte version into a utf-8 decoder as hexadecimal input, you'll see that except for letter b (mapped from 62 in the 7th byte), the rest of the letters were formed by 3-byte sequences. In addition, the first 3-byte sequence maps to "ZERO WIDTH NO-BREAK SPACE character", so it's not visible when you print the string to console.
Step 3. Converting from full width to half width using the Nippon package:
library(Nippon)
test1.converted <- zen2han(test1)
> test1.converted
[1] "Ab9876543210"
# If you want to compare against the original test2 string, remove the zero
# width character in front
> all.equal(substring(test1.converted, 2), test2)
[1] TRUE
Here is a base R solution
Full width characters are in the range 0xFF01:0xFFEF, and can be offset like this.
x <- "Ab9876543210"
iconv(x, to = "utf8") |>
utf8ToInt() |>
(\(.) ifelse(. > 0xFF01 & . <= 0xFFEF, . - 65248, .))() |>
intToUtf8()
[1] "Ab9876543210"
I am writing my bachelor thesis and I have not much experience with r so far.
My problem is that my dates which I made with this commands :
t<-strptime(x, "%d.%m.%Y %H.%M")
don't work anymore when I save them in a matrix with the other information on those specific dates.
I am a bit confused because it works just fine when I don't put them in a matrix like this t[1:10]
But that happens as soon as I try to save them in a matrix
matrix1<-matrix(c(t,v2,v3,v4),nrow=length(v2))
Fehler in as.POSIXct.numeric(X[[i]], ...) : 'origin' muss angegeben werden
It's German but it means origin must be supplied.
Any ideas what I have to do to fix it? I am a bit frustrated :)
Roland is right. You can't have Posixlt objects in a matrix. What you can do is save those dates as numeric timestamps in the matrix and convert them back to dates while accessing
Converting to numeric timestamp:
>date<- as.numeric(as.POSIXct("2014-02-16 2:13:46 UTC",origin="01-01-1970"))
>date
[1] 1392545626
Then save those timestamps in a matrix as you do and to convert it back to date, use the above command again without converting it into a numeric.
t (terrible name by the way, easily confused with the t function) is a POSIXlt object, which internally is a list. First you should check, what c(t,v2,v3,v4) returns (I don't know how v2 etc are defined).
Then we can look into the documentation in help("matrix"):
data
an optional data vector (including a list or expression vector). Non-atomic classed R objects are coerced by as.vector and all attributes discarded.
The important bit is "all attributes discarded". This is what you get if you discard the attributes (which include the class attribute) of a POSIXlt object:
x <- strptime(c("2016-05-09 12:00:00", "2016-05-09 13:00:00"), format = "%Y-%m-%d %H:%M:%S")
attributes(x) <- NULL
print(x)
# [[1]]
# [1] 0 0
#
# [[2]]
# [1] 0 0
#
# [[3]]
# [1] 12 13
#
# [[4]]
# [1] 9 9
#
# [[5]]
# [1] 4 4
#
# [[6]]
# [1] 116 116
#
# [[7]]
# [1] 1 1
#
# [[8]]
# [1] 129 129
#
# [[9]]
# [1] 1 1
#
# [[10]]
# [1] "CEST" "CEST"
#
# [[11]]
# [1] NA NA
A matrix can't contain POSIXlt objects (or any objects, i.e., anything with an explicit class).
To convert strings to ascii code in R, I typically use:
> strtoi(charToRaw("abcd"),16L)
[1] 97 98 99 100
Is there a function to do the inverse, i.e.
>myDesiredFunc(c(97 98 99 100))
[1] "abcd"
Thanks.
I just noticed that R has a intToUtf8 and utf8ToInt functions that does the same thing.
> test<-utf8ToInt("Apples")
> test
[1] 65 112 112 108 101 115
> intToUtf8(test)
[1] "Apples"
Also:
rawToChar(as.raw(c(97,98,99,100)))
Type ?charToRaw at the command prompt for more info.
Best I found is:
readLines(rawConnection(as.raw(c(97,98,99,100,13))))
Although I guess it is better to close the connection
con = rawConnection(as.raw(c(97,98,99,100,13)))
res = readLines(con);
close(con);
show(res);
[1] "abcd"
I must be misunderstanding how read.csv works in R. I have read the help file, but still do not understand how a csv file containing:
40900,-,-,-,241.75,0
40905,244,245.79,241.25,244,22114
40906,244,246.79,243.6,245.5,18024
40907,246,248.5,246,247,60859
read into R using: euk<-data.matrix(read.csv("path\to\csv.csv"))
produces this as a result (using tail):
Date Open High Low Close Volume
[2713,] 15329 490 404 369 240.75 62763
[2714,] 15330 495 409 378 242.50 127534
[2715,] 15331 1 1 1 241.75 0
[2716,] 15336 504 425 385 244.00 22114
[2717,] 15337 504 432 396 245.50 18024
[2718,] 15338 512 442 405 247.00 60859
It must be something obvious that I do not understand. Please be kind in your responses, I am trying to learn.
Thanks!
The issue is not with read.csv, but with data.matrix. read.csv imports any column with characters in it as a factor. The '-' in the first row for your dataset are character, so the column is converted to a factor. Now, you pass the result of the read.csv into data.matrix, and as the help states, it replaces the levels of the factor with it's internal codes.
Basically, you need to insure that the columns of your data are numeric before you pass the data.frame into data.matrix.
This should work in your case (assuming the only characters are '-'):
euk <- data.matrix(read.csv("path/to/csv.csv", na.strings = "-", colClasses = 'numeric'))
I'm no R expert, but you may consider using scan() instead, eg:
> data = scan("foo.csv", what = list(x = numeric(), y = numeric()), sep = ",")
Where foo.csv has two columns, x and y, and is comma delimited. I hope that helps.
I took a cut/paste of your data, put it in a file and I get this using 'R'
> c<-data.matrix(read.csv("c:/DOCUME~1/Philip/LOCALS~1/Temp/x.csv",header=F))
> c
V1 V2 V3 V4 V5 V6
[1,] 40900 1 1 1 241.75 0
[2,] 40905 2 2 2 244.00 22114
[3,] 40906 2 3 3 245.50 18024
[4,] 40907 3 4 4 247.00 60859
>
There must be more in your data file, for one thing, data for the header line. And the output you show seems to start with row 2713. I would check:
The format of the header line, or get rid of it and add it manually later.
That each row has exactly 6 values.
The the filename uses forward slashes and has no embedded spaces
(use the 8.3 representation as shown in my filename).
Also, if you generated your csv file from MS Excel, the internal representation for a date is a number.