Converting a \u escaped Unicode string to ASCII - r

After reading all about iconv and Encoding, I am still confused.
I am scraping the source of a web page I have a string that looks like this: 'pretty\u003D\u003Ebig' (displayed in the R console as 'pretty\\\u003D\\\u003Ebig'). I want to convert this to the ASCII string, which should be 'pretty=>big'.
More simply, if I set
x <- 'pretty\\u003D\\u003Ebig'
How do I perform a conversion on x to yield pretty=>big?
Any suggestions?

Use parse, but don't evaluate the results:
x1 <- 'pretty\\u003D\\u003Ebig'
x2 <- parse(text = paste0("'", x1, "'"))
x3 <- x2[[1]]
x3
# [1] "pretty=>big"
is.character(x3)
# [1] TRUE
length(x3)
# [1] 1

With the stringi package:
> x <- 'pretty\\u003D\\u003Ebig'
> stringi::stri_unescape_unicode(x)
[1] "pretty=>big"

Although I have accepted Hong ooi's answer, I can't help thinking parse and eval is a heavyweight solution. Also, as pointed out, it is not secure, although for my application I can be confident that I will not get dangerous quotes.
So, I have devised an alternative, somewhat brutal, approach:
udecode <- function(string){
uconv <- function(chars) intToUtf8(strtoi(chars, 16L))
ufilter <- function(string) {
if (substr(string, 1, 1)=="|") uconv(substr(string, 2, 5)) else string
}
string <- gsub("\\\\u([[:xdigit:]]{4})", ",|\\1,", string, perl=TRUE)
strings <- unlist(strsplit(string, ","))
string <- paste(sapply(strings, ufilter), collapse='')
return(string)
}
Any simplifications welcomed!

A use for eval(parse)!
eval(parse(text=paste0("'", x, "'")))
This has its own problems of course, such as having to manually escape any quote marks within the string. But it should work for any valid Unicode sequences that may appear.

I sympathise; I have struggled with R and unicode text in the past and not always successfully. If your data is in x then first try a global replace, something like this:
x <- gsub("\u003D", "=>", x)
I sometimes use a construction like
lapply(x, utf8ToInt)
to see where the high code points are e.g. anything over 150. This helps me locate problems caused by non-breaking spaces, for example, which seem to pop up every now and again.

> iconv('pretty\u003D\u003Ebig', "UTF-8", "ASCII")
[1] "pretty=>big"
but you appear to have an extra escape

The trick here is that '\\u003D' is actually 6 characters while you want '\u003D' which is only one character. The further trick is that to match those backslashes you need to use doubly escaped backslashes in the pattern:
gsub("\\\\u003D\\\\u003E", "\u003D\u003E", x)
#[1] "pretty=>big"
To replace multiple characters with one character you need to target the entire pattern. You cannot simply delete a backslash. (Since you have indicated this is a more general problem, I think the answer might lie in modifications to your as yet undescribed method for downloading this text.)
When I load your functions and the dependencies, this code works:
> freq <- ngram(c('pretty\u003D\u003Ebig'), year_start = 1950)
>
> str(freq)
'data.frame': 59 obs. of 4 variables:
$ Year : num 1950 1951 1952 1953 1954 ...
$ Phrase : Factor w/ 1 level "pretty=>big": 1 1 1 1 1 1 1 1 1 1 ...
$ Frequency: num 1.52e-10 6.03e-10 5.98e-10 8.27e-10 8.13e-10 ...
$ Corpus : Factor w/ 1 level "eng_2012": 1 1 1 1 1 1 1 1 1 1 ...
(So I guess I am still not clear on the use case.)

Related

Creating functions with eval(parse()) containing numeric vectors

I have several functions as strings which contain a lot of numeric vectors in the form of
c(1,2,3) , with three fixed values each (3D-coordinates). See test_string below as a small example. I can create a working function test_fun using eval and parse, but there is a problem:
I need these vectors to be recognized as one input, i.e. as double[3] and not as language with the parts 'c' (symbol), 1 (double[1]), 2 (double[1]) and 3 (double[1]). Check this code to see what I mean:
test_string <- "function(x) \n c(1,2,3)*x"
test_fun <- eval(parse(text = test_string))
test_fun(2)
#[1] 2 4 6 <- it's working
View(list(test_fun)) # see 'type' column
str(body(test_fun)[[2]])
# language c(1, 2, 3) <- desired output here: num [1:3] 1 2 3
str(body(test_fun)[[2]][[1]])
# symbol c
Is there an easy solution that works on the full string? I would be very happy to learn about this! If necessary I could also change the code in the function which creates these function strings when the substrings are concatenated with paste("function(x) \n ","c(1,2,3)","*x",sep = "").
Edit: I did a mistake in the 'View' and 'desired output' line. It is now correct.
I think I found a solution that works for me. If there is a more elegant solution, please let me know!
I go recursively through the function body and evaluate the parts which are numerical vectors a second time (like #Allan Cameron suggested, thanks!). Here is the function:
evalBodyParts <- function(fun_body){
for (i in 1:length(fun_body)){ #i=2
if (typeof(fun_body[[i]])=="language" &&
typeof(fun_body[[i]][[1]])=="symbol" && fun_body[[i]][[1]]=="c"){
#if first element is symbol 'c' the whole list is only num [1:3] here
fun_body[[i]] <- eval(fun_body[[i]])
} else {
if(typeof(fun_body[[i]])=="language"){
fun_body[[i]] <- evalBodyParts(fun_body=fun_body[[i]])
}
}
}
return(fun_body)
}
To do a quick example which is a bit more complex than the one in the main question above, let me show you the following.
Before:
test_string <- paste("function(x) \n ","c(1,2,3)","*x","+c(7,8,9)",sep = "")
test_fun <- eval(parse(text = test_string))
test_fun(2) # it's working
# [1] 9 12 15
str(body(test_fun)[[2]][[2]])
# language c(1, 2, 3)
str(body(test_fun)[[3]])
# language c(7, 8, 9)
After:
body(test_fun) <- evalBodyParts(fun_body=body(test_fun))
test_fun(2) # it is still working
# [1] 9 12 15
str(body(test_fun)[[2]][[2]])
# num [1:3] 1 2 3
str(body(test_fun)[[3]])
# num [1:3] 7 8 9

Delete duplicate elements in String in R

I've got some problems deleting duplicate elements in a string.
My data look similar to this:
idvisit path
1 1,16,23,59
2 2,14,14,19
3 5,19,23,19
4 10,10
5 23,23,27,29,23
I have a column containing an unique ID and a column containing a path for web page navigation.
The right column contains some cases, where pages just were reloaded and the page were tracked twice or even more.
The pages are separated with commas and are saved as factors.
My problem is, that I don't want to have multiple pages in a row, so the data should look like this.
idvisit path
1 1,16,23,59
2 2,14,19
3 5,19,23,19
4 10
5 23,27,29,23
The multiple pages next to each other should be removed. I know how to delete a specific multiple number using regexpressions, but I have about 20.000 different pages and can't do this for all of them.
Does anyone have a solution or a hint, for my problem?
Thanks
Sebastian
We can use tidyverse. Use the separate_rows to split the 'path' variable by the delimiter (,) to convert to a long format, then grouped by 'idvisit', we paste the run-length-encoding values
library(tidyverse)
separate_rows(df1, path) %>%
group_by(idvisit) %>%
summarise(path = paste(rle(path)$values, collapse=","))
# A tibble: 5 × 2
# idvisit path
# <int> <chr>
#1 1 1,16,23,59
#2 2 2,14,19
#3 3 5,19,23,19
#4 4 10
#5 5 23,27,29,23
Or a base R option is
df1$path <- sapply(strsplit(df1$path, ","), function(x) paste(rle(x)$values, collapse=","))
NOTE: If the 'path' column is factor class, convert to character before passing as argument to strsplit i.e. strsplit(as.character(df1$path), ",")
Using stringr package, with function: str_replace_all, I think it gets what you want using the following regular expression: ([0-9]+),\\1and then replace it with \\1 (we need to scape the \ special character):
library(stringr)
> str_replace_all("5,19,23,19", "([0-9]+),\\1", "\\1")
[1] "5,19,23,19"
> str_replace_all("10,10", "([0-9]+),\\1", "\\1")
[1] "10"
> str_replace_all("2,14,14,19", "([0-9]+),\\1", "\\1")
[1] "2,14,19"
You can use it in a array form: x <- c("5,19,23,19", "10,10", "2,14,14,19") then:
str_replace_all(x, "([0-9]+),\\1", "\\1")
[1] "5,19,23,19" "10" "2,14,19"
or using sapply:
result <- sapply(x, function(x) str_replace_all(x, "([0-9]+),\\1", "\\1"))
Then:
> result
5,19,23,19 10,10 2,14,14,19
"5,19,23,19" "10" "2,14,19"
Notes:
The first line is the attribute information:
> str(result)
Named chr [1:3] "5,19,23,19" "10" "2,14,19"
- attr(*, "names")= chr [1:3] "5,19,23,19" "10,10" "2,14,14,19"
If you don't want to see them (it does not affect the result), just do:
attributes(result) <- NULL
Then,
> result
[1] "5,19,23,19" "10" "2,14,19"
Explanation about the regular expression used: ([0-9]+),\\1
([0-9]+): Starts with a group 1 delimited by () and finds any digit (at least one)
,: Then comes a punctuation sign: , (we can include spaces here, but the original example only uses this character as delimiter)
\\1: Then comes an identical string to the group 1, i.e.: the repeated number. If that doesn't happen, then the pattern doesn't match.
Then if the pattern matches, it replaces it, with the value of the variable \\1, i.e. the first time the number appears in the pattern matched.
How to handle more than one duplicated number, for example 2,14,14,14,19?:
Just use this regular expression instead: ([0-9]+)(,\\1)+, then it matches when at least there is one repetition of the delimiter (right) and the number. You can try other possibilities using this regex101.com (in MHO it more user friendly than other online regular expression checkers).
I hope this would work for you, it is a flexible solution, you just need to adapt it with the pattern you need.

plot function type

For digits I have done so:
digits <- c("0","1","2","3","4","5","6","7","8","9")
You can use the [:punct:] to detect punctuation. This detects
[!"\#$%&'()*+,\-./:;<=>?#\[\\\]^_`{|}~]
Either in grepexpr
x = c("we are friends!, Good Friends!!")
gregexpr("[[:punct:]]", x)
R> gregexpr("[[:punct:]]", x)
[[1]]
[1] 15 16 30 31
attr(,"match.length")
[1] 1 1 1 1
attr(,"useBytes")
[1] TRUE
or via stringi
# Gives 4
stringi::stri_count_regex(x, "[:punct:]")
Notice the , is counted as punctuation.
The question seems to be about getting individual counts of particular punctuation marks. #Joba provides a neat answer in the comments:
## Create a vector of punctuation marks you are interested in
punct = strsplit('[]?!"\'#$%&(){}+*/:;,._`|~[<=>#^-]\\', '')[[1]]
The count how often they appear
counts = stringi::stri_count_fixed(x, punct)
Decorate the vector
setNames(counts, punct)
You can use regular expressions.
stringi::stri_count_regex("amdfa, ad,a, ad,. ", "[:punct:]")
https://en.wikipedia.org/wiki/Regular_expression
might help too.

R Error in x$ed : $ operator is invalid for atomic vectors

Here is my code:
x<-c(1,2)
x
names(x)<- c("bob","ed")
x$ed
Why do I get the following error?
Error in x$ed : $ operator is invalid for atomic vectors
From the help file about $ (See ?"$") you can read:
$ is only valid for recursive objects, and is only discussed in the section below on recursive objects.
Now, let's check whether x is recursive
> is.recursive(x)
[1] FALSE
A recursive object has a list-like structure. A vector is not recursive, it is an atomic object instead, let's check
> is.atomic(x)
[1] TRUE
Therefore you get an error when applying $ to a vector (non-recursive object), use [ instead:
> x["ed"]
ed
2
You can also use getElement
> getElement(x, "ed")
[1] 2
The reason you are getting this error is that you have a vector.
If you want to use the $ operator, you simply need to convert it to a data.frame. But since you only have one row in this particular case, you would also need to transpose it; otherwise bob and ed will become your row names instead of your column names which is what I think you want.
x <- c(1, 2)
x
names(x) <- c("bob", "ed")
x <- as.data.frame(t(x))
x$ed
[1] 2
Because $ does not work on atomic vectors. Use [ or [[ instead. From the help file for $:
The default methods work somewhat differently for atomic vectors, matrices/arrays and for recursive (list-like, see is.recursive) objects. $ is only valid for recursive objects, and is only discussed in the section below on recursive objects.
x[["ed"]] will work.
Here x is a vector.
You need to convert it into a dataframe for using $ operator.
x <- as.data.frame(x)
will work for you.
x<-c(1,2)
names(x)<- c("bob","ed")
x <- as.data.frame(x)
will give you output of x as:
bob 1
ed 2
And, will give you output of x$ed as:
NULL
If you want bob and ed as column names then you need to transpose the dataframe like x <- as.data.frame(t(x))
So your code becomes
x<-c(1,2)
x
names(x)<- c("bob","ed")
x$ed
x <- as.data.frame(t(x))
Now the output of x$ed is:
[1] 2
You get this error, despite everything being in line, because of a conflict caused by one of the packages that are currently loaded in your R environment.
So, to solve this issue, detach all the packages that are not needed from the R environment. For example, when I had the same issue, I did the following:
detach(package:neuralnet)
bottom line: detach all the libraries no longer needed for execution... and the problem will be solved.
This solution worked for me
data<- transform(data, ColonName =as.integer(ColonName))
Atomic collections are accessible by $
Recursive collections are not. Rather the [[ ]] is used
Browse[1]> is.atomic(list())
[1] FALSE
Browse[1]> is.atomic(data.frame())
[1] FALSE
Browse[1]> is.atomic(class(list(foo="bar")))
[1] TRUE
Browse[1]> is.atomic(c(" lang "))
[1] TRUE
R can be funny sometimes
a = list(1,2,3)
b = data.frame(a)
d = rbind("?",c(b))
e = exp(1)
f = list(d)
print(data.frame(c(list(f,e))))
X1 X2 X3 X2.71828182845905
1 ? ? ? 2.718282
2 1 2 3 2.718282

format numeric without leading zero

What's the best way to format a numeric so that it does NOT show leading zero. For example:
test = .006
sprintf/format/formatC( ??? ) # should result in ".006"
I believe I answered this once before but can't find it. You cannot tell sprintf() et al about a format that drops the leading zero ... so you have to do it yourself, eg via substring():
R> val <- 0.006
R> aa <- substring(sprintf("%4.3f", val), 2)
R> aa
[1] ".006"
R>
f <- function(x) gsub("^(\\s*[+|-]?)0\\.", "\\1.", as.character(x))
f(0.006)
# ".006"
f(-0.006)
# "-.006"
f("+0.006")
# "+.006"
f(" 0.006")
# " .006"
f(10.05)
# "10.05"
You can always fix it up yourself with regular expression search-and-replace:
library(stringr)
test = .006
str_replace(as.character(test), "^0\\.", ".")
Not the most elegant answer, but it works. Substitute whatever string conversion you like for as.character, such as sprintf with your preferred floating point format.

Resources