I have several functions as strings which contain a lot of numeric vectors in the form of
c(1,2,3) , with three fixed values each (3D-coordinates). See test_string below as a small example. I can create a working function test_fun using eval and parse, but there is a problem:
I need these vectors to be recognized as one input, i.e. as double[3] and not as language with the parts 'c' (symbol), 1 (double[1]), 2 (double[1]) and 3 (double[1]). Check this code to see what I mean:
test_string <- "function(x) \n c(1,2,3)*x"
test_fun <- eval(parse(text = test_string))
test_fun(2)
#[1] 2 4 6 <- it's working
View(list(test_fun)) # see 'type' column
str(body(test_fun)[[2]])
# language c(1, 2, 3) <- desired output here: num [1:3] 1 2 3
str(body(test_fun)[[2]][[1]])
# symbol c
Is there an easy solution that works on the full string? I would be very happy to learn about this! If necessary I could also change the code in the function which creates these function strings when the substrings are concatenated with paste("function(x) \n ","c(1,2,3)","*x",sep = "").
Edit: I did a mistake in the 'View' and 'desired output' line. It is now correct.
I think I found a solution that works for me. If there is a more elegant solution, please let me know!
I go recursively through the function body and evaluate the parts which are numerical vectors a second time (like #Allan Cameron suggested, thanks!). Here is the function:
evalBodyParts <- function(fun_body){
for (i in 1:length(fun_body)){ #i=2
if (typeof(fun_body[[i]])=="language" &&
typeof(fun_body[[i]][[1]])=="symbol" && fun_body[[i]][[1]]=="c"){
#if first element is symbol 'c' the whole list is only num [1:3] here
fun_body[[i]] <- eval(fun_body[[i]])
} else {
if(typeof(fun_body[[i]])=="language"){
fun_body[[i]] <- evalBodyParts(fun_body=fun_body[[i]])
}
}
}
return(fun_body)
}
To do a quick example which is a bit more complex than the one in the main question above, let me show you the following.
Before:
test_string <- paste("function(x) \n ","c(1,2,3)","*x","+c(7,8,9)",sep = "")
test_fun <- eval(parse(text = test_string))
test_fun(2) # it's working
# [1] 9 12 15
str(body(test_fun)[[2]][[2]])
# language c(1, 2, 3)
str(body(test_fun)[[3]])
# language c(7, 8, 9)
After:
body(test_fun) <- evalBodyParts(fun_body=body(test_fun))
test_fun(2) # it is still working
# [1] 9 12 15
str(body(test_fun)[[2]][[2]])
# num [1:3] 1 2 3
str(body(test_fun)[[3]])
# num [1:3] 7 8 9
Related
I'm in a bit of a pickle. I have a bunch (thousands) of .csv files where a few lines contain a vector of numbers instead of a single value that I need to read into a tibble or data frame with the vector as a character for further processing. For example:
"col1","col2","col3"
"a",1,integer(0)
"c",c(3,4),5
"e",6,7
should end up as
col1 col2 col3
<chr> <chr> <chr>
1 a 1 integer(0)
2 c c(3,4) 5
3 e 7 7
The vector is only ever in "col2" and contains integers. The vector usually contains 2 entries, but it could be more. In reality, there are two columns in the middle that could contain multiple entries, but I know the positions of both.
I can't work out how to read these to R successfully. read.csv or read_csv can't seem to handle them. Is there a way I could read in the files line by line (they're thankfully not long) and eval() the line, maybe, before splitting by commas? I thought about replacing c( with "c( and ) with )" in bash before reading the files in (and will have to do this to integer(.
Alternatively, I've thought of splitting the .csvs in bash into ones that contain "normal" lines and ones that contain the vectors (grep c() but I'm not sure how to then nest 2:length(-1) of the columns back into a vector.
However, I'd definitely prefer a method that was self-contained in R. Any ideas appreciated!
I typed your example into a csv file, then brought it in with read.csv and specified that column 2 is character. Using gsub I replace the letter c and the open and close parentheses. Then I loop through column 2 to find cases where a comma appears and convert those instances to a list of integers.
data <- read.csv("SO question.csv", colClasses = c("character","character","integer"))
data$col2 <- gsub("(c|\\(|\\))","",data$col2)
for (i in 1:nrow(data)) {
if (grepl(",", data$col2[i]) == TRUE) {
temp <- unlist(strsplit(data$col2[i],","))
data$col2[i] <- list(as.integer(temp))
}
}
data
It looks like both col2 and col3 have complex contents. Assuming that the possible complex contents are c(...) and integer(0) we enclose both in double quotes and read them in as character converting from character to list in the final line. (We have used the r'{...}' literal string constant notation introduced in R 4.0 to avoid double backslashes. Modify as needed if you are using an earlier version of R.)
library(dplyr)
DF <- "myfile.csv" %>%
readLines %>%
gsub(r'{(c\(.*?\)|integer\(0\))}', r'{"\1"}', .) %>%
read.csv(text = .) %>%
mutate(across(2:3, ~ lapply(., function(x) eval(parse(text = x)))))
giving:
> str(DF)
'data.frame': 3 obs. of 3 variables:
$ col1: chr "a" "c" "e"
$ col2:List of 3
..$ : num 1
..$ : num 3 4
..$ : num 6
$ col3:List of 3
..$ : int
..$ : num 5
..$ : num 7
Note
We assume the file is as shwon reproducibly below:
Lines <- "\"col1\",\"col2\",\"col3\"\n\"a\",1,integer(0)\n\"c\",c(3,4),5\n\"e\",6,7\n"
cat(Lines, file = "myfile.csv")
Here is my code:
x<-c(1,2)
x
names(x)<- c("bob","ed")
x$ed
Why do I get the following error?
Error in x$ed : $ operator is invalid for atomic vectors
From the help file about $ (See ?"$") you can read:
$ is only valid for recursive objects, and is only discussed in the section below on recursive objects.
Now, let's check whether x is recursive
> is.recursive(x)
[1] FALSE
A recursive object has a list-like structure. A vector is not recursive, it is an atomic object instead, let's check
> is.atomic(x)
[1] TRUE
Therefore you get an error when applying $ to a vector (non-recursive object), use [ instead:
> x["ed"]
ed
2
You can also use getElement
> getElement(x, "ed")
[1] 2
The reason you are getting this error is that you have a vector.
If you want to use the $ operator, you simply need to convert it to a data.frame. But since you only have one row in this particular case, you would also need to transpose it; otherwise bob and ed will become your row names instead of your column names which is what I think you want.
x <- c(1, 2)
x
names(x) <- c("bob", "ed")
x <- as.data.frame(t(x))
x$ed
[1] 2
Because $ does not work on atomic vectors. Use [ or [[ instead. From the help file for $:
The default methods work somewhat differently for atomic vectors, matrices/arrays and for recursive (list-like, see is.recursive) objects. $ is only valid for recursive objects, and is only discussed in the section below on recursive objects.
x[["ed"]] will work.
Here x is a vector.
You need to convert it into a dataframe for using $ operator.
x <- as.data.frame(x)
will work for you.
x<-c(1,2)
names(x)<- c("bob","ed")
x <- as.data.frame(x)
will give you output of x as:
bob 1
ed 2
And, will give you output of x$ed as:
NULL
If you want bob and ed as column names then you need to transpose the dataframe like x <- as.data.frame(t(x))
So your code becomes
x<-c(1,2)
x
names(x)<- c("bob","ed")
x$ed
x <- as.data.frame(t(x))
Now the output of x$ed is:
[1] 2
You get this error, despite everything being in line, because of a conflict caused by one of the packages that are currently loaded in your R environment.
So, to solve this issue, detach all the packages that are not needed from the R environment. For example, when I had the same issue, I did the following:
detach(package:neuralnet)
bottom line: detach all the libraries no longer needed for execution... and the problem will be solved.
This solution worked for me
data<- transform(data, ColonName =as.integer(ColonName))
Atomic collections are accessible by $
Recursive collections are not. Rather the [[ ]] is used
Browse[1]> is.atomic(list())
[1] FALSE
Browse[1]> is.atomic(data.frame())
[1] FALSE
Browse[1]> is.atomic(class(list(foo="bar")))
[1] TRUE
Browse[1]> is.atomic(c(" lang "))
[1] TRUE
R can be funny sometimes
a = list(1,2,3)
b = data.frame(a)
d = rbind("?",c(b))
e = exp(1)
f = list(d)
print(data.frame(c(list(f,e))))
X1 X2 X3 X2.71828182845905
1 ? ? ? 2.718282
2 1 2 3 2.718282
After reading all about iconv and Encoding, I am still confused.
I am scraping the source of a web page I have a string that looks like this: 'pretty\u003D\u003Ebig' (displayed in the R console as 'pretty\\\u003D\\\u003Ebig'). I want to convert this to the ASCII string, which should be 'pretty=>big'.
More simply, if I set
x <- 'pretty\\u003D\\u003Ebig'
How do I perform a conversion on x to yield pretty=>big?
Any suggestions?
Use parse, but don't evaluate the results:
x1 <- 'pretty\\u003D\\u003Ebig'
x2 <- parse(text = paste0("'", x1, "'"))
x3 <- x2[[1]]
x3
# [1] "pretty=>big"
is.character(x3)
# [1] TRUE
length(x3)
# [1] 1
With the stringi package:
> x <- 'pretty\\u003D\\u003Ebig'
> stringi::stri_unescape_unicode(x)
[1] "pretty=>big"
Although I have accepted Hong ooi's answer, I can't help thinking parse and eval is a heavyweight solution. Also, as pointed out, it is not secure, although for my application I can be confident that I will not get dangerous quotes.
So, I have devised an alternative, somewhat brutal, approach:
udecode <- function(string){
uconv <- function(chars) intToUtf8(strtoi(chars, 16L))
ufilter <- function(string) {
if (substr(string, 1, 1)=="|") uconv(substr(string, 2, 5)) else string
}
string <- gsub("\\\\u([[:xdigit:]]{4})", ",|\\1,", string, perl=TRUE)
strings <- unlist(strsplit(string, ","))
string <- paste(sapply(strings, ufilter), collapse='')
return(string)
}
Any simplifications welcomed!
A use for eval(parse)!
eval(parse(text=paste0("'", x, "'")))
This has its own problems of course, such as having to manually escape any quote marks within the string. But it should work for any valid Unicode sequences that may appear.
I sympathise; I have struggled with R and unicode text in the past and not always successfully. If your data is in x then first try a global replace, something like this:
x <- gsub("\u003D", "=>", x)
I sometimes use a construction like
lapply(x, utf8ToInt)
to see where the high code points are e.g. anything over 150. This helps me locate problems caused by non-breaking spaces, for example, which seem to pop up every now and again.
> iconv('pretty\u003D\u003Ebig', "UTF-8", "ASCII")
[1] "pretty=>big"
but you appear to have an extra escape
The trick here is that '\\u003D' is actually 6 characters while you want '\u003D' which is only one character. The further trick is that to match those backslashes you need to use doubly escaped backslashes in the pattern:
gsub("\\\\u003D\\\\u003E", "\u003D\u003E", x)
#[1] "pretty=>big"
To replace multiple characters with one character you need to target the entire pattern. You cannot simply delete a backslash. (Since you have indicated this is a more general problem, I think the answer might lie in modifications to your as yet undescribed method for downloading this text.)
When I load your functions and the dependencies, this code works:
> freq <- ngram(c('pretty\u003D\u003Ebig'), year_start = 1950)
>
> str(freq)
'data.frame': 59 obs. of 4 variables:
$ Year : num 1950 1951 1952 1953 1954 ...
$ Phrase : Factor w/ 1 level "pretty=>big": 1 1 1 1 1 1 1 1 1 1 ...
$ Frequency: num 1.52e-10 6.03e-10 5.98e-10 8.27e-10 8.13e-10 ...
$ Corpus : Factor w/ 1 level "eng_2012": 1 1 1 1 1 1 1 1 1 1 ...
(So I guess I am still not clear on the use case.)
Good afternoon,
Thanks for helping me out with this question.
I have a set of >5000 URLs within a list that I am interested in scraping. I have used lapply and readLines to extract the text for these webpages using the sample code below:
multipleURL <- c("http://dailymed.nlm.nih.gov/dailymed/lookup.cfm?ndc=0002-1200&start=1&labeltype=all", "http://dailymed.nlm.nih.gov/dailymed/lookup.cfm?ndc=0002-1407&start=1&labeltype=all", "http://dailymed.nlm.nih.gov/dailymed/lookup.cfm?ndc=0002-1975&start=1&labeltype=all")
multipleText <- lapply(multipleURL, readLines)
Now I would like to query each of these texts for the word "radioactive". I am simply interested in figuring out if this term is mentioned in the text and have been using the logical grep command:
radioactive <- grepl("radioactive" , multipleText, ignore.case = TRUE)
When I count the number of items in our list that contain the word "radioactive" it returns a count of 0:
count(radioactive)
x freq
1 FALSE 3
However, a cursory review of the webpages for each of these URLs however reveals that the first link (http://dailymed.nlm.nih.gov/dailymed/lookup.cfm?ndc=0002-1200&start=1&labeltype=all) DOES in fact contain the word radioactive. Our "multipleText" list even includes the word radioactive, although our grepl command doesn't seem to pick it up.
Any thoughts on what I am doing wrong would be greatly appreciated.
Many thanks,
Chris
I think you should you parse your document using html parser. Here I am using XML package. I convert your document to an R list and then I can apply grep on it.
library(XML)
multipleText <- lapply(multipleURL,function(x) {
y <- xmlToList(htmlParse(x))
y.flat <- unlist(y,recursive=TRUE)
length(grep('radioactive',c(y.flat,names(y.flat))))
})
multipleText
[[1]]
[1] 8
[[2]]
[1] 0
[[3]]
[1] 0
EDIT to search for multi search :
## define your words here
WORDS <- c('CLINICAL ','solution','Action','radioactive','Effects')
library(XML)
multipleText <- lapply(multipleURL,
function(x) {
y <- xmlToList(htmlParse(x))
y.flat <- unlist(y,recursive=TRUE)
sapply(WORDS,function(y)
length(grep(y,c(y.flat,names(y.flat)))))
})
do.call(rbind,multipleText)
CLINICAL solution Action radioactive Effects
[1,] 6 10 2 8 2
[2,] 1 3 1 0 3
[3,] 6 22 2 0 6
PS: maybe you should use ignore.case = TRUE for the grep command.
I have a question about aaply. I want to check which column is.numeric but the return values of aaply are kind of unexpected. Below is example code. Why do I get "data.frame" for all columns (which explains why is.numeric is FALSE even for columns with numeric vectors)?
Thanks!
data=data.frame(str=rep("str",3),num=c(1:3))
is.numeric(data[,1])
# FALSE
is.numeric(data[,2])
# TRUE
aaply(data,2,is.numeric)
# FALSE FALSE
aaply(data,2,class)
# "data.frame" "data.frame"
EDIT: In other situations this produces a warning message:
aaply(data,2,mean)
# 1: mean(<data.frame>) is deprecated.
# Use colMeans() or sapply(*, mean) instead.
It is the way aaply works, you could even use identity to see what is passed to each function call, a data.frame representing each column of data:
aaply(data, 2, identity)
# $num
# num
# 1 1
# 2 2
# 3 3
#
# $str
# str
# 1 str
# 2 str
# 3 str
So using aaply the way you want, you would have to use a function that extracts the first column of each data.frame, something like:
aaply(data, 2, function(df)is.numeric(df[[1]]))
# num str
# TRUE FALSE
but it seems much easier to just do:
sapply(data, is.numeric)
# str num
# FALSE TRUE
The basic reason is that you are providing aaply with an argument of a class it is not designed to work with. The first letter of a plyr function signifies the type of argument, in this case "a" for array. It does work as you expect if you offer an array:
> xx <- plyr::aaply(matrix(1:10, 2), 2, class)
> xx
1 2 3 4 5
"integer" "integer" "integer" "integer" "integer"
At least that was my understanding until I read the help page. It says that dataframe input should be accepted and that an array should be the output. So you have discovered either an error in the documentation or a bug in the function. Either way, the correct place to take this up is on the 'manipulatr' Google-newsgroup. There is a fair chance that #hadley will be along to clear things up, since he is a valued contributor here as well.