I have a data frame in R where the majority of columns are values, but there is one character column. For each column excluding the character column I want to subset the values that are over a threshold and obtain the corresponding value in the character column.
I'm unable to find a built-in dataset that contains the pattern of data I want, so a dput of my data can be accessed here.
When I use subsetting, I get the output I'm expecting:
> df[abs(df$PA3) > 0.32,1]
[1] "SSI_01" "SSI_02" "SSI_04" "SSI_05" "SSI_06" "SSI_07" "SSI_08" "SSI_09"
When I try to iterate over the columns of the data frame using apply, I get a recursion error:
> apply(df[2:10], 2, function(x) df[abs(df[[x]])>0.32, 1])
Error in .subset2(x, i, exact = exact) :
recursive indexing failed at level 2
Any suggestions where I'm going wrong?
The reason your solution didn't work is that the x being passed to your user-defined function is actually a column of df. Therefore, you could get your solution working with a small modification (replacing df[[x]] with x):
apply(df[2:10], 2, function(x) df[abs(x)>0.32, 1])
You could use the ... argument to apply to pass an extra argument. In this case, you would want to pass the first column:
apply(df[2:10], 2, function(x, y) y[abs(x) > 0.32], y=df[,1])
Yet another variation:
apply(abs(df[-1]) > .32, 2, subset, x=df[[1]])
The cute trick here is to "curry" subset by specifying the x parameter. I was hoping I could do it with [ but that doesn't deal with named parameters in the typical way because it is a primitive function :..(
A quick and non-sophisticated solution might be:
sapply(2:10, function(x) df[abs(df[,x])>0.32, 1])
Try:
lapply(df[,2:10],function(x) df[abs(x)>0.32, 1])
Or using apply:
apply(df[2:10], 2, function(x) df[abs(x)>0.32, 1])
Related
I would like to use paste0 to create a long string containing the conditions for the subset function.
I tried the following:
#rm(list=ls())
set.seed(1)
id<-1:20
ids<-sample(id, 3)
d <- subset(id, noquote(paste0("id==",ids,collapse="|")))
I get the
Error in subset.default(id, noquote(paste0("id==", ids, collapse = "|"))) :
'subset' must be logical
I tried the same without noquote. Interestinly when I run
noquote(paste0("id==",ids,collapse="|"))
I get [1] id==4|id==7|id==1. When I then paste this by hand in the subset formula
d2<-subset(id,id==4|id==7|id==1)
Everything runs nice. But why does subset(id, noquote(paste0("id==",ids,collapse="|"))) not work although it seems to be the same? Thanks a lot for your help!
OK, I have a little problem which I believe I can solve with which and grepl (alternatives are welcome), but I am getting lost:
my_query<- c('g1', 'g2', 'g3')
my_data<- c('string2','string4','string5','string6')
I would like to return the index in my_query matching in my_data. In the example above, only 'g2' is in mydata, so the result in the example would be 2.
It seems to me that there is no easy way to do this without a loop. For each element in my_query, we can use either of the below functions to get TRUE or FALSE:
f1 <- function (pattern, x) length(grep(pattern, x)) > 0L
f2 <- function (pattern, x) any(grepl(pattern, x))
For example,
f1(my_query[1], my_data)
# [1] FALSE
f2(my_query[1], my_data)
# [1] FALSE
Then, we use *apply loop to apply, say f2 to all elements of my_query:
which(unlist(lapply(my_query, f2, x = my_data)))
# [1] 2
Thanks, that seems to work. To be honest, I preferred to your one-line original version. I am not sure why you edited with creating another function to call afterwards with *apply. Is there any advantage as compared to which(lengths(lapply(my_query, grep, my_data)) > 0L)?
Well, I am not entirely sure. When I read ?lengths:
One advantage of ‘lengths(x)’ is its use as a more efficient
version of ‘sapply(x, length)’ and similar ‘*apply’ calls to
‘length’.
I don't know how much more efficient that lengths is compared with sapply. Anyway, if it is still a loop, then my original suggestion which(lengths(lapply(my_query, grep, my_data)) > 0L) is performing 2 loops. My edit is essentially combining two loops together, hopefully to get some boost (if not too tiny).
You can still arrange my new edit into a single line:
which(unlist(lapply(my_query, function (pattern, x) any(grepl(pattern, x)), x = my_data)))
or
which(unlist(lapply(my_query, function (pattern) any(grepl(pattern, my_data)))))
Expanding on a comment posted initially by #Gregor you could try:
which(colSums(sapply(my_query, grepl, my_data)) > 0)
#g2
# 2
The function colSums is vectorized and represents no problem in terms of performance. The sapply() loop seems inevitable here, since we need to check each element within the query vector. The result of the loop is a logical matrix, with each column representing an element of my_query and each row an element of my_data. By wrapping this matrix into which(colSums(..) > 0) we obtain the index numbers of all columns that contain at least one TRUE, i.e., a match with an entry of my_data.
When I perform:
a <- seq(1,1.5,0.1)
b <- c(1,1.1,1.4,1.5)
x <- rep(c(a,b),times=c(2,1))
Error in rep(c(a, b), c(2, 1)) : invalid 'times' argument
Why?
When we concatenate (c) two vectors, it becomes a single vector. If the idea would be to replicate 'a' by 2 and 'b' by 1, we place them in a list, and use rep. The output will be a list, which can be unlisted to get a vector.
unlist(rep(list(a,b), c(2,1)))
Marked answer is already perfect: Here an alternative using mapply
unlist(mapply(function(x,n)rep(x,n),list(a,b),c(2,1)))
It may be a silly question but I have been bothered for quite a while. I've seen people use single quotation marks to surround the function name when they are defining a function. I keep wondering the benefit of doing so. Below is a naive example
'row.mean' <- function(mat){
return(apply(mat, 1, mean))
}
Thanks in advance!
Going off Richard's assumption, the back ticks allows you to use symbols in names which are normally not allowed. See:
`add+5` <- function(x) {return(x+5)}
defines a function, but
add+5 <- function(x) {return(x+5)}
returns
Error in add + 5 <- function(x) { : object 'add' not found
To refer to the function, you need to explicitly use the back ticks as well.
> `add+5`(3)
[1] 8
To see the code for this function, simply call it without its arguments:
> `add+5`
function(x) {return(x+5)}
See also this comment which deals with the difference between the backtick and quotes in name assignment: https://stat.ethz.ch/pipermail/r-help/2006-December/121608.html
Note, the usage of back ticks is much more general. For example, in a data frame you can have columns named with integers (maybe from using reshape::cast on integer factors).
For example:
test = data.frame(a = "a", b = "b")
names(test) <- c(1,2)
and to retrieve these columns you can use the backtick in conjunction with the $ operator, e.g.:
> test$1
Error: unexpected numeric constant in "test$1"
but
> test$`1`
[1] a
Levels: a
Funnily you can't use back ticks in assigning the data frame column names; the following doesn't work:
test = data.frame(`1` = "a", `2` = "b")
And responding to statechular's comments, here are the two more use cases.
In fix functions
Using the % symbol we can naively define the dot product between vectors x and y:
`%.%` <- function(x,y){
sum(x * y)
}
which gives
> c(1,2) %.% c(1,2)
[1] 5
for more, see: http://dennisphdblog.wordpress.com/2010/09/16/infix-functions-in-r/
Replacement functions
Here is a great answer demonstrating what these are: What are Replacement Functions in R?
I'd like to know the reason why the following does not work on the matrix structure I have posted here (I've used the dput command).
When I try running:
apply(mymatrix, 2, sum)
I get:
Error in FUN(newX[, i], ...) : invalid 'type' (list) of argument
However, when I check to make sure it's a matrix I get the following:
is.matrix(mymatrix)
[1] TRUE
I realize that I can get around this problem by unlisting the data into a temp variable and then just recreating the matrix, but I'm curious why this is happening.
?is.matrix says:
'is.matrix' returns 'TRUE' if 'x' is a vector and has a '"dim"'
attribute of length 2) and 'FALSE' otherwise.
Your object is a list with a dim attribute. A list is a type of vector (even though it is not an atomic type, which is what most people think of as vectors), so is.matrix returns TRUE. For example:
> l <- as.list(1:10)
> dim(l) <- c(10,1)
> is.matrix(l)
[1] TRUE
To convert mymatrix to an atomic matrix, you need to do something like this:
mymatrix2 <- unlist(mymatrix, use.names=FALSE)
dim(mymatrix2) <- dim(mymatrix)
# now your apply call will work
apply(mymatrix2, 2, sum)
# but you should really use (if you're really just summing columns)
colSums(mymatrix2)
The elements of your matrix are not numeric, instead they are list, to see this you can do:
apply(m,2, class) # here m is your matrix
So if you want the column sum you have to 'coerce' them to be numeric and then apply colSums which is a shortcut for apply(x, 2, sum)
colSums(apply(m, 2, as.numeric)) # this will give you the sum you want.