I'd like to understand what's going on in this piece of R code I was testing. I'd like to replace part of a vector with another vector. The original and replacement values are in a data.frame. I'd like to replace all elements of the vector that match the original column with the corresponding replacement values. I have the answer to the larger question, but I'm unable to understand how it works.
Here's a simple example:
> vecA <- 1:5;
> vecB <- data.frame(orig=c(2,3), repl=c(22,33));
> vecA[vecA %in% vecB$orig] <- vecB$repl #Question-1
> vecA
[1] 1 22 33 4 5
> vecD<-data.frame(orig=c(5,7), repl=c(55,77))
> vecA[vecA %in% vecD$orig] <- vecD$repl #Question-2
Warning message:
In vecA[vecA %in% vecD$orig] <- vecD$repl :
number of items to replace is not a multiple of replacement length
> vecA
[1] 1 22 33 4 55
Here are my questions:
How does the assignment on Line-3 work? The LHS expression is a 2-item vector, whereas the RHS is a 5-element vector.
Why does the assignment on Line-6 give a warning (but still work)?
The First Question
R goes through each element in vecA and checks to see if it exists in vecB$orig. The %in% operator will return a boolean. If you run the command vecA %in% vecB$orig you get the following:
[1] FALSE TRUE TRUE FALSE FALSE
which is telling you that in the vector 1 2 3 4 5 it sees 2 and 3 in vecB$orig.
By subsetting vecA by this command, you are isolating only the TRUE values in vecA, so vecA[vecA %in% vecB$orig] returns:
[1] 2 3
On the RHS, you are re-assigning wherever vecA[vecA %in% vecB$orig] equals TRUE to vecB$repl, which will replace 2 3 in vecA with 22 33.
The Second Question
In this case, the same logic applies for subsetting, but running vecA[vecA %in% vecD$orig] gives you
[1] 5
as 7 does not exist in vecA. You are trying to replace a vector of length 1 with a vector of length 2, which is what triggers the warning. In this case, it will just replace the first element of vecD$repl which happens to be 55.
Related
I have a list, "my_list", below:
$`2015-03-01 00:18:50`
integer(0)
$`2015-03-01 11:19:59`
[1] 4 6
$`2015-03-01 12:18:29`
[1] 12 13
$`2015-03-01 13:19:09`
[1] 1
$`2015-03-01 17:18:44`
integer(0)
$`2015-03-01 22:18:49`
integer(0)
I want to get the element index (not the subelement index) of the values greater than 0 (or where a list subelement is NOT empty). The output expected is a list that looks like:
2,2,3,3,4
I have gotten close with:
indices<-which(lapply(my_list,length)>0)
This piece of code however, only gives me the following and doesn't account for there being more than one subelement within a list element:
2,3,4
Does anyone know how to achieve what I am looking for?
We can use lapply along with a seq_along trick to bring in the indices of each element of the list. Then, for each list element, generate a vector of matching indices. Finally, unlist the entire list to obtain a single vector of matches.
x <- list(a=integer(0),b=c(4,6),c=c(12,13),d=c(1),e=integer(0),f=integer(0))
result <- lapply(seq_along(x), function(i) { rep(i, sum(x[[i]] > 0)) })
unlist(result)
[1] 2 2 3 3 4
Demo
You can try this, I hope this is what you have expected, Using lengths to calculate length of items in the list, then iterating every items of that list in rep command to get the final outcome:
lyst <- list(l1=integer(0), l2= c(1,2), l3=c(3,4), l4=character(0), l5=c(5,6,6))
lyst1 <- lengths(lyst)
unlist(lapply(1:length(lyst1), function(x)rep(x, lyst1[[x]])))
Output:
#> unlist(lapply(1:length(lyst1), function(x)rep(x, lyst1[[x]])))
#[1] 2 2 3 3 5 5 5
Repeat each numeric index by the respective length:
rep(seq_along(x), lengths(x))
#[1] 2 2 3 3 4
Using #Tim's x data.
I am calculating gradient values by using
DF$gradUx <- sapply(1:nrow(DF), function(i) ((DF$V4[i+1])-DF$V4[i]), simplify = "vector")
but when checking class(DF$gradUx), I still get a list. What I want is a numeric vector. What am I doing wrong?
Browse[1]> head(DF)
V1 V2 V3 V4
1 0 0 -2.913692e-09 2.913685e-09
2 1 0 1.574589e-05 3.443367e-09
3 2 0 2.111406e-05 3.520451e-09
4 3 0 2.496275e-05 3.613013e-09
5 4 0 2.735775e-05 3.720385e-09
6 5 0 2.892444e-05 3.841937e-09
You will only get a numeric vector when all return values are of length 1. More accurately, you will get an array if all return values are the same length. From ?sapply "Details":
Simplification in 'sapply' is only attempted if 'X' has length
greater than zero and if the return values from all elements of
'X' are all of the same (positive) length. If the common length
is one the result is a vector, and if greater than one is a matrix
with a column corresponding to each element of 'X'.
When i == 0, your formula will return numeric(0), so the whole return will be a list.
You need to change your calculation to account for indexing outside the bounds of your vector. DF$V4[1-1] returns numeric(0), and DF$V4[nrow(DF)+1] returns NA. Fix this logic and you should remedy the vector problem.
Edit: for historical reasons, the original question incorrectly calculated the difference as DF$V4[i+1])-DF$V4[i-1], giving a lag-2 difference, whereas the recently-edited question (and the OP's intent) shows a lag-1 difference.
Instead of sapply I should simply use diff(DF$V3) and write it into a new data.frame:
gradients = data.frame(gradUx=diff(DF$V3),gradUy=diff(DF$V4))
This calculation can be vectorized very easily if you line up the observations. I use head and tail to drop the first 2 and last 2 observations:
gradUx <- c(NA, tail(df$V4, -2) - head(df$V4, -2), NA)
> gradUx
[1] NA 6.06766e-10 1.69646e-10 1.99934e-10 2.28924e-10 NA
Which provides the same values as your approach, in vector form:
> sapply(1:nrow(df), function(i) ((df$V4[i+1])-df$V4[i-1]), simplify = "vector")
[[1]]
numeric(0)
[[2]]
[1] 6.06766e-10
[[3]]
[1] 1.69646e-10
[[4]]
[1] 1.99934e-10
[[5]]
[1] 2.28924e-10
[[6]]
[1] NA
In R, for the sake of example, I have a list composed of equal-length numeric vectors of form similar to:
list <- list(c(1,2,3),c(1,3,2),c(2,1,3))
[[1]]
[1] 1 2 3
[[2]]
[1] 1 3 2
[[3]]
[1] 2 1 3
...
Every element of the list is unique. I want to get the index number of the element x <- c(2,1,3), or any other particular numeric vector within the list.
I've attempted using match(x,list), which gives a vector full of NA, and which(list==(c(1,2,3)), which gives me a "(list) object cannot be coerced to type 'double'" error. Coercing the list to different types didn't seem to make a difference for the which function. I also attempted various grep* functions, but these don't return exact numeric vector matches. Using find(c(1,2,3),list) or even some fancy sapply which %in% type functions didn't give me what I was looking for. I feel like I have a type problem. Any suggestions?
--Update--
Summary of Solutions
Thanks for your replies. The method in the comment for this question is clean and works well (via akrun).
> which(paste(list)==deparse(x))
[1] 25
The next method didn't work correctly
> which(duplicated(c(x, list(y), fromLast = TRUE)))
[1] 49
> y
[1] 1 2 3
This sounds good, but in the next block you can see the problem
> y<-c(1,3,2)
> which(duplicated(c(list, list(y), fromLast = TRUE)))
[1] 49
More fundamentally, there are only 48 elements in the list I was using.
The last method works well (via BondedDust), and I would guess it is more efficient using an apply function:
> which( sapply(list, identical, y ))
[1] 25
match works fine if you pass it the right data.
L <- list(c(1,2,3),c(1,3,2),c(2,1,3))
match(list(c(2,1,3)), L)
#[1] 3
Beware that this works via coercing lists to character, so fringe cases will fail - with a hat-tip to #nicola:
match(list(1:3),L)
#[1] NA
even though:
1:3 == c(1,2,3)
#[1] TRUE TRUE TRUE
Although arguably:
identical(1:3,c(1,2,3))
#[1] FALSE
identical(1:3,c(1L,2L,3L))
#[1] TRUE
You can use duplicated(). If we add the matching vector to the end of the original list and set fromLast = TRUE, we will find the duplicate(s). Then we can use which() to get the index.
which(duplicated(c(list, list(c(2, 1, 3)), fromLast = TRUE))
# [1] 3
Or you could add it as the first element and subtract 1 from the result.
which(duplicated(c(list(c(2, 1, 3)), list))) - 1L
# [1] 3
Note that the type always matters with this type of comparison. When comparing integers and numerics, you will need to convert doubles to integers for this to run without issue. For example, 1:3 is not the same type as c(1, 2, 3).
> L <- list(c(1,2,3),c(1,3,2),c(2,1,3))
> sapply(L, identical, c(2,1,3))
[1] FALSE FALSE TRUE
> which( sapply(L, identical, c(2,1,3)) )
[1] 3
This would be slightly less restrictive in its test:
> which( sapply(L, function(x,y){all(x==y)}, c(1:3)) )
[1] 1
Try:
vapply(list,function(z) all(z==x),TRUE)
#[1] FALSE FALSE TRUE
Enclosing the above line to which gives you the index of the list.
Say I have a lookup table as following
dt <- data.frame(name=c("jack","jill","sam","dan"),age=c(20,14,28,13))
name age
1 jack 20
2 jill 14
3 sam 28
4 dan 13
Now I want to convert the following vectors to the vectors containing the ages of each element.
query1 <- c("jack","dan")
query2 <- c("sam")
query3 <- c("jack","sam", "dan")
I can build the following function(which I don't like) to accomplish this task,
get.age <- function(x) {
answer <- list()
for(i in 1:length(x)){
answer[[i]]<- dt[dt$name==x[i],"age"]
}
ldply(answer)$V1
}
which gets the job done like this
> get.age(query1)
[1] 20 13
> get.age(query2)
[1] 28
> get.age(query3)
[1] 20 28 13
But I don't like the solution because it uses a for loop and some dirty hack. Ideally, I would want to do it more R-like using vector operations like this(which doesn't seem to work)
> dt[dt$name==c("jack","dan"),"age"]
[1] 20 13 #worked
> dt[dt$name==c("jack","sam"),"age"]
[1] 20 # not the right answer
The following solution works, but this requires the prior knowledge of how many things I lookup.
dt[dt$name=="jack" | dt$name=="sam","age"]
[1] 20 28
I would like to know a method that can handle arbitrary size of vectors that converts the keys to elements without using for loop, if there is any
For a true lookup table, the result should be the length of the query and also deal with replication in the query. The approaches using match(...) are the only ones that do this:
query4 <- c("jack","sam", "dan","sam","jack")
dt[match(query4,dt$name),]$age
# [1] 20 28 13 28 20
This is because match(LHS,RHS) returns an integer vector of length(LHS) which contains the row numbers of the RHS which match the corresponding element of LHS.
The approaches based on comparison (==) will generally not work. This s because, when comparing two vectors, R tries to replicate the shorter one however many times are needed to make it the same length as the longer one, and then does an element-by-element comparison. So in the case of dt$name==query1, for example, the RHS gets replicated twice and the comparison is between c("jack","jill","sam","dan") and c("jack","dan","jack","dan").
dt$name==query1 # RHS is: c("jack","dan","jack","dan")
# [1] TRUE FALSE FALSE TRUE
dt$name==query2 # RHS is: c("sam","sam","sam","sam")
# [1] FALSE FALSE TRUE FALSE
dt$name==query3 # RHS is: c("jack","sam", "dan","jack") with warning
# [1] TRUE FALSE FALSE FALSE
# with warning: longer object length is not a multiple of shorter object length
On the other hand, using LHS %in% RHS gives a result with length(LHS) and either T or F depending on whether that element is present in RHS.
dt$name %in% query1
# [1] TRUE FALSE FALSE TRUE
query1 %in% dt$name
# [1] TRUE TRUE
Note that it looks like df$name %in% query1 and df$name==query1 give the same result, but that's an artifact of query1 being replicated twice in the latter comparison. See, for example:
dt$name %in% query3
# [1] TRUE FALSE TRUE TRUE
dt$name == query3
# [1] TRUE FALSE FALSE FALSE
You want %in%, it returns of logical vector that is used to subset the data frame
dt[dt$name %in% query3,"age"]
There are a lot of ways to do this, but I'll throw out one that I find useful. match(). #jlhoward's answer goes into more detail and explains why my previous == examples were wrong.
> match(query1, dt$name) #these give us the index of the *first* matching value
[1] 1 4
> match(query2, dt$name)
[1] 3
> dt$age[match(query1, dt$name)]
[1] 20 13
> dt$age[match(query2, dt$name)]
[1] 28
You can also use %in% unlike match this returns TRUE and FALSE for the elements that exist in the comparison (be sure to get the order right for, dt$name %in% query1 returns TRUE FALSE FALSE TRUE, query1 %in% dt$name returns TRUE TRUE)
> dt[dt$name %in% query1, ][,'age',]
[1] 20 13
With dplyr you can use filter
> require(dplyr)
> filter(dt, name %in% query1)
name age
1 jack 20
2 dan 13
> filter(dt, name %in% query1)$age
[1] 20 13
By mistake, I found that R count vector with NA included in an interesting way:
> temp <- c(NA,NA,NA,1) # 4 items
> length(temp[temp>1])
[1] 3
> temp <- c(NA,NA,1) # 3 items
> length(temp[temp>1])
[1] 2
At first I assume R will process all NAs into one NA, but this is not the case.
Can anyone explain? Thanks.
You were expecting only TRUE's and FALSE's (and the results to only be FALSE) but a logical vector can also have NA's. If you were hoping for a length zero result, then you had at least three other choices:
> temp <- c(NA,NA,NA,1) # 4 items
> length(temp[ which(temp>1) ] )
[1] 0
> temp <- c(NA,NA,NA,1) # 4 items
> length(subset( temp, temp>1) )
[1] 0
> temp <- c(NA,NA,NA,1) # 4 items
> length( temp[ !is.na(temp) & temp>1 ] )
[1] 0
You will find the last form in a lot of the internal code of well established functions. I happen to think the first version is more economical and easier to read, but the R Core seems to disagree. I have several times been advised on R help not to use which() around logical expressions. I remain unconvinced. It is correct that one should not combine it with negative indexing.
EDIT The reason not to use the construct "minus which" (negative indexing with which) is that in the case where all the items fail the which-test and where you would therefore expect all of them to be returned , it returns an unexpected empty vector:
temp <- c(1,2,3,4,NA)
temp[!temp > 5]
#[1] 1 2 3 4 NA As expected
temp[-which(temp > 5)]
#numeric(0) Not as expected
temp[!temp > 5 & !is.na(temp)]
#[1] 1 2 3 4 A correct way to handle negation
I admit that the notion that NA's should select NA elements seems a bit odd, but it is rooted in the history of S and therefore R. There is a section in ?"[" about "NA's in indexing". The rationale is that each NA as an index should return an unknown result, i.e. another NA.
If you break down each command and look at the output, it's more enlightening:
> tmp = c(NA, NA, 1)
> tmp > 1
[1] NA NA FALSE
> tmp[tmp > 1]
[1] NA NA
So, when we next perform length(tmp[tmp > 1]), it's as if we're executing length(c(NA,NA)). It is fine to have a vector full of NAs - it has a fixed length (as if we'd created it via NA * vector(length = 2), which should be different from NA * vector(length = 3).
You can use 'sum':
> tmp <- c(NA, NA, NA, 3)
> sum(tmp > 1)
[1] NA
> sum(tmp > 1, na.rm=TRUE)
[1] 1
A bit of explanation: 'sum' expects numbers but 'tmp > 1' is logical. So it is automatically coerced to be numeric: TRUE => 1; FALSE => 0; NA => NA.
I don't think there is anything precisely like this in 'The R Inferno' but this is definitely the sort of question that it is aimed at. http://www.burns-stat.com/pages/Tutor/R_inferno.pdf