By mistake, I found that R count vector with NA included in an interesting way:
> temp <- c(NA,NA,NA,1) # 4 items
> length(temp[temp>1])
[1] 3
> temp <- c(NA,NA,1) # 3 items
> length(temp[temp>1])
[1] 2
At first I assume R will process all NAs into one NA, but this is not the case.
Can anyone explain? Thanks.
You were expecting only TRUE's and FALSE's (and the results to only be FALSE) but a logical vector can also have NA's. If you were hoping for a length zero result, then you had at least three other choices:
> temp <- c(NA,NA,NA,1) # 4 items
> length(temp[ which(temp>1) ] )
[1] 0
> temp <- c(NA,NA,NA,1) # 4 items
> length(subset( temp, temp>1) )
[1] 0
> temp <- c(NA,NA,NA,1) # 4 items
> length( temp[ !is.na(temp) & temp>1 ] )
[1] 0
You will find the last form in a lot of the internal code of well established functions. I happen to think the first version is more economical and easier to read, but the R Core seems to disagree. I have several times been advised on R help not to use which() around logical expressions. I remain unconvinced. It is correct that one should not combine it with negative indexing.
EDIT The reason not to use the construct "minus which" (negative indexing with which) is that in the case where all the items fail the which-test and where you would therefore expect all of them to be returned , it returns an unexpected empty vector:
temp <- c(1,2,3,4,NA)
temp[!temp > 5]
#[1] 1 2 3 4 NA As expected
temp[-which(temp > 5)]
#numeric(0) Not as expected
temp[!temp > 5 & !is.na(temp)]
#[1] 1 2 3 4 A correct way to handle negation
I admit that the notion that NA's should select NA elements seems a bit odd, but it is rooted in the history of S and therefore R. There is a section in ?"[" about "NA's in indexing". The rationale is that each NA as an index should return an unknown result, i.e. another NA.
If you break down each command and look at the output, it's more enlightening:
> tmp = c(NA, NA, 1)
> tmp > 1
[1] NA NA FALSE
> tmp[tmp > 1]
[1] NA NA
So, when we next perform length(tmp[tmp > 1]), it's as if we're executing length(c(NA,NA)). It is fine to have a vector full of NAs - it has a fixed length (as if we'd created it via NA * vector(length = 2), which should be different from NA * vector(length = 3).
You can use 'sum':
> tmp <- c(NA, NA, NA, 3)
> sum(tmp > 1)
[1] NA
> sum(tmp > 1, na.rm=TRUE)
[1] 1
A bit of explanation: 'sum' expects numbers but 'tmp > 1' is logical. So it is automatically coerced to be numeric: TRUE => 1; FALSE => 0; NA => NA.
I don't think there is anything precisely like this in 'The R Inferno' but this is definitely the sort of question that it is aimed at. http://www.burns-stat.com/pages/Tutor/R_inferno.pdf
Related
I'd like to understand what's going on in this piece of R code I was testing. I'd like to replace part of a vector with another vector. The original and replacement values are in a data.frame. I'd like to replace all elements of the vector that match the original column with the corresponding replacement values. I have the answer to the larger question, but I'm unable to understand how it works.
Here's a simple example:
> vecA <- 1:5;
> vecB <- data.frame(orig=c(2,3), repl=c(22,33));
> vecA[vecA %in% vecB$orig] <- vecB$repl #Question-1
> vecA
[1] 1 22 33 4 5
> vecD<-data.frame(orig=c(5,7), repl=c(55,77))
> vecA[vecA %in% vecD$orig] <- vecD$repl #Question-2
Warning message:
In vecA[vecA %in% vecD$orig] <- vecD$repl :
number of items to replace is not a multiple of replacement length
> vecA
[1] 1 22 33 4 55
Here are my questions:
How does the assignment on Line-3 work? The LHS expression is a 2-item vector, whereas the RHS is a 5-element vector.
Why does the assignment on Line-6 give a warning (but still work)?
The First Question
R goes through each element in vecA and checks to see if it exists in vecB$orig. The %in% operator will return a boolean. If you run the command vecA %in% vecB$orig you get the following:
[1] FALSE TRUE TRUE FALSE FALSE
which is telling you that in the vector 1 2 3 4 5 it sees 2 and 3 in vecB$orig.
By subsetting vecA by this command, you are isolating only the TRUE values in vecA, so vecA[vecA %in% vecB$orig] returns:
[1] 2 3
On the RHS, you are re-assigning wherever vecA[vecA %in% vecB$orig] equals TRUE to vecB$repl, which will replace 2 3 in vecA with 22 33.
The Second Question
In this case, the same logic applies for subsetting, but running vecA[vecA %in% vecD$orig] gives you
[1] 5
as 7 does not exist in vecA. You are trying to replace a vector of length 1 with a vector of length 2, which is what triggers the warning. In this case, it will just replace the first element of vecD$repl which happens to be 55.
In R, for the sake of example, I have a list composed of equal-length numeric vectors of form similar to:
list <- list(c(1,2,3),c(1,3,2),c(2,1,3))
[[1]]
[1] 1 2 3
[[2]]
[1] 1 3 2
[[3]]
[1] 2 1 3
...
Every element of the list is unique. I want to get the index number of the element x <- c(2,1,3), or any other particular numeric vector within the list.
I've attempted using match(x,list), which gives a vector full of NA, and which(list==(c(1,2,3)), which gives me a "(list) object cannot be coerced to type 'double'" error. Coercing the list to different types didn't seem to make a difference for the which function. I also attempted various grep* functions, but these don't return exact numeric vector matches. Using find(c(1,2,3),list) or even some fancy sapply which %in% type functions didn't give me what I was looking for. I feel like I have a type problem. Any suggestions?
--Update--
Summary of Solutions
Thanks for your replies. The method in the comment for this question is clean and works well (via akrun).
> which(paste(list)==deparse(x))
[1] 25
The next method didn't work correctly
> which(duplicated(c(x, list(y), fromLast = TRUE)))
[1] 49
> y
[1] 1 2 3
This sounds good, but in the next block you can see the problem
> y<-c(1,3,2)
> which(duplicated(c(list, list(y), fromLast = TRUE)))
[1] 49
More fundamentally, there are only 48 elements in the list I was using.
The last method works well (via BondedDust), and I would guess it is more efficient using an apply function:
> which( sapply(list, identical, y ))
[1] 25
match works fine if you pass it the right data.
L <- list(c(1,2,3),c(1,3,2),c(2,1,3))
match(list(c(2,1,3)), L)
#[1] 3
Beware that this works via coercing lists to character, so fringe cases will fail - with a hat-tip to #nicola:
match(list(1:3),L)
#[1] NA
even though:
1:3 == c(1,2,3)
#[1] TRUE TRUE TRUE
Although arguably:
identical(1:3,c(1,2,3))
#[1] FALSE
identical(1:3,c(1L,2L,3L))
#[1] TRUE
You can use duplicated(). If we add the matching vector to the end of the original list and set fromLast = TRUE, we will find the duplicate(s). Then we can use which() to get the index.
which(duplicated(c(list, list(c(2, 1, 3)), fromLast = TRUE))
# [1] 3
Or you could add it as the first element and subtract 1 from the result.
which(duplicated(c(list(c(2, 1, 3)), list))) - 1L
# [1] 3
Note that the type always matters with this type of comparison. When comparing integers and numerics, you will need to convert doubles to integers for this to run without issue. For example, 1:3 is not the same type as c(1, 2, 3).
> L <- list(c(1,2,3),c(1,3,2),c(2,1,3))
> sapply(L, identical, c(2,1,3))
[1] FALSE FALSE TRUE
> which( sapply(L, identical, c(2,1,3)) )
[1] 3
This would be slightly less restrictive in its test:
> which( sapply(L, function(x,y){all(x==y)}, c(1:3)) )
[1] 1
Try:
vapply(list,function(z) all(z==x),TRUE)
#[1] FALSE FALSE TRUE
Enclosing the above line to which gives you the index of the list.
Sorry this seems like a really silly question but are dataframe[ ,-1] and dataframe[-1] the same, and does it work for all data types?
And why are they the same
Almost.
[-1] uses the fact that a data.frame is a list, so when you do dataframe[-1] it returns another data.frame (list) without the first element (i.e. column).
[ ,-1]uses the fact that a data.frame is a two dimensional array, so when you do dataframe[, -1] you get the sub-array that does not include the first column.
A priori, they sound like the same, but the second case also tries by default to reduce the dimension of the subarray it returns. So depending on the dimensions of your dataframe you may get a data.frame or a vector, see for example:
> data <- data.frame(a = 1:2, b = 3:4)
> class(data[-1])
[1] "data.frame"
> class(data[, -1])
[1] "integer"
You can use drop = FALSE to override that behavior:
> class(data[, -1, drop = FALSE])
[1] "data.frame"
dataframe[-1] will treat your data in vector form, thus returning all but the very first element [[edit]] which as has been pointed out, turns out to be a column, as a data.frame is a list. dataframe[,-1] will treat your data in matrix form, returning all but the first column.
Sorry, wanted to leave this as a comment but thought it was too big, I just found it interesting that the only one which remains a non integer is dataframe[1].
Further to Carl's answer, it seems dataframe[[1]] is treated as a matrix as well.
But dataframe[1] isn't....
But it can't be treated as a matrix cause the results for dataframe[[1]] and matrix[[1]] are different.
D <- as.data.frame(matrix(1:16,4))
D
M <- (matrix(1:16,4))
M
> D[ ,1] # data frame leaving out first column
[1] 1 2 3 4
> D[[1]] # first column of dataframe
[1] 1 2 3 4
> D[1] # First column of dataframe
V1
1 1
2 2
3 3
4 4
>
> class(D[ ,1])
[1] "integer"
> class(D[[1]])
[1] "integer"
> class(D[1])
[1] "data.frame"
>
> M[ ,1] # matrix leaving out first column
[1] 1 2 3 4
> M[[1]] # First element of first row & col
[1] 1
> M[1] # First element of first row & col
[1] 1
>
> class(M[ ,1])
[1] "integer"
> class(M[[1]])
[1] "integer"
> class(M[1])
[1] "integer"
I don't know why this isn't giving me the desired results.
Here is my vector:
flowers = c("Flower", "Flower", "Vegatative", "Vegatative", "Dead")
Here is my for loop:
Na = 0
for (i in 1:length(flowers)){
if (i != "Dead"){
Na = Na + 1
}
}
Na
Obviously Na should equal 4, but it gives me a result of 5. When I print the flower's status it prints all 5. I don't want it to read the last one. What's my problem?
Thank you.
You seem to be trying to count the number of values in flowers that are not equal to "Dead". In R, the way to do this would be:
sum(flowers != "Dead")
# [1] 4
The bug in your code is this line:
if (i != "Dead"){
To understand why, it would be best to print out the values of i in the loop:
for (i in 1:length(flowers)){
print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
That is, you are iterating over numbers (indices of a vector), but not actually selecting the value from the vector when you do the test. To access the values, use flowers[i]:
for (i in 1:length(flowers)){
print(flowers[i])
}
[1] "Flower"
[1] "Flower"
[1] "Vegatative"
[1] "Vegatative"
[1] "Dead"
And so, the answer to your original question is this:
Na = 0
for (i in 1:length(flowers)){
if (flowers[i] != "Dead"){
Na = Na + 1
}
}
Na
[4]
R offers a lot of facilities for doing computations like this without a loop - it's called vectorization. A good article on it is John Cook's 5 Kinds of Subscripts in R. For example, you could get the same result like this:
length(flowers[flowers != "Dead"])
[1] 4
Sorry this seems like a really silly question but are dataframe[ ,-1] and dataframe[-1] the same, and does it work for all data types?
And why are they the same
Almost.
[-1] uses the fact that a data.frame is a list, so when you do dataframe[-1] it returns another data.frame (list) without the first element (i.e. column).
[ ,-1]uses the fact that a data.frame is a two dimensional array, so when you do dataframe[, -1] you get the sub-array that does not include the first column.
A priori, they sound like the same, but the second case also tries by default to reduce the dimension of the subarray it returns. So depending on the dimensions of your dataframe you may get a data.frame or a vector, see for example:
> data <- data.frame(a = 1:2, b = 3:4)
> class(data[-1])
[1] "data.frame"
> class(data[, -1])
[1] "integer"
You can use drop = FALSE to override that behavior:
> class(data[, -1, drop = FALSE])
[1] "data.frame"
dataframe[-1] will treat your data in vector form, thus returning all but the very first element [[edit]] which as has been pointed out, turns out to be a column, as a data.frame is a list. dataframe[,-1] will treat your data in matrix form, returning all but the first column.
Sorry, wanted to leave this as a comment but thought it was too big, I just found it interesting that the only one which remains a non integer is dataframe[1].
Further to Carl's answer, it seems dataframe[[1]] is treated as a matrix as well.
But dataframe[1] isn't....
But it can't be treated as a matrix cause the results for dataframe[[1]] and matrix[[1]] are different.
D <- as.data.frame(matrix(1:16,4))
D
M <- (matrix(1:16,4))
M
> D[ ,1] # data frame leaving out first column
[1] 1 2 3 4
> D[[1]] # first column of dataframe
[1] 1 2 3 4
> D[1] # First column of dataframe
V1
1 1
2 2
3 3
4 4
>
> class(D[ ,1])
[1] "integer"
> class(D[[1]])
[1] "integer"
> class(D[1])
[1] "data.frame"
>
> M[ ,1] # matrix leaving out first column
[1] 1 2 3 4
> M[[1]] # First element of first row & col
[1] 1
> M[1] # First element of first row & col
[1] 1
>
> class(M[ ,1])
[1] "integer"
> class(M[[1]])
[1] "integer"
> class(M[1])
[1] "integer"