Differences between vectors _including_ NA - r

Suppose I have a vector x<-c(1,2,NA,4,5,NA).
I apply some mythological code to that vector, which results in another vector, y<-c(1,NA,3, 4,10,NA)
Now I wish to find out at which positions my two vectors differ, where I count two NAs as being the same, and one NA and a non-NA (e.g. the second element of the two example vectors).
Specifically, for my example, I would like to end up with a vector holding c(2,3,5).
For my use case, I am not content with a vector of logical variables, but obviously I can easily convert (which), so I'll accept that as well.
I have some solutions like:
simplediff<-x!=y
nadiff<-is.na(x)!=is.na(y)
which(simplediff | nadiff)
but it feels like I'm reinventing the wheel here. Any better options?

How about looping and using identical?
!mapply(identical,x,y)
[1] FALSE TRUE TRUE FALSE TRUE FALSE
And for positions:
seq_along(x)[!mapply(identical,x,y)]
[1] 2 3 5
or
which(!mapply(identical,x,y))
[1] 2 3 5

One posible solution (but sure it is not the best):
(1:length(x))[-which((x-y)==0)]

Related

Testing Vector for equality with NAs in subscripts R

I discovered a strange behavior in R and wanted to ask whether it can lead to problems.
Here is an example of what I mean:
x<-c(1,2,3,4,NA, 5)
x==1
x
this returns:
TRUE FALSE FALSE FALSE NA FALSE
Now say I want to change all x==1 to 100:
x[x==1]<-100
x
returns:
100 2 3 4 NA 5
So for some reason referencing x with a logical vector with an NA worked. Why?
To me this behavior of "==" with NAs doesn't make much sense.
I am also worried now that this kind of assignment x[x==y]<-... may have caused (unnoticed) problems with other types of vectors such as in data.frames or matrices. Has anyone encountered any problems with this or does this logical NA referencing work consistently with other types of vectors?

How to count number of TRUE values in a logical vector before FALSE

While trying to find number of TRUE values in a vector, I came across the first Google hit. However, this does not fully meet my requirements. I am interested to find the number of TRUE values in a vector before the first FALSE if any. I have a vector a <- c(TRUE,TRUE,TRUE,FALSE,TRUE, TRUE) and want to find all TRUE values before the FALSE, so output will be three. Kindly note that it should also work if there are only TRUE values in the vector.
Here is a short way:
sum(cumprod(a))
# [1] 3
where cumprod gives a cumulative product (of zeros and ones in this case); so, it eliminates all TRUE's after the first FALSE, as in
cumprod(a)
# [1] 1 1 1 0 0 0
Using the below statement we can get the result easily.
which.min(a)-1

How to find common elements on two different length vectors in R?

I need to find common elements in 2 different length vectors.
For example, I have a vector A with 10 elements, and a vector B with 3 elements.
I need get the position of which elements in A is equal to B.
A=c(1,2,45,3,10,5,11,13,6,7)
B=c(45,3,10)
C would be [3,4,5]
I have already tried "match" and "intercept" functions, but no success :(
Thanks a lot! :)
You can use which function.
> which(A %in% B)
[1] 3 4 5

remove first ocurrence data frame R

So I've been playing around with a data frame in R, although I'm still thinking too much in Python and cannot seem to find a solution for my problem.
I have a data frame and one of the column is an user id. I would like to remove all the first occurrence of a number, for instance:
1,2,3,4,3,4,2,1,3,4,6,7,7
I would like to have an output like this:
3,4,2,1,3,4,7
Where the first time the user_id appears I would remove it but keep all the others even if repeated.
With python I would probably use enumerate or loop over it. For R, I've seen some functions that seem cool but I'm not sure how to use it with the data frame, like rle.
Any pointers will be really helpful since right now I'm a bit lost about the best approach for this problem.
Thank you all
The function duplicated() is going to be helpful here:
x <- c(1,2,3,4,3,4,2,1,3,4,6,7,7)
> x[duplicated(x)]
[1] 3 4 2 1 3 4 7
This works because duplicated() returns a logical vector indicating whether that element is, well, duplicated:
duplicated(x)
[1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE
You then use this logical vector to subset (extract) the values you want from x. But notice that in the extraction I keep all of the duplicated values, not remove them.
To remove all of the duplicated values (not what you want, but I illustrate regardless), try the negation:
x[!duplicated(x)]
[1] 1 2 3 4 6 7

Different behaviour of intersect on vectors and factors

I try to compare multiple vectors of Entrez IDs (integer vectors) by using Reduce(intersect,...). The vectors are selected from a database using "DISTINCT" so a single vector does not contain duplicates.
length(factor(c(l1$entrez)))
gives the same length (and the same IDs w/o the length function) as
length(c(l1$entrez))
When I compare multiple vectors with
length(Reduce(intersect,list(c(l1$entrez),c(l2$entrez),c(l3$entrez),c(l4$entrez))))
or
length(Reduce(intersect,list(c(factor(l1$entrez)),c(factor(l2$entrez)),c(factor(l3$entrez)),c(factor(l4$entrez)))))
the result is not the same. I know that factor!=originalVector but I cannot understand why the result differs although the length and the levels of the initial factors/vectors are the same.
Could somebody please explain the different behaviour of the intersect function on vectors and factors? Is it that the intersect of two factor lists are again factorlists and then duplicates are treated differently?
Edit - Example:
> head(l1)
entrez
1 1
2 503538
3 29974
4 87769
5 2
6 144568
> head(l2)
entrez
1 1743
2 1188
3 8915
4 7412
5 51082
6 5538
The lists contain around 500 to 20K Entrez IDs. So the vectors contain pure integer and should give the intersect among all tested vectors.
> length(Reduce(intersect,list(c(factor(l1$entrez)),c(factor(l2$entrez)),c(factor(l3$entrez)),c(factor(l4$entrez)))))
[1] 514
> length(Reduce(intersect,list(c(l1$entrez),c(l2$entrez),c(l3$entrez),c(l4$entrez))))
[1] 338
> length(Reduce(intersect,list(l1$entrez,l2$entrez,l3$entrez,l4$entrez)))
[1] 494
I have to apologize profusely. The different behaviour of the intersect function may be caused by a problem with the data. I have found fields in the dataset containing comma seperated Entrez IDs (22038, 23207, ...). I should have had a more detailed look at the data first. Thank you for the answers and your time. Although I do not understand the different results yet, I am sure that this is the cause of the different behaviour. Can somebody confirm that?
As Roman says, an example would be very helpful.
Nevertheless, one possibility is that your variables l1$entrez, l2$entrez etc have the same levels but in different orders.
intersect converts its arguments via as.vector, which turns factors into character variables. This is usually the right thing to do, as it means that varying level order doesn't make any difference to the result.
Passing factor(l1$entrez) as an argument to intersect also removes the impact of varying level order, as it effectively creates a new factor with level ordering set to the default. However, if you pass c(l1$entrez), you strip the factor attributes off your variable and what you're left with is the raw integer codes which will depend on level ordering.
Example:
a <- factor(letters[1:3], levels=letters)
b <- factor(letters[1:3], levels=rev(letters)
# returns 1 2 3
intersect(c(factor(a)), c(factor(b)))
# returns integer(0)
intersect(c(a), c(b))
I don't see any reason why you should use c() in here. Just let R handle factors by itself (although to be fair, there are other scenarios where you do want to step in).

Resources