Comparison of two vectors of unequal length - r

I was trying this out, trying to subset a data frame based on values in vector being in another vector:
x <- c( 1,2,3,1,2,3 )
df <- data.frame(x=x,y=x)
df[ df$x == c(1,2), ]
expecting to get this:
x y
1 1 1
2 2 2
4 1 1
5 2 2
but I didn't, I got this:
x y
1 1 1
2 2 2
Disregarding the fact that I really wanted this (occurred to me a minute later):
df[ df$x %in% c(1,2), ]
What is the logic behind the result of this:
x == c(1,2)
being this:
[1] TRUE TRUE FALSE FALSE FALSE FALSE
I don't really get it. I am aware that this is likely a duplicate, but I couldn't find one.

It is based on the recycling of c(1,2) to the length of 'x', i.e. we are comparing df$x with
rep(c(1,2),length.out= nrow(df))
#[1] 1 2 1 2 1 2
df$x ==rep(c(1,2),length.out= nrow(df))
#[1] TRUE TRUE FALSE FALSE FALSE FALSE
It means, we are comparing the corresponding elements of 'x' with the corresponding recycled c(1,2) instead of checking any element of 'x' contains c(1,2)

Related

How to match multiple columns without merge?

I have those two df's:
ID1 <- c("TRZ00897", "AAR9832", "NZU44447683209", "sxc89898989M", "RSU765th89", "FFF")
Date1 <- c("2022-08-21","2022-03-22","2022-09-24", "2022-09-21", "2022-09-22", "2022-09-22")
Data1 <- data.frame(ID1,Date1)
ID <- c("RSU765th89", "NZU44447683209", "AAR9832", "TRZ00897","ERD895655", "FFF", "IUHG0" )
Date <- c("2022-09-22","2022-09-21", "2022-03-22", "2022-08-21", "2022-09-21", "2022-09-22", "2022-09-22" )
Data2 <- data.frame(ID,Date)
I tried to get exact matches. An exact match is if ID and Date are the same in both df's, for example: "TRZ00897" "2022-08-21" is an exact match, because it is present in both df's
With the following line of code:
match(Data1$ID1, Data2$ID) == match(Data1$Date1, Data2$Date)
the output is:
TRUE TRUE NA NA TRUE FALSE
Obviously the last one should not be FALSE because "FFF" "2022-09-22" is in both df. The reason why it is FALSE is, that the Date"2022-09-22" occurred already in Data2 at index position 1.
match(Data1$ID1, Data2$ID)
4 3 2 NA 1 6
match(Data1$Date1, Data2$Date)
4 3 NA 2 1 1
So at the end, there is index position 6 and 1 which is not equal --> FALSE
How can I change this? Which function should I use to get the correct answer.
Note, I don't need to merge or join etc. I'm really looking for a function that can detect those patterns.
Combine the columns then match:
match(paste(Data1$ID1, Data1$Date1), paste(Data2$ID, Data2$Date))
# [1] 4 3 NA NA 1 6
To get logical outut use %in%:
paste(Data1$ID1, Data1$Date1) %in% paste(Data2$ID, Data2$Date)
# [1] TRUE TRUE FALSE FALSE TRUE TRUE
Try match with asplit (since you have different column names for two dataframes, I have to manually remove the names using unname, which can be avoided if both of them have the same names)
> match(asplit(unname(Data1), 1), asplit(unname(Data2), 1))
[1] 4 3 NA NA 1 6
Another option that is memory-expensive option is using interaction
> match(interaction(Data1), interaction(Data2))
[1] 4 3 NA NA 1 6
With mapply and %in%:
apply(mapply(`%in%`, Data1, Data2), 1, all)
[1] TRUE TRUE FALSE FALSE TRUE TRUE
rowSums(mapply(`%in%`, Data1, Data2)) == ncol(Data1)
Edit; for a subset of columns:
idx <- c(1, 2)
apply(mapply(`%in%`, Data1[idx], Data2[idx]), 1, all)
#[1] TRUE TRUE FALSE FALSE TRUE TRUE

Compare column to vector in R

I want to compare every row in a column in a dataframe to a single vector.
temp_df <- data.frame(x = c(3, 2, 1),
y = c(4, 4, 2))
> temp_df
x y
1 3 4
2 2 4
3 1 2
I want to compare each y to every single x to see if y is greater than all of the x values. If the y is not greater than all x values then I want to return FALSE.
I can achieve this by looping through my dataframe but I want to avoid that. This is the result I am seeking:
> temp_df
x y z
1 3 4 TRUE
2 2 4 TRUE
3 1 2 FALSE
I am trying to do this is base R but am open to other solutions also.
We can use
temp_df$z <- sapply(temp_df$y, function(u) all(u > temp_df$x))
temp_df$z
[1] TRUE TRUE FALSE

Conditional statement in sum() function of R

I've started learning R and got a piece of code in which a statement is:
if(sum(C == C[i]) == 1)# C is simply a vector and i is index of a value in this vector which the user specifies in an argument.
How can you pass a conditional statement as an argument of a function? Also explain the meaning of this statement.
Thank you.
Let's take an example to understand
Consider C as a numeric vector from 1 to 10 and let's take i as 3
C <- 1:10
i <- 3
So when we do
C == C[i]
#[1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
it compares every element of C with C[i] which is 3 and returns a corresponding logical vector which is only TRUE at 3rd index.
When we sum this logical vector it returns count of all TRUE (as it considers FALSE as 0 and TRUE as 1) values which in this case is 1
sum(C == C[i])
#[1] 1
which is then compared to 1 again to make sure that there is only one C[i] in C
sum(C == C[i]) == 1
#[1] TRUE
This will fail in case if we have repeated numbers in C. For example,
C <- c(1:10, 3) #Adding an extra 3 in the end
C
#[1] 1 2 3 4 5 6 7 8 9 10 3
i <- 3
sum(C == C[i]) == 1
#[1] FALSE
The bottom line is the condition is TRUE if C[i] occurs only once in C.

counting the larger value

Completely new to R and am trying to count how many numbers in a list are larger than the one right before.
This is what I have so far,
count <- 0
number <- function(value) {
for (i in 1:length(value))
{ if(value[i+1] > value[i])
{count <- count + 1}
}
}
x <- c(1,2,1,1,3,5)
number(x)
The output should be 3 based on the list.
Any help or advice would be greatly appreciated!
A base R alternative would be diff
sum(diff(x) > 0)
#[1] 3
Or we can also eliminate first and last values and compare them.
sum(x[-1] > x[-length(x)])
#[1] 3
where
x[-1]
#[1] 2 1 1 3 5
x[-length(x)]
#[1] 1 2 1 1 3
You can lag your vector and count how many times your initial vector is greater than your lagged vector
library(dplyr)
sum(x>lag(x), na.rm = TRUE)
In details, lag(x) does:
> lag(x)
[1] NA 1 2 1 1 3
so x > lag(x) does
> x>lag(x)
[1] NA TRUE FALSE FALSE TRUE TRUE
The sum of the above is 3.

Subset a data frame based on value pairs stored in independent ordered vectors

I have an R dataframe that I need to subset data from. The subsetting will be based on two columns in the dataframe. For example:
A <- c(1,2,3,3,5,1)
B <- c(6,7,8,9,8,8)
Value <- c(9,5,2,1,2,2)
DATA <- data.frame(A,B,Value)
This is how DATA looks
A B Value
1 6 9
2 7 5
3 8 2
3 9 1
5 8 2
1 8 2
I want those rows of data for which (A,B) combination is (1,6) and (3,8). These pairs are stored as individual (ordered) vectors of A and B:
AList <- c(1,3)
BList <- c(6,8)
Now, I am trying to subset the data basically by comparing if A column is present in AList AND B column is present in BList
DATA[(DATA$A %in% AList & DATA$B %in% BList),]
The subsetted result is shown below. In addition to the value pairs (1,6) and (3,8) I am also getting (1,8). Basically, this filter has given me value pairs for all combinations in AList and BList. How do I restrict it to just (1,6) and (3,8)?
A B Value
1 6 9
3 8 2
1 8 2
This is my desired result:
A B Value
1 6 9
3 8 2
This is a job for merge:
KEYS <- data.frame(A = AList, B = BList)
merge(DATA, KEYS)
# A B Value
# 1 1 6 9
# 2 3 8 2
Edit: after the OP expressed his preference for a logical vector in the comments below, I would suggest one of the following.
Use merge:
df.in.df <- function(x, y) {
common.names <- intersect(names(x), names(y))
idx <- seq_len(nrow(x))
x <- x[common.names]
y <- y[common.names]
x <- transform(x, .row.idx = idx)
idx %in% merge(x, y)$.row.idx
}
or interaction:
df.in.df <- function(x, y) {
common.names <- intersect(names(x), names(y))
interaction(x[common.names]) %in% interaction(y[common.names])
}
In both cases:
df.in.df(DATA, KEYS)
# [1] TRUE FALSE TRUE FALSE FALSE FALSE
You could try match which an appropriated nomatch argument:
sub <- match(DATA$A, AList, nomatch=-1) == match(DATA$B, BList, nomatch=-2)
sub
# [1] TRUE FALSE TRUE FALSE FALSE FALSE
DATA[sub,]
# A B Value
#1 1 6 9
#3 3 8 2
A paste based approach would also be possible:
sub <- paste(DATA$A, DATA$B, sep=":") %in% paste(AList, BList, sep=":")
sub
# [1] TRUE FALSE TRUE FALSE FALSE FALSE
DATA[sub,]
# A B Value
#1 1 6 9
#3 3 8 2

Resources