Uncover values of one data frame with another mask data frame - r

Suppose I have two data frames A and B_mask, where
A <- as.data.frame( matrix(1:20,nrow=4) )
V1 V2 V3 V4 V5
1 1 5 9 13 17
2 2 6 10 14 18
3 3 7 11 15 19
4 4 8 12 16 20
And suppose also,
B_mask <- matrix(FALSE, nrow=4, ncol=5)
B_mask[2:3, 1:3] <- TRUE
B_mask <- as.data.frame(B_mask)
V1 V2 V3 V4 V5
1 FALSE FALSE FALSE FALSE FALSE
2 TRUE TRUE TRUE FALSE FALSE
3 TRUE TRUE TRUE FALSE FALSE
4 FALSE FALSE FALSE FALSE FALSE
How does one get a result data frame such that:
If an entry in B_mask is equal to TRUE, uncover the corresponding value in A? For example, because B_mask[2,1] = TRUE, I would want result[2,1] = A[2,1] = 2.
If an entry in B_mask is equal to FALSE, cover the corresponding value in A as NA? For example, because B_mask[3,4] = FALSE, I would want result[3,4] = NA.
Thanks!

We create a copy of the dataset ('res'), convert the 'FALSE' to NA in 'B_mask', and use the logical index to subset the corresponding values of 'A' and assign the output back to 'res' with the structure intact ([])
res <- A
res[] <- as.matrix(A)[as.logical(NA^!B_mask)]
Or as #alexis_laz mentioned this can also done with
is.na(res) <- !as.matrix(B_mask)

Related

How to match multiple columns without merge?

I have those two df's:
ID1 <- c("TRZ00897", "AAR9832", "NZU44447683209", "sxc89898989M", "RSU765th89", "FFF")
Date1 <- c("2022-08-21","2022-03-22","2022-09-24", "2022-09-21", "2022-09-22", "2022-09-22")
Data1 <- data.frame(ID1,Date1)
ID <- c("RSU765th89", "NZU44447683209", "AAR9832", "TRZ00897","ERD895655", "FFF", "IUHG0" )
Date <- c("2022-09-22","2022-09-21", "2022-03-22", "2022-08-21", "2022-09-21", "2022-09-22", "2022-09-22" )
Data2 <- data.frame(ID,Date)
I tried to get exact matches. An exact match is if ID and Date are the same in both df's, for example: "TRZ00897" "2022-08-21" is an exact match, because it is present in both df's
With the following line of code:
match(Data1$ID1, Data2$ID) == match(Data1$Date1, Data2$Date)
the output is:
TRUE TRUE NA NA TRUE FALSE
Obviously the last one should not be FALSE because "FFF" "2022-09-22" is in both df. The reason why it is FALSE is, that the Date"2022-09-22" occurred already in Data2 at index position 1.
match(Data1$ID1, Data2$ID)
4 3 2 NA 1 6
match(Data1$Date1, Data2$Date)
4 3 NA 2 1 1
So at the end, there is index position 6 and 1 which is not equal --> FALSE
How can I change this? Which function should I use to get the correct answer.
Note, I don't need to merge or join etc. I'm really looking for a function that can detect those patterns.
Combine the columns then match:
match(paste(Data1$ID1, Data1$Date1), paste(Data2$ID, Data2$Date))
# [1] 4 3 NA NA 1 6
To get logical outut use %in%:
paste(Data1$ID1, Data1$Date1) %in% paste(Data2$ID, Data2$Date)
# [1] TRUE TRUE FALSE FALSE TRUE TRUE
Try match with asplit (since you have different column names for two dataframes, I have to manually remove the names using unname, which can be avoided if both of them have the same names)
> match(asplit(unname(Data1), 1), asplit(unname(Data2), 1))
[1] 4 3 NA NA 1 6
Another option that is memory-expensive option is using interaction
> match(interaction(Data1), interaction(Data2))
[1] 4 3 NA NA 1 6
With mapply and %in%:
apply(mapply(`%in%`, Data1, Data2), 1, all)
[1] TRUE TRUE FALSE FALSE TRUE TRUE
rowSums(mapply(`%in%`, Data1, Data2)) == ncol(Data1)
Edit; for a subset of columns:
idx <- c(1, 2)
apply(mapply(`%in%`, Data1[idx], Data2[idx]), 1, all)
#[1] TRUE TRUE FALSE FALSE TRUE TRUE

recoding based on two condtions in r

I have an example dataset looks like:
data <- as.data.frame(c("A","B","C","X1_theta","X2_theta","AB_theta","BC_theta","CD_theta"))
colnames(data) <- "category"
> data
category
1 A
2 B
3 C
4 X1_theta
5 X2_theta
6 AB_theta
7 BC_theta
8 CD_theta
I am trying to generate a logical variable when the category (variable) contains "theta" in it. However, I would like to assign the logical value as "FALSE" when cell values contain "X1" and "X2".
Here is what I did:
data$logic <- str_detect(data$category, "theta")
> data
category logic
1 A FALSE
2 B FALSE
3 C FALSE
4 X1_theta TRUE
5 X2_theta TRUE
6 AB_theta TRUE
7 BC_theta TRUE
8 CD_theta TRUE
here, all cells value that have "theta" have the logical value of "TRUE".
Then, I wrote this below to just assign "FALSE" when the cell value has "X" in it.
data$logic <- ifelse(grepl("X", data$category), "FALSE", "TRUE")
> data
category logic
1 A TRUE
2 B TRUE
3 C TRUE
4 X1_theta FALSE
5 X2_theta FALSE
6 AB_theta TRUE
7 BC_theta TRUE
8 CD_theta TRUE
But this, of course, overwrote the previous application
What I would like to get is to combine two conditions:
> data
category logic
1 A FALSE
2 B FALSE
3 C FALSE
4 X1_theta FALSE
5 X2_theta FALSE
6 AB_theta TRUE
7 BC_theta TRUE
8 CD_theta TRUE
Any thoughts?
Thanks
We can create the 'logic', by detecting substring 'theta' at the end and not having 'X' ([^X]) as the starting (^) character
libary(dplyr)
library(stringr)
library(tidyr)
data %>%
mutate(logic = str_detect(category, "^[^X].*theta$"))
If we need to split the column into separate columns based on the conditions
data %>%
mutate(logic = str_detect(category, "^[^X].*theta$"),
category = case_when(logic ~ str_replace(category, "_", ","),
TRUE ~ as.character(category))) %>%
separate(category, into = c("split1", "split2"), sep= ",", remove = FALSE)
# category split1 split2 logic
#1 A A <NA> FALSE
#2 B B <NA> FALSE
#3 C C <NA> FALSE
#4 X1_theta X1_theta <NA> FALSE
#5 X2_theta X2_theta <NA> FALSE
#6 AB,theta AB theta TRUE
#7 BC,theta BC theta TRUE
#8 CD,theta CD theta TRUE
Or in base R
data$logic <- with(data, grepl("^[^X].*theta$", category))
Another option is to have two grepl condition statements
data$logic <- with(data, grepl("theta$", category) & !grepl("^X\\d+", category))
data$logic
#[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
Not the cleanest in the world (since it adds 2 unnecessary cols) but it gets the job done:
data <- as.data.frame(c("A","B","C","X1_theta","X2_theta","AB_theta","BC_theta","CD_theta"))
colnames(data) <- "category"
data$logic1 <- ifelse(grepl('X',data$category), FALSE, TRUE)
data$logic2 <- ifelse(grepl('theta',data$category),TRUE, FALSE)
data$logic <- ifelse((data$logic1 == TRUE & data$logic2 == TRUE), TRUE, FALSE)
print(data)
I think you can also remove the logic1 and logic2 cols if you want but I usually don't bother (I'm a messy coder haha).
Hope this helped!
EDIT: akrun's grepl solution does what I'm doing way more cleanly (as in, it doesn't require the extra cols). I definitely recommend that approach!

R seems duplicated() to select the wrong duplicates

I've noticed a couple of times now that when I'm using R to identify duplicates, sometimes it seems to identify the wrong cases.
Here's a data frame that has three columns, each which may be holding duplicate values. I want to isolate the cases that are duplicates of another case on all three variables.
set.seed(100)
test <- data.frame(id = sample(1:15, 20, replace = TRUE),
cat1 = sample(letters[1:2], 20, replace = TRUE),
cat2 = sample(letters[1:2], 20, replace = TRUE))
Which gives me:
id cat1 cat2
1 5 b a
2 4 b b
3 9 b b
4 1 b b
5 8 a b
6 8 a a
7 13 b b
8 6 b b
9 9 b a
10 3 a a
11 10 a a
12 14 b a
13 5 a a
14 6 b a
15 12 b b
16 11 b a
17 4 a a
18 6 b a
19 6 b b
20 11 a a
I've tried this a couple of ways, such as:
duplicated(test$id) & duplicated(test$cat1) & duplicated(test$cat2)
But this just results in the same as duplicated(test$id):
[1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
[17] TRUE TRUE TRUE TRUE
So instead I tried duplicated(test$id, test$cat1, test$cat2), which produces different results:
[1] TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE
[17] FALSE TRUE FALSE FALSE
But is still incorrect - if I call these cases from the data frame we get:
> test[which(duplicated(test$id, test$cat1, test$cat2)),]
id cat1 cat2
1 5 b a
2 4 b b
3 9 b b
5 8 a b
8 6 b b
14 6 b a
16 11 b a
18 6 b a
As you can see these are not the rows we should be getting (were it doing what I'd have thought it would do), which should be (as far as I can see):
18 6 b a
19 6 b b
Does anyone know why it's coming up with these results, and where I'm going wrong using it? Is there a simple (ideally non-verbose) way of doing this?
We need to apply duplicated on a data.frame or matrix or vector
i1 <- duplicated(test[c('id', 'cat1')])
i2 <- duplicated(cbind(test$id, test$cat1))
identical(i1, i2)
#[1] TRUE
and not on more than one data.frame or matrix or vector
i3 <- duplicated(test$id, test$cat1)
identical(i1, i3)
#[1] FALSE
It is specified in the documents of ?duplicated
duplicated(x, incomparables = FALSE, ...)
where
x a vector or a data frame or an array or NULL.
and not 'x1', 'x2', etc..
As #Aaron mentioned in the comments, to subset the duplicates from the OP's data
test[duplicated(test),]
and if we wanted only the duplicates, then
test[duplicated(test)|duplicated(test, fromLast = TRUE),]
Taking duplicates of columns separately is not the same as taking duplicates of a data frame or matrix. This example makes it more clear:
df = data.frame(x = c(1,2,1),
y = c(1,3,3))
df$dupe = duplicated(df$x) & duplicated(df$y)
df$dupe2 = duplicated(df[,c("x","y")])
df
Using your method, duplicated says "When I hit the third row, x already had a 1 so it's duplicated. y already had a 3 so it's duplicated." This doesn't mean that it already saw a row where x = 1 and y = 3.

In R, how to split a data into list of multiple subsets based on multiple categorical variables?

In my data frame, I have a lot of logical variables and I want to split the data frame into multiple subsets conditional on each logical variable is TRUE. For example, let's suppose this is my df:
V1 V2 V3 V4
1 TRUE TRUE FALSE 2
2 TRUE FALSE TRUE 5
3 FALSE TRUE FALSE 4
So I want to have three subsets:
[1]
V1 V2 V3 V4
1 TRUE TRUE FALSE 2
2 TRUE FALSE TRUE 5
[2]
V1 V2 V3 V4
1 TRUE TRUE FALSE 2
2 FALSE TRUE FALSE 4
[3]
V1 V2 V3 V4
1 TRUE FALSE TRUE 5
Thanks for any help!
A simple lapply loop should do the trick:
read.table(textConnection("V1 V2 V3 V4
T T F 2
T F T 5
F T F 4"), header=T) -> df
lapply(1:(ncol(df)-1), function(i) {
subset(df, df[[i]])
})
[[1]]
V1 V2 V3 V4
1 TRUE TRUE FALSE 2
2 TRUE FALSE TRUE 5
[[2]]
V1 V2 V3 V4
1 TRUE TRUE FALSE 2
3 FALSE TRUE FALSE 4
[[3]]
V1 V2 V3 V4
2 TRUE FALSE TRUE 5
Your problem is very simple. For the first subset you can use:
subset1 <- df[df[ ,1]==T, ]
in which the function takes out the rows that has the first column V1's value of T.
Of course if you wanna set up a whole function for that job for later use, then #thc's solution is best. But in case you just need to get 3 subsets nicely and quickly, try the above.
I'll let you figure out how to do the rest with subset2 and subset3.

Count exact row matches in a dataframe

Let's say I have a dataframe df and an example row sample <- df[1,].
How can I count occurrences of sample in df?
From what I found so far, it should be something like sum(df==sample), but I get an error
‘==’ only defined for equally-sized data frames.
For example:
df <- data.frame(matrix(rnorm(20), nrow=10))
df <- rbind(df, df[1,])
sample <- df[1,]
unlist(sample)[col(df)]==df
X1 X2
1 TRUE TRUE
2 FALSE FALSE
3 FALSE FALSE
4 FALSE FALSE
5 FALSE FALSE
6 FALSE FALSE
7 FALSE FALSE
8 FALSE FALSE
9 FALSE FALSE
10 FALSE FALSE
11 TRUE TRUE
Use merge then count rows:
# reproducible example data
set.seed(1)
df1 <- data.frame(matrix(rnorm(20), nrow = 10))
# add duplicate row
df1 <- rbind(df1, df1[1,])
df1_sample <- df1[1,]
# merge and get number of rows
nrow(merge(df1_sample, df1))
# [1] 2

Resources