Identify index that is not shared between two variables in R - r

I would like to identify the indices for which there is not a match between two variables. The following code identifies the matches rather than the mismatched:
x <- c("a", "b", "c")
y <- c("a", "z", "c")
which(unique(as.character(x))%in% unique(y))
Thoughts on how to get this to identify the False indices (or in this example, 2)?

which(!(unique(as.character(x))%in% unique(y)))
cdeeterman is basically correct, just need to make sure that the not (!) applies to the entire relation unique(as.character(x))%in% unique(y)

You could also try using two equal signs where "x == y" basically says "x is exactly equal to y"
x = c("a", "b", "c")
y = c("a", "z", "c")
z = x == y
which(z == FALSE)

What about setdiff?
> which( y %in% setdiff(y,x) )
[1] 2

Related

Subsetting a dataframe using %in% and ! in R

I have the following dataframe.
Test_Data <- data.frame(x = c("a", "b", "c"), y = c("d", "e", "f"), z = c("g", "h", "i"))
x y z
1 a d g
2 b e h
3 c f i
I would like to filter it based on multiple conditions. Specifically, I would like to remove any record that has the value of "b" in column x or "f" in column y. My subsetted result would be;
x y z
1 a d g
I tried the following solutions;
View(Test_Data %>% subset(!x %in% "b" | !y %in% "f"))
View(Test_Data %>% subset(!x %in% "b" & !y %in% "f"))
View(Test_Data %>% subset(!(x %in% "b" | y %in% "f")))
The last two solutions give me the result I want, however the first one is the only one that makes 'sense' to me because it uses the OR operator and I only need one of the conditions to be met. Why do the last solutions work but not the first?
The subset operation returns the rows that you want to KEEP.
However your set of rules defines the rows you want NOT TO KEEP. Therefore you're getting confused with the negation logic.
The rows you don't want to keep follow a series of rules: r1 | r2 | ....
The NEGATION is: !(r1 | r2 | ...), or: !r1 & !r2 & ...

Finding in which vector does the element belong to

suppose I have 3 vectors:
a = c("A", "B", "C")
b = c("D", "E", "F")
c = c("G", "H", "I")
then I have an element:
element = "E"
I want to find which list does my element belongs to. In this case, list b.
It will be appreciated if the solution to this problem is more general because my real data set have more than a hundred lists.
element = "E"
names(our_lists)[sapply(our_lists, `%in%`, x = element)]
# [1] "b"
Data
our_lists <- list(
a = c("A", "B", "C"),
b = c("D", "E", "F"),
c = c("G", "H", "I")
)
Using grep.
element <- "E"
l <- mget(c("a", "b", "c"))
names(l)[grep(element, l)]
# [1] "b"
If you keep the data in individual objects, you need to check for the element in each one individually. Get them in a list.
list_data <- mget(c('a', 'b', 'c'))
names(Filter(any, lapply(list_data, `==`, element)))
#[1] "b"
If all your vectors have the same length then a vectorised idea can be,
c('a', 'b', 'c')[ceiling(which(c(a, b, c) == 'E') / length(a))]
#[1] "b"
You can use dplyr::lst that creates named list from variable names. Then purrr::keep to keep only the vectors that contain your element.
require(tidyverse)
lst(a, b, c) %>%
keep(~ element %in% .x) %>%
names()
output:
[1] "b"

How to delete entire row for x if y appears at least once in same column?

I would like to run a code in which I delete the entire row for entries of "x", if "y" appears at least once in the same column of "var4". I can't find any solution in R. Below is what I tried.
In the code below, I tried to tell R that if var4 contains at least one y, all rows containing x should be filtered out/removed.
Example for df:
var1 var2 var3 var4
a b b a
b a b x
a b a x
a a a y
if (all(df$var4 %in% c("y"))) {
df <- filter(!var4 %in% c("x"))
}
So, I would like to delete rows 2&3 because y appears in var4. Unfortunately the code above doesn't return any change in df, even though y appears several times in var4.
Many thanks. I appreciate any kind of recommendation.
In the OP's code, filter statement is not getting the data. Instead, it can be
library(dplyr)
if("y" %in% df$var4) {
df <- df %>%
filter(!var4 %in% "x")
}
df
# var1 var2 var3 var4
#1 a b b a
#2 a a a y
It can be also written as
df %>%
filter("y" %in% var4 & !var4 %in% 'x')
data
df <- structure(list(var1 = c("a", "b", "a", "a"), var2 = c("b", "a",
"b", "a"), var3 = c("b", "b", "a", "a"), var4 = c("a", "x", "x",
"y")), class = "data.frame", row.names = c(NA, -4L))
If you want to use base R commands.
df[!df$var4 == "x", ] should do it.
df$var4 == "x" will return a vector of TRUE/FALSE
> df$var4 == "x"
[1] FALSE TRUE TRUE FALSE
The ! in front of it flips the TRUE FALSE
> !df$var4 == "x"
[1] TRUE FALSE FALSE TRUE
Then the bracket notation refers to subsetting the object by rows, then columns.
df[rows,columns]
Putting it all together, the following will subset rows based on the criteria supplied, and include all columns.
df[!df$var4 == "x", ]
Note that the nothing after the , means include all columns.

Select and count the number of duplicate items with two different outcome values?

Long-time follower, thanks so much for all your help over the years! I have a question that might have an easy answer, but I failed in googling it, and trying various subsetting and bracket notation also feel short. I'm betting someone here has encountered a similar problem.
I have a long-form data set with a set of duplicate ids. I also have a third variable that might be different for the duplicate. By example, if you recreate my data set:
x <- c("a", "a", "b", "c", "c", "d", "d", "d")
y <- c("z", "z", "z", "y", "y", "y", "x", "x")
z <- c(10, 20, 10, 10, 10, 10, 10, 20)
df <- cbind(x, y, z)
df <- as.data.frame(df)
names(df) <- c("id1", "id2", "var1")
df
I want to select the rows in which id2 has BOTH a 10 and 20 when they are connected to the same id1, For example, 'x' has two observations connected to id1 ('a') with two different var1 values (a '10' and a '20).
I want to select these cases, as well as count how many cases like this are in the overall data set. Thanks in advance!
One way is with ddply from the plyr package. Something like this:
> library(plyr)
> ddply(df, c('id2', 'id1'), function(x) if(length(unique(x$var1))==2) x)
id1 id2 var1
1 d x 10
2 d x 20
3 a z 10
4 a z 20

Subset a data frame using OR when the column contains a factor

I would like to make a subset of a data frame in R that is based on one OR another value in a column of factors but it seems I cannot use | with factor values.
Example:
# fake data
x <- sample(1:100, 9)
nm <- c("a", "a", "a", "b", "b", "b", "c", "c", "c")
fake <- cbind(as.data.frame(nm), as.data.frame(x))
# subset fake to only rows with name equal to a or b
fake.trunk <- fake[fake$nm == "a" | "b", ]
produces the error:
Error in fake$nm == "a" | "b" :
operations are possible only for numeric, logical or complex types
How can I accomplish this?
Obviously my actual data frame has more than 3 values in the factor column so just using != "c" won't work.
You need fake.trunk <- fake[fake$nm == "a" | fake$nm == "b", ]. A more concise way of writing that (especially with more than two conditions) is:
fake[ fake$nm %in% c("a","b"), ]
Another approach would be to use subset() and write
fake.trunk = subset(fake, nm %in% c('a', 'b'))

Resources