Subsetting a dataframe using %in% and ! in R - r

I have the following dataframe.
Test_Data <- data.frame(x = c("a", "b", "c"), y = c("d", "e", "f"), z = c("g", "h", "i"))
x y z
1 a d g
2 b e h
3 c f i
I would like to filter it based on multiple conditions. Specifically, I would like to remove any record that has the value of "b" in column x or "f" in column y. My subsetted result would be;
x y z
1 a d g
I tried the following solutions;
View(Test_Data %>% subset(!x %in% "b" | !y %in% "f"))
View(Test_Data %>% subset(!x %in% "b" & !y %in% "f"))
View(Test_Data %>% subset(!(x %in% "b" | y %in% "f")))
The last two solutions give me the result I want, however the first one is the only one that makes 'sense' to me because it uses the OR operator and I only need one of the conditions to be met. Why do the last solutions work but not the first?

The subset operation returns the rows that you want to KEEP.
However your set of rules defines the rows you want NOT TO KEEP. Therefore you're getting confused with the negation logic.
The rows you don't want to keep follow a series of rules: r1 | r2 | ....
The NEGATION is: !(r1 | r2 | ...), or: !r1 & !r2 & ...

Related

Trying to sort character variable into new variable with new value based on conditions

I want to sort a character variable into two categories in a new variable based on conditions, in conditions are not met i want it to return "other".
If variable x cointains 4 character values "A", "B", "C" & "D" I want to sort them into a 2 categories, 1 and 0, in a new variable y, creating a dummy variable
Ideally I want it to look like this
df <- data.frame(x = c("A", "B", "C" & "D")
y <- if x == "A" | "D" then assign 1 in y
if x == "B" | "C" then assign 0 in y
if x == other then assign NA in y
x y
1 "A" 1
2 "B" 0
3 "C" 0
4 "D" 1
library(dplyr)
df <- df %>% mutate ( y =case_when(
(x %in% df == "A" | "D") ~ 1 ,
(x %in% df == "B" | "C") ~ 1,
x %in% df == ~ NA
))
I got this error message
Error: replacement has 3 rows, data has 2
Here's the proper case_when syntax.
df <- data.frame(x = c("A", "B", "C", "D"))
library(dplyr)
df <- df %>%
mutate(y = case_when(x %in% c("A", "D") ~ 1,
x %in% c("B", "C") ~ 0,
TRUE ~ NA_real_))
df
#> x y
#> 1 A 1
#> 2 B 0
#> 3 C 0
#> 4 D 1
You're combining syntaxes in a way that makes sense in speech but not in code.
Generally you can't use foo == "G" | "H". You need to use foo == "G" | foo == "H", or the handy shorthand foo %in% c("G", "H").
Similarly x %in% df == "A" doesn't make sense x %in% df makes sense. df == "A" makes sense. Putting them together x %in% df == ... does not make sense to R. (Okay, it does make sense to R, but not the same sense it does to you. R will use its Order of Operations which evaluates x %in% df first and gets a result from that, and then checks whether that result == "A", which is not what you want.)
Inside a dplyr function like mutate, you don't need to keep specifying df. You pipe in df and now you just need to use the column x. x %in% df looks like you're testing whether the column x is in the data frame df, which you don't need to do. Instead use x %in% c("A", "D"). Aron's answer shows the full correct syntax, I hope this answer helps you understand why.

How to delete entire row for x if y appears at least once in same column?

I would like to run a code in which I delete the entire row for entries of "x", if "y" appears at least once in the same column of "var4". I can't find any solution in R. Below is what I tried.
In the code below, I tried to tell R that if var4 contains at least one y, all rows containing x should be filtered out/removed.
Example for df:
var1 var2 var3 var4
a b b a
b a b x
a b a x
a a a y
if (all(df$var4 %in% c("y"))) {
df <- filter(!var4 %in% c("x"))
}
So, I would like to delete rows 2&3 because y appears in var4. Unfortunately the code above doesn't return any change in df, even though y appears several times in var4.
Many thanks. I appreciate any kind of recommendation.
In the OP's code, filter statement is not getting the data. Instead, it can be
library(dplyr)
if("y" %in% df$var4) {
df <- df %>%
filter(!var4 %in% "x")
}
df
# var1 var2 var3 var4
#1 a b b a
#2 a a a y
It can be also written as
df %>%
filter("y" %in% var4 & !var4 %in% 'x')
data
df <- structure(list(var1 = c("a", "b", "a", "a"), var2 = c("b", "a",
"b", "a"), var3 = c("b", "b", "a", "a"), var4 = c("a", "x", "x",
"y")), class = "data.frame", row.names = c(NA, -4L))
If you want to use base R commands.
df[!df$var4 == "x", ] should do it.
df$var4 == "x" will return a vector of TRUE/FALSE
> df$var4 == "x"
[1] FALSE TRUE TRUE FALSE
The ! in front of it flips the TRUE FALSE
> !df$var4 == "x"
[1] TRUE FALSE FALSE TRUE
Then the bracket notation refers to subsetting the object by rows, then columns.
df[rows,columns]
Putting it all together, the following will subset rows based on the criteria supplied, and include all columns.
df[!df$var4 == "x", ]
Note that the nothing after the , means include all columns.

Identify index that is not shared between two variables in R

I would like to identify the indices for which there is not a match between two variables. The following code identifies the matches rather than the mismatched:
x <- c("a", "b", "c")
y <- c("a", "z", "c")
which(unique(as.character(x))%in% unique(y))
Thoughts on how to get this to identify the False indices (or in this example, 2)?
which(!(unique(as.character(x))%in% unique(y)))
cdeeterman is basically correct, just need to make sure that the not (!) applies to the entire relation unique(as.character(x))%in% unique(y)
You could also try using two equal signs where "x == y" basically says "x is exactly equal to y"
x = c("a", "b", "c")
y = c("a", "z", "c")
z = x == y
which(z == FALSE)
What about setdiff?
> which( y %in% setdiff(y,x) )
[1] 2

Subset dataframe by multiple logical conditions of rows to remove

I would like to subset (filter) a dataframe by specifying which rows not (!) to keep in the new dataframe. Here is a simplified sample dataframe:
data
v1 v2 v3 v4
a v d c
a v d d
b n p g
b d d h
c k d c
c r p g
d v d x
d v d c
e v d b
e v d c
For example, if a row of column v1 has a "b", "d", or "e", I want to get rid of that row of observations, producing the following dataframe:
v1 v2 v3 v4
a v d c
a v d d
c k d c
c r p g
I have been successful at subsetting based on one condition at a time. For example, here I remove rows where v1 contains a "b":
sub.data <- data[data[ , 1] != "b", ]
However, I have many, many such conditions, so doing it one at a time is not desirable. I have not been successful with the following:
sub.data <- data[data[ , 1] != c("b", "d", "e")
or
sub.data <- subset(data, data[ , 1] != c("b", "d", "e"))
I've tried some other things as well, like !%in%, but that doesn't seem to exist.
Any ideas?
Try this
subset(data, !(v1 %in% c("b","d","e")))
The ! should be around the outside of the statement:
data[!(data$v1 %in% c("b", "d", "e")), ]
v1 v2 v3 v4
1 a v d c
2 a v d d
5 c k d c
6 c r p g
You can also accomplish this by breaking things up into separate logical statements by including & to separate the statements.
subset(my.df, my.df$v1 != "b" & my.df$v1 != "d" & my.df$v1 != "e")
This is not elegant and takes more code but might be more readable to newer R users. As pointed out in a comment above, subset is a "convenience" function that is best used when working interactively.
This answer is more meant to explain why, not how. The '==' operator in R is vectorized in a same way as the '+' operator. It matches the elements of whatever is on the left side to the elements of whatever is on the right side, per element. For example:
> 1:3 == 1:3
[1] TRUE TRUE TRUE
Here the first test is 1==1 which is TRUE, the second 2==2 and the third 3==3. Notice that this returns a FALSE in the first and second element because the order is wrong:
> 3:1 == 1:3
[1] FALSE TRUE FALSE
Now if one object is smaller then the other object then the smaller object is repeated as much as it takes to match the larger object. If the size of the larger object is not a multiplication of the size of the smaller object you get a warning that not all elements are repeated. For example:
> 1:2 == 1:3
[1] TRUE TRUE FALSE
Warning message:
In 1:2 == 1:3 :
longer object length is not a multiple of shorter object length
Here the first match is 1==1, then 2==2, and finally 1==3 (FALSE) because the left side is smaller. If one of the sides is only one element then that is repeated:
> 1:3 == 1
[1] TRUE FALSE FALSE
The correct operator to test if an element is in a vector is indeed '%in%' which is vectorized only to the left element (for each element in the left vector it is tested if it is part of any object in the right element).
Alternatively, you can use '&' to combine two logical statements. '&' takes two elements and checks elementwise if both are TRUE:
> 1:3 == 1 & 1:3 != 2
[1] TRUE FALSE FALSE
data <- data[-which(data[,1] %in% c("b","d","e")),]
my.df <- read.table(textConnection("
v1 v2 v3 v4
a v d c
a v d d
b n p g
b d d h
c k d c
c r p g
d v d x
d v d c
e v d b
e v d c"), header = TRUE)
my.df[which(my.df$v1 != "b" & my.df$v1 != "d" & my.df$v1 != "e" ), ]
v1 v2 v3 v4
1 a v d c
2 a v d d
5 c k d c
6 c r p g
sub.data<-data[ data[,1] != "b" & data[,1] != "d" & data[,1] != "e" , ]
Larger but simple to understand (I guess) and can be used with multiple columns, even with !is.na( data[,1]).
And also
library(dplyr)
data %>% filter(!v1 %in% c("b", "d", "e"))
or
data %>% filter(v1 != "b" & v1 != "d" & v1 != "e")
or
data %>% filter(v1 != "b", v1 != "d", v1 != "e")
Since the & operator is implied by the comma.

Subset a data frame using OR when the column contains a factor

I would like to make a subset of a data frame in R that is based on one OR another value in a column of factors but it seems I cannot use | with factor values.
Example:
# fake data
x <- sample(1:100, 9)
nm <- c("a", "a", "a", "b", "b", "b", "c", "c", "c")
fake <- cbind(as.data.frame(nm), as.data.frame(x))
# subset fake to only rows with name equal to a or b
fake.trunk <- fake[fake$nm == "a" | "b", ]
produces the error:
Error in fake$nm == "a" | "b" :
operations are possible only for numeric, logical or complex types
How can I accomplish this?
Obviously my actual data frame has more than 3 values in the factor column so just using != "c" won't work.
You need fake.trunk <- fake[fake$nm == "a" | fake$nm == "b", ]. A more concise way of writing that (especially with more than two conditions) is:
fake[ fake$nm %in% c("a","b"), ]
Another approach would be to use subset() and write
fake.trunk = subset(fake, nm %in% c('a', 'b'))

Resources