I'm having a hard time understand how R is treating the AND and OR operators when I'm using filter from dplyr.
Here's an example to illustrate:
library(dplyr)
xy <- data.frame(x=1:6, y=c("a", "b"), z= c(rep("d",3), rep("g",3)))
> xy
x y z
1 1 a d
2 2 b d
3 3 a d
4 4 b g
5 5 a g
6 6 b g
Using filter I want to eliminate all rows where x==1 and z==d. This would lead me to believe I want to use the AND operator: &
> filter(xy, x != 1 & z != "d")
x y z
1 4 b g
2 5 a g
3 6 b g
But this removes all rows that have either x==1 or z==d. What's more confusing, is that when I use the OR operator, | I get the desired result:
> filter(xy, x != 1 | z != "d")
x y z
1 2 b d
2 3 a d
3 4 b g
4 5 a g
5 6 b g
Also, this does work, however not as desirable for if I were stringing together == and != in the same conditional evaluation.
> filter(xy, !(x == 1 & z == "d"))
x y z
1 2 b d
2 3 a d
3 4 b g
4 5 a g
5 6 b g
Can someone explain what I'm missing?
This is a question of boolean algebra. The logical expression !(x == 1 & z == d) is equivalent to x != 1 | z != d, just the same as -(x + y) is equivalent to -x - y. Eliminating the bracket, you change all == to != and all & to | and vice versa. This leads to the fact that
!(x == 1 & z == "d")
is NOT the same as
x != 1 & z != "d"
but rather
x != 1 | z != "d"
A couple tips that won't fit in a comment:
If you're having trouble understanding how something is working in R, I'd highly recommend running each individual piece of the operation. With dplyr, it's easy to keep track on intermediate steps and display them all:
mutate(xy,
A = x != 1,
B = z != 'd',
A_and_B = A & B,
A_or_B = A | B
)
# x y z A B A_and_B A_or_B
# 1 1 a d FALSE FALSE FALSE FALSE
# 2 2 b d TRUE FALSE FALSE TRUE
# 3 3 a d TRUE FALSE FALSE TRUE
# 4 4 b g TRUE TRUE TRUE TRUE
# 5 5 a g TRUE TRUE TRUE TRUE
# 6 6 b g TRUE TRUE TRUE TRUE
I think that if you look at the definition of each column its values will make perfect sense. Then, after going one step at a time, hopefully the results will make sense too.
As others have stated in various ways, you're setting yourself up for a hard time from the start with
Using filter I want to eliminate all rows where x==1 and z==d
Don't think of filter as eliminating rows, think of it as keeping rows. If you mentally invert your goal to "keep all rows where..." you'll set yourself up for a more direct translation of words to code.
The result of filter is the rows where the specified condition is true.
Take for example x != 1 & z != "d". What are the rows where this condition is true? The output you got. The other rows were removed, because the condition was not true for those rows.
In this example, your real intention was to eliminate rows where x == 1 and z == "d".
In other words, you want to keep the rows where the condition x == 1 and z == "d" is false.
Putting that into code becomes filter(xy, !(x == 1 and z == "d")).
It's ironic that this looks much like your intention, and very different from what you actually tried to write.
If you forget this logic of filter,
you can remind yourself with a simpler experiment, filter(xy, TRUE) which will return all rows, and filter(xy, FALSE) which will return none.
# x != 1 & z != "d" evaluates to a single TRUE/FALSE vector which subsets the data
# note how & and | behave in isolation:
TRUE & TRUE # T AND T = T
## [1] TRUE
TRUE & FALSE # T AND F = F
## [1] FALSE
FALSE & FALSE # F AND F = F
## [1] FALSE
TRUE | TRUE # T OR T = T
## [1] TRUE
TRUE | FALSE # T OR F = T
## [1] TRUE
FALSE | FALSE # F OR F = F
## [1] FALSE
# Apply over vectors
(x1 <- xy$x != 1)
## [1] FALSE TRUE TRUE TRUE TRUE TRUE
(z1 <- xy$z != "d")
## [1] FALSE FALSE FALSE TRUE TRUE TRUE
x1 & z1 # you get last 3 rows
## [1] FALSE FALSE FALSE TRUE TRUE TRUE
x1 | z1 # you get all but 1st row (which contains 1 and d)
## [1] FALSE TRUE TRUE TRUE TRUE TRUE
Related
I have a dataframe of two columns of T and F.
I want to know
which row is T in the first and F in the second
which row is F in the first and T in the second
which row is F in both
I have very little clues on the matter, con someone shine a light?
You can use case when
library(dplyr)
df = data.frame(x = c("T","T","F","F","F"), y = c("T","F","T","F","X"))
df %>%
mutate(condition = case_when(
x == "T" & y == "T" ~ "Both are T",
x == "T" & y == "F" ~ "First is T fecond is F",
x == "F" & y == "F" ~ "Both are F",
x == "F" & y == "T" ~ "First is F, second is T",
TRUE ~ "Something else"
))
#> x y condition
#> 1 T T Both are T
#> 2 T F First is T fecond is F
#> 3 F T First is F, second is T
#> 4 F F Both are F
#> 5 F X Something else
Created on 2021-08-05 by the reprex package (v2.0.0)
Here is one possible way to solve your problem:
library(dplyr)
df <- data.frame(a = rep(c(T, F, T, F), each=2),
b = rep(c(T, T, F, F), each=2))
# a b
# 1 TRUE TRUE
# 2 TRUE TRUE
# 3 FALSE TRUE
# 4 FALSE TRUE
# 5 TRUE FALSE
# 6 TRUE FALSE
# 7 FALSE FALSE
# 8 FALSE FALSE
df %>%
mutate(newcol = case_when(a & !b ~ "first=T second=F",
!a & b ~ "first=F second=T",
!a & !b ~ "both=F",
TRUE ~ "other"))
# a b newcol
# 1 TRUE TRUE other
# 2 TRUE TRUE other
# 3 FALSE TRUE first=F second=T
# 4 FALSE TRUE first=F second=T
# 5 TRUE FALSE first=T second=F
# 6 TRUE FALSE first=T second=F
# 7 FALSE FALSE both=F
# 8 FALSE FALSE both=F
You can treat [a,b] columns as a 2-bit binary number vector, and a*2+b transfer it from binary to decimal. Thus, 2*a+b+1 is mapped to 1,2,3,4.
Try the base R code below
transform(
df,
newcol = c("both=F", "first=F,second=T", "first=T,second=F", "other")[a * 2 + b + 1]
)
which gives
a b newcol
1 TRUE TRUE other
2 TRUE TRUE other
3 FALSE TRUE first=F,second=T
4 FALSE TRUE first=F,second=T
5 TRUE FALSE first=T,second=F
6 TRUE FALSE first=T,second=F
7 FALSE FALSE both=F
8 FALSE FALSE both=F
Data
df <- data.frame(a = rep(c(T, F, T, F), each=2),
b = rep(c(T, T, F, F), each=2))
Being this my initial dataset:
x <- c("a","a","b","b","c","c","d","d")
y <- c("a","a","a","b","c","c", "d", "d")
z <- c(5,1,2,6,1,1,5,6)
df <- data.frame(x,y,z)
I am trying to create a column in a dataframe to flag if there is another row in the dataset with the following condition:
There is a row in the dataset with the same "x" and "y" columns. And at least 1 of the rows of the dataset, with that "x" and "y" has a "z" value >= 5
With the example provided, the output should be:
x y z flag
1 a a 5 TRUE
2 a a 1 TRUE
3 b a 2 FALSE
4 b b 6 TRUE
5 c c 1 FALSE
6 c c 1 FALSE
7 d d 5 TRUE
8 d d 6 TRUE
Thank you!
I use data.table package for all my aggregations. With this package I would do the following:
library(data.table)
dt <- as.data.table(df)
# by=.(x, y): grouping by x and y
# find all cases where
# 1. the maximum z value is >= 5
# 2. there are more than 1 entry for that (x, y) combo. .N is a data.table syntax for number of rows in that group
# := is a data.table syntax to assign back in to the original data.table
dt[, flag := max(z) >= 5 & .N > 1, by=.(x, y)]
# Does x need to equal y? If so use this
dt[, flag := max(z) >= 5 & .N > 1 & x == y, by=.(x, y)]
# view the result
dt[]
# return back to df
df <- as.data.frame(dt)
df
You can try the code below
> within(df, flag <- x==y & z>=5)
x y z flag
1 a a 5 TRUE
2 a a 1 FALSE
3 b a 2 FALSE
4 b b 6 TRUE
5 c c 1 FALSE
6 c c 1 FALSE
7 d d 5 TRUE
8 d d 6 TRUE
I am trying to create three new columns with values depending on a particular order of three logical type columns.
eg I have this:
a b c
1 TRUE TRUE TRUE
2 TRUE FALSE TRUE
3 TRUE FALSE TRUE
And depending if going across the row the values are TRUE, TRUE, TRUE as in row 1, then create three new columns with the values 1,1,1 but if the order is TRUE,FALSE,TRUE as in row 2 and 3 then the values would be 2,3,3. Just to note, a value of TRUE does not = 1 but rather a value I define depending on all three logical values (A total of 8 possible combinations each defined by three separate numbers). So I get something like this:
a b c d e f
1 TRUE TRUE TRUE 5 5 2
2 TRUE FALSE TRUE 2 3 3
3 TRUE FALSE TRUE 2 3 3
If someone could point me in the right direction to do this as efficiently as possible it would be greatly appreciated as I am relatively new to R.
If there is no logic in getting values for the columns and you need to add conditions individually for each combination you can use if/else.
df[c('d', 'e', 'f')] <- t(apply(df, 1, function(x) {
if (x[1] && x[2] && x[3]) c(5, 5, 2)
else if (x[1] && !x[2] && x[3]) c(2, 3, 3)
#add more conditions
#....
}))
df
# a b c d e f
#1 TRUE TRUE TRUE 5 5 2
#2 TRUE FALSE TRUE 2 3 3
#3 TRUE FALSE TRUE 2 3 3
Here's a dplyr solution using case_when. On the left side of the ~ you define your conditions, and on the right side of the ~ you assign a value for when those conditions are met. If a condition is not met (i.e. all FALSE values), you will return NA.
df %>%
mutate(d =
case_when(
a == TRUE & b == TRUE & c == TRUE ~ 5,
a == TRUE & b == FALSE & c == TRUE ~ 2
),
e =
case_when(
a == TRUE & b == TRUE & c == TRUE ~ 5,
a == TRUE & b == FALSE & c == TRUE ~ 3
),
f =
case_when(
a == TRUE & b == TRUE & c == TRUE ~ 2,
a == TRUE & b == FALSE & c == TRUE ~ 3
))
Which gives you:
a b c d e f
<lgl> <lgl> <lgl> <dbl> <dbl> <dbl>
1 TRUE TRUE TRUE 5 5 2
2 TRUE FALSE TRUE 2 3 3
3 TRUE FALSE TRUE 2 3 3
Data:
df <- tribble(
~a, ~b, ~c,
TRUE, TRUE, TRUE,
TRUE, FALSE, TRUE,
TRUE, FALSE, TRUE
)
Recently, I was asked about subsetting a data frame in R. My colleague had this line of code
dd2 <- subset(dd, tret == c("T1", "T2", "T3", "T4")) which yields 1/4 of the subset. In contrast to the standard dd2 <- subset(dd, tret == "T1" | tret == "T2" | tret == "T3" | tret == "T4") which yields 960 rows, the first line of code only yields 240 rows.
Same thing happens to vectors. For instance,
x <- c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4)
y <- x[x == 1 | x == 2] gives a vector different from
y <- x[x == c(1,2)]
Any insight on the differences? Thank you.
The issue is with the recycling of values when we use a vector with length greater than 1 with another one having length > 1.
x == 1:2
#[1] TRUE FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
where
x
#[1] 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4
and the comparison works in the following way
rep(1:2, length.out = length(x))
#[1] 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2
In the above example, 1 is compared to the first element of x, 2 with 2nd element, 1 again with 3rd element of x, 2 with 4th, and it repeats until the end of the vector 'x'. For comparing vectors of length > 1, use %in%
identical(x[x == 1 | x == 2], x[x %in% 1:2])
#[1] TRUE
Suppose I have an outcome such like:
df<-data.frame(id=rep(letters[1:4], each=4), stringsAsFactors=FALSE,
test=c(rep(FALSE, 4), rep(c(FALSE, TRUE), 4), rep(TRUE, 4)))
id test
1 a FALSE
2 a FALSE
3 a FALSE
4 a FALSE
5 b FALSE
6 b TRUE
7 b FALSE
8 b TRUE
9 c FALSE
10 c TRUE
11 c FALSE
12 c TRUE
13 d TRUE
14 d TRUE
15 d TRUE
16 d TRUE
What I wanted to see is whether the test results were consistent across each subject. Such that:
id consist
1 a TRUE
2 b FALSE
3 c FALSE
4 d TRUE
What is an easy way to realize this in R?
Here is a method using aggregate:
aggregate(test ~ id, data=df, FUN=function(x) min(x) == max(x))
id test
1 a TRUE
2 b FALSE
3 c FALSE
4 d TRUE
For each, id, the function checks whether the min of the test results equal the maximum of the results.
A second method is to check if there are any differences in the values using diff:
aggregate(test ~ id, data=df, FUN=function(x) max(abs(diff(x))) == 0)
id test
1 a TRUE
2 b FALSE
3 c FALSE
4 d TRUE
Here, taking the maximum of the absolute value to get the magnitude of the differences.
Could also check if either TRUE or FALSE isn't present at all by group using table and rowSums combination
rowSums(table(df) == 0)
# a b c d
# 1 0 0 1
Or closer to your desired output
data.frame(test = rowSums(table(df) == 0) == 1)
# test
# a TRUE
# b FALSE
# c FALSE
# d TRUE
Here is an option using data.table
library(data.table)
setDT(df)[, .(consist= all(test)| all(!test)) , by = id]
# id consist
#1: a TRUE
#2: b FALSE
#3: c FALSE
#4: d TRUE
Or use uniqueN
setDT(df)[,.(consist = uniqueN(test)==1) , by = id]
Another approach using dplyr package
df %>%group_by(id) %>% summarise(consist=ifelse(var(test)==0,TRUE,FALSE))
Thanks to #David Arenburg's comment, We can simplify above using base R by doing this
data.frame(test=with(df, tapply(test, id, var)) == 0)