how to subset data in R using conditional operation booleans [duplicate] - r

I would like to subset (filter) a dataframe by specifying which rows not (!) to keep in the new dataframe. Here is a simplified sample dataframe:
data
v1 v2 v3 v4
a v d c
a v d d
b n p g
b d d h
c k d c
c r p g
d v d x
d v d c
e v d b
e v d c
For example, if a row of column v1 has a "b", "d", or "e", I want to get rid of that row of observations, producing the following dataframe:
v1 v2 v3 v4
a v d c
a v d d
c k d c
c r p g
I have been successful at subsetting based on one condition at a time. For example, here I remove rows where v1 contains a "b":
sub.data <- data[data[ , 1] != "b", ]
However, I have many, many such conditions, so doing it one at a time is not desirable. I have not been successful with the following:
sub.data <- data[data[ , 1] != c("b", "d", "e")
or
sub.data <- subset(data, data[ , 1] != c("b", "d", "e"))
I've tried some other things as well, like !%in%, but that doesn't seem to exist.
Any ideas?

Try this
subset(data, !(v1 %in% c("b","d","e")))

The ! should be around the outside of the statement:
data[!(data$v1 %in% c("b", "d", "e")), ]
v1 v2 v3 v4
1 a v d c
2 a v d d
5 c k d c
6 c r p g

You can also accomplish this by breaking things up into separate logical statements by including & to separate the statements.
subset(my.df, my.df$v1 != "b" & my.df$v1 != "d" & my.df$v1 != "e")
This is not elegant and takes more code but might be more readable to newer R users. As pointed out in a comment above, subset is a "convenience" function that is best used when working interactively.

This answer is more meant to explain why, not how. The '==' operator in R is vectorized in a same way as the '+' operator. It matches the elements of whatever is on the left side to the elements of whatever is on the right side, per element. For example:
> 1:3 == 1:3
[1] TRUE TRUE TRUE
Here the first test is 1==1 which is TRUE, the second 2==2 and the third 3==3. Notice that this returns a FALSE in the first and second element because the order is wrong:
> 3:1 == 1:3
[1] FALSE TRUE FALSE
Now if one object is smaller then the other object then the smaller object is repeated as much as it takes to match the larger object. If the size of the larger object is not a multiplication of the size of the smaller object you get a warning that not all elements are repeated. For example:
> 1:2 == 1:3
[1] TRUE TRUE FALSE
Warning message:
In 1:2 == 1:3 :
longer object length is not a multiple of shorter object length
Here the first match is 1==1, then 2==2, and finally 1==3 (FALSE) because the left side is smaller. If one of the sides is only one element then that is repeated:
> 1:3 == 1
[1] TRUE FALSE FALSE
The correct operator to test if an element is in a vector is indeed '%in%' which is vectorized only to the left element (for each element in the left vector it is tested if it is part of any object in the right element).
Alternatively, you can use '&' to combine two logical statements. '&' takes two elements and checks elementwise if both are TRUE:
> 1:3 == 1 & 1:3 != 2
[1] TRUE FALSE FALSE

data <- data[-which(data[,1] %in% c("b","d","e")),]

my.df <- read.table(textConnection("
v1 v2 v3 v4
a v d c
a v d d
b n p g
b d d h
c k d c
c r p g
d v d x
d v d c
e v d b
e v d c"), header = TRUE)
my.df[which(my.df$v1 != "b" & my.df$v1 != "d" & my.df$v1 != "e" ), ]
v1 v2 v3 v4
1 a v d c
2 a v d d
5 c k d c
6 c r p g

sub.data<-data[ data[,1] != "b" & data[,1] != "d" & data[,1] != "e" , ]
Larger but simple to understand (I guess) and can be used with multiple columns, even with !is.na( data[,1]).

And also
library(dplyr)
data %>% filter(!v1 %in% c("b", "d", "e"))
or
data %>% filter(v1 != "b" & v1 != "d" & v1 != "e")
or
data %>% filter(v1 != "b", v1 != "d", v1 != "e")
Since the & operator is implied by the comma.

Related

Subsetting a dataframe using %in% and ! in R

I have the following dataframe.
Test_Data <- data.frame(x = c("a", "b", "c"), y = c("d", "e", "f"), z = c("g", "h", "i"))
x y z
1 a d g
2 b e h
3 c f i
I would like to filter it based on multiple conditions. Specifically, I would like to remove any record that has the value of "b" in column x or "f" in column y. My subsetted result would be;
x y z
1 a d g
I tried the following solutions;
View(Test_Data %>% subset(!x %in% "b" | !y %in% "f"))
View(Test_Data %>% subset(!x %in% "b" & !y %in% "f"))
View(Test_Data %>% subset(!(x %in% "b" | y %in% "f")))
The last two solutions give me the result I want, however the first one is the only one that makes 'sense' to me because it uses the OR operator and I only need one of the conditions to be met. Why do the last solutions work but not the first?
The subset operation returns the rows that you want to KEEP.
However your set of rules defines the rows you want NOT TO KEEP. Therefore you're getting confused with the negation logic.
The rows you don't want to keep follow a series of rules: r1 | r2 | ....
The NEGATION is: !(r1 | r2 | ...), or: !r1 & !r2 & ...

R: Match specific elements from a list of data frames and create new data frame

Let's have a list of data frames:
df1 <- data.frame(V1=c("a", "b", "c"),V2=c("d", "e","f"), V3=c("g","h","i"),V4=c("j","k","l"))
df2 <- data.frame(V1=c("m","n"), V2=c("o","p"), V3=c("q","r"))
l <-list(df1, df2)
> l
[[1]]
V1 V2 V3 V4
1 a d g j
2 b e h k
3 c f i l
[[2]]
V1 V2 V3
1 m o q
2 n p r
Moreover, we have a vector:
ele <- c("a","b","e","g","i","m","p","s","t")
I want to obtain a new data frame contructed by matching elements from vector ele and list l. Data frame should have colnames from matched elemenets from vector and element right to the matches elements from the list.
For instance:
df3 <-data.frame(a="d",b='e',e="h",g="j",i="l",m="o",p="r")
> df3
a b e g i m p
1 d e h j l o r
As you may notice there is not spefic matching pattern.
Probably there's better solutions somewhere, but this is a possibility:
library(tidyverse)
library(magrittr)
l %<>%
map(~ t(.x) %>%
as_tibble() %>%
flatten_chr())
ele %>%
map(~ map(l, equals, .x)) %>%
map_chr(~ {
lgl <- map_lgl(.x, any)
if (!any(lgl)) {
NA
} else {
lgl_idx <- min(which(lgl))
lgl <- l[[lgl_idx]]
lgl[min(which(.x[[lgl_idx]])) + 1]
}
}) %>%
set_names(ele) %>%
na.omit()
Needs some more exception handling (such as when the vector equals an element in the last column) but it works on the example you've given.
a b e g i m p
"d" "e" "h" "j" "l" "o" "r"
You can fine the element that matches an argument using which, and then add a vector to it (in this case c(0,1)).
ele_list = as.list(ele)
names(ele_list) = ele
unlist(lapply(ele_list, function(e) df1[which(df1 == e, arr.ind = TRUE) + c(0, 1)]))
a b e g i
"d" "e" "h" "j" "l"
I only did it for df1, you could run the third line for both, then combine the vectors and convert to dataframe.

Shortcut for if else

What is the shortest way to express the folowing decission rule
df<-data.frame(a=LETTERS[1:5],b=1:5)
index<-df[,"a"]=="F"
if(any(index)){
df$new<-"A"
}else{
df$new<-"B"
}
Shortest is
df$new=c("B","A")[1+any(df$a=="F")]
More elegant is:
df$new <- if (any(df$a == "F")) "A" else "B"
or
df <- transform(df, new = if (any(a == "F")) "A" else "B")
The ifelse operator was suggested twice, but I would reserve it for a different type of operation:
df$new <- ifelse(df$a == "F", "A", "B")
would put a A or a B on every row depending on the value of a in that row only (which is not what your code is currently doing.)
Maybe using the vectorized version ifelse
> df$new <- ifelse(any(df[,"a"]=="F"), "A", "B")
> df
a b new
1 A 1 B
2 B 2 B
3 C 3 B
4 D 4 B
5 E 5 B
Another solution with ifelse:
df$new <- ifelse("F" %in% df$a,"A","B")
Technically this is shorter than all the foregoing ;)
df$new <- LETTERS(2-any("F"%in%df$a))

Access a single cell / subsetted column of a data.table

How can I access just a single cell in a data.table in the way as I could for a data.frame:
mdf <- data.frame(a = c("A", "B", "C"), b = rnorm(3), c = 1:3)
mdf[ mdf$a == "B", "c" ]
[1] 2
Doing the analogue on a data.table a data.table is returned including the key column(s):
mdt <- data.table( mdf, key = "a" )
mdt[ "B", c ]
a c
1: B 2
mdt[ "B", c ][ , c]
[1] 2
Did I miss a parameter or does it has to be done as in the last line?
Either of these will avoid repeating the c but are not as efficient since they involve computing the first [] as well as the final answer:
> mdt[ "B", ][["c"]]
[1] 2
> mdt[ "B", ][, c]
[1] 2
Recent versions of data.table make this easier
mdt[ "B", c]
# [1] 2
Original answer was returning a data.table like:
mdt['B', 'c']
# c
# 1: 2

Subset dataframe by multiple logical conditions of rows to remove

I would like to subset (filter) a dataframe by specifying which rows not (!) to keep in the new dataframe. Here is a simplified sample dataframe:
data
v1 v2 v3 v4
a v d c
a v d d
b n p g
b d d h
c k d c
c r p g
d v d x
d v d c
e v d b
e v d c
For example, if a row of column v1 has a "b", "d", or "e", I want to get rid of that row of observations, producing the following dataframe:
v1 v2 v3 v4
a v d c
a v d d
c k d c
c r p g
I have been successful at subsetting based on one condition at a time. For example, here I remove rows where v1 contains a "b":
sub.data <- data[data[ , 1] != "b", ]
However, I have many, many such conditions, so doing it one at a time is not desirable. I have not been successful with the following:
sub.data <- data[data[ , 1] != c("b", "d", "e")
or
sub.data <- subset(data, data[ , 1] != c("b", "d", "e"))
I've tried some other things as well, like !%in%, but that doesn't seem to exist.
Any ideas?
Try this
subset(data, !(v1 %in% c("b","d","e")))
The ! should be around the outside of the statement:
data[!(data$v1 %in% c("b", "d", "e")), ]
v1 v2 v3 v4
1 a v d c
2 a v d d
5 c k d c
6 c r p g
You can also accomplish this by breaking things up into separate logical statements by including & to separate the statements.
subset(my.df, my.df$v1 != "b" & my.df$v1 != "d" & my.df$v1 != "e")
This is not elegant and takes more code but might be more readable to newer R users. As pointed out in a comment above, subset is a "convenience" function that is best used when working interactively.
This answer is more meant to explain why, not how. The '==' operator in R is vectorized in a same way as the '+' operator. It matches the elements of whatever is on the left side to the elements of whatever is on the right side, per element. For example:
> 1:3 == 1:3
[1] TRUE TRUE TRUE
Here the first test is 1==1 which is TRUE, the second 2==2 and the third 3==3. Notice that this returns a FALSE in the first and second element because the order is wrong:
> 3:1 == 1:3
[1] FALSE TRUE FALSE
Now if one object is smaller then the other object then the smaller object is repeated as much as it takes to match the larger object. If the size of the larger object is not a multiplication of the size of the smaller object you get a warning that not all elements are repeated. For example:
> 1:2 == 1:3
[1] TRUE TRUE FALSE
Warning message:
In 1:2 == 1:3 :
longer object length is not a multiple of shorter object length
Here the first match is 1==1, then 2==2, and finally 1==3 (FALSE) because the left side is smaller. If one of the sides is only one element then that is repeated:
> 1:3 == 1
[1] TRUE FALSE FALSE
The correct operator to test if an element is in a vector is indeed '%in%' which is vectorized only to the left element (for each element in the left vector it is tested if it is part of any object in the right element).
Alternatively, you can use '&' to combine two logical statements. '&' takes two elements and checks elementwise if both are TRUE:
> 1:3 == 1 & 1:3 != 2
[1] TRUE FALSE FALSE
data <- data[-which(data[,1] %in% c("b","d","e")),]
my.df <- read.table(textConnection("
v1 v2 v3 v4
a v d c
a v d d
b n p g
b d d h
c k d c
c r p g
d v d x
d v d c
e v d b
e v d c"), header = TRUE)
my.df[which(my.df$v1 != "b" & my.df$v1 != "d" & my.df$v1 != "e" ), ]
v1 v2 v3 v4
1 a v d c
2 a v d d
5 c k d c
6 c r p g
sub.data<-data[ data[,1] != "b" & data[,1] != "d" & data[,1] != "e" , ]
Larger but simple to understand (I guess) and can be used with multiple columns, even with !is.na( data[,1]).
And also
library(dplyr)
data %>% filter(!v1 %in% c("b", "d", "e"))
or
data %>% filter(v1 != "b" & v1 != "d" & v1 != "e")
or
data %>% filter(v1 != "b", v1 != "d", v1 != "e")
Since the & operator is implied by the comma.

Resources