Subset a data frame using OR when the column contains a factor

Subset a data frame using OR when the column contains a factor - r

I would like to make a subset of a data frame in R that is based on one OR another value in a column of factors but it seems I cannot use | with factor values.
Example:
# fake data
x <- sample(1:100, 9)
nm <- c("a", "a", "a", "b", "b", "b", "c", "c", "c")
fake <- cbind(as.data.frame(nm), as.data.frame(x))
# subset fake to only rows with name equal to a or b
fake.trunk <- fake[fake$nm == "a" | "b", ]
produces the error:
Error in fake$nm == "a" | "b" :
operations are possible only for numeric, logical or complex types
How can I accomplish this?
Obviously my actual data frame has more than 3 values in the factor column so just using != "c" won't work.

You need fake.trunk <- fake[fake$nm == "a" | fake$nm == "b", ]. A more concise way of writing that (especially with more than two conditions) is:
fake[ fake$nm %in% c("a","b"), ]

Another approach would be to use subset() and write
fake.trunk = subset(fake, nm %in% c('a', 'b'))

Related

Subsetting a dataframe using %in% and ! in R

I have the following dataframe.
Test_Data <- data.frame(x = c("a", "b", "c"), y = c("d", "e", "f"), z = c("g", "h", "i"))
x y z
1 a d g
2 b e h
3 c f i
I would like to filter it based on multiple conditions. Specifically, I would like to remove any record that has the value of "b" in column x or "f" in column y. My subsetted result would be;
x y z
1 a d g
I tried the following solutions;
View(Test_Data %>% subset(!x %in% "b" | !y %in% "f"))
View(Test_Data %>% subset(!x %in% "b" & !y %in% "f"))
View(Test_Data %>% subset(!(x %in% "b" | y %in% "f")))
The last two solutions give me the result I want, however the first one is the only one that makes 'sense' to me because it uses the OR operator and I only need one of the conditions to be met. Why do the last solutions work but not the first?

The subset operation returns the rows that you want to KEEP.
However your set of rules defines the rows you want NOT TO KEEP. Therefore you're getting confused with the negation logic.
The rows you don't want to keep follow a series of rules: r1 | r2 | ....
The NEGATION is: !(r1 | r2 | ...), or: !r1 & !r2 & ...

Can the "c" statement be used along with the "which" statement?

I am using the R programming language. I am interested in seeing whether the "c" statement can be used along with the "which" statement in R. For example, consider the following code (var1 and var2 are both "Factor" variables):
my_file
var1 var2
1 A AA
2 B CC
3 D CC
4 C AA
5 A BB
ouput <- my_file[which(my_file$var1 == c("A", "B", "C") & my_file$var2 !== c("AA", "CC")), ]
But this does not seem to be working.
I can run each of these conditions individually, e.g.
output <- my_file[which(my_file$var1 == "A" | my_file$var1 == "B" | my_file$var1 == "C"), ]
output1 <- output[which(output$var2 == "AA" | output$var2 == "CC" ), ]
But I would like to run them in a more "compact" form, e.g.:
ouput <- my_file[which(my_file$var1 == c("A", "B", "C") & my_file$var2 !== c("AA", "CC")), ]
Can someone please tell me what I am doing wrong?
Thanks

When you compare my_file$var1 == c("A", "B", "C"), the comparison will take place element-by-element, but because they are different lengths, the shorter will be repeated (with a warning because the repeating is incomplete.
c("A", "B", "D", "C", "A") == c("A", "B", "C", "A", "B") giving:
c(TRUE, TRUE, FALSE, FALSE, FALSE), then which will convert to c(1, 2).
The reason it works when you use one letter at a time is that the single element is repeated 5 times my_file$var1 == "A" leads to c("A", "B", "D", "C", "A") == c("A", "A", "A", "A", "A") and gives the result you expect.
#deschen is right, you should use %in%
output <- my_file[which(my_file$var1 %in% c("A", "B", "C") & !my_file$var2 %in% c("AA", "CC")), ]

As #deschen says in a comment, you should use %in% rather than ==. You can also (1) get rid of the which() (logical indexing works just as well here as indexing by position) and (2) use subset to avoid re-typing my_file.
output <- subset(my_file, var1 %in% c("A", "B", "C") &
!(var2 %in% c("AA", "CC")))
Alternatively, if you like the tidyverse, this would be:
library(dplyr)
output <- my_file %>% dplyr::filter(var1 %in% c("A", "B", "C"),
!(var2 %in% c("AA", "CC")))
(comma-separated conditions in filter() work the same as &).

R: Is there a method in R, to substiute the values of a vector using a dictionary (2 column dataframe with old and new value)

Is there a method in R, to substitute the values of a vector using a dictionary (2 column dataframe with old and new value)
The only method I know is to extract the old value into a dataframe and merge it with, what I call,the dictionary (which is a two column dataframe with old and new values). Afterwards reassign the new value to the original old value. However, it seems when using merge (at least since R v4.1, the order of the x value is not maintained, so I am using join now which keeps the original order of dataframe x intact. I am thinking that there must be an easier way, I just have not found it. Hope this is understandable, I appreciate any help.
cheers Hermann

You could use a named character vector as a dict for replacement by unquoting with !!! inside of dplyr::recode. If you have your "dict" stored as a two-column dataframe, then tidyr::deframe might be handy.
library(tidyverse)
x <- c("a", "b", "c")
dict <- tribble(
~old, ~new,
"a", "d",
"b", "e",
"c", "f"
)
recode(x, !!!deframe(dict))
#> [1] "d" "e" "f"
Created on 2021-06-14 by the reprex package (v1.0.0)

You can use match to substitute the values of a vector using a dictionary:
D$new[match(x, D$old)]
#[1] "d" "e" "f"
You can also use the names to get the new values:
L <- setNames(D$new, D$old)
L[x]
#"d" "e" "f"
Data:
x <- c("a", "b", "c")
D <- data.frame(old = c("a", "b", "c"), new = c("d", "e", "f"))

How to use the same R recode function on multiple variables without coding each?

From the recode examples, what if I have two variables where I want to apply the same recode?
factor_vec1 <- factor(c("a", "b", "c"))
factor_vec2 <- factor(c("a", "d", "f"))
How can I recode the same answer without writing a recode for each factor_vec? These don't work, do I need to learn how to use purrr to do it, or is there another way?
Output 1: recode(c(factor_vec1, factor_vec2), a = "Apple")
Output 2: recode(c(factor_vec2, factor_vec2), a = "Apple", b =
"Banana")

If there are not many items needed to be recoded, you can try a simple lookup table approach using base R.
v1 <- c("a", "b", "c")
v2 <- c("a", "d", "f")
# lookup table
lut <- c("a" ="Apple",
"b" = "Banana",
"c" = "c",
"d" = "d",
"f" = "f")
lut[v1]
lut[v2]
You can reuse the lookup table for any relevant variables. The results are:
> lut[v1]
a b c
"Apple" "Banana" "c"
> lut[v2]
a d f
"Apple" "d" "f"

Use lists to hold multiple vectors and then you can apply same function using lapply/map.
library(dplyr)
list_fac <- lst(factor_vec1, factor_vec2)
list_fac <- purrr::map(list_fac, recode, a = "Apple", b = "Banana")
You can keep the vectors in list itself (which is better) or get the changed vectors in global environment using list2env.
list2env(list_fac, .GlobalEnv)

Returning the values of a list based on "two" parameters

Very new to R. So I am wondering if you can use two different parameters to get the position of both elements from a list. See the below example...
x <- c("A", "B", "A", "A", "B", "B", "C", "C", "A", "A", "B")
y <- c(which(x == "A"))
[1] 1 3 4 9 10
x[y]
[1] "A" "A" "A" "A" "A"
x[y+1]
[1] "B" "A" "B" "A" "B"
But I would like to return the positions of both y and y+1 together in the same list. My current solution is to merge the two above lists by row number and create a dataframe from there. I don't really like that and was wondering if there is another way. Thanks!

I dont know what exactly you want, but this could help:
newY = c(which(x == "A"),which(x == "A")+1)
After that you can sort it with
finaldata <- newY[order(newY)]
Or you do both in one step:
finaldata <- c(which(x == "A"),which(x == "A")+1)[order(c(which(x == "A"),which(x == "A")+1))]
Then you could also delete duplicates if you want to. Please tell me if this is what you wanted.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Subset a data frame using OR when the column contains a factor - r

You need fake.trunk <- fake[fake$nm == "a" | fake$nm == "b", ]. A more concise way of writing that (especially with more than two conditions) is: fake[ fake$nm %in% c("a","b"), ]

Another approach would be to use subset() and write fake.trunk = subset(fake, nm %in% c('a', 'b'))

Related

Subsetting a dataframe using %in% and ! in R

Can the "c" statement be used along with the "which" statement?

R: Is there a method in R, to substiute the values of a vector using a dictionary (2 column dataframe with old and new value)

How to use the same R recode function on multiple variables without coding each?

Returning the values of a list based on "two" parameters

Categories

Resources