How to filter an table by a value in row - r

I am doing an script in r that load some data from a csv file. SO i user these function to load my data
data <- read.csv("info.csv",colClasses = "character")
and my data look like these
a b c d ...
1 "A" 3 4 ...
5 "B" 7 8 ...
9 "C" 7 4 ...
9 "C" 2 5 ...
9 "A" 1 6 ...
How could filter only the rows that contain "C" or the "A" or both or any other string?

For A only, you can try:
data.Aonly <- data[data$b == "A", ]
or using the subset() command:
data.Aonly <- subset(data, b == "A")
For either A or C, you can use the %in% operator:
data.AC <- data[data$b %in% c("A", "C"), ], or
data.AC <- subset(data, b %in% c("A", "C"))

if data is a data.frame, you can do
data[data$b == "C",]
to get all rows with C in them in column B

In the dplyr package we take this form:
data %.% filter(b == "A") ## OR...
filter(data, b == "A")
So here it is with a real data set. But dplyr is more about performance related tasks. if all you truly want is to grab some rows from a small data set it's likely not the best answer here.
library(dplyr)
filter(mtcars, gear == 4) # OR...
tbl_df(mtcars) %.% filter(gear == 4)
tbl_df(mtcars) %.% filter(gear == 3)
tbl_df(mtcars) %.% filter(gear %in% 3:4)

Related

Trying to sort character variable into new variable with new value based on conditions

I want to sort a character variable into two categories in a new variable based on conditions, in conditions are not met i want it to return "other".
If variable x cointains 4 character values "A", "B", "C" & "D" I want to sort them into a 2 categories, 1 and 0, in a new variable y, creating a dummy variable
Ideally I want it to look like this
df <- data.frame(x = c("A", "B", "C" & "D")
y <- if x == "A" | "D" then assign 1 in y
if x == "B" | "C" then assign 0 in y
if x == other then assign NA in y
x y
1 "A" 1
2 "B" 0
3 "C" 0
4 "D" 1
library(dplyr)
df <- df %>% mutate ( y =case_when(
(x %in% df == "A" | "D") ~ 1 ,
(x %in% df == "B" | "C") ~ 1,
x %in% df == ~ NA
))
I got this error message
Error: replacement has 3 rows, data has 2
Here's the proper case_when syntax.
df <- data.frame(x = c("A", "B", "C", "D"))
library(dplyr)
df <- df %>%
mutate(y = case_when(x %in% c("A", "D") ~ 1,
x %in% c("B", "C") ~ 0,
TRUE ~ NA_real_))
df
#> x y
#> 1 A 1
#> 2 B 0
#> 3 C 0
#> 4 D 1
You're combining syntaxes in a way that makes sense in speech but not in code.
Generally you can't use foo == "G" | "H". You need to use foo == "G" | foo == "H", or the handy shorthand foo %in% c("G", "H").
Similarly x %in% df == "A" doesn't make sense x %in% df makes sense. df == "A" makes sense. Putting them together x %in% df == ... does not make sense to R. (Okay, it does make sense to R, but not the same sense it does to you. R will use its Order of Operations which evaluates x %in% df first and gets a result from that, and then checks whether that result == "A", which is not what you want.)
Inside a dplyr function like mutate, you don't need to keep specifying df. You pipe in df and now you just need to use the column x. x %in% df looks like you're testing whether the column x is in the data frame df, which you don't need to do. Instead use x %in% c("A", "D"). Aron's answer shows the full correct syntax, I hope this answer helps you understand why.

Select for every row between two columns based on condition in another column in R

may someone help me to find the answer thread or provide a method for solution? I can not find a solution.
What I want to do:
For every row if the value in column "x" is "A" then select the value in column "y" from the same row and if the value in column "x" is "B" then select the value in column "z" from the same row.
Ideally collected in a vector to include as a new column in the df afterwards.
df <- data.frame(x = c("A", "B", "B", "A"), y = c(1,2,3,4), z = c(4,3,2,1), fix.empty.names = FALSE)
df
x y z
1 A 1 4
2 B 2 3
3 B 3 2
4 A 4 1
result
[1] 1 3 2 4
Thank you very much in advance
If we can assume x is always "A" or "B":
ifelse(df$x == "A", df$y, df$z)
More generally:
ifelse(df$x == "A", df$y, ifelse(df$x == "B", df$z, NA))
You can, of course, assign this directly as a new column: df$result <- ifelse...
If you like dplyr:
library(dplyr)
df %>%
mutate(
result = case_when(
x == "A" ~ y,
x == "B" ~ z,
TRUE ~ NA_real_
)
)

Replace strings in variable using lookup vector

I have a dataframe df with a character variable and the fromvec and tovec.
df <- tibble(var = c("A", "B", "C", "a", "E", "D", "b"))
fromvec <- c("A", "B", "C")
tovec <- c("X", "Y", "Z")
Use strings in fromvec, check them in df and then replace them with the corresponding strings in tovec so that "A" in df gets replaced with "X", "B" with "Y" and so on to get the desired_df.
desired_df <- tibble(var = c("X", "Y", "Z", "X", "E", "D", "Y"))
I tried following, but not getting the desired result!
from_vec <- paste(fromvec, collapse="|")
to_vec <- paste(tovec, collapse="|")
undesired_df <- df %>%
mutate(var = str_replace(str_to_upper(var), from_vec, to_vec))
i.e. this
tibble(var = c("X|Y|Z", "X|Y|Z", "X|Y|Z", "X|Y|Z", "E", "D", "X|Y|Z"))
How can I get the desired_df?
You could use chartr :
df$var <- chartr(paste(fromvec,collapse=""),
paste(tovec,collapse=""),
toupper(df$var))
# # A tibble: 7 x 1
# var
# <chr>
# 1 X
# 2 Y
# 3 Z
# 4 X
# 5 E
# 6 D
# 7 Y
Or we can use recode
library(dplyr)
df$var <- recode(toupper(df$var), !!!setNames(tovec,fromvec))
If you really want to use str_replace you could do:
library(purrr)
library(stringr)
df$var <- reduce2(fromvec, tovec, str_replace, .init=toupper(df$var))
The correct way to do this with stringr is with str_replace_all:
mutate(df,str_replace_all(str_to_upper(var),setNames(tovec, fromvec)))
(thanks, #Moody_Mudskipper!)
We can use base R
with(df, ifelse(toupper(var) %in% fromvec,
setNames(tovec, fromvec)[toupper(var)], var))
#[1] "X" "Y" "Z" "X" "E" "D" "Y"
which can be also written in two lines by creating a logical condition
i1 <- toupper(df$var) %in% fromvec
df$var[i1] <- setNames(tovec, fromvec)[toupper(df$var)[i1]]
Or using data.table
library(data.table)
setDT(df)[toupper(var) %in% fromvec, var := setNames(tovec, fromvec)[toupper(var)]]
It's not clear the result should be case insensitive.
In my opinion, replacement (update) operations that involve an indeterminate number of changes are best accomplished using JOINs. In this case, it also cements a good practice of tracking your changes in a separate dataframe.
Unfortunately, the tidyverse has no "update dataframe" function....a glaring omission. That means tidyverse-ers must use a work-around, coalesce.
#JOIN Operation
tibble(fromvec, tovec) %>% #< dataframe of changes
right_join(df, by = c("fromvec" = "var")) %>% #< join operation
transmute(var = coalesce(tovec, fromvec)) #< coalesce work-around
# A tibble: 7 x 1
var
<chr>
1 X
2 Y
3 Z
4 a
5 E
6 D
7 b
If a case insensitive operation is preferred, consider inserting str_to_upper in the pipeline:
tibble(fromvec, tovec) %>%
right_join(df %>% mutate(var = (str_to_upper(var))), #<modify case
by = c("fromvec" = "var")) %>%
transmute(var = coalesce(tovec, fromvec))
# A tibble: 7 x 1
var
<chr>
1 X
2 Y
3 Z
4 X
5 E
6 D
7 Y

R function to filter / subset (programatically) multiple values over one variable [duplicate]

This question already has answers here:
Filter multiple values on a string column in dplyr
(6 answers)
Closed last year.
Is there a function that takes one dataset, one col, one operator, but several values to evaluate a condition?
v1 <- c(1:3)
v2 <- c("a", "b", "c")
df <- data.frame(v1, v2)
Options to subset (programmatically)
result <- df[df$v2 == "a" | df$v2 == "b", ]
result
1 1 a
2 2 b
Or, for more robustness
result1 <- df[ df[[2]] == "a" | df[[2]] == "b", ]
result1
v1 v2
1 1 a
2 2 b
Alternatively, for easier syntax:
library(dplyr)
result2 <- filter(df, v2 == "a" | v2 == "b")
result2
v1 v2
1 1 a
2 2 b
(Am I right to assume that I can safely use dplyr's filter() inside a function?
)
I did not include subset() above as it is known to be for interactive use only.
In all the cases above, one has to repeat the condition (v2 == "a" | v2 == "b").
I'm looking for a function to which I could pass a vector to the argument, like c("a", "b") because I would like to pass a large number of values, and automate the process.
Such function could perhaps be something like:
fun(df, col = v2, operator = "|", value = c("a", "b")
Thank you
We can use %in% if the number of elements to check is more than 1.
df[df$v2 %in% c('a', 'b'),]
# v1 v2
#1 1 a
#2 2 b
Or if we use subset, the df$ can be removed
subset(df, v2 %in% c('a', 'b'))
Or the dplyr::filter
filter(df, v2 %in% c('a', 'b'))
This can be wrapped in a function
f1 <- function(dat, col, val){
filter(dat, col %in% val)
}
f1(df, v2, c('a', 'b'))
# v1 v2
#1 1 a
#2 2 b
If we need to use ==, we could loop the vector to compare in a list and use Reduce with |
df[Reduce(`|`, lapply(letters[1:2], `==`, df$v2)),]

Shortcut for if else

What is the shortest way to express the folowing decission rule
df<-data.frame(a=LETTERS[1:5],b=1:5)
index<-df[,"a"]=="F"
if(any(index)){
df$new<-"A"
}else{
df$new<-"B"
}
Shortest is
df$new=c("B","A")[1+any(df$a=="F")]
More elegant is:
df$new <- if (any(df$a == "F")) "A" else "B"
or
df <- transform(df, new = if (any(a == "F")) "A" else "B")
The ifelse operator was suggested twice, but I would reserve it for a different type of operation:
df$new <- ifelse(df$a == "F", "A", "B")
would put a A or a B on every row depending on the value of a in that row only (which is not what your code is currently doing.)
Maybe using the vectorized version ifelse
> df$new <- ifelse(any(df[,"a"]=="F"), "A", "B")
> df
a b new
1 A 1 B
2 B 2 B
3 C 3 B
4 D 4 B
5 E 5 B
Another solution with ifelse:
df$new <- ifelse("F" %in% df$a,"A","B")
Technically this is shorter than all the foregoing ;)
df$new <- LETTERS(2-any("F"%in%df$a))

Resources