Trying to sort character variable into new variable with new value based on conditions - r

I want to sort a character variable into two categories in a new variable based on conditions, in conditions are not met i want it to return "other".
If variable x cointains 4 character values "A", "B", "C" & "D" I want to sort them into a 2 categories, 1 and 0, in a new variable y, creating a dummy variable
Ideally I want it to look like this
df <- data.frame(x = c("A", "B", "C" & "D")
y <- if x == "A" | "D" then assign 1 in y
if x == "B" | "C" then assign 0 in y
if x == other then assign NA in y
x y
1 "A" 1
2 "B" 0
3 "C" 0
4 "D" 1
library(dplyr)
df <- df %>% mutate ( y =case_when(
(x %in% df == "A" | "D") ~ 1 ,
(x %in% df == "B" | "C") ~ 1,
x %in% df == ~ NA
))
I got this error message
Error: replacement has 3 rows, data has 2

Here's the proper case_when syntax.
df <- data.frame(x = c("A", "B", "C", "D"))
library(dplyr)
df <- df %>%
mutate(y = case_when(x %in% c("A", "D") ~ 1,
x %in% c("B", "C") ~ 0,
TRUE ~ NA_real_))
df
#> x y
#> 1 A 1
#> 2 B 0
#> 3 C 0
#> 4 D 1

You're combining syntaxes in a way that makes sense in speech but not in code.
Generally you can't use foo == "G" | "H". You need to use foo == "G" | foo == "H", or the handy shorthand foo %in% c("G", "H").
Similarly x %in% df == "A" doesn't make sense x %in% df makes sense. df == "A" makes sense. Putting them together x %in% df == ... does not make sense to R. (Okay, it does make sense to R, but not the same sense it does to you. R will use its Order of Operations which evaluates x %in% df first and gets a result from that, and then checks whether that result == "A", which is not what you want.)
Inside a dplyr function like mutate, you don't need to keep specifying df. You pipe in df and now you just need to use the column x. x %in% df looks like you're testing whether the column x is in the data frame df, which you don't need to do. Instead use x %in% c("A", "D"). Aron's answer shows the full correct syntax, I hope this answer helps you understand why.

Related

Subsetting a dataframe using %in% and ! in R

I have the following dataframe.
Test_Data <- data.frame(x = c("a", "b", "c"), y = c("d", "e", "f"), z = c("g", "h", "i"))
x y z
1 a d g
2 b e h
3 c f i
I would like to filter it based on multiple conditions. Specifically, I would like to remove any record that has the value of "b" in column x or "f" in column y. My subsetted result would be;
x y z
1 a d g
I tried the following solutions;
View(Test_Data %>% subset(!x %in% "b" | !y %in% "f"))
View(Test_Data %>% subset(!x %in% "b" & !y %in% "f"))
View(Test_Data %>% subset(!(x %in% "b" | y %in% "f")))
The last two solutions give me the result I want, however the first one is the only one that makes 'sense' to me because it uses the OR operator and I only need one of the conditions to be met. Why do the last solutions work but not the first?
The subset operation returns the rows that you want to KEEP.
However your set of rules defines the rows you want NOT TO KEEP. Therefore you're getting confused with the negation logic.
The rows you don't want to keep follow a series of rules: r1 | r2 | ....
The NEGATION is: !(r1 | r2 | ...), or: !r1 & !r2 & ...

Select for every row between two columns based on condition in another column in R

may someone help me to find the answer thread or provide a method for solution? I can not find a solution.
What I want to do:
For every row if the value in column "x" is "A" then select the value in column "y" from the same row and if the value in column "x" is "B" then select the value in column "z" from the same row.
Ideally collected in a vector to include as a new column in the df afterwards.
df <- data.frame(x = c("A", "B", "B", "A"), y = c(1,2,3,4), z = c(4,3,2,1), fix.empty.names = FALSE)
df
x y z
1 A 1 4
2 B 2 3
3 B 3 2
4 A 4 1
result
[1] 1 3 2 4
Thank you very much in advance
If we can assume x is always "A" or "B":
ifelse(df$x == "A", df$y, df$z)
More generally:
ifelse(df$x == "A", df$y, ifelse(df$x == "B", df$z, NA))
You can, of course, assign this directly as a new column: df$result <- ifelse...
If you like dplyr:
library(dplyr)
df %>%
mutate(
result = case_when(
x == "A" ~ y,
x == "B" ~ z,
TRUE ~ NA_real_
)
)

Non duplicate remove subsetting [duplicate]

This question already has answers here:
"Set Difference" between two vectors with duplicate values
(4 answers)
Closed 2 years ago.
a <- c("A", "B", "C", "A", "A", "B")
b <- c("A", "C", "A")
I want to subset a wrt to b such that the following set is obtained:-
("B" "A" "B")
Tradition subsetting results in removal of all the "A"s and "C"s from set a.
It removes duplicates also. I don't want them to be remove. For ex:- Set b has 2 "A"s and 1 "C". So while subsetting a wrt b only two "A"s and one "C" should be removed from set a. And rest all the elements in a should remain even though they might be "A" or "C".
I just want to know if there is a way of doing this in R.
An easy option is to use vsetdiff from package vecsets, i.e.,
vecsets::vsetdiff(a,b)
such that
> vecsets::vsetdiff(a,b)
[1] "B" "A" "B"
Using tibble and dplyr, you can do:
enframe(a) %>%
transmute(name = value) %>%
group_by(name) %>%
mutate(ID = 1:n()) %>%
left_join(enframe(table(b)), by = c("name" = "name")) %>%
filter(ID > value | is.na(value)) %>%
pull(name)
[1] "B" "A" "B"
Here is a way to do this :
#Count occurrences of `a`
a_count <- table(a)
#Count occurrences of `b`
b_count <- table(b)
#Subtract the count present in b from a
a_count[names(b_count)] <- a_count[names(b_count)] - b_count
#Create a new vector of remaining values
rep(names(a_count), a_count)
#[1] "A" "B" "B"
Or:
a <- c("A", "B", "C", "A", "A", "B")
b <- c("A", "C", "A")
greedy_delete <- function(x, rmv) {
for (i in rmv) {
x <- x[-which(x == i)[1]]
}
x
}
greedy_delete(a, b)
#"B" "A" "B"

Replace strings in variable using lookup vector

I have a dataframe df with a character variable and the fromvec and tovec.
df <- tibble(var = c("A", "B", "C", "a", "E", "D", "b"))
fromvec <- c("A", "B", "C")
tovec <- c("X", "Y", "Z")
Use strings in fromvec, check them in df and then replace them with the corresponding strings in tovec so that "A" in df gets replaced with "X", "B" with "Y" and so on to get the desired_df.
desired_df <- tibble(var = c("X", "Y", "Z", "X", "E", "D", "Y"))
I tried following, but not getting the desired result!
from_vec <- paste(fromvec, collapse="|")
to_vec <- paste(tovec, collapse="|")
undesired_df <- df %>%
mutate(var = str_replace(str_to_upper(var), from_vec, to_vec))
i.e. this
tibble(var = c("X|Y|Z", "X|Y|Z", "X|Y|Z", "X|Y|Z", "E", "D", "X|Y|Z"))
How can I get the desired_df?
You could use chartr :
df$var <- chartr(paste(fromvec,collapse=""),
paste(tovec,collapse=""),
toupper(df$var))
# # A tibble: 7 x 1
# var
# <chr>
# 1 X
# 2 Y
# 3 Z
# 4 X
# 5 E
# 6 D
# 7 Y
Or we can use recode
library(dplyr)
df$var <- recode(toupper(df$var), !!!setNames(tovec,fromvec))
If you really want to use str_replace you could do:
library(purrr)
library(stringr)
df$var <- reduce2(fromvec, tovec, str_replace, .init=toupper(df$var))
The correct way to do this with stringr is with str_replace_all:
mutate(df,str_replace_all(str_to_upper(var),setNames(tovec, fromvec)))
(thanks, #Moody_Mudskipper!)
We can use base R
with(df, ifelse(toupper(var) %in% fromvec,
setNames(tovec, fromvec)[toupper(var)], var))
#[1] "X" "Y" "Z" "X" "E" "D" "Y"
which can be also written in two lines by creating a logical condition
i1 <- toupper(df$var) %in% fromvec
df$var[i1] <- setNames(tovec, fromvec)[toupper(df$var)[i1]]
Or using data.table
library(data.table)
setDT(df)[toupper(var) %in% fromvec, var := setNames(tovec, fromvec)[toupper(var)]]
It's not clear the result should be case insensitive.
In my opinion, replacement (update) operations that involve an indeterminate number of changes are best accomplished using JOINs. In this case, it also cements a good practice of tracking your changes in a separate dataframe.
Unfortunately, the tidyverse has no "update dataframe" function....a glaring omission. That means tidyverse-ers must use a work-around, coalesce.
#JOIN Operation
tibble(fromvec, tovec) %>% #< dataframe of changes
right_join(df, by = c("fromvec" = "var")) %>% #< join operation
transmute(var = coalesce(tovec, fromvec)) #< coalesce work-around
# A tibble: 7 x 1
var
<chr>
1 X
2 Y
3 Z
4 a
5 E
6 D
7 b
If a case insensitive operation is preferred, consider inserting str_to_upper in the pipeline:
tibble(fromvec, tovec) %>%
right_join(df %>% mutate(var = (str_to_upper(var))), #<modify case
by = c("fromvec" = "var")) %>%
transmute(var = coalesce(tovec, fromvec))
# A tibble: 7 x 1
var
<chr>
1 X
2 Y
3 Z
4 X
5 E
6 D
7 Y

How to filter an table by a value in row

I am doing an script in r that load some data from a csv file. SO i user these function to load my data
data <- read.csv("info.csv",colClasses = "character")
and my data look like these
a b c d ...
1 "A" 3 4 ...
5 "B" 7 8 ...
9 "C" 7 4 ...
9 "C" 2 5 ...
9 "A" 1 6 ...
How could filter only the rows that contain "C" or the "A" or both or any other string?
For A only, you can try:
data.Aonly <- data[data$b == "A", ]
or using the subset() command:
data.Aonly <- subset(data, b == "A")
For either A or C, you can use the %in% operator:
data.AC <- data[data$b %in% c("A", "C"), ], or
data.AC <- subset(data, b %in% c("A", "C"))
if data is a data.frame, you can do
data[data$b == "C",]
to get all rows with C in them in column B
In the dplyr package we take this form:
data %.% filter(b == "A") ## OR...
filter(data, b == "A")
So here it is with a real data set. But dplyr is more about performance related tasks. if all you truly want is to grab some rows from a small data set it's likely not the best answer here.
library(dplyr)
filter(mtcars, gear == 4) # OR...
tbl_df(mtcars) %.% filter(gear == 4)
tbl_df(mtcars) %.% filter(gear == 3)
tbl_df(mtcars) %.% filter(gear %in% 3:4)

Resources