Changing values in multiple column given one condition (preferably in dplyr) - r

I'm looking for an easy way to change several values for the same person. Preferably with dplyr or any other package from tidyverse.
Here my example:
df <- data.frame(personid = 1:3, class = c("class1", "class3", "class3"), classlevel = c(1, 11, 3), education = c("BA", "Msc", "BA"))
df
My dataset contains an entry with several mistakes. Person #2 should be part of class 1, at classlevel 1 und his education is BA, not MSc. I use mutate with case_when a lot, but in my case I don't want to change one variable with multiple condition, I have one condition and want to change multiple values in other variables based on this condition.
Basically, I'm looking for an shorter code which replaces this:
df$class[df$personid == 2] <- "class1"
df$classlevel[df$personid == 2] <- 1
df$education[df$personid == 2] <- "BA"
df
or this:
library(tidyverse)
df <- df |>
mutate(class = case_when(personid == 2 ~ "class1", TRUE ~ class)) |>
mutate(classlevel = case_when(personid == 2 ~ 1, TRUE ~ as.numeric(classlevel))) |>
mutate(education = case_when(personid == 2 ~ "BA", TRUE ~ education))
df
In my original data, there are several dozend cases like this, and I find it a bit tedious to use three lines of code for each person. Is there a shorter way?
Thanks for your input!

One way would be to create a data frame of the values to be updated and use rows_update(). Note that this assumes the rows are uniquely identified.
library(dplyr)
df_update <- tribble(
~personid, ~class, ~classlevel, ~education,
1, "class1", 1, "BA"
)
df %>%
rows_update(df_update, by = "personid")
personid class classlevel education
1 1 class1 1 BA
2 2 class1 1 BA
3 3 class3 3 BA

I think I need a little bit more information to try to answer your question., but I'll try anyway.
If you want to change the value of some columns based on a unique condition across all the rows I recommend doing this (I created new columns col_name1 so you can see the original and ouput):
df <- df %>% mutate(class1 = case_when(class != "class1" ~ "class1", TRUE ~ class),
classlevel1 = case_when( classlevel != 1 ~ 1, TRUE ~ as.numeric(classlevel)),
education1 = case_when( education != "BA" ~ "BA", TRUE ~ education))
If that was your problem, then you are probably not familiar with the concept of vectorization. Briefly, a vectorized function runs for all the rows or elements in your vector, without you needing to specify that. There are a lof of examples and tutorial on the web if you search "vectorization in R" or something similar.
Otherwise, if your condition changes for each single id (or row) in your data, then the problem is more complicated.
Let me know if that helps and, if it doesn't, consider providing more information in your question.

Related

Change multiple value labels using tidyverse syntax?

The labelled package provides this functionality to modify value labels for multiple variables in one go:
df <- data.frame(v1 = 1:3, v2 = c(2, 3, 1), v3 = 3:1)
val_labels(df[, c("v1", "v3")]) <- c(YES = 1, MAYBE = 2, NO = 3)
val_labels(df)
But I'm wondering if there's a way to do this in tidyverse syntax:
Something like this:
library(tidyverse)
df%>%
mutate(across(V1:V2), ~val_labels(.x)<-c(YES = 1, MAYBE = 2, NO = 3)
We need to assign and then return the column (.x). In addition, when there are more than one expression, wrap it inside the {}
library(dplyr)
library(labelled)
df <- df %>%
mutate(across(v1:v2, ~
{
val_labels(.x) <- c(YES = 1, MAYBE = 2, NO = 3)
.x
}))
-output
> val_labels(df)
$v1
YES MAYBE NO
1 2 3
$v2
YES MAYBE NO
1 2 3
$v3
NULL
I would suggest using haven's labelled class directly, alternatively check out the labelled package's functions made for the dplyr syntax, e.g. add_value_labels.
df <-
df |>
mutate(across(v1:v2,
~ haven::labelled(.,
labels = c(YES = 1,
MAYBE = 2,
NO = 3)
)
)
)
labelled::val_labels(df)
Output:
$v1
YES MAYBE NO
1 2 3
$v2
YES MAYBE NO
1 2 3
$v3
NULL
A side note: Unless you have a very specific reason for using the labelled-package I'd suggest that you keep the usage to a minimum and coerce into factors, especially in the case of value labels. I've learned the hard way that the labelled-package (and sjlabelled for that matter) will often let you do things that seems smart on the outset but isn't in the long run.
A labelled vector is a common data structure in other statistical environments, allowing you to assign text labels to specific values. (...) This class provides few methods, as I expect you'll coerce to a standard R class (e.g. a factor()) soon after importing.
https://haven.tidyverse.org/reference/labelled.html
(My emphasis)

R: Create Indicator Columns from list of conditions

I have a dataframe and a number of conditions. Each condition is supposed to check whether the value in a certain column of the dataframe is within a set of valid values.
This is what I tried:
# create the sample dataframe
age <- c(120, 45)
sex <- c("x", "f")
df <-data.frame(age, sex)
# create the sample conditions
conditions <- list(
list("age", c(18:100)),
list("sex", c("f", "m"))
)
addIndicator <- function (df, columnName, validValues) {
indicator <- vector()
for (row in df[, toString(columnName)]) {
# for some strange reason, %in% doesn't work correctly here, but always returns FALSe
indicator <- append(indicator, row %in% validValues)
}
df <- cbind(df, indicator)
# rename the column
names(df)[length(names(df))] <- paste0("I_", columnName)
return(df)
}
for (condition in conditions){
columnName <- condition[1]
validValues <- condition[2]
df <- addIndicator(df, columnName, validValues)
}
print(df)
However, this leads to all conditions considered not to be met - which is not what I expect:
age sex I_age I_sex
1 120 x FALSE FALSE
2 45 f FALSE FALSE
I figured that %in% does not return the expected result. I checked for the typeof(row) and tried to boil this down into a minimum example. In a simple ME, with the same type and values of the variables, the %in% works properly. So, something must be wrong within the context I try to apply this. Since this is my first attempt to write anything in R, I am stuck here.
What am I doing wrong and how can I achieve what I want?
If you prefer an approach that uses the tidyverse family of packages:
library(tidyverse)
allowed_values <- list(age = 18:100, sex = c("f", "m"))
df %>%
imap_dfr(~ .x %in% allowed_values[[.y]]) %>%
rename_with(~ paste0('I_', .x)) %>%
bind_cols(df)
imap_dfr allows you to manipulate each column in df using a lambda function. .x references the column content and .y references the name.
rename_with renames the columns using another lambda function and bind_cols combines the results with the original dataframe.
I borrowed the simplified list of conditions from ben's answer. I find my approach slightly more readable but that is a matter of taste and of whether you are already using the tidyverse elsewhere.
conditions appears to be a nested list. When you use:
validValues <- condition[2]
in your for loop, your result is also a list.
To get the vector of values to use with %in%, you can extract [[ by:
validValues <- condition[[2]]
A simplified approach to obtaining indicators could be with a simple list:
conditions_lst <- list(age = 18:100, sex = c("f", "m"))
And using sapply instead of a for loop:
cbind(df, sapply(setNames(names(df), paste("I", names(df), sep = "_")), function(x) {
df[[x]] %in% conditions_lst[[x]]
}))
Output
age sex I_age I_sex
1 120 x FALSE FALSE
2 45 f TRUE TRUE
An alternative approach using across and cur_column() (and leaning heavily on severin's solution):
library(tidyverse)
df <- tibble(age = c(12, 45), sex = c('f', 'f'))
allowed_values <- list(age = 18:100, sex = c("f", "m"))
df %>%
mutate(across(c(age, sex),
c(valid = ~ .x %in% allowed_values[[cur_column()]])
)
)
Reference: https://dplyr.tidyverse.org/articles/colwise.html#current-column
Related question: Refering to column names inside dplyr's across()

R limit output of dataframe?

I have a data frame of transactions.
I am using dplyr to filter the transaction by gender.
Gender in my case is 0 or 1.
I want to filter 2 rows one with Gender == 0 and the second with Gender == 1.
The closest was to do it like this
df %>% arrange(Gender)
and then select 2 transactions in the middle where one is 1 and the second is 0.
Please advise.
To randomly sample a row/cell where condition in another cell is satisfied you can use sample like this:
# Dummy data: X = value of interest, G = Gender (0,1)
df1 <- data.frame("X" = rnorm(10, 0, 1), "G" = sample(c(0,1), replace = T, size = 10))
# Sampling
sample(df1[,'X'][df1[,'G'] == 1], size = 1)
sample(df1[,'X'][df1[,'G'] == 0], size = 1)
This is taking one value of X for each gender (condition of G being set by [df1[,'G'] == 1]).
Building from the comment by docendo discimus you can use the popular dplyr package, using the script below, but note that this runs considerably slower (5 times slower, 3M rows & 1000 iterations) than the sample approach I offered above:
pull(df1 %>% group_by(G) %>% sample_n(1), X)

Conditionally creating multiple variables

I am trying to create multiple variables based on a condition using R dplyr. I have to write the same condition three times to get this work but I am guessing that there is an efficient way to do this task by writing the condition once and using that condition to create multiple variables. The reason I am trying to do this is, if there is a change in the condition, it will be easier to update the condition at one single location instead of updating it three times. Please help.
Example:
Current solution:
library(dplyr)
x = c(12,8,3)
df<-data.frame(x)
y<- df %>% mutate( a = ifelse(x>10 ,1,
ifelse(x>5 ,11,0)),
b = ifelse(x>10 ,2,
ifelse(x>5 ,12,0)),
c = ifelse(x>10 ,3,
ifelse(x>5 ,13,0))
)
Looking for something like this:
if x>10 then
{a=1 b=2 c=3}
else if x>5 then
{a=11 b=12 c=13}
else
{a=0 b=0 c=0}
Define a function and use it three times:
cond <- function(x, x1, x2)
case_when(
x > 10 ~ x1,
x > 5 ~ x2,
TRUE ~ 0)
df %>% mutate(a = cond(x, 1, 11), b = cond(x, 2, 12), c = cond(x, 3, 13))

dplyr filter with condition on multiple columns

I'd like to remove rows corresponding to a particular combination of variables from my data frame.
Here's a dummy data :
father<- c(1, 1, 1, 1, 1)
mother<- c(1, 1, 1, NA, NA)
children <- c(NA, NA, 2, 5, 2)
cousins <- c(NA, 5, 1, 1, 4)
dataset <- data.frame(father, mother, children, cousins)
dataset
father mother children cousins
1 1 NA NA
1 1 NA 5
1 1 2 1
1 NA 5 1
1 NA 2 4
I want to filter this row :
father mother children cousins
1 1 NA NA
I can do it with :
test <- dataset %>%
filter(father==1 & mother==1) %>%
filter (is.na(children)) %>%
filter (is.na(cousins))
test
My question :
I have many columns like grand father, uncle1, uncle2, uncle3 and I want to avoid something like that:
filter (is.na(children)) %>%
filter (is.na(cousins)) %>%
filter (is.na(uncle1)) %>%
filter (is.na(uncle2)) %>%
filter (is.na(uncle3))
and so on...
How can I use dplyr to say filter all the column with na (except father==1 & mother==1)
A possible dplyr(0.5.0.9004 <= version < 1.0) solution is:
# > packageVersion('dplyr')
# [1] ‘0.5.0.9004’
dataset %>%
filter(!is.na(father), !is.na(mother)) %>%
filter_at(vars(-father, -mother), all_vars(is.na(.)))
Explanation:
vars(-father, -mother): select all columns except father and mother.
all_vars(is.na(.)): keep rows where is.na is TRUE for all the selected columns.
note: any_vars should be used instead of all_vars if rows where is.na is TRUE for any column are to be kept.
Update (2020-11-28)
As the _at functions and vars have been superseded by the use of across since dplyr 1.0, the following way (or similar) is recommended now:
dataset %>%
filter(across(c(father, mother), ~ !is.na(.x))) %>%
filter(across(c(-father, -mother), is.na))
See more example of across and how to rewrite previous code with the new approach here: Colomn-wise operatons or type vignette("colwise") in R after installing the latest version of dplyr.
dplyr >= 1.0.4
If you're using dplyr version >= 1.0.4 you really should use if_any or if_all, which specifically combines the results of the predicate function into a single logical vector making it very useful in filter. The syntax is identical to across, but these verbs were added to help fill this need: if_any/if_all.
library(dplyr)
dataset %>%
filter(if_all(-c(father, mother), ~ is.na(.)), if_all(c(father, mother), ~ !is.na(.)))
Here I have written out the variable names, but you can use any tidy selection helper to specify variables (e.g., column ranges by name or location, regular expression matching, substring matching, starts with/ends with, etc.).
Output
father mother children cousins
1 1 1 NA NA
None of the answers seems to be an adaptable solution. I think the intention is not to list all the variables and values to filter the data.
One easy way to achieve this is through merging. If you have all the conditions in df_filter then you can do this:
df_results = df_filter %>% left_join(df_all)
A dplyr solution:
test <- dataset %>%
filter(father==1 & mother==1 & rowSums(is.na(.[,3:4]))==2)
Where '2' is the number of columns that should be NA.
This gives:
> test
father mother children cousins
1 1 1 NA NA
You can apply this logic in base R as well:
dataset[dataset$father==1 & dataset$mother==1 & rowSums(is.na(dataset[,3:4]))==2,]
Here is a base R method using two Reduce functions and [ to subset.
keepers <- Reduce(function(x, y) x == 1 & y == 1, dataset[, 1:2]) &
Reduce(function(x, y) is.na(x) & is.na(y), dataset[, 3:4])
keepers
[1] TRUE FALSE FALSE FALSE FALSE
Each Reduce consecutively takes the variables provided and performs a logical check. The two results are connected with an &. The second argument to the Reduce functions can be adjusted to include whatever variables in the data.frame that you want.
Then use the logical vector to subset
dataset[keepers,]
father mother children cousins
1 1 1 NA NA
This answer builds on #Feng Jiangs answer using the dplyr::left_joint() operation, and is more like a reprex. In addition, it ensures the proper order of columns is restored in case the order of variables in df_filter differs from the order of the variables in the original dataset. Also, the dataset was expanded for a duplicate combination to show these are part of the filtered output (df_out).
library(dplyr)
father<- c(1, 1, 1, 1, 1,1)
mother<- c(1, 1, 1, NA, NA,1)
children <- c(NA, NA, 2, 5, 2,NA)
cousins <- c(NA, 5, 1, 1, 4,NA)
dataset <- data.frame(father, mother, children, cousins)
df_filter <- data.frame( father = 1, mother = 1, children = NA, cousins = NA)
test <- df_filter %>%
left_join(dataset) %>%
relocate(colnames(dataset))

Resources