I have a dataset that contains columns hh_c22j, hh_r02a, hh_r02b. I want to replace NAs in these col with 0. Right now I have the command as below, it works. But is redundant, as I need to specify for each column to replace with 0.
df %>% select(case_id, hh_c22j, hh_r02a, hh_r02b) %>% replace_na(list(hh_c22j=0, hh_r02a=0, hh_r02b=0))
I want to select the columns together in an array/list like below.
df %>% select(case_id, hh_c22j, hh_r02a, hh_r02b) %>% replace_na(c(hh_c22j, hh_r02a, hh_r02b), 0)
But I got an error. The error msg is :
Error in is_list(replace) : object 'hh_c22j' not found
Error: 1 components of `...` were not used.
We detected these problematic arguments:
* `..1`
Did you misspecify an argument?
Run `rlang::last_error()` to see where the error occurred.
> rlang::last_error()
<error/rlib_error_dots_unused>
1 components of `...` were not used.
We detected these problematic arguments:
* `..1`
Did you misspecify an argument?
Backtrace:
1. `%>%`(...)
5. ellipsis:::action_dots(...)
Run `rlang::last_trace()` to see the full context.
Assuming you have other columns in the data as well but want to change just the three columns, you can do this:
library(dplyr)
df %>% mutate_at(vars(hh_c22j, hh_r02a, hh_r02b), list(~ replace(., which(is.na(.)), 0)))
# Alternatively, using replace_na
df %>% mutate_at(vars(hh_c22j, hh_r02a, hh_r02b), list(~ replace_na(., 0)))
Just for future reference, a small reproducible sample would go a long way to get better answers!
One option to do this in a clean way is make use of the mutate_all function and pass it the function to use on each of the columns. For example, here I create a dataset similar to what you have and replace the null values with 0s:
data <- data.frame(hh_c22j = sample(c(NA, 1), size = 5, replace = TRUE),
hh_r02a = sample(c(NA, 1), size = 5, replace = TRUE),
hh_r02b = sample(c(NA, 1), size = 5, replace = TRUE))
data %>%
mutate_all(replace_na, 0)
If you only want to perform this operation on some columns, mutate_at is a similar option where you can specify which column(s) to use this on.
Related
In apply() function, I need to provide a function name. But in my case, that function name needs to be based on some other condition. Below is such example:
library(dplyr)
Function = TRUE
as.data.frame(matrix(1:12, 4)) %>%
mutate(Res = apply(as.matrix(.), 1, ifelse(Function, ~mean, ~sd), na.rm = TRUE))
However with this I am getting below error:
Error: Problem with `mutate()` column `Res`.
ℹ `Res = apply(as.matrix(.), 1, ifelse(Function, ~mean, ~sd), na.rm = TRUE)`.
✖ attempt to replicate an object of type 'language'
Run `rlang::last_error()` to see where the error occurred.
Can you please help me on right way to apply condition to chose a function.
This should work:
library(dplyr)
Function = TRUE
as.data.frame(matrix(1:12, 4)) %>%
mutate(Res = apply(as.matrix(.), 1, if (Function) mean else sd, na.rm = TRUE))
ifelse is a function that takes a vector and applies a logical condition to it, and returns a vector containing some specified value if that condition is true for that element, or another specified value if that condition is false for that element. The separate if else operators are used for conditionals when programming in R. Sometimes they're interchangeable and sometimes they're not.
I'm recoding survey responses (character) to a set of questions (that are not in continuous columns), and I was thrilled to get the following code to work:
#make a list of the selected columns
fcols <- c(2, 6, 8, 9, 14)
#recode the selected columns
d <- d %>% mutate_at(vars(fcols),
~(recode(.,
"OriginalResponse1" = "NewResponse1",
"OriginalResponse2" = "NewResponse2",
"OriginalResponse3" = "NewResponse3",
"OriginalResponse4" = "NewResponse4",
.default = NA_character_)))
My main question has to do with making this work with "across", since "mutate_at" is apparently superseded.
I tried the below - put in the "across", and make sure to add a new closed paren at the end - but it doesn't work:
d <- d %>% mutate(across(vars(fcols),
~(recode(.,
"OriginalResponse1" = "NewResponse1",
"OriginalResponse2" = "NewResponse2",
"OriginalResponse3" = "NewResponse3",
"OriginalResponse4" = "NewResponse4",
.default = NA_character_))))
Error: Problem with `mutate()` input `..1`.
x Must subset columns with a valid subscript vector.
x Subscript has the wrong type `quosures`.
i It must be numeric or character.
i Input `..1` is `across(...)`.
Also, I've been trying to create a new set of columns (rather than just changing the existing ones) using the .names argument, after the .default argument, but I haven't been able to get that to work, except once only partially - when the columns appeared but they were all empty.
Main question: what am I missing in converting this to "across" from the working "mutate_at" version?
Bonus: how do I get the .names part to work?
For across when you have fcols as numbers you don't need vars -
library(dplyr)
d %>% mutate(across(fcols,
~recode(.,
"OriginalResponse1" = "NewResponse1",
"OriginalResponse2" = "NewResponse2",
"OriginalResponse3" = "NewResponse3",
"OriginalResponse4" = "NewResponse4",
.default = NA_character_)))
.names is useful when you want to keep the original columns as it is and create new columns.
d %>% mutate(across(fcols,
~recode(.,
"OriginalResponse1" = "NewResponse1",
"OriginalResponse2" = "NewResponse2",
"OriginalResponse3" = "NewResponse3",
"OriginalResponse4" = "NewResponse4",
.default = NA_character_), .names = '{col}_new'))
We can wrap with all_of instead of vars. Also, recode can take a named vector
library(dplyr)
library(stringr)
nm1 <- setNames(str_c("NewResponse", 1:4),
str_c("OriginalResponse", 1:4))
d %>%
mutate(across(all_of(fcols),
~recode(., !!! nm1,
.default = NA_character_)))
Self taught coder here with no cs background. It seems like I run into problems like this all the time where I don't understand really what is happening behind the scenes with the tidy verse functions I use. I need someone to explain why this isn't working in a way that I will understand.
I'm trying to run this code:
df2.p<- df2 %>% mutate(across(4:9,~./weight))
I understand this code to mean "divide columns 4:9 of df2 by the column named weight which is also in df2"
I get this error:
Error: Problem with mutate() input ..1.
x Input ..1 can't be recycled to size 52.
ℹ Input ..1 is (function (.cols = everything(), .fns = NULL, ..., .names = NULL) ....
ℹ Input ..1 must be size 52 or 1, not 42021.
I've looked at the size of df2. Not sure what is going on.
class(df2) "tbl_df" "tbl" "data.frame"
dim(df2) is 52 x 10
code that created df2 is:
df2<- df1.w %>%
group_by(state) %>%
summarise(weight.s= sum(weight, na.rm= TRUE),
native.s= sum(Native, na.rm= TRUE),
asian.s= sum(Asian, na.rm= TRUE),
black.s= sum(Black, na.rm= TRUE),
pacisland.s= sum(`Pacific Islander`, na.rm= TRUE),
middle.s= sum(`Middle Eastern`, na.rm= TRUE),
white.s= sum(White, na.rm= TRUE),
raceo.s= sum(`Race Other`, na.rm= TRUE),
na.rm= TRUE
)
I created df2 from a df1.w that has 42021 rows. I grouped these rows by state to get to 52 rows. It seems that mutate() is ungrouping df2 and looking at it as df1.w somehow. How do I get this to work?
In the OP's post, the summarise didn't do the sum on 'weight' and thus the column was not present in the output 'df2' because summarise returns only the summarised columns and the grouping columns. We could use across with everything to do the sum on all the columns and then do the mutate
library(dplyr)
df1.w %>%
group_by(state) %>%
summarise(across(everything(), sum, na.rm= TRUE)) %>%
mutate(across(4:9,~./weight))
The error may have happened because 'weight' as an object may have been created in the global env as part of the original object
I am making my first baby steps with non standard evaluation (NSE) in dplyr.
Consider the following snippet: it takes a tibble, sorts it according to the values inside a column and replaces the n-k lower values with "Other".
See for instance:
library(dplyr)
df <- cars%>%as_tibble
k <- 3
df2 <- df %>%
arrange(desc(dist)) %>%
mutate(dist2 = factor(c(dist[1:k],
rep("Other", n() - k)),
levels = c(dist[1:k], "Other")))
What I would like is a function such that:
df2bis<-df %>% sort_keep(old_column, new_column, levels_to_keep)
produces the same result, where old_column column "dist" (the column I use to sort the data set), new_column (the column I generate) is "dist2" and levels_to_keep is "k" (number of values I explicitly retain).
I am getting lost in enquo, quo_name etc...
Any suggestion is appreciated.
You can do:
library(dplyr)
sort_keep=function(df,old_column, new_column, levels_to_keep){
old_column = enquo(old_column)
new_column = as.character(substitute(new_column))
df %>%
arrange(desc(!!old_column)) %>%
mutate(use = !!old_column,
!!new_column := factor(c(use[1:levels_to_keep],
rep("Other", n() - levels_to_keep)),
levels = c(use[1:levels_to_keep], "Other")),
use=NULL)
}
df%>%sort_keep(dist,dist2,3)
Something like this?
old_column = "dist"
new_column = "dist2"
levels_to_keep = 3
command = "df2bis<-df %>% sort_keep(old_column, new_column, levels_to_keep)"
command = gsub('old_column', old_column, command)
command = gsub('new_column', new_column, command)
command = gsub('levels_to_keep', levels_to_keep, command)
eval(parse(text=command))
I am trying to use dplyr computation as below and then call this in a function where I can change the column name and dataset name. The code is as below:
sample_table <- function(byvar = TRUE, dataset = TRUE) {
tcount <-
df2 %>% group_by(.dots = byvar) %>% tally() %>% arrange(byvar) %>% rename(tcount = n) %>%
left_join(
select(
dataset %>% group_by(.dots = byvar) %>% tally() %>% arrange(byvar) %>% rename(scount = n), byvar, scount
), by = c("byvar")
) %>%
mutate_each(funs(replace(., is.na(.), 0)),-byvar %>% mutate(
tperc = round(tcount / rcount, digits = 2), sperc = round(scount / samplesize, digits = 2),
absdiff = abs(sperc - tperc)
) %>%
select(byvar, tcount, tperc, scount, sperc, absdiff)
return(tcount)
}
category_Sample1 <- sample_table(byvar = "category", dataset = Sample1)
My function name is sample_table.
The Error message is as below:-
Error: All select() inputs must resolve to integer column positions.
The following do not:
* byvar
I know this is a repeat question and I have gone through the below links:
Function writing passing column reference to group_by
Error when combining dplyr inside a function
I am not sure where I am going wrong. rcount is the number of rows in df2 and samplesize is the number of rows in "dataset" dataframe. I have to compute the same thing for another variable with three different "dataset" names.
You use column references as strings (byvar) (Standard Evaluation) and normal reference (tcount, tperc etc.) (Non Standard Evaluation) together.
Make sure you use one of both and the appropriate function: select() or select_(). You can fix your issue by using
select(one_of(c(byvar,'tcount')))