Self taught coder here with no cs background. It seems like I run into problems like this all the time where I don't understand really what is happening behind the scenes with the tidy verse functions I use. I need someone to explain why this isn't working in a way that I will understand.
I'm trying to run this code:
df2.p<- df2 %>% mutate(across(4:9,~./weight))
I understand this code to mean "divide columns 4:9 of df2 by the column named weight which is also in df2"
I get this error:
Error: Problem with mutate() input ..1.
x Input ..1 can't be recycled to size 52.
ℹ Input ..1 is (function (.cols = everything(), .fns = NULL, ..., .names = NULL) ....
ℹ Input ..1 must be size 52 or 1, not 42021.
I've looked at the size of df2. Not sure what is going on.
class(df2) "tbl_df" "tbl" "data.frame"
dim(df2) is 52 x 10
code that created df2 is:
df2<- df1.w %>%
group_by(state) %>%
summarise(weight.s= sum(weight, na.rm= TRUE),
native.s= sum(Native, na.rm= TRUE),
asian.s= sum(Asian, na.rm= TRUE),
black.s= sum(Black, na.rm= TRUE),
pacisland.s= sum(`Pacific Islander`, na.rm= TRUE),
middle.s= sum(`Middle Eastern`, na.rm= TRUE),
white.s= sum(White, na.rm= TRUE),
raceo.s= sum(`Race Other`, na.rm= TRUE),
na.rm= TRUE
)
I created df2 from a df1.w that has 42021 rows. I grouped these rows by state to get to 52 rows. It seems that mutate() is ungrouping df2 and looking at it as df1.w somehow. How do I get this to work?
In the OP's post, the summarise didn't do the sum on 'weight' and thus the column was not present in the output 'df2' because summarise returns only the summarised columns and the grouping columns. We could use across with everything to do the sum on all the columns and then do the mutate
library(dplyr)
df1.w %>%
group_by(state) %>%
summarise(across(everything(), sum, na.rm= TRUE)) %>%
mutate(across(4:9,~./weight))
The error may have happened because 'weight' as an object may have been created in the global env as part of the original object
Related
I have a dataset that contains columns hh_c22j, hh_r02a, hh_r02b. I want to replace NAs in these col with 0. Right now I have the command as below, it works. But is redundant, as I need to specify for each column to replace with 0.
df %>% select(case_id, hh_c22j, hh_r02a, hh_r02b) %>% replace_na(list(hh_c22j=0, hh_r02a=0, hh_r02b=0))
I want to select the columns together in an array/list like below.
df %>% select(case_id, hh_c22j, hh_r02a, hh_r02b) %>% replace_na(c(hh_c22j, hh_r02a, hh_r02b), 0)
But I got an error. The error msg is :
Error in is_list(replace) : object 'hh_c22j' not found
Error: 1 components of `...` were not used.
We detected these problematic arguments:
* `..1`
Did you misspecify an argument?
Run `rlang::last_error()` to see where the error occurred.
> rlang::last_error()
<error/rlib_error_dots_unused>
1 components of `...` were not used.
We detected these problematic arguments:
* `..1`
Did you misspecify an argument?
Backtrace:
1. `%>%`(...)
5. ellipsis:::action_dots(...)
Run `rlang::last_trace()` to see the full context.
Assuming you have other columns in the data as well but want to change just the three columns, you can do this:
library(dplyr)
df %>% mutate_at(vars(hh_c22j, hh_r02a, hh_r02b), list(~ replace(., which(is.na(.)), 0)))
# Alternatively, using replace_na
df %>% mutate_at(vars(hh_c22j, hh_r02a, hh_r02b), list(~ replace_na(., 0)))
Just for future reference, a small reproducible sample would go a long way to get better answers!
One option to do this in a clean way is make use of the mutate_all function and pass it the function to use on each of the columns. For example, here I create a dataset similar to what you have and replace the null values with 0s:
data <- data.frame(hh_c22j = sample(c(NA, 1), size = 5, replace = TRUE),
hh_r02a = sample(c(NA, 1), size = 5, replace = TRUE),
hh_r02b = sample(c(NA, 1), size = 5, replace = TRUE))
data %>%
mutate_all(replace_na, 0)
If you only want to perform this operation on some columns, mutate_at is a similar option where you can specify which column(s) to use this on.
With the new release of dplyr I am refactoring quite a lot of code and removing functions that are now retired or deprecated. I had a function that is as follows:
processingAggregatedLoad <- function (df) {
defined <- ls()
passed <- names(as.list(match.call())[-1])
if (any(!defined %in% passed)) {
stop(paste("Missing values for the following arguments:", paste(setdiff(defined, passed), collapse=", ")))
}
df_isolated_load <- df %>% select(matches("snsr_val")) %>% mutate(global_demand = rowSums(.)) # we get isolated load
df_isolated_load_qlty <- df %>% select(matches("qlty_good_ind")) # we get isolated quality
df_isolated_load_qlty <- df_isolated_load_qlty %>% mutate_all(~ factor(.), colnames(df_isolated_load_qlty)) %>%
mutate_each(funs(as.numeric(.)), colnames(df_isolated_load_qlty)) # we convert the qlty to factors and then to numeric
df_isolated_load_qlty[df_isolated_load_qlty[]==1] <- 1 # 1 is bad
df_isolated_load_qlty[df_isolated_load_qlty[]==2] <- 0 # 0 is good we mask to calculate the global index quality
df_isolated_load_qlty <- df_isolated_load_qlty %>% mutate(global_quality = rowSums(.)) %>% select(global_quality)
df <- bind_cols(df, df_isolated_load, df_isolated_load_qlty)
return(df)
}
Basically the function does as follows:
1.The function selects all of the values of a pivoted dataframe and aggregated them.
2.The function selects the quality indicator (character) of a pivoted dataframe.
3.I convert the characters of the quality to factors and then to numeric to get the 2 levels (1 or 2).
4.I replace the numeric values of each of the individual columns by 0 or 1 depending on the level.
5.I rowsum the individual quality as I will get 0 if all of the values are good, otherwise the global quality is bad.
The problem is that I am getting the following messages:
1: `funs()` is deprecated as of dplyr 0.8.0.
Please use a list of either functions or lambdas:
# Simple named list:
list(mean = mean, median = median)
# Auto named with `tibble::lst()`:
tibble::lst(mean, median)
# Using lambdas
list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.
2: `mutate_each_()` is deprecated as of dplyr 0.7.0.
Please use `across()` instead.
I did multiple trials as for instance:
df_isolated_load_qlty %>% mutate(across(.fns = ~ as.factor(), .names = colnames(df_isolated_load_qlty)))
Error: Problem with `mutate()` input `..1`.
x All unnamed arguments must be length 1
ℹ Input `..1` is `across(.fns = ~as.factor(), .names = colnames(df_isolated_load_qlty))`.
But I am still a bit confused about the new dplyr syntax. Would someone be able to guide me a little bit around the right way of doing this?
mutate_each has been long deprecated and was replaced with mutate_all.
mutate_all is now replaced with across
across has default .cols as everything() which means it behaves as mutate_all by default (like here) if not mentioned explicitly.
You can apply the mulitple function in the same mutate call, so here factor and as.numeric can be applied together.
Considering all this you can change your existing function to :
library(dplyr)
processingAggregatedLoad <- function (df) {
defined <- ls()
passed <- names(as.list(match.call())[-1])
if (any(!defined %in% passed)) {
stop(paste("Missing values for the following arguments:",
paste(setdiff(defined, passed), collapse=", ")))
}
df_isolated_load <- df %>%
select(matches("snsr_val")) %>%
mutate(global_demand = rowSums(.))
df_isolated_load_qlty <- df %>% select(matches("qlty_good_ind"))
df_isolated_load_qlty <- df_isolated_load_qlty %>%
mutate(across(.fns = ~as.numeric(factor(.))))
df_isolated_load_qlty[df_isolated_load_qlty ==1] <- 1
df_isolated_load_qlty[df_isolated_load_qlty==2] <- 0
df_isolated_load_qlty <- df_isolated_load_qlty %>%
mutate(global_quality = rowSums(.)) %>%
select(global_quality)
df <- bind_cols(df, df_isolated_load, df_isolated_load_qlty)
return(df)
}
I'm trying to fill all the NA's in my fields with the mean of each column.
The code I've been using is:
var1<-colnames(DF)
for (i in 1:length(var1)) {
v<-paste0("`",var1[i],"`")
DF<-DF %>%
mutate(v=ifelse(is.na(v),mean(v,na.rm=TRUE),v))
}
After running this piece of code, nothing happens with the DF.
I already tried running for an individual column, and the code works:
DF<-DF%>%
mutate(col1=ifelse(is.na(col1),mean(col1,na.rm=TRUE),col1))
I'm using the ` in the paste part because some of the columns can have spaces between words and I cannot change this. I have the feeling that this part is where the mistake reside.
For multiple columns use mutate_at (for all columns - mutate_all)
DF %>%
mutate_all(funs(ifelse(is.na(.), mean(., na.rm = TRUE), .)))
It can be made compact with na.aggregate from zoo (replaces the NA with the mean for each columns. By default FUN = mean)
library(zoo)
na.aggregate(DF)
If we are using a for loop, then there is no need for a package. Just update the column NA elements with the mean of that column
for(nm in var1) DF[[nm]][is.na(DF[[nm]])] <- mean(DF[[nm]], na.rm = TRUE)
Or with lapply
DF[] <- lapply(DF, function(x) replace(x, is.na(x), mean(x, na.rm = TRUE)))
Or using colMeans
DF[is.na(DF)] <- colMeans(DF, na.rm = TRUE)[col(DF)][is.na(DF)]
data
set.seed(24)
DF <- as.data.frame(matrix(sample(c(NA, 0:5), 20 *5, replace = TRUE), 20, 5))
I'm trying to replace the NAs in multiple column variables with randomly generated values from each student_id's subset row data:
data snapshot
so for student 3, systolic needs two NAs replaced. I used the min and max values for each variable within the student 3 subset to generate random values.
library(dplyr)
library(tidyr)
library(tibble)
library(tidyverse)
dplyr::filter(exercise, student_id == "3") %>% replace_na(list(systolic= round(sample(runif(1000, 125,130),2),0),
diastolic =round(sample(runif(1000, 85,85),3),0), heart_rate= round(sample(runif(1000, 79,86),2),0),
phys_score = round(sample(runif(1000, 8,9),2),0)
However it works only when one NA needs replacing: successfully replaced systolic NA values. When I try to replace more than one NAs, this error comes up.
Error: Replacement for `systolic` is length 2, not length 1
Is there a way to fix this? I tried converting the column variables to data frames instead of the vectors they are now, but it only returned the original data without any replacement changes.
Are there any simpler ways to this? Any suggestions/comments would be appreciated. Thanks.
A solution that makes things a little more automated but may be unnecessarily complex.
Generated some grouped missing data from the mtcars dataset
library(magrittr)
library(purrr)
library(dplyr)
library(stringr)
library(tidyr)
## Generate some missing data with a subset of car make
mtcars_miss <- mtcars %>%
as_tibble(rownames = "car") %>%
select(car) %>%
separate(car, c("make", "name"), " ") %>%
bind_cols(mtcars[, -1] %>%
map_df(~.[sample(c(TRUE, NA), prob = c(0.8, 0.2),
size = length(.), replace = TRUE)])) %>%
filter(make %in% c("Mazda", "Hornet", "Merc"))
Function to replace na values from a given variable by sampling within the min and max and depending on some group (here make).
replace_na_sample <- function(df_miss, var, group = "make") {
var <- enquo(var)
df_miss %>%
group_by(.dots = group) %>%
mutate(replace_var := round(runif(n(), min(!!var, na.rm = T),
max(!!var, na.rm = T)), 0)) %>%
rowwise %>%
mutate_at(.vars = vars(!!var),
.funs = funs(replace_na(., replace_var))) %>%
select(-replace_var) %>%
ungroup
}
Example replacing several missing values in multiple columns.
mtcars_replaced <- mtcars_miss %>%
replace_na_sample(cyl, group = "make") %>%
replace_na_sample(disp, group = "make") %>%
replace_na_sample(hp, group = "make")
I'm wondering if this is something possible in R:
I have 2 columns. Column A (primaryhistory2.DEPT) has a bunch of categorical data, column B (primaryhistry2.ACT.ENROLL) has numbers and NAs.
I want to get a summary of column B for each category in column A.
Something like, for "NUT" in column A, I want to see min, max, mean, median, NAs, etc. And I would like to see this for every category. Like when you use summary() command.
Not sure if this is possible.. Thank you all in advance!
#Moody_Mudskipper
The results are what I'm looking for. But without column names it's hard to read.
and for the base R, it's not doing counts for NAs, which I do see a lot of NAs in my file.
Very possible using dplyr library:
library(dplyr)
most.of.the.answer = df %>%
group_by(primaryhistory2.DEPT) %>%
summarise(min = min(primaryhistry2.ACT.ENROLL, na.rm = TRUE), max = max(primaryhistry2.ACT.ENROLL, na.rm = TRUE), mean = mean(primaryhistry2.ACT.ENROLL, na.rm = TRUE), median = median(primaryhistry2.ACT.ENROLL, na.rm = TRUE))
(assuming your dataframe is called df)
For counting NA's, try dplyr's filter feature:
count.NAs = df %>% filter(is.na(primaryhistry2.ACT.ENROLL)) %>%
group_by(primaryhistory2.DEPT) %>%
summarise(count.NA = n())
I'll leave it to you to merge the two dataframes.
with base R you can do this:
temp <- aggregate(primaryhistory2..ACT.ENROLL ~ primaryhistory2.DEPT,df,function(x){c(mean = mean(x,na.rm=T),median = median(x,na.rm=T),min = min(x,na.rm=T),max = max(x,na.rm=T),nas=sum(is.na(x)))})
res <- cbind(temp[1],temp[[2]])
If you want to use summary:
summary1 <- sapply(unique(df$primaryhistory2.DEPT),function(x) summary(subset(df,primaryhistory2.DEPT == x)$primaryhistory2..ACT.ENROLL))
colnames(summary1) <- unique(df$primaryhistory2.DEPT)