How to use %>% in tidymodels in R? - r

I am trying to split a dataset from tidymodels in R.
library(tidymodels)
data(Sacramento, package = "modeldata")
data_split <- initial_split(Sacramento, prop = 0.75, strata = price)
Sac_train <- training(data_split)
I want to describe the distribution of the training dataset, but the following error occurs.
Sac_train %>%
select(price) %>%
summarize(min_sell_price = min(),
max_sell_price = max(),
mean_sell_price = mean(),
sd_sell_price = sd())
# Error: In min() : no non-missing arguments to min; returning Inf
However, the following code works.
Sac_train %>%
summarize(min_sell_price = min(price),
max_sell_price = max(price),
mean_sell_price = mean(price),
sd_sell_price = sd(price))
My question is: why select(price) is not working in the first example? Thanks.

Assuming your data are a data frame, despite having only one column selected, you still need to tell R/dplyr what column you want to summarize.
In other words, it doesn't treat a single-column data frame as a vector that you can pass through a function - i.e.:
Sac_train.vec <- 1:25
mean(Sac_train.vec)
# [1] 13
will calculate the mean, whereas
Sac_train.df <- data.frame(price = 1:25)
mean(Sac_train.df)
throws an error.
In the special case of only one column, this may be more parsimonious code:
# Example Data
Sac_train <- data.frame(price = 1:25, col2 = LETTERS[1:25])
Sac_train %>%
select(price) %>%
summarize(across(everything(),
list(min = min, max = max, mean = mean, sd = sd)))
Output:
# price_min price_max price_mean price_sd
# 1 1 25 13 7.359801

Related

Can't use survey_mean in sapply

I'm using survey data with packages survey and srvyr and I have some trouble applying survey_mean() to all columns.
Here's an example:
library(survey)
library(srvyr)
data(api)
dstrata <- apistrat %>%
as_survey_design(strata = stype, weights = pw) %>%
mutate(api00 = ifelse(api00 == 467, NA, api00),
api99 = ifelse(api99 == 491, NA, api99))
sapply(dstrata$variables %>% select(api99, api00), function(x){
x <- enquo(x)
dstrata %>%
filter(!is.na(!!x)) %>%
summarise(stat = srvyr::survey_mean(!!x, na.rm = TRUE)[, 1])
})
Error: Assigned data x must be compatible with existing data.
x Existing data has 198 rows.
x Assigned data has 200 rows.
ℹ Only vectors of size 1 are recycled.
Run rlang::last_error() to see where the error occurred.
Note that:
dstrata %>%
select(api99, api00) %>%
summarise_all(.funs = srvyr::survey_mean, na.rm = T)
works with this example but not with my actual data so I would like to understand why the function above does not work.
I'm using srvyr_0.3.9 and survey_4.0
I don't know why would you need any kind of NSE here because in sapply only the value is passed and not an expression.
This seems to work :
library(dplyr)
sapply(dstrata$variables %>% select(api99, api00), function(x){
dstrata %>%
summarise(stat = srvyr::survey_mean(x, na.rm = TRUE))
})
# api99 api00
#stat 630.3107 663.4118
#stat_se 10.14777 9.566393

Is there a way of creating a loop that will create a new variable for each of the original 18 variables?

I have a data set with 4 variables, one of these variables is a dummy stating whether the individual graduated from a particular program (exits). I need to create a loop that will, for each of the 3 variables create two new variables (mean for dummy = 1 and mean for dummy = 0). This is my code, I want to make it more efficient, since afterwards I want to create a new data.frame for exits == 0 and substract both!.
summary_means_1 = bf %>%
filter(exits == 1) %>%
summarise(
v1_1 = as.double(mean(bf$v25_grad, na.rm = TRUE)),
v2_1 = as.double(mean(bf$v29_read, na.rm = TRUE)),
v3_1 = as.double(mean(bf$v30_math, na.rm = TRUE))
)
You can do this with the plyr package:
Say this is your data (simplified):
df <- data.frame(Dummy=sample(0:1, 10, T), V1=rnorm(10, 10), V2=rpois(10, 0.5))
This code will calculate the mean of each column, split by dummy:
library(magrittr)
library(plyr)
df %>%
group_by(Dummy) %>%
summarise(Mean_V1=mean(V1, na.rm = T),
Mean_V2=mean(V2, na.rm = T))
You'll need to add a new row in the summarise section for each column.
Using base R you can use colMeans with subsetted data:
colMeans(df[df$Dummy==0, -1])
colMeans(df[df$Dummy==1, -1])
Or you could combine them like this:
data.frame(Col=c("V1", "V2"),
Mean_0=colMeans(df[df$Dummy==0, -1]),
Mean_1=colMeans(df[df$Dummy==1, -1]))

Using replace_na for multiple data subsets

I'm trying to replace the NAs in multiple column variables with randomly generated values from each student_id's subset row data:
data snapshot
so for student 3, systolic needs two NAs replaced. I used the min and max values for each variable within the student 3 subset to generate random values.
library(dplyr)
library(tidyr)
library(tibble)
library(tidyverse)
dplyr::filter(exercise, student_id == "3") %>% replace_na(list(systolic= round(sample(runif(1000, 125,130),2),0),
diastolic =round(sample(runif(1000, 85,85),3),0), heart_rate= round(sample(runif(1000, 79,86),2),0),
phys_score = round(sample(runif(1000, 8,9),2),0)
However it works only when one NA needs replacing: successfully replaced systolic NA values. When I try to replace more than one NAs, this error comes up.
Error: Replacement for `systolic` is length 2, not length 1
Is there a way to fix this? I tried converting the column variables to data frames instead of the vectors they are now, but it only returned the original data without any replacement changes.
Are there any simpler ways to this? Any suggestions/comments would be appreciated. Thanks.
A solution that makes things a little more automated but may be unnecessarily complex.
Generated some grouped missing data from the mtcars dataset
library(magrittr)
library(purrr)
library(dplyr)
library(stringr)
library(tidyr)
## Generate some missing data with a subset of car make
mtcars_miss <- mtcars %>%
as_tibble(rownames = "car") %>%
select(car) %>%
separate(car, c("make", "name"), " ") %>%
bind_cols(mtcars[, -1] %>%
map_df(~.[sample(c(TRUE, NA), prob = c(0.8, 0.2),
size = length(.), replace = TRUE)])) %>%
filter(make %in% c("Mazda", "Hornet", "Merc"))
Function to replace na values from a given variable by sampling within the min and max and depending on some group (here make).
replace_na_sample <- function(df_miss, var, group = "make") {
var <- enquo(var)
df_miss %>%
group_by(.dots = group) %>%
mutate(replace_var := round(runif(n(), min(!!var, na.rm = T),
max(!!var, na.rm = T)), 0)) %>%
rowwise %>%
mutate_at(.vars = vars(!!var),
.funs = funs(replace_na(., replace_var))) %>%
select(-replace_var) %>%
ungroup
}
Example replacing several missing values in multiple columns.
mtcars_replaced <- mtcars_miss %>%
replace_na_sample(cyl, group = "make") %>%
replace_na_sample(disp, group = "make") %>%
replace_na_sample(hp, group = "make")

dplyr 'object not found' median only

This problem has me stumped.
I have the following data frame:
library(dplyr)
# approximation of data frame
x <- data.frame(doy = sample(c(seq(200, 300)), 20, replace = T),
year = sample(c("2000", "2005"), 20, replace = T),
phase = sample(c("pre", "post"), 20, replace = T))
and a simple 'summarize' function that takes in the column name as a variable, and works nicely:
getStats <- function(df, col) {
col <- as.name(col)
df %>%
group_by(year, phase) %>%
summarize(n = sum(!is.na(col)),
mean = mean(col, na.rm = T),
sd = sd(col, na.rm = T),
se = sd/sqrt(n))
}
> getStats(x, "doy")
Source: local data frame [4 x 6]
Groups: year [?]
year phase n mean sd se
<fctr> <fctr> <int> <dbl> <dbl> <dbl>
1 2000 post 8 248.625 30.42526 10.75695
2 2000 pre 2 290.000 14.14214 10.00000
3 2005 post 5 231.400 32.86031 14.69558
4 2005 pre 5 274.200 29.79429 13.32441
However, if I modify the function to get the median, it returns an error:
getStats <- function(df, col) {
col <- as.name(col)
df %>%
group_by(year, phase) %>%
summarize(n = sum(!is.na(col)),
mean = mean(col, na.rm = T),
med = median(col, na.rm = T), # new line
sd = sd(col, na.rm = T),
se = sd/sqrt(n))
}
> getStats(x, "doy")
Error in median (doy, na.rm = TRUE): object "doy" not found
I've tried a host of name and position changes, but all yield the same result: 'median' doesn't accept the column name as a passed variable. I assume I'm missing something so basic I'll do a face palm when someone points it out to me, but in the interim I feel like I'm losing my sanity. I appreciate any insights!
Your proximal problem may be that median doesn't have a ... argument, while mean does (I'm not sure why sd is working ... maybe an interaction between methods and ...?)
In any case, IMO the right way to handle this sort of problem is to use standard evaluation (i.e., not non-standard evaluation, i.e. use summarise_ rather than summarise, as illustrated in vignette("nse",package="dplyr")):
Illustrating how this works in the global environment rather than inside a function, but I think that shouldn't matter ...
col <- "doy"
funs <- c("n","mean","stats::median","sd","se")
## put together function calls
dots <- c(sprintf("sum(!is.na(%s))",col),
sprintf("%s(%s,na.rm=TRUE)",funs[2:4],col),
"sd/sqrt(n)")
names(dots) <- gsub("^.*::","",funs) ## ugh
dots
## n mean
## "sum(!is.na(doy))" "mean(doy,na.rm=TRUE)"
## median sd
## "stats::median(doy,na.rm=TRUE)" "sd(doy,na.rm=TRUE)"
## se
## "sd/sqrt(n)"
x %>%
group_by(year, phase) %>%
summarise_(.dots=dots)
The only annoying thing here is that for some reason dplyr can't find median unless I call it as stats::median, which means we have to work a little harder to get nice column names. The standard-evaluation method is a little uglier, but that's the price you pay for this kind of flexibility.
Embedding this in a function, I would probably break off getStats in a different place, i.e.
getStats <- function(data,col) {
## if you want to pass a string argument instead, remove
## the next line
col <- deparse(substitute(col))
funs <- c("n","mean","stats::median","sd","se")
dots <- c(sprintf("sum(!is.na(%s))",col),
sprintf("%s(%s,na.rm=TRUE)",funs[2:4],col),
"sd/sqrt(n)")
names(dots) <- gsub("^.*::","",funs) ## ugh
summarise_(data,.dots=dots)
}
x %>% group_by(year,phase) %>% getStats(doy)
This gives you more flexibility to do different groupings ...

R: optimize finding max values of function on a data frame and then trim the rest

First of all my data comes from Temperature.xls which can be downloaded from this link: RBook
My code is this:
temp = read.table("Temperature.txt", header = TRUE)
length(unique(temp$Year)) # number of unique values in the Year vector.
res = ddply(temp, c("Year","Month"), summarise, Mean = mean(Temperature, na.rm = TRUE))
res1 = ddply(temp, .(Year,Month), summarise,
SD = sd(Temperature, na.rm = TRUE),
N = sum(!is.na(Temperature))
)
# ordering res1 by sd and year:
res1 = res1[order(res1$Year,res1$SD),];
# finding maximum of SD in res1 by year and displaying just them in a separate data frame
res1_maxsd = ddply(res1, .(Year), summarise, MaxSD = max(SD, na.rm = TRUE)) # find the maxSD in each Year
res1_max = merge(res1_maxsd,res1, all = FALSE) # merge it with the original to see other variables at the max's rows
res1_m = res1_max[res1_max$MaxSD==res1_max$SD,] # find which rows are the ones corresponding to the max value
res1_mm = res1_m[complete.cases(res1_m),] # trim all others (which are NA's)
I know that I can cut the 4 last lines to less lines. Can I somehow execute the last 2 lines in one command? I have stumbled across:
res1_m = res1_max[complete.cases(res1_max$MaxSD==res1_max$SD),]
But this does not give me what I want which is eventually a smaller data frame only with the rows (with all the variables) that contain the maxSD.
Rather than fixing the last 2 lines why not start with res1? Reversing the order of SD and taking the first row per year gives you an equivalent final data set...
res1 <- res1[order(res1$Year,-res1$SD),]
res_final <- res1[!duplicated(res1$Year),]
The last four lines can be cut down if you use dplyr package. Since you want to keep some information from original data set, you probably don't want to use summarise because it only returns summarized information and you have to merge with original dataset, so mutate and filter would be a better choice:
library(dplyr)
res1_mm1 <- res1 %>% group_by(Year) %>% filter(SD == max(SD, na.rm = T))
You can also use a mutate function to create the new column MaxSD which is the same as SD in the result data frame for your case:
res1_mm1 <- res1 %>% group_by(Year) %>% mutate(MaxSD = max(SD, na.rm = T)) %>%
filter(SD == MaxSD)

Resources