How to move R code into functions to generalise behaviour - r

I have a huge messy piece of R code with loads of ugly repetition. There is an opportunity to massively reduce it. Starting with this piece of code:
table <-
risk_assigned %>%
group_by(rental_type, room_type) %>%
summarise_all(funs( sum(!is.na(.)) / length(.) ) ) %>%
select(-c(device_id, ts, room, hhi, temp)) %>%
adorn_pct_formatting()
I would like to generalise it into a function so it can be reused.
LayKable = function(kableDetails) {
table <-
risk_assigned %>%
group_by(kableDetails$group1 , kableDetails$group2) %>%
summarise_all(funs( sum(!is.na(.)) / length(.) ) ) #%>%
select(-c(device_id, ts, room, hhi, temp)) %>%
adorn_pct_formatting()
...
kable <- table
return(kable)
}
kableDetails <- list(
group1 = "rental_type",
group2 = "room_type"
)
newKable <- LayKable(kableDetails)
This rather half-hearted attempt serves to explain what I want to do. How can I pass stuff into this function inside a list (I'm a C programmer, pretending it's a struct).

When passing function arguments to a dplyr verb inside a function you have to use rlang terms. But should be simple to define a function you can pass a number of grouping terms to:
library(dplyr)
test_func <- function(..., data = mtcars) {
# Passing `data` as a default argument as it's nice to be flexible!
data %>%
group_by(!!!enquos(...)) %>%
summarise(across(.fns = sum), .groups = "drop")
}
test_func(cyl, gear)
#> # A tibble: 8 x 11
#> cyl gear mpg disp hp drat wt qsec vs am carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 4 3 21.5 120. 97 3.7 2.46 20.0 1 0 1
#> 2 4 4 215. 821 608 32.9 19.0 157. 8 6 12
#> 3 4 5 56.4 215. 204 8.2 3.65 33.6 1 2 4
#> 4 6 3 39.5 483 215 5.84 6.68 39.7 2 0 2
#> 5 6 4 79 655. 466 15.6 12.4 70.7 2 2 16
#> 6 6 5 19.7 145 175 3.62 2.77 15.5 0 1 6
#> 7 8 3 181. 4291. 2330 37.4 49.2 206. 0 0 37
#> 8 8 5 30.8 652 599 7.76 6.74 29.1 0 2 12
Update - adding a list
I see your ideal would be to write a list of arguments for each function call and pass these rather than write out the arguments in each call. You can do this using do.call to pass a list of named arguments to a function. Again, when using dplyr verbs you can quote variable names in constructing your list (so that R doesn't try to find them in the global environment when compiling the list) and !!enquo each one in the calls to then use them there:
library(dplyr)
test_func2 <- function(.summary_var, .group_var, data = mtcars) {
data %>%
group_by(!!enquo(.group_var)) %>%
summarise(mean = mean(!!enquo(.summary_var)))
}
# Test with bare arguments
test_func2(hp, cyl)
#> # A tibble: 3 x 2
#> cyl mean
#> <dbl> <dbl>
#> 1 4 82.6
#> 2 6 122.
#> 3 8 209.
# Construct and pass list
args <- list(.summary_var = quote(hp), .group_var = quote(cyl))
do.call(test_func2, args = args)
#> # A tibble: 3 x 2
#> cyl mean
#> <dbl> <dbl>
#> 1 4 82.6
#> 2 6 122.
#> 3 8 209.
A handy guide to tidy evaluation where most of these ideas are explained more clearly.
Created on 2021-12-21 by the reprex package (v2.0.1)

Related

R: paired t-test on multiple columns

I am trying to run a t-test on multiple columns. Basically trying to find the change from baseline to year 1 for a number of joint angles. I only want to conduct this on the study side. Below is an image with the first few rows and columns of the data. Sample Data
I have tried using both of these functions without success:
Code 1:
res <- FAI_SLS %>%
filter(study_side == "Study")%>%
select(-id,-subject,-activity,-side,-study_side,-year) %>%
map_df(~ broom::tidy(t.test(. ~ year)), .id = 'var')
I get the following error:
Error in eval(predvars, data, env) : object 'year' not found
I tried taking out -year but I still have the same issue.
Code 2:
t(sapply(FAI_SLS%>%filter(study_side == "Study")%>%select(-id,-subject,-activity,-side,-study_side,-year), function(x)
unlist(t.test(x~FAI_SLS$year)[c("estimate","p.value","statistic","conf.int")])))
I get the following error:
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 't': variable lengths differ (found for 'FAI_SLS$year')
Again I tried taking -year out without success.
Any suggestions on how I can fix this? Thanks
Try fitting the t-test within summarise() on all the columns you want to test (selected in across()). Here's an example with a different dataset:
library(dplyr)
library(tidyr)
data("storms")
storms %>%
filter(year %in% c(2019, 2020)) %>%
summarise(across(-c(name, year, status, category),
~broom::tidy(t.test(. ~ year)))) %>%
pivot_longer(everything(), names_to = "variable") %>%
unnest(value)
#> # A tibble: 9 × 11
#> variable estimate estimate1 estimate2 statistic p.value parameter conf.low
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 month 0.0917 8.93 8.84 1.15 2.52e- 1 892. -0.0654
#> 2 day 4.29 18.2 13.9 7.49 2.34e-13 641. 3.17
#> 3 hour -0.0596 9.13 9.19 -0.128 8.99e- 1 687. -0.978
#> 4 lat 2.14 25.9 23.7 3.75 1.94e- 4 668. 1.02
#> 5 long 6.06 -60.7 -66.8 4.27 2.25e- 5 736. 3.27
#> 6 wind 8.42 58.8 50.4 4.42 1.18e- 5 529. 4.68
#> 7 pressure -4.46 989. 993. -3.03 2.59e- 3 537. -7.35
#> 8 tropicalst… 7.39 153. 145. 0.810 4.18e- 1 701. -10.5
#> 9 hurricane_… 10.9 24.1 13.2 3.92 1.02e- 4 508. 5.45
#> # … with 3 more variables: conf.high <dbl>, method <chr>, alternative <chr>
Created on 2022-06-02 by the reprex package (v2.0.1)

row bind list columns using dplyr

I would like to find a better way to bind together the results of any number of regressions after adding an identifier for each model. The code below is my current solution but is too manual for a large number of regressions. This is part of a larger tidy workflow so a solution inside of the tidyverse is preferred but whatever works is fine. Thanks
library(tidyverse)
library(broom)
model_dat=mtcars %>%
do(lm_1 = tidy(lm(disp~ wt*vs, data = .),conf.int=T),
lm_2=tidy(lm(cyl ~ wt*vs, data = .),conf.int=T ),
lm_3=tidy(lm(mpg ~ wt*vs, data = .),conf.int=T ))
df=model_dat %>%
select(lm_1) %>%
unnest(c(lm_1)) %>%
mutate(model="one") %>%
select(model,term,estimate,p.value:conf.high) %>%
bind_rows(
model_dat %>%
select(lm_2) %>%
unnest(c(lm_2)) %>%
mutate(model="two") %>%
select(model,term,estimate,p.value:conf.high)) %>%
bind_rows(
model_dat %>%
select(lm_3) %>%
unnest(c(lm_3)) %>%
mutate(model="three") %>%
select(model,term,estimate,p.value:conf.high))
It may be easier with map2 i.e. loop across the columns and the corresponding english word for the sequence of columns, pluck the list element, create the 'model' column with second argument i.e. engish words (.y), select the columns of interest, and create a single dataset by specifying _dfr in map
library(purrr)
library(english)
library(dplyr)
library(broom)
map2_dfr(model_dat, as.character(english(seq_along(model_dat))),
~ .x %>%
pluck(1) %>%
mutate(model = .y) %>%
select(model, term, estimate, p.value:conf.high) )
-output
# A tibble: 12 x 6
# model term estimate p.value conf.low conf.high
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 one (Intercept) -70.0 1.55e- 1 -168. 28.2
# 2 one wt 102. 8.20e- 9 76.4 128.
# 3 one vs 31.2 6.54e- 1 -110. 172.
# 4 one wt:vs -36.7 1.10e- 1 -82.2 8.82
# 5 two (Intercept) 4.31 1.28e- 5 2.64 5.99
# 6 two wt 0.849 4.90e- 4 0.408 1.29
# 7 two vs -2.19 7.28e- 2 -4.59 0.216
# 8 two wt:vs 0.0869 8.20e- 1 -0.689 0.862
# 9 three (Intercept) 29.5 6.55e-12 24.2 34.9
#10 three wt -3.50 2.33e- 5 -4.92 -2.08
#11 three vs 11.8 4.10e- 3 4.06 19.5
#12 three wt:vs -2.91 2.36e- 2 -5.40 -0.419
Or use summarise with across, unclass and then bind with bind_rows
model_dat %>%
summarise(across(everything(), ~ {
# // get the column name
nm1 <- cur_column()
# // extract the list element (.[[1]])
list(.[[1]] %>%
# // create new column by extracting the numeric part
mutate(model = english(readr::parse_number(nm1))) %>%
# // select the subset of columns, wrap in a list
select(model, term, estimate, p.value:conf.high))
}
)) %>%
# // unclass to list
unclass %>%
# // bind the list elements
bind_rows
-output
# A tibble: 12 x 6
# model term estimate p.value conf.low conf.high
# <english> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 one (Intercept) -70.0 1.55e- 1 -168. 28.2
# 2 one wt 102. 8.20e- 9 76.4 128.
# 3 one vs 31.2 6.54e- 1 -110. 172.
# 4 one wt:vs -36.7 1.10e- 1 -82.2 8.82
# 5 two (Intercept) 4.31 1.28e- 5 2.64 5.99
# 6 two wt 0.849 4.90e- 4 0.408 1.29
# 7 two vs -2.19 7.28e- 2 -4.59 0.216
# 8 two wt:vs 0.0869 8.20e- 1 -0.689 0.862
# 9 three (Intercept) 29.5 6.55e-12 24.2 34.9
#10 three wt -3.50 2.33e- 5 -4.92 -2.08
#11 three vs 11.8 4.10e- 3 4.06 19.5
#12 three wt:vs -2.91 2.36e- 2 -5.40 -0.419

Create new column based on regular expression match

Problem
I would like to create a new column for relative standard deviation using following formula:stdev * 100 / abs(mean). I have over 40 variables, each with their own stdev and mean (so 80 columns). What I would like to do is use regular expressions to calculate the relative standard deviation from the 2 columns (stdev and mean) based on the preceding names. For example, for columns AceticAcid.stdevand AceticAcid.mean, calculate the relative standard deviation to automatically create a new column AcetiAcid.rsd. The equation being: AceticAcid.stdev * 100 / abs(AceticAcid.mean).
Example Dataframe
print(df)
AceticAcid.mean AceticAcid.stdev Glucose.mean Glucose.stdev Propanol.mean Propanol.stdev
1 28.75775 0.911130 48.27333 4.4991249 144.4770 38.34122
2 78.83051 10.562110 28.13337 1.2304387 134.6402 31.76264
3 40.89769 17.848381 37.10283 0.2102977 132.0253 33.76568
4 88.30174 11.028700 32.90534 1.6396036 149.7135 21.56639
5 94.04673 9.132295 14.11699 4.7725182 132.7853 15.88455
Desired Output (Don't care about the order of the new columns)
print(df_rsd)
AceticAcid.mean AceticAcid.stdev Glucose.mean Glucose.stdev Propanol.mean Propanol.stdev AceticAcid.rsd Glucose.rsd Propanol.rsd
1 28.75775 0.911130 48.27333 4.4991249 144.4770 38.34122 3.168294 9.3201039 26.53795
2 78.83051 10.562110 28.13337 1.2304387 134.6402 31.76264 13.398504 4.3735921 23.59076
3 40.89769 17.848381 37.10283 0.2102977 132.0253 33.76568 43.641536 0.5667969 25.57515
4 88.30174 11.028700 32.90534 1.6396036 149.7135 21.56639 12.489788 4.9827894 14.40511
5 94.04673 9.132295 14.11699 4.7725182 132.7853 15.88455 9.710380 33.8069175 11.96258
Repetitive Attempt...
I do not want to write these out 40 times (there has to be a nice regex way to achieve this):
df_rsd <- df %>% mutate(AceticAcid.rsd = AceticAcid.stdev * 100 / abs(AceticAcid.mean),
Glucose.rsd = Glucose.stdev * 100 / abs(Glucose.mean),
Propanol.rsd = Propanol.stdev * 100 / abs(Propanol.mean))
Reproducible Data
structure(list(AceticAcid.mean = c(28.7577520124614, 78.8305135443807,
40.89769218117, 88.3017404004931, 94.0467284293845), AceticAcid.stdev = c(0.911129987798631,
10.5621097609401, 17.8483808878809, 11.0287002893165, 9.13229470606893
), Glucose.mean = c(48.2733338139951, 28.1333662476391, 37.1028254181147,
32.9053360782564, 14.1169873066247), Glucose.stdev = c(4.49912485200912,
1.2304386717733, 0.210297667654231, 1.63960359641351, 4.77251824573614
), Propanol.mean = c(144.476965803187, 134.64017030783, 132.025340688415,
149.713488831185, 132.785289955791), Propanol.stdev = c(38.3412187267095,
31.7626409884542, 33.7656808178872, 21.5663894917816, 15.884545892477
)), class = "data.frame", row.names = c(NA, -5L))
We can use split.default to split the dataset into a list of data.frame columns based on removing the suffix part of the column names, then loop over the list with lapply, do the calculation and assign it to new column in 'df'
out <- lapply(split.default(df, sub("\\..*", "", names(df))),
function(x) x[[2]]* 100/abs(x[[1]]))
df[paste0(names(out), ".rsd")] <- out
df
# AceticAcid.mean AceticAcid.stdev Glucose.mean Glucose.stdev Propanol.mean Propanol.stdev AceticAcid.rsd Glucose.rsd Propanol.rsd
#1 28.75775 0.911130 48.27333 4.4991249 144.4770 38.34122 3.168294 9.3201039 26.53795
#2 78.83051 10.562110 28.13337 1.2304387 134.6402 31.76264 13.398504 4.3735921 23.59076
#3 40.89769 17.848381 37.10283 0.2102977 132.0253 33.76568 43.641536 0.5667969 25.57515
#4 88.30174 11.028700 32.90534 1.6396036 149.7135 21.56639 12.489788 4.9827894 14.40511
#5 94.04673 9.132295 14.11699 4.7725182 132.7853 15.88455 9.710380 33.8069175 11.96258
Or with tidyverse
library(purrr)
library(dplyr)
library(stringr)
df %>%
split.default(str_remove(names(.), "\\..*")) %>%
map_dfc(~ .x[[2]] * 100/abs(.x[[1]])) %>%
rename_all(~ str_c(., '.rsd')) %>%
bind_cols(df, .)
alternative, also with the tidyverse.
library(tidyverse)
df_long <- df %>%
mutate(measurement_number=row_number(), .before=1) %>%
pivot_longer(cols=-measurement_number, names_to="var", values_to="value") %>%
separate(var, into=c("var", "indicator")) %>%
pivot_wider(id_cols=c("measurement_number", "var"), names_from = indicator, values_from=value) %>%
mutate(rsd=stdev * 100 / abs(mean)) %>%
arrange(var, measurement_number)
df_long
#> # A tibble: 15 x 5
#> measurement_number var mean stdev rsd
#> <int> <chr> <dbl> <dbl> <dbl>
#> 1 1 AceticAcid 28.8 0.911 3.17
#> 2 2 AceticAcid 78.8 10.6 13.4
#> 3 3 AceticAcid 40.9 17.8 43.6
#> 4 4 AceticAcid 88.3 11.0 12.5
#> 5 5 AceticAcid 94.0 9.13 9.71
#> 6 1 Glucose 48.3 4.50 9.32
#> 7 2 Glucose 28.1 1.23 4.37
#> 8 3 Glucose 37.1 0.210 0.567
#> 9 4 Glucose 32.9 1.64 4.98
#> 10 5 Glucose 14.1 4.77 33.8
#> 11 1 Propanol 144. 38.3 26.5
#> 12 2 Propanol 135. 31.8 23.6
#> 13 3 Propanol 132. 33.8 25.6
#> 14 4 Propanol 150. 21.6 14.4
#> 15 5 Propanol 133. 15.9 12.0
df_wide <- df_long %>%
pivot_wider(id_cols=c("measurement_number"),
names_from = c(var),
values_from = c(mean, stdev, rsd),
names_sep = ".")
df_wide
#> # A tibble: 5 x 10
#> measurement_num~ mean.AceticAcid mean.Glucose mean.Propanol stdev.AceticAcid
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 28.8 48.3 144. 0.911
#> 2 2 78.8 28.1 135. 10.6
#> 3 3 40.9 37.1 132. 17.8
#> 4 4 88.3 32.9 150. 11.0
#> 5 5 94.0 14.1 133. 9.13
#> # ... with 5 more variables: stdev.Glucose <dbl>, stdev.Propanol <dbl>,
#> # rsd.AceticAcid <dbl>, rsd.Glucose <dbl>, rsd.Propanol <dbl>
Created on 2020-05-26 by the reprex package (v0.3.0)

Weird things with Automatically generate new variable names using dplyr mutate

OK this is going to be a long post.
So i am fairly new with R (i am currently using the MR free 3.5, with no checkpoint) but i am trying to work with the tidyverse, which i find very elegant in writing code and a lot of times a lot more simple.
I decided to replicate an exercise from guru99 here. It is a simple k-means exercise. However because i always want to write "generalizeble" code i was trying to automatically rename the variables in mutate with new names. So i searched SO and found this solution here which is very nice.
First what works fine.
#library(tidyverse)
link <- "https://raw.githubusercontent.com/guru99-edu/R-Programming/master/computers.csv"
df <- read.csv(link)
rescaled <- df %>% discard(is.factor) %>%
select(-X) %>%
mutate_all(
funs("scaled" = scale)
)
When you download the data with read.csv you get the df in dataframe class and everything works.
And now the weird thinks start. If you download the data with read_csv or make it a tibble at any point after (the first X variable will be named X1 and you need to change the is.factor to is.character because stings are converted to character not factors unless explicitly asked for, for future me and others.)
and then run the code
df1 <- read_csv(link)
df1 %>% discard(is.character) %>%
select(-X1) %>%
mutate_all(
funs("scaled" = scale)
)
the new named variables are named price_scaled[,1] speed_scaled[,1] hd_scaled[,1] ram_scaled[,1] etc. when you view the output in the console or you even if you print().
BUT if you view() on it you see the output with the names you expect which are price_scaled speed_scaled hd_scaled etc. ALSO I am using an Rmarkdown document for the code and when i change the chunk output to inline it diplays the names correctly with hd_scaled etc.
Any one has any idea how to get the names printed in the console like price_scaled etc.
Why this is happening?
Though that this would be interesting to ask.
scale() returns a matrix, and dplyr/tibble isn't automatically coercing it to a vector. By changing your mutate_all() call to the below, we can have it return a vector. I identified this is what was happening by calling class(df1$speed_scaled) and seeing the result of "matrix".
library(tidyverse)
link <- "https://raw.githubusercontent.com/guru99-edu/R-Programming/master/computers.csv"
df <- read_csv(link)
#> Warning: Missing column names filled in: 'X1' [1]
#> Parsed with column specification:
#> cols(
#> X1 = col_double(),
#> price = col_double(),
#> speed = col_double(),
#> hd = col_double(),
#> ram = col_double(),
#> screen = col_double(),
#> cd = col_character(),
#> multi = col_character(),
#> premium = col_character(),
#> ads = col_double(),
#> trend = col_double()
#> )
df %>% discard(is.character) %>%
select(-X1) %>%
mutate_all(
list("scaled" = function(x) scale(x)[[1]])
)
#> # A tibble: 6,259 x 14
#> price speed hd ram screen ads trend price_scaled speed_scaled
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1499 25 80 4 14 94 1 -1.24 -1.28
#> 2 1795 33 85 2 14 94 1 -1.24 -1.28
#> 3 1595 25 170 4 15 94 1 -1.24 -1.28
#> 4 1849 25 170 8 14 94 1 -1.24 -1.28
#> 5 3295 33 340 16 14 94 1 -1.24 -1.28
#> 6 3695 66 340 16 14 94 1 -1.24 -1.28
#> 7 1720 25 170 4 14 94 1 -1.24 -1.28
#> 8 1995 50 85 2 14 94 1 -1.24 -1.28
#> 9 2225 50 210 8 14 94 1 -1.24 -1.28
#> 10 2575 50 210 4 15 94 1 -1.24 -1.28
#> # ... with 6,249 more rows, and 5 more variables: hd_scaled <dbl>,
#> # ram_scaled <dbl>, screen_scaled <dbl>, ads_scaled <dbl>,
#> # trend_scaled <dbl>

How can I pass a vector as variable arguments into a function in R [duplicate]

This question already has answers here:
How to parametrize function calls in dplyr 0.7?
(3 answers)
Closed 4 years ago.
I am not sure if what i am trying to do makes sense in R.
I want to do the something like follow:
fun <- function(df, args){
.....
df %>%
group_by(args)
.....
I am trying to pass a char vector as args, then group by the args as column name, but it does not work.
i have tried get and mget, they do not work the way i want.
Here's a small example of how to accomplish that. You pass in a string of args, we use syms from rlang to turn that into a list of symbols. We then use the !!! unquote-splice operator to group by those symbols.
library(rlang)
library(dplyr)
fun <- function(df, args){
by <- syms(args)
df %>%
group_by(!!!by) %>%
summarize_all(mean)
}
Using this example with mtcars:
> fun(mtcars, c("cyl"))
# A tibble: 3 x 11
cyl mpg disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4.00 26.7 105 82.6 4.07 2.29 19.1 0.909 0.727 4.09 1.55
2 6.00 19.7 183 122 3.59 3.12 18.0 0.571 0.429 3.86 3.43
3 8.00 15.1 353 209 3.23 4.00 16.8 0 0.143 3.29 3.50

Resources