I am trying to do the same thing as below, except naming order changes. Got the code from here
mtcars; rownames(mtcars) <- NULL
df <- mtcars[,c(2,8,9)]
head(df)
(df
%>% pivot_longer(-cyl) ## spread out variables (vs, am)
%>% group_by(cyl,name)
%>% dplyr::mutate(n=n()) ## obs per cyl/var combo
%>% group_by(cyl,name,value)
%>% dplyr::summarise(prop=n()/n) ## proportion of 0/1 per cyl/var
%>% unique() ## not sure why I need this?
%>% pivot_wider(id_cols=c(cyl,name),names_from=value,values_from=prop)
)
Expected answer
cyl name `0` `1`
4 vs 0.0909 0.909
4 am 0.273 0.727
6 vs 0.429 0.571
6 am 0.571 0.429
8 vs 1 NA
8 am 0.857 0.143
One possible solution involves adding three lines below your code.
Basically, you modify your variable name to be a factor with values coming in the order specified in levels so that it is internally coded as 1, 2, ...
Then you group by cyl and sort according to name
(df
%>% pivot_longer(-cyl) ## spread out variables (vs, am)
%>% group_by(cyl,name)
%>% dplyr::mutate(n=n()) ## obs per cyl/var combo
%>% group_by(cyl,name,value)
%>% dplyr::summarise(prop=n()/n) ## proportion of 0/1 per cyl/var
%>% unique() ## not sure why I need this?
%>% pivot_wider(id_cols=c(cyl,name),names_from=value,values_from=prop)
%>% mutate(name = factor(name, levels = c("vs", "am")))
%>% group_by(cyl)
%>% arrange(name, .by_group = TRUE)
)
# A tibble: 6 x 4
# Groups: cyl [3]
cyl name `0` `1`
<dbl> <fct> <dbl> <dbl>
1 4 vs 0.0909 0.909
2 4 am 0.273 0.727
3 6 vs 0.429 0.571
4 6 am 0.571 0.429
5 8 vs 1 NA
6 8 am 0.857 0.143
Different take:
df %>% pivot_longer(!cyl) %>% group_by(cyl, name, value) %>% mutate(cnt = n()) %>%
ungroup() %>% group_by(cyl, name) %>% mutate(prop = cnt/n()) %>% distinct() %>%
pivot_wider(id_cols = c(cyl, name), names_from = value, values_from = prop) %>%
arrange(cyl, desc(name))
# A tibble: 6 x 4
# Groups: cyl, name [6]
cyl name `0` `1`
<dbl> <chr> <dbl> <dbl>
1 4 vs 0.0909 0.909
2 4 am 0.273 0.727
3 6 vs 0.429 0.571
4 6 am 0.571 0.429
5 8 vs 1 NA
6 8 am 0.857 0.143
>
Related
I'm having trouble figuring out how to use purrr::map() with mutate(across(...)).
I want to do a linear model and pull out the estimate for the slope of multiple columns as predicted by a single column.
Here is what I'm attempting with an example data set:
mtcars %>%
mutate(across(-mpg),
map(.x, lst(slope = ~lm(.x ~ mpg, data = .x) %>%
tidy() %>%
filter(term != "(Intercept") %>%
pull(estimate)
)))
The output I'm looking for would be new columns for each non-mpg column with _slope appended to the name, ie cyl_slope
In my actual data, I'll be grouping by another variable as well in case that matters, as I need the slope for each group for each predicted variable. I have this working in a standard mutate doing one variable at a time as follows:
df %>%
group_by(unitid) %>%
nest() %>%
mutate(tuition_and_fees_as_pct_total_rev_slope = map_dbl(data, ~lm(tuition_and_fees_as_pct_total_rev ~ year, data = .x) %>%
tidy() %>%
filter(term == "year") %>%
pull(estimate)
))
So:
I think my issue is how to pass the column name being predicted into the lm
I don't know if the solution requires nesting or not, so it would be appreciated if in the mtcars example that is considered.
If we wanted to do lm on all other columns with independent variable as 'mpg', one option is to loop over the column names of the 'mtcars' except the 'mpg', create the formula with reformulate, apply the lm, convert to a tidy format, filter out the 'Intercept' and select the 'estimate' column
library(dplyr)
library(tidyr)
library(broom)
map_dfc(setdiff(names(mtcars), 'mpg'), ~
lm(reformulate('mpg', response = .x), data = mtcars) %>%
tidy %>%
filter(term != "(Intercept)") %>%
select(estimate))
-output
# A tibble: 1 x 10
# estimate...1 estimate...2 estimate...3 estimate...4 estimate...5 estimate...6 estimate...7 estimate...8 estimate...9 estimate...10
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 -0.253 -17.4 -8.83 0.0604 -0.141 0.124 0.0555 0.0497 0.0588 -0.148
Or this can be done more easily with a matrix as dependent
library(stringr)
lm(as.matrix(mtcars[setdiff(names(mtcars), "mpg")]) ~ mpg,
data = mtcars) %>%
tidy %>%
filter(term != "(Intercept)") %>%
select(response, estimate) %>%
mutate(response = str_c(response, '_slope'))
-output
# A tibble: 10 x 2
# response estimate
# <chr> <dbl>
# 1 cyl_slope -0.253
# 2 disp_slope -17.4
# 3 hp_slope -8.83
# 4 drat_slope 0.0604
# 5 wt_slope -0.141
# 6 qsec_slope 0.124
# 7 vs_slope 0.0555
# 8 am_slope 0.0497
# 9 gear_slope 0.0588
#10 carb_slope -0.148
Or another option is summarise with across
mtcars %>%
summarise(across(-mpg, ~ list(lm(reformulate('mpg',
response = cur_column())) %>%
tidy %>%
filter(term != "(Intercept)") %>%
pull(estimate)), .names = "{.col}_slope")) %>%
unnest(everything())
# A tibble: 1 x 10
# cyl_slope disp_slope hp_slope drat_slope wt_slope qsec_slope vs_slope am_slope gear_slope carb_slope
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 -0.253 -17.4 -8.83 0.0604 -0.141 0.124 0.0555 0.0497 0.0588 -0.148
One option could be:
map_dfr(.x = names(select(mtcars, -c(mpg, vs))),
~ mtcars %>%
group_by(vs) %>%
nest() %>%
mutate(variable = .x,
estimate = map_dbl(data, function(y) lm(!!sym(.x) ~ mpg, data = y) %>%
tidy() %>%
filter(term != "(Intercept)") %>%
pull(estimate))) %>%
select(-data))
vs variable estimate
<dbl> <chr> <dbl>
1 0 cyl -0.242
2 1 cyl -0.116
3 0 disp -22.5
4 1 disp -8.01
5 0 hp -10.1
6 1 hp -3.26
7 0 drat 0.0748
8 1 drat 0.0529
9 0 wt -0.192
10 1 wt -0.113
11 0 qsec -0.0357
12 1 qsec -0.0432
13 0 am 0.0742
14 1 am 0.0710
15 0 gear 0.114
16 1 gear 0.0492
17 0 carb -0.0883
18 1 carb -0.0790
I came across something weird with dplyr and across, or at least something I do not understand.
If we use the across function to compute the mean and standard error of the mean across multiple columns, I am tempted to use the following command:
mtcars %>% group_by(gear) %>% select(mpg,cyl) %>%
summarize(across(everything(), ~mean(.x, na.rm = TRUE), .names = "{col}"),
across(everything(), ~sd(.x, na.rm=T)/sqrt(sum(!is.na(.x))), .names="se_{col}")) %>% head()
Which results in
gear mpg cyl se_mpg se_cyl
<dbl> <dbl> <dbl> <dbl> <dbl>
1 3 16.1 7.47 NA NA
2 4 24.5 4.67 NA NA
3 5 21.4 6 NA NA
However, if I switch the order of the individual across commands, I get the following:
mtcars %>% group_by(gear) %>% select(mpg,cyl) %>%
summarize(across(everything(), ~sd(.x, na.rm=T)/sqrt(sum(!is.na(.x))), .names="se_{col}"),
across(everything(), ~mean(.x, na.rm = TRUE), .names = "{col}")) %>% head()
# A tibble: 3 x 5
gear se_mpg se_cyl mpg cyl
<dbl> <dbl> <dbl> <dbl> <dbl>
1 3 0.871 0.307 16.1 7.47
2 4 1.52 0.284 24.5 4.67
3 5 2.98 0.894 21.4 6
Why is this the case? Does it have something to do with my usage of everything()? In my situation I'd like the mean and the standard error of the mean calculated across every variable in my dataset.
I have no idea why summarize behaves like that, it's probably due to an underlying interaction of the two across functions (although it seems weird to me). Anyway, I suggest you to write a single across statement and use a list of lambda functions as suggested by the across documentation.
In this way it doesn't matter if the mean or the standard deviation is specified as first function, you will get no NAs.
mtcars %>%
group_by(gear) %>%
select(mpg, cyl) %>%
summarize(across(everything(), list(
mean = ~mean(.x, na.rm = TRUE),
se = ~sd(.x, na.rm = TRUE)/sqrt(sum(!is.na(.x)))
), .names = "{fn}_{col}"))
# A tibble: 3 x 5
# gear mean_mpg se_mpg mean_cyl se_cyl
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 3 16.1 0.871 7.47 0.307
# 2 4 24.5 1.52 4.67 0.284
# 3 5 21.4 2.98 6 0.894
mtcars %>%
group_by(gear) %>%
select(mpg, cyl) %>%
summarize(across(everything(), list(
se = ~sd(.x, na.rm = TRUE)/sqrt(sum(!is.na(.x))),
mean = ~mean(.x, na.rm = TRUE)
), .names = "{fn}_{col}"))
# A tibble: 3 x 5
# gear se_mpg mean_mpg se_cyl mean_cyl
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 3 0.871 16.1 0.307 7.47
# 2 4 1.52 24.5 0.284 4.67
# 3 5 2.98 21.4 0.894 6
I have a data frame similar to data created below:
ID <- data.frame(ID=rep(c(12,122,242,329,595,130,145,245,654,878),each=5))
Var <- data.frame(Variable=c("Copper","Iron","Lead","Zinc","CaCO"))
n <- 10
Variable <- do.call("rbind",replicate(n,Var,simplify=F))
Location <- rep(c("Alpha","Beta","Gamma"), times=c(20,20,10))
Location <- data.frame(Location)
set.seed(1)
FirstPt<- data.frame(FirstPt=sample(1:100,50,replace=T))
LastPt <- data.frame(LastPt=sample(1:100,50,replace=T))
First3<- data.frame(First3=sample(1:100,50,replace=T))
First5<- data.frame(First5=sample(1:100,50,replace=T))
First7<- data.frame(First7=sample(1:100,50,replace=T))
First10<- data.frame(First10=sample(1:100,50,replace=T))
Last3<- data.frame(Last3=sample(1:100,50,replace=T))
Last5<- data.frame(Last5=sample(1:100,50,replace=T))
Last7<- data.frame(Last7=sample(1:100,50,replace=T))
Last10<- data.frame(Last10=sample(1:100,50,replace=T))
data <- cbind(ID,Location,Variable,FirstPt,LastPt,First3,First5,First7,
First10,Last3,Last5,Last7,Last10)
This may be a two part question, but I want to write a function that groups all Variables that are the same (for instance, all the observations that are Copper) and conducts a paired t test between all possible combinations of the numeric columns (FirstPt:Last10). I want it to return the p values in a data frame like this:
Test P-Value
FirstPt.vs.LastPt …
FirstPt.vs.First3 …
ect... …
This will likely be a second function, but I also want to do this after the observations are grouped by Location so that the output data frame will look like this:
Test P-Value
FirstPt.vs.LastPt.InAlpha
FirstPt.vs.LastPt.InBeta
ect...
You can do both of these with one function:
library(tidyverse)
t.test.by.group.combos <- function(.data, groups){
by <- gsub(x = rlang::quo_get_expr(enquo(groups)), pattern = "\\((.*)?\\)", replacement = "\\1")[-1]
.data %>%
group_by(!!!groups) %>%
select_if(is.integer) %>%
group_split() %>%
map(.,
~pivot_longer(., cols = (FirstPt:Last10), names_to = "name", values_to = "val") %>%
nest(data = val) %>%
full_join(.,.,by = by) %>%
filter(name.x != name.y) %>%
mutate(test = paste(name.x, "vs",name.y, !!!groups, sep = "."),
p.value = map2_dbl(data.x,data.y, ~t.test(unlist(.x), unlist(.y))$p.value)) %>%
select(test,p.value)%>%
filter(!duplicated(p.value))
) %>%
bind_rows()
}
t.test.by.group.combos(data, vars(Variable))
#> # A tibble: 225 x 2
#> test p.value
#> <chr> <dbl>
#> 1 FirstPt.vs.LastPt.CaCO 0.511
#> 2 FirstPt.vs.First3.CaCO 0.184
#> 3 FirstPt.vs.First5.CaCO 0.494
#> 4 FirstPt.vs.First7.CaCO 0.354
#> 5 FirstPt.vs.First10.CaCO 0.893
#> 6 FirstPt.vs.Last3.CaCO 0.496
#> 7 FirstPt.vs.Last5.CaCO 0.909
#> 8 FirstPt.vs.Last7.CaCO 0.439
#> 9 FirstPt.vs.Last10.CaCO 0.146
#> 10 LastPt.vs.First3.CaCO 0.578
#> # … with 215 more rows
t.test.by.group.combos(data, vars(Variable, Location))
#> # A tibble: 674 x 2
#> test p.value
#> <chr> <dbl>
#> 1 FirstPt.vs.LastPt.CaCO.Alpha 0.850
#> 2 FirstPt.vs.First3.CaCO.Alpha 0.822
#> 3 FirstPt.vs.First5.CaCO.Alpha 0.895
#> 4 FirstPt.vs.First7.CaCO.Alpha 0.810
#> 5 FirstPt.vs.First10.CaCO.Alpha 0.645
#> 6 FirstPt.vs.Last3.CaCO.Alpha 0.870
#> 7 FirstPt.vs.Last5.CaCO.Alpha 0.465
#> 8 FirstPt.vs.Last7.CaCO.Alpha 0.115
#> 9 FirstPt.vs.Last10.CaCO.Alpha 0.474
#> 10 LastPt.vs.First3.CaCO.Alpha 0.991
#> # … with 664 more rows
This is kind of a lengthy function, but in general we group by the groups argument, then we select the groups and any integer columns, then we split the dataframe by the groups. After, we map all the combinations of variables and perform t.tests for each combo. Lastly, we rejoin all the groups into one dataframe.
I think this is what you want. The key was to use group_by and do from tidyverse.
df <- NULL
for(i in (4:(ncol(data)-1))){
for(j in ((i+1):ncol(data))){
df <- rbind(df,data %>%
group_by(Location) %>%
do(data.frame(pval = t.test(.[[i]],.[[j]], data = .)$p.value)) %>%
ungroup() %>%
mutate(Test = paste0(colnames(data)[i],'.vs.',colnames(data)[j]))
)
}
}
df$Test <- paste0(df$Test,'.In',df$Location)
Probably, you can acheive what you want using the below code :
library(dplyr)
library(tidyr)
data %>%
pivot_longer(cols = FirstPt:Last10) %>%
group_by(Variable) %>%
summarise(p_value = list(combn(name, 2, function(x)
t.test(value[name == x[1]], value[name == x[2]])$p.value)),
test = list(combn(name, 2, paste, collapse = "_"))) %>%
unnest(cols = c(test, p_value))
# Variable p_value test
# <fct> <dbl> <chr>
# 1 CaCO 0.915 FirstPt_LastPt
# 2 CaCO 0.529 FirstPt_First3
# 3 CaCO 0.337 FirstPt_First5
# 4 CaCO 0.350 FirstPt_First7
# 5 CaCO 0.395 FirstPt_First10
# 6 CaCO 0.765 FirstPt_Last3
# 7 CaCO 0.204 FirstPt_Last5
# 8 CaCO 0.873 FirstPt_Last7
# 9 CaCO 0.479 FirstPt_Last10
#10 CaCO 1 FirstPt_FirstPt
# … with 24,740 more rows
To do it grouped by Location you can add that into group_by command and keep rest of the code as it is.
Using mtcars data, I want to calculate proportion of mpg for each group of cyl and am. How to calc it?
mtcars %>%
group_by(cyl, am) %>%
summarise(mpg = n(mpg)) %>%
mutate(mpg.gr = mpg/(sum(mpg))
Thanks in advance!
If I understand you correctly, you want the proportion of records for each combination of cyl and am. If so, then I believe your code isn't working because n() doesn't accept an argument. You also need to ungroup() before calculating your proportions.
You could simply do:
mtcars %>%
group_by(cyl, am) %>%
summarise(mpg = n()) %>%
ungroup() %>%
mutate(mpg.gr = mpg/(sum(mpg))
#> # A tibble: 6 x 4
#> cyl am mpg mpg.gr
#> <dbl> <dbl> <int> <dbl>
#> 1 4 0 3 0.0938
#> 2 4 1 8 0.25
#> 3 6 0 4 0.125
#> 4 6 1 3 0.0938
#> 5 8 0 12 0.375
#> 6 8 1 2 0.0625
Note that thanks to ungroup(), the proportions are calculated using the counts of all records, not just those within the cyl group, as before.
This question already has answers here:
Count number of rows within each group
(17 answers)
Closed 3 years ago.
I want to use the size of a group as part of a groupwise operation in dplyr::summarise.
E.g calculate the proportion of manuals by cylinder, by grouping the cars data by cyl and dividing the number of manuals by the size of the group:
mtcars %>%
group_by(cyl) %>%
summarise(zz = sum(am)/group_size(.))
But, (I think), because group_size is after a grouped tbl_df and . is ungrouped, this returns
Error in mutate_impl(.data, dots) : basic_string::resize
Is there a way to do this?
You probably can use n() to get the number of rows for group
library(dplyr)
mtcars %>%
group_by(cyl) %>%
summarise(zz = sum(am)/n())
# cyl zz
# <dbl> <dbl>
#1 4.00 0.727
#2 6.00 0.429
#3 8.00 0.143
It is just a group by mean
mtcars %>%
group_by(cyl) %>%
summarise(zz = mean(am))
# A tibble: 3 x 2
# cyl zz
# <dbl> <dbl>
#1 4 0.727
#2 6 0.429
#3 8 0.143
If we need to use group_size
library(tidyverse)
mtcars %>%
group_by(cyl) %>%
nest %>%
mutate(zz = map_dbl(data, ~ sum(.x$am)/group_size(.x))) %>%
arrange(cyl) %>%
select(-data)
# A tibble: 3 x 2
# cyl zz
# <dbl> <dbl>
#1 4 0.727
#2 6 0.429
#3 8 0.143
Or using do
mtcars %>%
group_by(cyl) %>%
do(data.frame(zz = sum(.$am)/group_size(.)))
# A tibble: 3 x 2
# Groups: cyl [3]
# cyl zz
# <dbl> <dbl>
#1 4 0.727
#2 6 0.429
#3 8 0.143