I am trying to calculate several binomial proportion confidence intervals. My data are in a data frame, and though I can successfully extract the estimate from the object returned by prop.test, the conf.int variable seems to be null when run on the data frame.
library(dplyr)
cases <- c(50000, 1000, 10, 2343242)
population <- c(100000000, 500000000, 100000, 200000000)
df <- as.data.frame(cbind(cases, population))
df %>% mutate(rate = prop.test(cases, population, conf.level=0.95)$estimate)
This appropriately returns
cases population rate
1 50000 1e+08 0.00050000
2 1000 5e+08 0.00000200
3 10 1e+05 0.00010000
4 2343242 2e+08 0.01171621
However, when I run
df %>% mutate(confint.lower= prop.test(cases, pop, conf.level=0.95)$conf.int[1])
I sadly get
Error in mutate_impl(.data, dots) :
Column `confint.lower` is of unsupported type NULL
Any thoughts? I know alternative ways to calculate the binomial proportion confidence interval, but I would really like to learn how to use dplyr well.
Thank you!
You can use dplyr::rowwise() to group on rows:
df %>%
rowwise() %>%
mutate(lower_ci = prop.test(cases, pop, conf.level=0.95)$conf.int[1])
By default dplyr takes the column names and treats them like vectors. So vectorized functions, like #Jake Fisher mentioned above, just work without rowwise() added.
This is what I would do to catch all of the confidence interval components at once:
df %>%
rowwise %>%
mutate(tst = list(broom::tidy(prop.test(cases, pop, conf.level=0.95)))) %>%
tidyr::unnest(tst)
As of version 1.0.0, rowwise() is no longer being questioned.
As of version 0.8.3 of dplyr, the lifecycle status of the rowwise() function is "questioning".
As an alternative, I would rather recommend the use of purrr::map2() to achieve the goal:
df %>%
mutate(rate = map2(cases, pop, ~ prop.test(.x, .y, conf.level=0.95) %>%
broom::tidy())) %>%
unnest(rate)
Related
I'm trying to calculate the following proportion for each city: mean(age < 25).
My code so far is the following:
namevar <- data %>% group_by(city) %>% mean (age < 25).
My data is clean and has no NA.
If I use mean(age <25) it works, but when I use the group_by function it doesn't.
This is the message that appears:
In mean.default(unlist(x, use.names = FALSE, recursive = TRUE), :
argument is not numeric or logical: returning NA
Thanks a lot for reading and helping :)
We can use mutate (if we want to create a new column) or summarise (if needed to summarise)
library(dplyr)
data1 <- data %>%
group_by(city) %>%
summarise(Prop = mean(age < 25))
I have a dataframe ("df") with a number of columns that I would like to estimate the weighted means of, weighting by population (df$Population), and grouping by commuting zone (df$cz).
This is the list of columns I would like to estimate the weighted means of:
vlist = c("Public_Welf_Total_Exp", "Welf_Cash_Total_Exp", "Welf_Cash_Cash_Assist", "Welf_Ins_Total_Exp","Total_Educ_Direct_Exp", "Higher_Ed_Total_Exp", "Welf_NEC_Cap_Outlay","Welf_NEC_Direct_Expend", "Welf_NEC_Total_Expend", "Total_Educ_Assist___Sub", "Health_Total_Expend", "Total_Hospital_Total_Exp", "Welf_Vend_Pmts_Medical","Hosp_Other_Total_Exp","Unemp_Comp_Total_Exp", "Unemp_Comp_Cash___Sec", "Total_Unemp_Rev", "Hous___Com_Total_Exp", "Hous___Com_Construct")
This is the code I have been using:
df = df %>% group_by(cz) %>% mutate_at(vlist, weighted.mean(., df$Population))
I have also tried:
df = df %>% group_by(cz) %>% mutate_at(vlist, function(x) weighted.mean(x, df$Population))
As well as tested the following code on only 2 columns:
df = df %>% group_by(cz) %>% mutate_at(vars(Public_Welf_Total_Exp, Welf_Cash_Total_Exp), weighted.mean(., df$Population))
However, everything I have tried gives me the following error, even though there are no NAs in any of my variables:
Error in weighted.mean.default(., df$Population) :
'x' and 'w' must have the same length
I understand that I could do the following estimation using lapply, but I don't know how to group by another variable using lapply. I would appreciate any suggestions!
There is a lot to unpack here...
Probably you mean summarise instead of mutate, because with mutate you would just replicate your result for each row.
mutate_at and summarise_at are subseeded and you should use across instead.
the reason why your code wasn't working was because you did not write your function as a formula (you did not add ~ at the beginning), also you were using df$Population instead of Population. When you write Population, summarise knows you're talking about the column Population which, at that point, is grouped like the rest of the dataframe. When you use df$Population you are calling the column of the original dataframe without grouping. Not only it is wrong, but you would also get an error because the length of the variable you are trying to average and the lengths of the weights provided by df$Population would not correspond.
Here is how you could do it:
library(dplyr)
df %>%
group_by(cz) %>%
summarise(across(vlist, weighted.mean, Population),
.groups = "drop")
If you really need to use summarise_at (and probably you are using an old version of dplyr [lower than 1.0.0]), then you could do:
df %>%
group_by(cz) %>%
summarise_at(vlist, ~weighted.mean(., Population)) %>%
ungroup()
I considered df and vlist like the following:
vlist <- c("Public_Welf_Total_Exp", "Welf_Cash_Total_Exp", "Welf_Cash_Cash_Assist", "Welf_Ins_Total_Exp","Total_Educ_Direct_Exp", "Higher_Ed_Total_Exp", "Welf_NEC_Cap_Outlay","Welf_NEC_Direct_Expend", "Welf_NEC_Total_Expend", "Total_Educ_Assist___Sub", "Health_Total_Expend", "Total_Hospital_Total_Exp", "Welf_Vend_Pmts_Medical","Hosp_Other_Total_Exp","Unemp_Comp_Total_Exp", "Unemp_Comp_Cash___Sec", "Total_Unemp_Rev", "Hous___Com_Total_Exp", "Hous___Com_Construct")
df <- as.data.frame(matrix(rnorm(length(vlist) * 100), ncol = length(vlist)))
names(df) <- vlist
df$cz <- rep(letters[1:10], each = 10)
df$Population <- runif(100)
I'm summarizing a data frame in dplyr with the summarize_all() function. If I do the following:
summarize_all(mydf, list(mean="mean", median="median", sd="sd"))
I get a tibble with 3 variables for each of my original measures, all suffixed by the type (mean, median, sd). Great! But when I try to capture the within-vector n's to calculate the standard deviations myself and to make sure missing cells aren't counted...
summarize_all(mydf, list(mean="mean", median="median", sd="sd", n="n"))
...I get an error:
Error in (function () : unused argument (var_a)
This is not an issue with my var_a vector. If I remove it, I get the same error for var_b, etc. The summarize_all function is producing odd results whenever I request n or n(), or if I use .funs() and list the descriptives I want to compute instead.
What's going on?
The reason it's giving you problems is because n() doesn't take any arguments, unlike mean() and median(). Use length() instead to get the desired effect:
summarize_all(mydf, list(mean="mean", median="median", sd="sd", n="length"))
Here, we can use the ~ if we want to have finer control, i.e. adding other parameters
library(dplyr)
mtcars %>%
summarise_all(list(mean = ~ mean(.), median = ~median(.), n = ~ n()))
However, getting the n() for each column is not making much sense as it would be the same. Instead create the n() before doing the summarise
mtcars %>%
group_by(n = n()) %>%
summarise_all(list(mean = mean, median = median))
Otherwise, just pass the unquoted function
mtcars %>%
summarise_all(list(mean = mean, median = median))
I tried to do a t-test comparing values between time1/2/3.. and threshold.
here is my data frame:
time.df1<-data.frame("condition" =c("A","B","C","A","C","B"),
"time1" = c(1,3,2,6,2,3) ,
"time2" = c(1,1,2,8,2,9) ,
"time3" = c(-2,12,4,1,0,6),
"time4" = c(-8,3,2,1,9,6),
"threshold" = c(-2,3,8,1,9,-3))
and I tried to compare each two values by:
time.df1%>%
select_if(is.numeric) %>%
purrr::map_df(~ broom::tidy(t.test(. ~ threshold)))
However, I got this error message
Error in eval(predvars, data, env) : object 'threshold' not found
So, I tried another way (maybe it is wrong)
time.df2<-time.df1%>%gather(TF,value,time1:time4)
time.df2%>% group_by(condition) %>% do(tidy(t.test(value~TF, data=.)))
sadly, I got this error. Even I limited the condition to only two levels (A,B)
Error in t.test.formula(value ~ TF, data = .) : grouping factor must have exactly 2 levels
I wish to loop t-test over each time column to threshold column per condition, then using broom::tidy to get the results in tidy format. My approaches apparently aren't working, any advice is much appreciated to improve my codes.
An alternative route would be to define a function with the required options for t.test() up front, then create data frames for each pair of variables (i.e. each combination of 'time*' and 'threshold') and nesting them into list columns and use map() combined with relevant functions from 'broom' to simplify the outputs.
library(tidyverse)
library(broom)
ttestfn <- function(data, ...){
# amend this function to include required options for t.test
res = t.test(data$resp, data$threshold)
return(res)
}
df2 <-
time.df1 %>%
gather(time, "resp", - threshold, -condition) %>%
group_by(time) %>%
nest() %>%
mutate(ttests = map(data, ttestfn),
glances = map(ttests, glance))
# df2 has data frames, t-test objects and glance summaries
# as separate list columns
Now it's easy to query this object to extract what you want
df2 %>%
unnest(glances, .drop=TRUE)
However, it's unclear to me what you want to do with 'condition', so I'm wondering if it is more straightforward to reframe the question in terms of a GLM (as camille suggested in the comments: ANOVA is part of the GLM family).
Reshape the data, define 'threshold' as the reference level of the 'time' factor and the default 'treatment' contrasts used by R will compare each time to 'threshold':
time.df2 <-
time.df1 %>%
gather(key = "time", value = "resp", -condition) %>%
mutate(time = fct_relevel(time, "threshold")) # define 'threshold' as baseline
fit.aov <- aov(resp ~ condition * time, data = time.df2)
summary(fit.aov)
summary.lm(fit.aov) # coefficients and p-values
Of course this assumes that all subjects are independent (i.e. there are no repeated measures). If not, then you'll need to move on to more complicated procedures. Anyway, moving to appropriate GLMs for the study design should help minimise the pitfalls of doing multiple t-tests on the same data set.
We could remove the threshold from the select and then reintroduce it by creating a data.frame which would go into the formula object of t.test
library(tidyverse)
time1.df %>%
select_if(is.numeric) %>%
select(-threshold) %>%
map_df(~ data.frame(time = .x, time1.df['threshold']) %>%
broom::tidy(t.test(. ~ threshold)))
I am currently learning purrr in R. I have code which does the following
Uses the pysch package in r to get the mean, SD, range etc from a list of questions
Returns those statistics in a single data-frame where the list item is added to the table as a column. In the case below its schools.
Below is an example where I'm about 90% there i think. All i want to do is add the names of the schools to the dataframe as a column so as to be able to chart them afterwards. Can anyone help? The method below loses the names as soon as the bind_rows() command is run
library(lavaan)
library(tidyverse)
# function pulls the mean, sd, range, kurtosis and skew
get_stats <- function(x){
row_names <- rownames(x)
mydf_temp <- x %>%
dplyr::select(mean, sd, range, kurtosis, skew) %>%
mutate_if(is.numeric, round, digits=2) %>%
filter(complete.cases(.))
mydf_temp
}
# Generate the data for the reproducible example
mydf <- HolzingerSwineford1939 %>%
select(school, starts_with("x")) %>%
psych::describeBy(., group=.$school, digits = 2)
# Gets the summary statistics per school
stats_summ <- mydf %>%
map(get_stats) %>%
bind_rows()
We can use the .id argument from bind_rows
mydf %>%
map(get_stats) %>%
bind_rows(., .id = 'group')
Using a reproducible example with iris dataset
mydf <- iris %>%
psych::describeBy(., group=.$Species, digits = 2)