Using summary(glm-object) inside ldply() with the summarise() - function

Using summary(glm-object) inside ldply() with the summarise() - function - r

How do i use the summary-function inside a ldply()-summarise-function to extract p-values?
Example data:
(The data frame "Puromycin" is preinstalled)
library(reshape2)
library(plyr)
Puromycin.m <- melt( Puromycin , id=c("state") )
Puro.models <- dlply( Puromycin.m , .(variable) , glm , formula = state ~ value ,
family = binomial )
I can construct this data frame with extracted results:
ldply( Puro.models , summarise , "n in each model" = length(fitted.values) ,
"Coefficients" = coefficients[2] )
But i cant extract the p-values in the same way. I thougt this would work but it does not:
ldply( Puro.models , summarise ,
"n in each model" = length(fitted.values) ,
"Coefficients" = coefficients[2],
"P-value" = function(x) summary(x)$coef[2,4] )
How can i extract p-values to that data frame :) Please help!

Why don't you get them directly?
library(reshape2)
library(plyr)
Puromycin.m <- melt( Puromycin , id=c("state") )
Puro.models <- ddply( Puromycin.m , .(variable), function(x) {
t <- glm(x, formula = state ~ value, family="binomial")
data.frame(n = length(t$fitted.values),
coef = coefficients(t)[2],
pval = summary(t)$coef[2,4])
})
> Puro.models
# variable n coef pval
# 1 conc 23 -0.55300908 0.6451550
# 2 rate 23 -0.01555023 0.1272184

Related

Add a string as a formula with defined function

I want to define a function when I input a string as covariate the function will put my string on the specific location and transform it as a formula. I know my code is incorrect but I do not know how to write it.
What I want is when I type covars <- "+s(time,bs= 'cr',fx=TRUE,k=7)" the function will add covarsto the formula like this gam.model <- gam(cvd ~ pm10 +s(time,bs= 'cr',fx=TRUE,k=7), data = chicagoNMMAPS , family =poisson, na.rm=T)
library(dlnm) # use chicagoNMMAPS data
library(mgcv)
# define myfun
myfun <- function(covars){
covars <- covars
gam.model <- gam(cvd ~ pm10 + covars, data = chicagoNMMAPS , family =poisson, na.rm=T)
summary(gam.model)
}
myfun("+s(time,bs= 'cr',fx=TRUE,k=7)")
myfun should do this :
gam.model <- gam(cvd ~ pm10 + covars, data = chicagoNMMAPS , family =poisson, na.rm=T)

Are You looking for this, not sure but try this as.formula with paste0:
myfunc_formula <- function(covars){
return(as.formula(paste0('cvd ~ pm10 ', covars)))
}
we can later use this input to gam(myfunc_formula(covars), data = chicagoNMMAPS , family =poisson, na.rm=T),
## In case someone wants to return the summary of given gam model
myfunc_formula_v1 <- function(covars){
gam1 <- gam(as.formula(paste0('cvd ~ pm10 ', covars)), data = chicagoNMMAPS , family =poisson, na.rm=TRUE)
return(summary(gam1))
}
Also we can make it flexible, by providing parameters for input like target variable name etc.
for example another version could be:
myfunc_formula_v2 <- function(covars, target='cvd'){
return(as.formula(paste0(target, ' ~ pm10 ', covars)))
}
Output:
> myfunc_formula(covars)
cvd ~ pm10 + s(time, bs = "cr", fx = TRUE, k = 7)
given covars = "+s(time,bs= 'cr',fx=TRUE,k=7)"

paste0 works, but reformulate is marginally more elegant:
myfun <- function(covars){
form <- reformulate(c("pm10",covars), response="cvd")
gam.model <- gam(form, data = chicagoNMMAPS , family =poisson, na.rm=TRUE)
summary(gam.model)
}

How to generate weighted multi-stage table (frequency as well as percentage) using frequency weight in R?

I have a survey data set with frequency weight (freq_wgt). The survey design is multi-stage cluster sampling. The data set if following.
sector <- c(1,2,1,2,1,2,1,2,1,2,2,2,1,1,2,1,2,1,2,2)
sex <- c(2,1,2,2,2,1,2,1,2,1,1,1,1,2,2,1,2,1,2,2)
group <- c(1,2,3,3, 2,1,1,2,3,3,2,1,1,3,3,1,3,1,2,2)
freq_wgt <- c(2,4,5,6,3,4,5,3,2,5,6,7,5,4,3,5,7,8,9,1)
df <- data.frame(sector, sex, group, freq_wgt)
df$sector <- factor(df$sector, levels = c(1,2), labels = c("Rural", "Urban"))
df$sex <- factor(df$sex, levels = c(1,2), labels = c("Male", "Female"))
df$group <- factor(df$group, levels = c(1,2,3), labels = c("STs", "SCs", "Others"))
I want to generate following kind of multi-strata table (frequency as well as col/row percentage) after applying the frequency weight.
mytable <- ftable(xtabs(~ sector + sex + group, data= df))
print(mytable)
Note: I found wtd.table function from Hmisc package but not suitable as it only generate one stage strata table. Thanks in advance.

the design below assumes simple random sampling. you'll need to look at the tech docs and/or ?svydesign to see how to make it account for multi-stage cluster sampling.
library(survey)
my_design <- svydesign( ~ 1 , data = df , weights = ~ freq_wgt )
svytable( ~ sector + sex + group , my_design )
svyby( ~ sector , ~ sex + group , my_design , svymean )
svyby( ~ sex , ~ sector + group , my_design , svymean )
svyby( ~ group , ~ sector + sex , my_design , svymean )

Like this example (in section 2.1.4) on the NHANES dataset, you can simply use svytable and then calculate a percentage from the totals.
require(tidyverse)
df <- data.frame(
sector = c(1,2,1,2,1,2,1,2,1,2,2,2,1,1,2,1,2,1,2,2)),
sex = c(2,1,2,2,2,1,2,1,2,1,1,1,1,2,2,1,2,1,2,2),
group = c(1,2,3,3,2,1,1,2,3,3,2,1,1,3,3,1,3,1,2,2),
freq_wgt = c(2,4,5,6,3,4,5,3,2,5,6,7,5,4,3,5,7,8,9,1))
df$sector <- factor(c("Rural", "Urban")[df$sector])
df$sex <- factor(c("Male", "Female")[df$sex])
df$group <- factor(c("STs", "SCs", "Others")[df$group])
# define survey design object
# my_design <- ...
svytable(~sex, design = my_design) %>%
as.data.frame() %>%
mutate(prop = Freq / sum(Freq) * 100)
Similarly, in a more hacky way, you could add a dummy variable to the dataset and take svymean of that, which will give you percentages for categorical variables
# (define data as above)
# Create dummy variable
df$dummy <- 1
# define design object
# my_design <- ...
svyby(~sex, ~dummy, svymean, design = my_design)

Raking Multiple Imputed dataset

This is a continuation from my previous post
Error with svydesign using imputed data sets
I would like to run a rake() function in my imputed dataset. However, it seems it is not finding the input variable. Below is a sample code:
library(mitools)
library(survey)
library(mice)
data(nhanes)
nhanes2$hyp <- as.factor(nhanes2$hyp)
imp <- mice(nhanes2,method=c("polyreg","pmm","logreg","pmm"), seed = 23109)
imp_list <- lapply( 1:5 , function( n ) complete( imp , action = n ) )
des<-svydesign(id=~1, data=imputationList(imp_list))
age.dist <- data.frame(age = c("20-39","40-59", "60-99"),
Freq = nrow(des) * c(0.5, 0.3, .2))
small.svy.rake <- rake(design = des,
sample.margins = list(~age),
population.margins = list(age.dist))
Error in eval(expr, envir, enclos) : object 'age' not found
The code works if I change the input data to a single dataset. That is, instead of des<-svydesign(id=~1, data=imputationList(imp_list)), I have this
data3 <- complete(imp,1)
des<-svydesign(id=~1, data=data3)
How can i edit the code such that it would recognize that the input dataset in the rake() function is of multiple imputation type?

# copy over the structure of your starting multiply-imputed design
small.svy.rake <- des
# loop through each of the implicates
# applying the `rake` function to each
small.svy.rake$designs <-
lapply(
des$designs ,
rake ,
sample.margins = list(~age),
population.margins = list(age.dist)
)
# as you'd expect, the overall number changes..
MIcombine( with( des , svymean( ~ bmi ) ) )
MIcombine( with( small.svy.rake , svymean( ~ bmi ) ) )
# ..but the within-age-category numbers do not
MIcombine( with( des , svyby( ~ bmi , ~ age , svymean ) ) )
MIcombine( with( small.svy.rake , svyby( ~ bmi , ~ age , svymean ) ) )

jackknife paired t-test in R

Consider the following dummy data:
x <- rnorm(15,mean = 3,sd = 1)
y <- rnorm(15,mean = 3,sd = 1)
xy <- c(x,y)
factor <- c(rep('A',15),rep('B',15))
df1 <- data.frame(xy,factor)
df1$PAIR_IDENTIFIER <- 1:15
Now, we want to test if the means are different between the factor==A and factor==B. So we implement a paired t-test
paired_t_test <- t.test(xy ~ factor, df1, paired = T)
paired_t_test$p.value
Now we want to extend on this by using a jackknife resample
That is we leave one pair (PAIR_IDENTIFIER) and re-run the t-test. We want to re-run the test 15-1 times.
I have tried to implement the jackknife function from the bootstrap library
library(bootstrap)
n <- length(df1$xy)
theta <- function(x,df1){ t.test(xy ~ factor, df1, paired = T)}
results <- jackknife(1:n,theta,df1)
But I am not sure how to write the function to remove a PAIR_IDENTIFIER for each iteration.

You were close. There's really no need to remove that variable, t-test will only use what's specified in the formula.
theta.fun <- function(x, mydata) {
t.test(xy ~ factor,
data = mydata[mydata$PAIR_IDENTIFIER %in% x, ],
paired = T)$p.value
}
jackknife(x = 1:15, theta = theta.fun, mydata = df1)
$jack.se
[1] 0.5257458
$jack.bias
[1] 0.4501173
$jack.values
[1] 0.4064047 0.1164558 0.6187378 0.2853089 0.5913767 0.3906702 0.3528575 0.5142996 0.2430837 0.5590809 0.5015720 0.6029110 0.3881225
[14] 0.5223167 0.3734025
$call
jackknife(x = 1:15, theta = theta.fun, mydata = df1)

Combining lapply, svyby, svyratio to calculate many ratios with confidence intervals

I am using the survey package in R to work with the U.S. Census' PUMS dataset for population. I've created a Boolean for each broad industry and a character variable MigrationStatus with three values (Stayed,Left,Entered). I'd like to examine the ratios of workers in each industry by migration status.
This works fine:
AGR_ratio=svyby(~JobAGR, by=~MigrationStatus, denominator=~EmployedAtWork, design=subset(pums_design,EmployedAtWork==1), svyratio, vartype='ci')
But this produces an error:
varlist=names(pums_design$variables)[32:50]
job_ratios = lapply(varlist, function(x) {
svyby(substitute(~i, list(i = as.name(x))), by=~MigrationStatus, denominator=~EmployedAtWork, design=subset(pums_design,EmployedAtWork==1), svyratio, vartype='ci')
})
#Error in svyby.default(substitute(~i, list(i = as.name(x))), by = ~MigrationStatus, :
#object 'byfactor' not found
varlist
#[1] "JobADM" "JobAGR" "JobCON" "JobEDU" "JobENT" "JobEXT" "JobFIN" "JobINF" "JobMED" "JobMFG" "JobMIL" "JobPRF" "JobRET" "JobSCA" "JobSRV"
#[16] "JobTRN" "JobUNE" "JobUTL" "JobWHL"

how about this?
# setup
library(survey)
data(api)
dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc)
# single example
svyby(~api99, by = ~stype, denominator = ~api00 , dclus1, svyratio)
# multiple example
variables <- c( "api99" , "pcttest" )
# breaks
lapply(variables, function(x) svyby(substitute(~i, list(i = as.name(x))), by=~stype, denominator=~api00, design=dclus1, svyratio, vartype='ci'))
# works
lapply( variables , function( z ) svyby( as.formula( paste0( "~" , z ) ) , by = ~stype, denominator = ~api00 , dclus1, svyratio , vartype = 'ci' ) )
btw you might be interested in this uspums data automation syntax

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Using summary(glm-object) inside ldply() with the summarise() - function - r

Related

Add a string as a formula with defined function

How to generate weighted multi-stage table (frequency as well as percentage) using frequency weight in R?

Raking Multiple Imputed dataset

jackknife paired t-test in R

Combining lapply, svyby, svyratio to calculate many ratios with confidence intervals

Categories

Resources