I am currently using R's survey library to analyze survey data. I have two samples from two different time periods. My goal is to test if the difference between the two weighted sample means is equal to 0. Question: How do I approach this using R's survey library?
I have tried two approaches to doing this:
Approach 1: Create two different postStratify objects. Toy example:
q1 = c(1,1,1,1,0)
group = c(0,0,0,1,1)
df = data.frame(q1, group, time)
svy_design = svydesign(ids = ~1 , data = df)
pop_data = data.frame(group = c(0,1), Freq = c(10,90))
ps_design = postStratify(svy_design, strata = ~group,pop_data)
first = svymean(q1, ps_design) #Weighted Mean of first sample
q1 = c(1,1,1,0,0)
g2 = c(1,1,0,0,0)
df2 = data.frame(q1, g2)
pop_data_2 = data.frame(group = c(0,1), Freq = c(20,80))
svd_2 = svydesign(ids = ~1, data = df2)
psd_2 = postStratify(svd_2, strata = ~g2, pop_data_2)
second = svymean(q2, psd_2) #Weighted mean of second sample
The problem with this approach is that I do not know how to conduct the difference of means test on "first" and "Second" - the two svymean objects.
Approach 2: Create only one postStratify object. Toy example:
q1 = c(1,1,1,1,0, 1,1,0,0,1)
group = c(0,0,0,1,1, 0,0,1,1,1)
time = c(0,0,0,0,0, 1,1,1,1,1) #Variable that distinguishes between the samples
df = data.frame(q1, group, time)
svy_design = svydesign(ids = ~1 , data = df)
pop_data = data.frame(group = c(0,1), Freq = c(10,90))
ps_design = postStratify(svy_design, strata = ~group,pop_data)
svyby(~q1, ~time, ps_design, svymean)
svyttest(q1~time, ps_design)
The problem with this approach is that when i run svyby just to check the created mean values, the output of svyby is not what I expect. It puts out mean = 0.5714 for time = 0, when the theoretical weighted mean for that is 0.55. Any insight as to why the theoretical mean differs from that of svyby will be greatly appreciated.
Thank you so much for your time.
are you looking for this? thanks
library(survey)
q1 = c(1,1,1,1,0, 1,1,0,0,1)
# edited #
group = c(0,0,0,1,1, 2,2,3,3,3)
time = c(0,0,0,0,0, 1,1,1,1,1) #Variable that distinguishes between the samples
df = data.frame(q1, group, time)
svy_design = svydesign(ids = ~1 , data = df)
# edited #
pop_data = data.frame(group = c(0,1,2,3), Freq = c(10,90,20,80))
ps_design = postStratify(svy_design, strata = ~group,pop_data)
svyby(~q1, ~time, ps_design, svymean)
svyttest(q1~time, ps_design)
Related
I want to run a between-within design MANCOVA with R, with two dependent variables (Planned and Unplanned), two between-subject variables (Genre [Male, Female] and Urb [Yes, No]), one within-subject variable (Period [Before, During]), and one covariate (BMI).
Here is what I've done (see here for similar calculation: https://stats.stackexchange.com/questions/183441/correct-way-to-perform-a-one-way-within-subjects-manova-in-r):
# Create dummy data
data <- data.frame(Quest_before_planned = sample(1:100, 10),
Quest_during_planned = sample(1:100, 10),
Quest_before_unplanned = sample(1:100, 10),
Quest_during_unplanned = sample(1:100, 10),
Genre = sample(rep(c("Male", "Female"), each = 5)),
Urb = sample(rep(c("Yes", "No"), each = 5)),
BMI = sample(1:100, 10))
# Define the within-subjects factor
period <- as.factor(rep(c('before','during'), each = 2))
idata <- data.frame(period)
# Create the data structure for the linear model
data.model <- with(data, cbind(Quest_before_planned, Quest_during_planned,
Quest_before_unplanned, Quest_during_unplanned))
# Build the multivariate-linear model
mod.mlm <- lm(data.model ~ Genre * Urb, data = data_total)
# Run the MANOVA
mav.blpaq <- Anova(mod.mlm, idata = idata, idesign = ~ period, type = 2)
print(mav.blpaq)
Thus, the between-within design MANOVA here works well. However, I failed to add a covariate (i.e., BMI) to this model. Do you know how can I achieve this?
N.B.: I also tried using the (great) mancova() function , which include a covariate parameter; but with this function, I do not know how to specify that Period is a within-subject variable...
blpaq_macov <- mancova(data_tidy,
deps = c("Quest_planned", "Quest_unplanned"),
factors = c("Genre", "Period", "Urb"),
covs = "BMI",
multivar = "pillai")
consider the following data frame:
dat1 <- data.frame(Loc = rep(c("NY","MA","FL","GA"), each = 1000),
Region = rep(c("a","b","c","d"),each = 1000),
ID = rep(c(1:10), each=200),
var1 = rnorm(1000),
var2=rnorm(1000),
var3=rnorm(1000))
Loc and Region are two grouping variables for ID. Assume I have several other data frames like dat1. I am trying to write a function that will automatically fit a random forest model to the data. I want to specify the dataframe, grouping variable, and columns that I want it to use.
I have tried variants of the following functions, but keep getting error messages that say Error in get(dat, envir = .GlobalEnv) : invalid first argument when I try to run them
library(caret)
library(randomForest)
rand.f <- function(dat,groupvar,cols){
model <- train(groupvar ~ paste0(cols,collapse = "+"), data = dat, method = "rf", trControl = trainControl("cv", number = 10), importance = T)
c.e <- model$finalModel$confusion[, "class.error"]
print(c.e)
}
rand.f(dat="dat1", groupvar = "Region", cols = 5:6)
model$bestTune
##################
rand.f <- function(dat,groupvar,cols){
model <- train(get(dat, envir=.GlobalEnv)[,groupvar] ~ paste0(cols,collapse = "+"), data = dat, method = "rf", trControl = trainControl("cv", number = 10), importance = T)
c.e <- model$finalModel$confusion[, "class.error"]
print(c.e)
}
rand.f(dat="dat1", groupvar = "Region", cols = 5:6)
model$bestTune
what am I doing wrong?
The following should be working:
rand.f <- function(dat,outcome){
model <- train(x = dat[, cols, drop=F]
, y = dat[, outcome]
, method = "rf"
, trControl = trainControl("cv", number = 2)
, importance = T)
c.e <- model$finalModel$confusion[, "class.error"]
return(c.e)
}
which also works for numbers as well as vectors for the column names, e.g.
cols <- colnames(dat1)[5:6]
Note that I renamed the 'grouping' variable as it is a bit unclear what the grouping variable should be in this context. I have renamed it as outcome that is to be predicted to highlight what this stands for. If you did indeed try to predict the region, you can ignore this comment.
If you do want to trigger this function for different groups in your data, i.e. separate forests for different subsets, then you would best do that outside of this function.
How do I go about calculating an index/score from principal component analysis?
Here is a reproducible example
set.seed(1)
dat <- data.frame(
Diet = sample(1:2),
Outcome1 = sample(1:10),
Outcome2 = sample(11:20),
Outcome3 = sample(21:30),
Response1 = sample(31:40),
Response2 = sample(41:50),
Response3 = sample(51:60)
)
ir.pca <- prcomp(dat[,3:5], center = TRUE, scale. = TRUE)
summary(ir.pca)
loadings <- ir.pca$rotation
scores <- ir.pca$x
correlations <- t(loadings)*ir.pca$sdev
This generates three principal components. Could I use these to calculate a score or an index called 'Response Index' for each row in the above data?
I am having trouble passing data to forecast.lm in a dplyr do. I want to make several models based on a a factor - hour - and the forecaste these models using new data.
Building on previous excellent examples here is my data example:
require(dplyr)
require(forecast)
# Training set
df.h <- data.frame(
hour = factor(rep(1:24, each = 100)),
price = runif(2400, min = -10, max = 125),
wind = runif(2400, min = 0, max = 2500),
temp = runif(2400, min = - 10, max = 25)
)
# Forecasting set
df.f <- data.frame(
hour = factor(rep(1:24, each = 10)),
wind = runif(240, min = 0, max = 2500),
temp = runif(240, min = - 10, max = 25)
)
# Bind training & forecasting
df <- rbind(df.h, data.frame(df.f, price=NA))
# Do a training model and then forecast using the new data
df <- rbind(df.h, data.frame(df.f, price=NA))
res <- group_by(df, hour) %>% do({
hist <- .[!is.na(.$price), ]
fore <- .[is.na(.$price), c('hour', 'wind', 'temp')]
fit <- Arima(hist$price, xreg = hist[,3:4], order = c(1,1,0))
data.frame(fore[], price=forecast.Arima(fit, xreg = fore[ ,2:3])$mean)
})
res
This works excellently with a time series model, but using a lm I have problem passing the data into the forecasting part.
My corresponding lm example looks like this:
res <- group_by(df, hour) %>% do({
hist <- .[!is.na(.$price), ]
fore <- .[is.na(.$price), c('hour', 'wind', 'temp')]
fit <- lm(hist$price ~ wind + temp, data = hist)
data.frame(fore[], price = forecast.lm(fit, newdata = fore[, 2:3])$mean)
})
The problem is that I cant' get data into the newdata = function. If you add hist$ in the fit section, you can't reference the forecast data, and for some reason if you add data = fore it can't find it - but it can in the time series example.
The problem is that forecast.lm expects that fit has a data component. If you use glm or tslm, that is true. But lm objects don't generally have a data component. So you need to manually add fit$data <- hist for forecast.lm to work properly.
res <- group_by(df, hour) %>% do({
hist <- .[!is.na(.$price), ]
fore <- .[is.na(.$price), c('hour', 'wind', 'temp')]
fit <- lm(price ~ wind + temp, data = hist)
fit$data <- hist # have to add data manually
data.frame(fore[], price = forecast.lm(fit, newdata = fore[, 2:3])$mean)
})
This is actually a known issue.
I have data of students from several schools. I want to show a histogram of the percentage of all students that passed the test in each school, using R.
My data looks like this (id,school,passed/failed):
432342 school1 passed
454233 school2 failed
543245 school1 failed
etc'
(The point is that I am only interested in the percent of students that passed, obviously those that didn't passed have failed. I want to have one column for each school that shows the percent of the students in that school that passed)
Thanks
there are many ways to do that.
one is:
df<-data.frame(ID=sample(100),
school=factor(sample(3,100,TRUE),labels=c("School1","School2","School3")),
result=factor(sample(2,100,TRUE),labels=c("passed","failed")))
p<-aggregate(df$result=="passed"~school, mean, data=df)
barplot(p[,2]*100,names.arg=p[,1])
My previous answer didn't go all the way. Here's a redo. Example is the one from #eyjo's answer.
students <- 400
schools <- 5
df <- data.frame(
id = 1:students,
school = sample(paste("school", 1:schools, sep = ""), size = students, replace = TRUE),
results = sample(c("passed", "failed"), size = students, replace = TRUE, prob = c(.8, .2)))
r <- aggregate(results ~ school, FUN = table, data = df)
r <- do.call(cbind, r) # "flatten" the result
r <- as.data.frame(cbind(r, sum = rowSums(r)))
r$perc.passed <- round(with(r, (passed/sum) * 100), 0)
library(ggplot2)
ggplot(r, aes(x = school, y = perc.passed)) +
theme_bw() +
geom_bar(stat = "identity")
Since you have individual records (id) and want to calculate based on index (school) I would suggest tapply for this.
students <- 400
schools <- 5
df <- data.frame("id" = 1:students,
"school" = sample(paste("school", 1:schools, sep = ""),
size = students, replace = TRUE),
"results" = sample(c("passed", "failed"),
size = students, replace = TRUE, prob = c(.8, .2)))
p <- tapply(df$results == "passed", df$school, mean) * 100
barplot(p)