R How to make working t.test command line into function? - r

I have a code line which works independently, but I am trying to make it into a function which does not work.
Data set:
cooper <- data.frame(preDist=c(2454, 2666, 2153, 2144, 2957, 2407, 2167, 2259,
1993, 2351, 1642, 2121, 2603, 2669, 2064),
postDist=c(2763, 2710, 2272, 2342, 3256, 2617, 2515, 2469,
2257, 2637, 1597, 2331, 2616, 2679, 2114),
group=factor(c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3),
labels=c("Group1", "Group2", "Cont")))
Working code:
t.test(cooper$postDist[cooper$group == "Group1"],
cooper$preDist[cooper$group == "Group1"],
alternative = "greater",
paired = TRUE)$p.value
This returns correct value for my chosen group (Group1)
Not-working function:
pairtest <- function(grp) {
pvalue <- t.test(cooper$postDist[cooper$group == "grp"],
cooper$preDist[cooper$group == "grp"],
alternative = "greater", paired = TRUE)$p.value
return(pvalue)
}
pairtest(Group1)
Reports "not enough 'x' observations".

pairtest <- function(grp,df) { # add data frame to your input
with(df[df$group == grp,], # filter data frame on input
t.test(preDist,postDist,alternative="greater",paired = T)$p.value)
#changed pre to preDist
#changed post to postDist
}
pairtest("Group1",cooper)

Related

Pooled average marginal effects from survey-weighted and multiple-imputed data

I am working with survey data and their associated weights, in addition to missing data that I imputed using mice(). The model I'm eventually running contains complex interactions between variables for which I want the average marginal effect.
This task seems trivial in STATA, but I'd rather stay in R since that's what I know best. It seems easy to retrieve AME's for each separate imputed dataset and average the estimates. However, I need to make use of pool() (from mice) to make sure I'm getting the correct standard errors.
Here is a reproducible example:
library(tidyverse)
library(survey)
library(mice)
library(margins)
df <- tibble(y = c(0, 5, 0, 4, 0, 1, 2, 3, 1, 12), region = c(1, 1, 1, 1, 1, 3, 3, 3, 3, 3),
weight = c(7213, 2142, 1331, 4342, 9843, 1231, 1235, 2131, 7548, 2348),
x1 = c(1.14, 2.42, -0.34, 0.12, -0.9, -1.2, 0.67, 1.24, 0.25, -0.3),
x2 = c(12, NA, 10, NA, NA, 12, 11, 8, 9, 9))
Using margins() on a simple (non-multiple) svyglm works without a hitch. Running svyglm on each imputation using which() and pooling the results also works well.
m <- with(surv_obj, svyglm(y ~ x1 * x2))
pool(m)
However, wrapping margins() into which() returns an error "Error in .svycheck(design) : argument "design" is missing, with no default"
with(surv_obj, margins(svyglm(y ~ x1 * x2), design = surv_obj))
If I specify the design in the svyglm call, I get "Error in UseMethod("svyglm", design) : no applicable method for 'svyglm' applied to an object of class "svyimputationList""
with(surv_obj, margins(svyglm(y ~ x1 * x2, design = surv_obj), design = surv_obj))
If I drop the survey layer, and simply try to run the margins on each imputed set and then pool, I get a warning: "Warning in get.dfcom(object, dfcom) : Infinite sample size assumed.".
m1 <- with(imputed_df, margins(lm(y ~ x1 * x2)))
pool(m1)
This worries me given that pool() may use sample size in its calculations.
Does anyone know of any method to either (a) use which(), margins() and pool() to retrieve the pooled average marginal effects or (b) knows what elements of margins() I should pass to pool() (or pool.scalar()) to achieve the desired result?
Update following Vincent's comment
Wanted to update this post following Vincent's comment and related package marginaleffects() which ended up fixing my issue. Hopefully, this will be helpful to others stuck on similar problems.
I implemented the code in the vignette linked in Vincent's comment, adding a few steps that allow for survey weighting and modeling. It's worth noting that svydesign() will drop any observations missing on clustering/weighting variables, so marginaleffects() can't predict values back unto the original "dat" data and will throw up an error. Pooling my actual data still throws up an "infinite sample size assumed", which (as noted) should be fine but I'm still looking into fixes.
library(tidyverse)
library(survey)
library(mice)
library(marginaleffects)
fit_reg <- function(dat) {
svy <- svydesign(ids = ~ 1, cluster = ~ region, weight = ~weight, data = dat)
mod <- svyglm(y ~ x1 + x2*factor(x3), design = svy)
out <- marginaleffects(mod, newdata = dat)
class(out) <- c("custom", class(out))
return(out)
}
tidy.custom <- function(x, ...) {
out <- marginaleffects:::tidy.marginaleffects(x, ...)
out$term <- paste(out$term, out$contrast)
return(out)
}
df <- tibble(y = c(0, 5, 0, 4, 0, 1, 2, 3, 1, 12), region = c(1, 1, 1, 1, 1, 3, 3, 3, 3, 3),
weight = c(7213, 2142, 1331, 4342, 9843, 1231, 1235, 2131, 7548, 2348),
x1 = c(1.14, 2.42, -0.34, 0.12, -0.9, -1.2, 0.67, 1.24, 0.25, -0.3),
x2 = c(12, NA, 10, NA, NA, 12, 11, 8, 9, 9),
x3 = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2))
imputed_df <- mice(df, m = 2, seed = 123)
dat_mice <- complete(imputed_df, "all")
mod_imputation <- lapply(dat_mice, fit_reg)
mod_imputation <- pool(mod_imputation)
summary(mod_imputation)

Stratified Sampling a Dataset and Averaging a Variable within the Train Dataset

I'm currently trying to do a stratified split in R to create train and test datasets.
A problem posed to me is the following
split the data into a train and test sample such that 70% of the data
is in the train sample. To ensure a similar distribution of price
across the train and test samples, use createDataPartition from the
caret package. Set groups to 100 and use a seed of 1031. What is the
average house price in the train sample?
The dataset is a set of houses with prices (along with other data points)
For some reason, when I run the following code, the output I get is labeled as incorrect in the practice problem simulator. Can anyone spot an issue with my code? Any help is much appreciated since I'm trying to avoid learning this language incorrectly.
dput(head(houses))
library(ISLR); library(caret); library(caTools)
options(scipen=999)
set.seed(1031)
#STRATIFIED RANDOM SAMPLING with groups of 100, stratefied on price, 70% in train
split = createDataPartition(y = houses$price,p = 0.7,list = F, groups = 100)
train = houses[split,]
test = houses[-split,]
nrow(train)
nrow(test)
nrow(houses)
mean(train$price)
mean(test$price)
Output
> dput(head(houses))
structure(list(id = c(7129300520, 6414100192, 5631500400, 2487200875,
1954400510, 7237550310), price = c(221900, 538000, 180000, 604000,
510000, 1225000), bedrooms = c(3, 3, 2, 4, 3, 4), bathrooms = c(1,
2.25, 1, 3, 2, 4.5), sqft_living = c(1180, 2570, 770, 1960, 1680,
5420), sqft_lot = c(5650, 7242, 10000, 5000, 8080, 101930), floors = c(1,
2, 1, 1, 1, 1), waterfront = c(0, 0, 0, 0, 0, 0), view = c(0,
0, 0, 0, 0, 0), condition = c(3, 3, 3, 5, 3, 3), grade = c(7,
7, 6, 7, 8, 11), sqft_above = c(1180, 2170, 770, 1050, 1680,
3890), sqft_basement = c(0, 400, 0, 910, 0, 1530), yr_built = c(1955,
1951, 1933, 1965, 1987, 2001), yr_renovated = c(0, 1991, 0, 0,
0, 0), age = c(59, 63, 82, 49, 28, 13)), row.names = c(NA, -6L
), class = c("tbl_df", "tbl", "data.frame"))
>
> library(ISLR); library(caret); library(caTools)
> options(scipen=999)
>
> set.seed(1031)
> #STRATIFIED RANDOM SAMPLING with groups of 100, stratefied on price, 70% in train
> split = createDataPartition(y = houses$price,p = 0.7,list = F, groups = 100)
>
> train = houses[split,]
> test = houses[-split,]
>
> nrow(train)
[1] 15172
> nrow(test)
[1] 6441
> nrow(houses)
[1] 21613
>
> mean(train$price)
[1] 540674.2
> mean(test$price)
[1] 538707.6
I try to reproduce it manually using sample_frac form dplyr package and cut2 function from Hmisc package. The results are almost the same - still not same.
It looks like there might be a problem with pseudo numbers generator or with some rounding.
In my opinion your code looks to be a correct one.
Is it possible that in previous steps you should remove some outliers or pre-process dataset in any way.
library(caret)
options(scipen=999)
library(dplyr)
library(ggplot2) # to use diamonds dataset
library(Hmisc)
diamonds$index = 1:nrow(diamonds)
set.seed(1031)
# I use diamonds dataset from ggplot2 package
# g parameter (in cut2) - number of quantile groups
split = diamonds %>%
group_by(cut2(diamonds$price, g= 100)) %>%
sample_frac(0.7) %>%
pull(index)
train = diamonds[split,]
test = diamonds[-split,]
> mean(train$price)
[1] 3932.75
> mean(test$price)
[1] 3932.917
set.seed(1031)
#STRATIFIED RANDOM SAMPLING with groups of 100, stratefied on price, 70% in train
split = createDataPartition(y = diamonds$price,p = 0.7,list = T, groups = 100)
train = diamonds[split$Resample1,]
test = diamonds[-split$Resample1,]
> mean(train$price)
[1] 3932.897
> mean(test$price)
[1] 3932.572
This sampling procedure should result in mean that approximate to a population one.

Join two data frames and add NAs for missing values of joining column

I am trying to join two data frames using dplyr left_join. The code is provided below:
file = c('f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15')
word_count = c(14806, 804, 168, 27172, 782, 699, 1891, 64385, 738, 9497, 74, 181183, 173, 37, 2321)
wc_df <- cbind.data.frame(file, word_count)
variant = c('bcause', 'bcos', 'bcos', 'bcos', 'bcos', 'bcos', 'bcos', 'bcos', 'bcos', 'bcos', 'bcus', 'bcus', 'bcus', 'because', 'because', 'because', 'because', 'because', 'because', 'because', 'because', 'because', 'because', 'becos', 'becos', 'becos', 'becos', 'becos', 'becos', 'becos', 'becos', 'becos', 'bicos', 'cos', 'cos', 'cos', 'cos', 'cos', 'cos', 'cos', 'cos', 'cos', 'cus', 'cus', 'cus', 'cus', 'cus', 'cuz', 'cuz', 'cuz', 'cuz', 'cuz', 'cuz', 'cz')
file = c('f1', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12')
freq = c(2, 14, 3, 1, 3, 11, 14, 5, 19, 4, 2, 1, 1, 23, 2, 51, 1, 1, 4, 52, 2, 29, 4, 7, 3, 1, 112, 12, 7, 7, 2, 8, 646, 15, 1, 1, 7, 9, 1, 13, 14, 11, 1, 6, 1, 4, 2, 4, 5, 2, 3, 1, 3, 1)
freq_df <- cbind.data.frame(file, variant, freq)
new_df <- left_join(freq_df, wc_df)
As it can be seen the file column is being used to join the two data frames. The problem is that freq_df will always have a few values (or levels) missing for file. So it is observable that wc_df has 15 levels of file, but freq_df only has 12 of the same. When I join these two using the above mentioned dplyr function, the values/levels of file that are not available in the first data frame are lost.
After searching around I came across the complete() function from tidyr. I can apply it as follows to get the missing combinations of existing file and variant (12 * 10 = 120 rows) as follows:
new_df <- left_join(freq_df, wc_df) %>% tidyr::complete(file, variant, fill = list(freq = 0))
However, I want to get the missing values/levels of file (from wc_df as well) in the new_df, where the combinations of variant and file will have an NA (or 0) in the third column freq (and the total number of rows will be 15 * 10 = 150).
How can I do it?
I am not sure I fully understand the question, but it sounds like you need dplyr::full_join() instead of dplyr::left_join to preserve non-matched values of file from either of the two joined data frames.
Thanks to the above answer with full_join, I have come up with a solution to solve my problem. The full_join will introduce NAs in freq and variant columns. I replace the NAs in each column step-by-step as follows:
new_df <- full_join(freq_df, wc_df) %>% %>% as.data.frame() #Full join and convert to data frame
#replace NAs with previous values
new_df <- tidyr::fill(new_df, variant) #Not converting to a data frame above causes an error that the column cannot be modified because it is a grouping variable
#NAs in freq column are simply replaced by zeroes
new_df$freq <- replace(new_df$freq, is.na(new_df$freq), 0)
Now all of the values of the joining column file are present in the new_df. The NAs introduced have been manually patched.

SumIfs in R - creating a subset off of multiple criteria and summing a specific column

I have a set of panel data similar to:
city <- c("ARI", "ATL", "BAL", "BUF", "CAR", "ARI", "ATL", "BAL", "BUF", "CAR", "ARI", "ATL", "BAL", "BUF", "CAR", "ARI", "ATL", "BAL", "BUF", "CAR", "ARI", "ATL", "BAL", "BUF", "CAR")
week <- c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5)
df <- as.data.frame(cbind(city, week))
df$week <- as.numeric(df$week)
df$x <- c(6, 3, 9, 12, 4, 3, 7, 8, 2, 12, 15, 6, 3, 9, 0, 14, 18, 2, 21, 15, 17, 9, 10, 1, 22)
I would like to create a new variable, df$y, that sums df$x for each city, and for each week, prior to the week currently being observed. So, for example, df$y[25] should equal 31 because sum(df[df$city == "CAR" & df$week < 5, 3]) equals 31.
My question is, how can I write this in a function to do this automatically?
To use sum(df[df$city == "CAR" & df$week < 5, 3]) for each team and week combination would be tedious. My natural inclination is to write something like df$y <- sum(df[df$city == df$city & df$week < df$week, 3]), but that doesn't make sense. I'm new to R and don't fully understand functions; but, is that the best route for what I'm trying to do?
Thanks for your help!
One option with dplyr
library(dplyr)
res <- df %>%
group_by(city) %>%
mutate(y = cumsum(lag(x, default = 0)))
res[25,]
# A tibble: 1 x 4
# Groups: city [1]
# city week x y
# <fctr> <dbl> <dbl> <dbl>
#1 CAR 5 22 31
One option with data.table
setDT(df)[, y := c(0, cumsum(x[-length(x)])), by = 'city']
df

Leave one out cross validation by leaving out two ID during the training process

I have a dataframe df
df<-structure(list(ID = c(4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5,
6, 6, 6, 6, 8, 8, 8, 9, 9), Y = c(2268.14043972082, 2147.62290922552,
2269.1387550775, 2247.31983098201, 1903.39138268307, 2174.78291538358,
2359.51909126411, 2488.39004804939, 212.851575751527, 461.398994384333,
567.150629704352, 781.775113821961, 918.303706148872, 1107.37695799186,
1160.80594193377, 1412.61328924168, 1689.48879626486, 685.154353165934,
574.088067465695, 650.30821636616, 494.185166497016, 436.312162090908
), P = c(1750.51986303926, 1614.11541634798, 951.847023338079,
1119.3682884872, 1112.38984390156, 1270.65773075982, 1234.72262170166,
1338.46096616983, 1198.95775346458, 1136.69287367165, 1265.46480803983,
1364.70149818063, 1112.37006707489, 1346.49240261316, 1740.56677791104,
1410.99217295647, 1693.18871380948, 275.447173420805, 396.449789014179,
251.609239829704, 215.432550271042, 55.5336257666349), A = c(49,
50, 51, 52, 53, 54, 55, 56, 1, 2, 3, 4, 5, 14, 15, 16, 17, 163,
164, 165, 153, 154), TA = c(9.10006221322572, 7.65505467142961,
8.21480062559674, 8.09251754304318, 8.466220758789, 8.48094407814006,
8.77304120569444, 8.31727518543397, 8.14410265791868, 8.80921738865237,
9.04091478341757, 9.66233618146246, 8.77015716015164, 9.46037931956657,
9.59702379240667, 10.1739258740118, 9.39524442215692, -0.00568604734662462,
-2.12940164413048, -0.428603434930109, 1.52337963973006, -1.04714984064565
), TS = c(9.6499861763085, 7.00622420539595, 7.73511170298675,
7.68006974050443, 8.07442411510912, 8.27687965909096, 8.76025039592727,
8.3345638889156, 9.23658956753677, 8.98160722605782, 8.98234210211611,
9.57066566368204, 8.74444401914267, 8.98719629775988, 9.18169205278566,
9.98225438314085, 9.56196773059615, 5.47788158053928, 2.58106090926808,
3.22420704848299, 1.36953555753786, 0.241334267522977), R = c(11.6679680423377,
11.0166459173372, 11.1851268491296, 10.7404563561694, 12.1054055597684,
10.9551321815546, 11.1975918244469, 10.7242192465965, 10.1661703705992,
11.4840412725324, 11.1248456370953, 11.2529612597628, 10.7694642397996,
12.3300887767583, 12.0478558531771, 12.3212362249214, 11.5650773932264,
9.56070414783612, 9.61762902218185, 10.2076240621201, 11.8234628013552,
10.9184029778985)), .Names = c("ID", "Y", "P", "A", "TA", "TS",
"R"), na.action = structure(77:78, .Names = c("77", "78"), class = "omit"), row.names = c(NA,
22L), class = "data.frame")
I am currently doing a linear regression in leave one cross validation mode. In other words, during the training I remove one site for each iteration and test the model on the site left out. See below the procedure:
df$prediction <- NA
for(id in unique(df$ID)){
train.df <- df[df$ID != id,]
test.df <- df[df$ID == id, c("P", "A", "TA", "TS","R")]
lm.df<- glm(Y ~ P+A+TA+TS+R, data=train.df)
step.df<- step(lm.df, direction = "backward")
df.pred = predict(object = step.df, newdata = test.df)
df$prediction[df$ID== id] <- df.pred
}
However, I would like to remove 2 IDs for each iteration during the cross validation instead of one. Therefore, my test set will contain two IDs instead of one every time. Anyone know how I could do it?
If you change == into %in% and unique(df$ID) into split(unique(df$ID), c(1,1,2,2,3)) it seems to be working. Essentially, in each iteration you pass two ids instead of one, so the test.df set contains those two.
See this:
df$prediction <- NA
for(id in split(unique(df$ID), c(1,1,2,2,3))){
print(id)
train.df <- df[!df$ID %in% id,]
test.df <- df[df$ID %in% id, c("P", "A", "TA", "TS","R")]
lm.df<- glm(Y ~ P+A+TA+TS+R, data=train.df)
step.df<- step(lm.df, direction = "backward",trace=0)
df.pred = predict(object = step.df, newdata = test.df)
df$prediction[df$ID %in% id] <- df.pred
}
Output:
[1] 4 5
[1] 6 8
[1] 9
I have set trace to zero above so that it only prints the ids passed in the loop. As you can see you have two instead of one (apart from the last one obviously). split splits the vector unique(df$ID) in 2-element pieces which we can then use within the loop.

Resources