Calculate the average of a subsample with cast in R reshape2 - r

I am attempting to calculate the average of a subsample with the function acast. As the subsample, I want to use data within a percentile range for which I use quantile within the subset. The problem seems that the quantiles are calculated before arranging the data by groups, thus the same values are used. See the example below:
library(reshape2)
library(plyr)
data(airquality)
aqm <- melt(airquality, id=c("Month", "Day"), na.rm=TRUE)
## here I calculate the length for each group for the whole sample
acast(aqm, variable + Month ~ . , length, value.var = "value")
## here I calculate the length for the range within the quantiles 0.05 - 0.5
acast(aqm, variable + Month ~ . , length, value.var = "value", subset = .(value >= quantile(value,c(0.05)) & value <= quantile(value,c(0.5))))
I should get with the subset half of the observations for each group, but instead I get in some cases way lees than half and in other way more. It seems to me that the quantiles are calculated with the melted data, therefore the function applies the same quantile numbers to all groups.
Does anyone have an idea how to make that the quantiles are calculates for each group? Any help would be appreciated. I know this would be possible by doing a loop by categories, but want to see if there is a way to do it all at once.
Thanks,
Sergio René

Related

Weighted dataset after IPTW using weightit?

I'm trying to get a weighted dataset after IPTW using weightit. Unfortunately, I'm not even sure where to start. Any help would be appreciated.
library(WeightIt)
library(cobalt)
library(survey)
W.out <- weightit(treat ~ age + educ + race + married + nodegree + re74 + re75,
data = lalonde, estimand = "ATT", method = "ps")
bal.tab(W.out)
# pre-weighting dataset
lalonde
# post-weighting dataset??
The weightit() function produces balance weights. In your case, setting method = "ps" will produce propensity scores that are transformed into weights. More details of how it produces those weights can be found with ?method_ps. You can extract the weights from your output and store them as a column in a data.frame via: data.frame(w = W.out[["weights"]]). The output is a vector of weights with a length equal to the number of non-NA rows in your data (lalonde).
What you actually mean by "weighted dataset" is ambiguous for two reasons. First, any analyses that use those weights will typically not actually produce a new data.set...rather it will weight the contribution of the row to the likelihood. This is substantively different from simply analyzing a dataset that has had each row's values multiplied by its weight and will produce different results for many models. Second, you are asking how to get a weighted dataset that has character vectors in columns. For example, lalonde$race is a character vector. Multiplying 5*"black" doesn't make much sense.
If you are indeed intent on multiplying every value in every row of your data by the row's respective weight, you will need to convert your race variable to numeric indicators, remove it from your data, then you can apply sweep():
library(dplyr)
df <- lalonde %>%
black = if_else(race == "black", 1, 0),
hispan = if_else(race == "hispan",1,0),
white = if_else(race == "white",1,0)) %>%
select(-race)
sweep(df, MARGIN = 2, W.out[["weights"]], `*`)

How to reverse the group comparison order for t.test()?

I would like to conduct a two-sample independent t-test and my grouping variable, "group" is factored as '0' and '1'. When calling the t.test(), it calculates the difference as mean for group 0 - mean for group 1, giving me a negative difference and negative confidence intervals.
How can I reverse the order of comparison so that R estimates mean for group 1 - mean for group 0? I know it compares alphabetically, so I assume it compares numbers in increasing value. I already tried fct_rev() from the forcats package to set group '1' as the reference group, but that did not change the order in which t.test compared and still gave negative outputs. This is important as it is a 1 sided test so I have expectations on the sign of the difference.
Thanks!
EDIT: I'd like to keep the names as '0' and '1' because they need to be coded as such for subsequent analyses to work
You can use the levels argument when constructing the factor. It is possible to revert the levels (see code below) and it is also possible to specify an arbitrary order:
# two samples with different mean
x <- c(rnorm(10, 5, 1), rnorm(10, 8, 1))
# the grouping variable
f1 <- factor(rep(0:1, each=10))
# the grouping variable with reverse order of levels
f2 <- factor(f1, levels=rev(levels(f1)))
t.test(x ~ f1)
t.test(x ~ f2)
For some reason I cannot comment, can you not simply recode your variables and rerun it so they run in the correct order?
e.g.:
mydata <- read.csv(yourdataset.csv, na.strings = "")
#make a new "recoded group" variable
mydata$group.recoded <- mydata$group
mydata <- within(mydata, group.recoded[group.variable==group0] <- "group2")

R Winsorizing with specific cut-off values does not work

I want to winsorize my data using the mean plus (/minus) 2 standard deviations of the variable as cut-off points. I thus want to winsorize every variable one by one.
The variable I want to winsorize in the example code I provided below has 5 outliers.
I have created two benchmarks (high and low) and have inserted them in minval and maxval.
Just to prevent misunderstandings: I have several timepoints and groups in my data frame, the grepl-part is to just get one group at one measurement point for the winsorizing.
My code so far:
library(DescTools)
benchhigh <- mean(ds$RRS[grepl('^34.*', ds$QUESTNNR)], na.rm=TRUE) +
2*sd(ds$RRS[grepl('^34.*', ds$QUESTNNR)], na.rm=TRUE)
benchlow <- mean(ds$RRS[grepl('^34.*', ds$QUESTNNR)], na.rm=TRUE) -
2*sd(ds$RRS[grepl('^34.*', ds$QUESTNNR)], na.rm=TRUE)
ds$RRSout <- Winsorize( ds$RRS[ grepl('^34.*', ds$QUESTNNR) ],
minval = benchlow , maxval = benchhigh, na.rm = TRUE)
The error I get is:
"Error in $<-.data.frame(*tmp*, RRSout, value = c(1, 1.33333333333333, :
replacement has 38 rows, data has 510"
My replacement only has 38 rows because the ^34.* group has only 38 participants. I have to winsorize the outliers per group and measurement point though...
How can I replace/winsorize the outliers of the specific participant group in the RRS variable?
Thank you very much in advance!
Your input to Winsorize() is restricted to certain observations (grepl('^34.*', ds$QUESTNNR)). You can only append the result to the same number of (and ideally the exact same) rows:
ds$RRSout[ grepl('^34.*', ds$QUESTNNR) ] <-
Winsorize( ds$RRS[ grepl('^34.*', ds$QUESTNNR) ],
minval = benchlow , maxval = benchhigh, na.rm = TRUE)

How can I use Aggregate to Group and Impute Data?

I need to impute data by grouping across categories and then replacing missing values with the 75th percentile.
I found the aggregate function, which let me do this:
GRPAVG = aggregate(INCWAGE ~ AGE + RCE, data = lps1, mean)
Which works beautifully for mean. However, I was unable to get quantile to work here, how can I aggregate across the 75th percentile? IE, I want to group by Age and Race and find the 75th percentile of data in that cross-group.
And furthermore, is there a way to replace missing values of a different variable with the output of aggregate?
aggregate has argument FUN (as you know). If function pass on to this argument takes in more arguments, you pass them along through .... If you're calculating quantiles, you will want to pass in probs argument.
data("ChickWeight")
head(ChickWeight)
aggregate(weight ~ Chick + Diet, data = ChickWeight,
FUN = quantile, probs = 0.75)

Linear regression on subsets with dependent variable per column using dlply() in R

I would like to automatically produce linear regressions for a data frame for each category separately.
My data frame includes one column with time categories, one column (slope$Abs) as the dependent variable, several columns, which should be used as the independent variable.
head(slope)
timepoint Abs In1 In2 In3 Out1 Out2 Out3 ...
1: t0 275.0 2.169214 2.169214 2.169214 2.069684 2.069684 2.069684
2: t0 275.5 2.163937 2.163937 2.163937 2.063853 2.063853 2.063853
3: t0 276.0 2.153298 2.158632 2.153298 2.052088 2.052088 2.057988
4: ...
All in all for each timepoint I have 40 variables, and I want to end up with a linear regression for each combination. Such as In1~Abs[t0], In1~Abs[t1] and so on for each column.
Of course I can do this manually, but I guess there must be a more elegant way to do the work.
I did my research and found out that dlply() might be the function I'm looking for. However, my attempt results in an error.
So I somehow tried to combine the answers from previous questions I have found:
On individual variables per column and on subsets per category
I came up with a function like this:
lm.fun <- function(x) {summary(lm(x ~ slope$Abs, data=slope))}
lm.list <- dlply(.data=slope, .variables=slope$timepoint, .fun=lm.fun )
But I get the following error:
Error in eval.quoted(.variables, data) :
envir must be either NULL, a list, or an environment.
Hope someone can help me out.
Thanks a lot in advance!
The dplyr package in R does not do well in accepting formulas in the form of y~x into its functions based on my research. So the other alternative is to calculate it someone manually. Now let me first inform you that slope = cor(x,y)*sd(y)/sd(x) (reference found here: http://faculty.cas.usf.edu/mbrannick/regression/regbas.html) and that the intercept = mean(y) - slope*mean(x). Simple linear regression requires that we use the centroid as our point of reference when finding our intercept because it is an unbiased estimator. Using a single point will only get you the intercept of that individual point and not the overall intercept.
Now for this explanation, I will be using the mtcars data set. I only wanted a subset of the data so I am using variables c('mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec') to basically mimic your dataset. In my example, my grouping variable is 'cyl', which is the equivalent of your 'timepoint' variable. The variable 'mpg' is the y-variable in this case, which is equivalent to 'Abs' in your data.
Based on my explanation of slope and intercept above, it is clear that we need three tables/datasets: a correlation dataset for your y with respect to your x for each group, a standard deviation table for each variable and group, and a table of means for each group and each variable.
To get the correlation dataset, we want to group by 'cyl' and calculate the correlation coefficients for , you should use:
df <- mtcars[c('mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec')]
corrs <- data.frame(k1 %>% group_by(cyl) %>% do(head(data.frame(cor(.[,c(1,3:7)])), n = 1)))
Because of the way my dataset is structured, the second variable (df[ ,2]) is 'cyl'. For you, you should use
do(head(data.frame(cor(.[,c(2:40)])), n = 1)))
since your first column is the grouping variable and it is not numeric. Essentially, you want to go across all numeric variables. Not using head will produce a correlation matrix, but since you are interested in finding the slope independent of each other x-variable, you only need the row that has the correlation coefficient of your y-variable equal to 1 (r_yy = 1).
To get standard deviation and means for each group, each variable, use
sds <- data.frame(k1 %>% group_by(cyl) %>% summarise_each(funs(sd)))
means <- data.frame(k1 %>% group_by(cyl) %>% summarise_each(funs(mean)))
Your group names will be the first column, so make sure to rename your rows for each dataset corrs, sds, and means and delete column 1.
rownames(corrs) <- rownames(means) <- rownames(sds) <- corrs[ ,1]
corrs <- corrs[ ,-1]; sds <- sds[ ,-1]; means <- means[ ,-1]
Now we need to calculate the sd(y)/sd(x). The best way I have done this, and seen it done is using an apply affiliated function.
sdst <- data.frame(t(apply(sds, 1, function(X) X[1]/X)))
I use X[1] because the first variable in sds is my y-variable. The first variable after you have deleted timepoint is Abs which is your y-variable. So use that.
Now the rest is pretty straight forward. Since everything is saved as a data frame, to find slope, all it you need to do is
slopes <- sdst*corrs
inter <- slopes*means
intercept <- data.frame(t(apply(inter, 1, function(x) x[1]-x)))
Again here, since our y-variable is in the first column, we use x[1]. To check if all is well, your slopes for your y-variable should be 1 and the intercept should be 0.
I have solved the issue with a simpler approach, so I wanted to update the answer.
To make life easier I converted the data frame structure so that all columns are converted into rows with the melt() function of the reshape package.
melt(slope, id = c("Abs", "timepoint"), variable_name = "Sites")
The output's column name is by default "value".
Then create one column that adds both predictors with paste().
slope$FullTreat <- paste(slope$Sites,slope$timepoint, sep="_")
Run a function through the dataset to create separate models for each treatment combination.
models <- dlply(slope, ~ FullTreat, function(df) {
lm(value ~ Abs, data = df)
})
To extract the coefficents simply run
coefs <- ldply(models, coef)
Then split the FullTreat column into separate columns again with colsplit() also from reshape. Plus, add the Intercept and slope to the new data frame:
coefs <- cbind(colsplit(coefs$FullTreat, split="_",
c("Sites","Timepoint")), coefs[,2:3])
I haven't worked on a function that plots all the regressions from the models, but I guess this is feasible with the ldply() function.

Resources