How can I use Aggregate to Group and Impute Data? - r

I need to impute data by grouping across categories and then replacing missing values with the 75th percentile.
I found the aggregate function, which let me do this:
GRPAVG = aggregate(INCWAGE ~ AGE + RCE, data = lps1, mean)
Which works beautifully for mean. However, I was unable to get quantile to work here, how can I aggregate across the 75th percentile? IE, I want to group by Age and Race and find the 75th percentile of data in that cross-group.
And furthermore, is there a way to replace missing values of a different variable with the output of aggregate?

aggregate has argument FUN (as you know). If function pass on to this argument takes in more arguments, you pass them along through .... If you're calculating quantiles, you will want to pass in probs argument.
data("ChickWeight")
head(ChickWeight)
aggregate(weight ~ Chick + Diet, data = ChickWeight,
FUN = quantile, probs = 0.75)

Related

Weighted dataset after IPTW using weightit?

I'm trying to get a weighted dataset after IPTW using weightit. Unfortunately, I'm not even sure where to start. Any help would be appreciated.
library(WeightIt)
library(cobalt)
library(survey)
W.out <- weightit(treat ~ age + educ + race + married + nodegree + re74 + re75,
data = lalonde, estimand = "ATT", method = "ps")
bal.tab(W.out)
# pre-weighting dataset
lalonde
# post-weighting dataset??
The weightit() function produces balance weights. In your case, setting method = "ps" will produce propensity scores that are transformed into weights. More details of how it produces those weights can be found with ?method_ps. You can extract the weights from your output and store them as a column in a data.frame via: data.frame(w = W.out[["weights"]]). The output is a vector of weights with a length equal to the number of non-NA rows in your data (lalonde).
What you actually mean by "weighted dataset" is ambiguous for two reasons. First, any analyses that use those weights will typically not actually produce a new data.set...rather it will weight the contribution of the row to the likelihood. This is substantively different from simply analyzing a dataset that has had each row's values multiplied by its weight and will produce different results for many models. Second, you are asking how to get a weighted dataset that has character vectors in columns. For example, lalonde$race is a character vector. Multiplying 5*"black" doesn't make much sense.
If you are indeed intent on multiplying every value in every row of your data by the row's respective weight, you will need to convert your race variable to numeric indicators, remove it from your data, then you can apply sweep():
library(dplyr)
df <- lalonde %>%
black = if_else(race == "black", 1, 0),
hispan = if_else(race == "hispan",1,0),
white = if_else(race == "white",1,0)) %>%
select(-race)
sweep(df, MARGIN = 2, W.out[["weights"]], `*`)

Aggregate function in R to create time series data

I am working on a dataset of 35 variables. I have derived age dummy variable categories to classify age of patients into different age groups. Now I want to aggregate the total no of cases and the number of cases in each age group based on date and location variables. Following is the code I have tried however I am not getting the sum of values of cases in each age group. For example if there are total 10 cases those ten cases should be divided into different age groups but NAs are appearing. In some cases 1 or 2 no of cases are appearing in few age groups which is not representative of total cases.
df_sa2 <- aggregate( cbind(cases=df_sa1$cases, agecat1=df_sa1$agecat1, agecat2=df_sa1$agecat2, agecat3=df_sa1$agecat3, agecat4=df_sa1$agecat4, agecat5=df_sa1$agecat5), by = list(Date=df_sa1$date, location=df_sa1$location), FUN = sum)
I have checked the datatypes they are all numeric.
Please suggest what is wrong with the code. Thank you.
Consider the formula style of aggregate which can read better and use the data argument to avoid the numerous df_sa1$ qualifiers.
With formula style, numeric columns are placed to the left of ~ and categorical variables to the right for grouping columns. Doing so also renders cbind and list unnecessary.
fml <- cases ~ date + location + agecat1 + agecat2 + agecat3 + agecat4 + agecat5
df_sa2 <- aggregate(fml, data=df_sa1, FUN=sum)
# TO ACCOUNT FOR POTENTIAL MISSING VALUES IN df_sa1$cases
df_sa2 <- aggregate(fml, data=df_sa1, FUN=function(x) sum(x, na.rm=TRUE), na.action=na.pass)
If you need individual age category groupings, adjust formula accordingly:
fml <- cases ~ date + location + agecat1
fml <- cases ~ date + location + agecat2
...
fml <- cases ~ date + location + agecat5

Calculate the average of a subsample with cast in R reshape2

I am attempting to calculate the average of a subsample with the function acast. As the subsample, I want to use data within a percentile range for which I use quantile within the subset. The problem seems that the quantiles are calculated before arranging the data by groups, thus the same values are used. See the example below:
library(reshape2)
library(plyr)
data(airquality)
aqm <- melt(airquality, id=c("Month", "Day"), na.rm=TRUE)
## here I calculate the length for each group for the whole sample
acast(aqm, variable + Month ~ . , length, value.var = "value")
## here I calculate the length for the range within the quantiles 0.05 - 0.5
acast(aqm, variable + Month ~ . , length, value.var = "value", subset = .(value >= quantile(value,c(0.05)) & value <= quantile(value,c(0.5))))
I should get with the subset half of the observations for each group, but instead I get in some cases way lees than half and in other way more. It seems to me that the quantiles are calculated with the melted data, therefore the function applies the same quantile numbers to all groups.
Does anyone have an idea how to make that the quantiles are calculates for each group? Any help would be appreciated. I know this would be possible by doing a loop by categories, but want to see if there is a way to do it all at once.
Thanks,
Sergio René

Linear regression on subsets with dependent variable per column using dlply() in R

I would like to automatically produce linear regressions for a data frame for each category separately.
My data frame includes one column with time categories, one column (slope$Abs) as the dependent variable, several columns, which should be used as the independent variable.
head(slope)
timepoint Abs In1 In2 In3 Out1 Out2 Out3 ...
1: t0 275.0 2.169214 2.169214 2.169214 2.069684 2.069684 2.069684
2: t0 275.5 2.163937 2.163937 2.163937 2.063853 2.063853 2.063853
3: t0 276.0 2.153298 2.158632 2.153298 2.052088 2.052088 2.057988
4: ...
All in all for each timepoint I have 40 variables, and I want to end up with a linear regression for each combination. Such as In1~Abs[t0], In1~Abs[t1] and so on for each column.
Of course I can do this manually, but I guess there must be a more elegant way to do the work.
I did my research and found out that dlply() might be the function I'm looking for. However, my attempt results in an error.
So I somehow tried to combine the answers from previous questions I have found:
On individual variables per column and on subsets per category
I came up with a function like this:
lm.fun <- function(x) {summary(lm(x ~ slope$Abs, data=slope))}
lm.list <- dlply(.data=slope, .variables=slope$timepoint, .fun=lm.fun )
But I get the following error:
Error in eval.quoted(.variables, data) :
envir must be either NULL, a list, or an environment.
Hope someone can help me out.
Thanks a lot in advance!
The dplyr package in R does not do well in accepting formulas in the form of y~x into its functions based on my research. So the other alternative is to calculate it someone manually. Now let me first inform you that slope = cor(x,y)*sd(y)/sd(x) (reference found here: http://faculty.cas.usf.edu/mbrannick/regression/regbas.html) and that the intercept = mean(y) - slope*mean(x). Simple linear regression requires that we use the centroid as our point of reference when finding our intercept because it is an unbiased estimator. Using a single point will only get you the intercept of that individual point and not the overall intercept.
Now for this explanation, I will be using the mtcars data set. I only wanted a subset of the data so I am using variables c('mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec') to basically mimic your dataset. In my example, my grouping variable is 'cyl', which is the equivalent of your 'timepoint' variable. The variable 'mpg' is the y-variable in this case, which is equivalent to 'Abs' in your data.
Based on my explanation of slope and intercept above, it is clear that we need three tables/datasets: a correlation dataset for your y with respect to your x for each group, a standard deviation table for each variable and group, and a table of means for each group and each variable.
To get the correlation dataset, we want to group by 'cyl' and calculate the correlation coefficients for , you should use:
df <- mtcars[c('mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec')]
corrs <- data.frame(k1 %>% group_by(cyl) %>% do(head(data.frame(cor(.[,c(1,3:7)])), n = 1)))
Because of the way my dataset is structured, the second variable (df[ ,2]) is 'cyl'. For you, you should use
do(head(data.frame(cor(.[,c(2:40)])), n = 1)))
since your first column is the grouping variable and it is not numeric. Essentially, you want to go across all numeric variables. Not using head will produce a correlation matrix, but since you are interested in finding the slope independent of each other x-variable, you only need the row that has the correlation coefficient of your y-variable equal to 1 (r_yy = 1).
To get standard deviation and means for each group, each variable, use
sds <- data.frame(k1 %>% group_by(cyl) %>% summarise_each(funs(sd)))
means <- data.frame(k1 %>% group_by(cyl) %>% summarise_each(funs(mean)))
Your group names will be the first column, so make sure to rename your rows for each dataset corrs, sds, and means and delete column 1.
rownames(corrs) <- rownames(means) <- rownames(sds) <- corrs[ ,1]
corrs <- corrs[ ,-1]; sds <- sds[ ,-1]; means <- means[ ,-1]
Now we need to calculate the sd(y)/sd(x). The best way I have done this, and seen it done is using an apply affiliated function.
sdst <- data.frame(t(apply(sds, 1, function(X) X[1]/X)))
I use X[1] because the first variable in sds is my y-variable. The first variable after you have deleted timepoint is Abs which is your y-variable. So use that.
Now the rest is pretty straight forward. Since everything is saved as a data frame, to find slope, all it you need to do is
slopes <- sdst*corrs
inter <- slopes*means
intercept <- data.frame(t(apply(inter, 1, function(x) x[1]-x)))
Again here, since our y-variable is in the first column, we use x[1]. To check if all is well, your slopes for your y-variable should be 1 and the intercept should be 0.
I have solved the issue with a simpler approach, so I wanted to update the answer.
To make life easier I converted the data frame structure so that all columns are converted into rows with the melt() function of the reshape package.
melt(slope, id = c("Abs", "timepoint"), variable_name = "Sites")
The output's column name is by default "value".
Then create one column that adds both predictors with paste().
slope$FullTreat <- paste(slope$Sites,slope$timepoint, sep="_")
Run a function through the dataset to create separate models for each treatment combination.
models <- dlply(slope, ~ FullTreat, function(df) {
lm(value ~ Abs, data = df)
})
To extract the coefficents simply run
coefs <- ldply(models, coef)
Then split the FullTreat column into separate columns again with colsplit() also from reshape. Plus, add the Intercept and slope to the new data frame:
coefs <- cbind(colsplit(coefs$FullTreat, split="_",
c("Sites","Timepoint")), coefs[,2:3])
I haven't worked on a function that plots all the regressions from the models, but I guess this is feasible with the ldply() function.

How to use weighted gini function within aggregate function?

I am trying to calculate gini coefficient with sample weights for different groups in my data. I prefer to use aggregate because I later use the output from aggregate to plot the coefficients. I found alternative ways to do it but in those cases the output wasn't exactly what I needed.
library(reldist) #to get gini function
dat <- data.frame(country=rep(LETTERS, each=10)[1:50], replicate(3, sample(11, 10)), year=sample(c(1990:1994), 50, TRUE),wght=sample(c(1:5), 50, TRUE))
dat[51,] <- c(NA,11,2,6,1992,3) #add one more row with NA for country
gini(dat$X1) #usual gini for all
gini(dat$X1,weight=dat$wght) #gini with weight, that's what I actually need
print(a1<-aggregate( X1 ~ country+year, data=dat, FUN=gini))
#Works perfectly fine without weight.
But, now how can I specify the weight option within aggregate? I know there are other ways (as shown here) :
print(b1<-by(dat,list(dat$country,dat$year), function(x)with(x,gini(x$X1,x$wght)))[])
#By function works with weight but now the output has NAs in it
print(s1<-sapply(split(dat, dat$country), function(x) gini(x$X1, x$wght)))
#This seems to a good alternative but I couldn't find a way to split it by two variables
library(plyr)
print(p1<-ddply(dat,.(country,year),summarise, value=gini(X1,wght)))
#yet another alternative but now the output includes NAs for the missing country
If someone could show me way to use weighted gini function within aggregate that would be very helpful, as it produces the output exactly in the way I need. Otherwise, I guess I will work with one of the alternatives.
#using aggregate
aggregate( X1 ~ country+year, data=dat, FUN=gini,weights=dat$wght) # gives different answer than the data.table and dplyr (not sure why?)
#using data.table
library(data.table)
DT<-data.table(dat)
DT[,list(mygini=gini(X1,wght)),by=.(country,year)]
#Using dplyr
library(dplyr)
dat %>%
group_by(country,year)%>%
summarise(mygini=gini(X1,wght))

Resources