I am running the following imputation task in R as a for loop:
myData <- essuk[c(2,3,4,5,6,12)]
myDataImp <- matrix(0,dim(myData)[1],dim(myData)[2])
lower <- c(0)
upper <- c(Inf)
for (k in c(1:5))
{
gmm.fit1 <- gmm.tmvnorm(matrix(myData[,k],length(myData[,k]),1), lower=lower, upper=upper)
useMu <- matrix(gmm.fit1$coefficients[1],1,1)
useSigma <- matrix(gmm.fit1$coefficients[2],1,1)
replaceThese <- myData[,k]<=0
myDataImp[,k] <- myData[,k]
myDataImp[replaceThese,k] <- rtmvnorm(n=sum(replaceThese), c(useMu), c(useSigma), c(-Inf), c(0))
}
The steps are pretty straightforward
Define the data set and an empty imputation data set.
For column 1-5, fit a model.
Extract model estimates to be used for imputation.
Run a model using model estimates and replace values <= 0 with the new values in the imputation data set.
However, I want to do this separately for multiple groups, rather than for the full sample. Column 12 in the data set contains information on group membership (integers ranging from 1-72).
I have tried several options, including splitting the data frame with data_list <- split(myData, myData$V12) and use the lapply() function. However, this does not work due to how model estimates are formatted:
Error in as.data.frame.default(data) :
cannot coerce class ""gmm"" to a data.frame
I have also thought about the possibility of doing a nested for loop, although I am not sure how that could be accomplished. Any suggestions are much appreciated.
what about using subset() ?
myData$V12 = as.factor(myData$V12)
listofresults= c()
for (i in levels(myData$V12)){
data = subset (myData, myData$V12 == i)
#your analysis here: result saved in myDataImp
listofresults = c(listofresults, myDataImp)
}
not the most elegant, but should work.
I'm completely new to programming and R, but have a dataset that can only be analyzed with a more powerful statistics program such as R.
I have a large but simple dataset consisting of thousands of different groups with multiple samples that I want to compare against the control group with a mann whitney U test, data structure is pictured below.
Group, Measurements
a 0.14534
cont 0.42574
d 0.36347
c 0.14284
a 0.23593
d 0.36347
cont 0.33514
cont 0.29210
b 0.36345
...
The problem comes from that the nature of the test requires that only two groups are designated. However, as I have more than 1 group it does not work.
This is what I have so far and I as you see it does not work in a repeated fashion and only works if I have two groups in my input file.
data1 = read.csv(file.choose(), header=TRUE, stringsAsFactors=FALSE)
attach(data1)
testoutput <- wilcox.test(group ~ measurement, mu=0, alt="two.sided", conf.int=TRUE, conf.level=0.95, paired=FALSE, exact=FALSE, correct=TRUE)
write.table(testoutput$p.value, file="mwUtest.tsv", sep="\t")
How do I do write and loop the test properly for it to test all my groups against my designated control group? I assume the sapply or lapply functions are used before the wilcox.test, but I dont know how.
I'm sorry if this simple question has been brought up before, but I could not find any previous question regarding this specific problem.
In R, there's often many solutions for the same problem. Here's how I would solve this.
First, I would split my data and have one dataframe with experiments and one with controls:
experiments <- dat[dat$group!="cont",]
controls <- dat[dat$group=="cont",]
Then I would split my experimental data by group, and feed that to my test together with my control measurements. Note that this construction makes it easy to extract more values from the test: just return a (named) vector.
result <- lapply(split(experiments, experiments$group),function(x){
mytest = wilcox.test(x$measurement,controls$measurement,mu=0, alt="two.sided", conf.int=TRUE, conf.level=0.95, paired=FALSE, exact=FALSE, correct=TRUE)
return(mytest$p.value)
})
Combining to a table is then easy:
output <- do.call(rbind,result)
Data used:
set.seed(123)
nobs=100
dat <- data.frame(group=sample(c(LETTERS[1:6],"cont"),nobs,T),
measurement=runif(nobs),stringsAsFactors=F)
I am looking for a robust way to partition a dataset without using the sample() function, and hope to get some feedback.
As a matter of fact, I'd ideally like to get rid of the of random property inherent to the usage of sample()
samp<-data.frame(qldat) # convert zoo time-series object to data.frame
ind <- sample(2,nrow(samp),replace = TRUE, prob=c(0.8,0.2)) # splitting
#data series between training and test sets
tsamp<- samp[ind==1,] # training dataset
vsamp<- samp[ind==2,] # test set
Following some researches, I've figured out that subset() could have helped, but it could involve a bit of hard-codingthe dataset. By hard-coding I mean for a 80:20 split(%) using nrow(samp), It's possible to subset the data from row=1 to row= 0.8 * nrow(samp) for instance, acknowledging that it might not be a very efficient solution.
I've also tried createDataPartition(), but it did not match my expectation since samp does not hold any categorical data on which I could rely on for the split (e.g createDataPartition(y=samp$categoricaldata,p=0.8, list=FALSE)
PS: What I like in ind<- is the inclusion of prob=c(0.8,0.2), thus the slice is sorted out automatically. Hence any similar idea without randomly splitting tsamp && vsamp would be very appreciated.
Best,
Is this what you are looking for?
n <- nrow(samp)
train_i <- 1:round(0.8*n)
test_i <- round(0.8*n+1):n
train <- samp[train_i,]
test <- samp[test_i,]
I have a data set "base_data" which has missing values. I have therefore used the package 'Amelia' to impute the missing values into an object "a.output".
I have been able to find the mean for some variables within the imputed results using the following code:
q.out<-NULL
se.out<-NULL
for(i in 1:m) {
dclus <- svydesign(id=~site, data=a.output$base_data[[i]])
q.out <- rbind(q.out, coef(svymean(~hh_expenditure, dclus)))
se.out <- rbind(se.out, SE(svymean(~hh_expenditure, dclus)))}
I have combined the results using:
svymean.combine <- mi.meld(q = q.out, se = se.out)
Which gives me the mean and standard error for household expenditure (hh_expenditure) across the population.
However I have a variable which splits the population into wealth quintiles (wealth_quin).
As such, I am now wanting to find the average, and standard error, of the household expenditure per wealth_quin (a variable which is either 1,2,3,4,or 5).
I initially tried subsetting the imputed data, but this came up with many errors.
Is there a way to do this without having to split up the data into the 5 wealth quintiles before imputing the data?
Cheers,
Timothy
EDIT: HERE IS A WORKABLE EXAMPLE
require(Amelia)
require(survey)
a<-as.data.frame(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16))
b<-as.data.frame(c(1,2,2,1,2,1,1,2,1,2,2,1,1,2,1,2))
c<-as.data.frame(c(2,7,8,5,4,4,3,8,7,9,10,1,3,3,2,8))
d<-as.data.frame(c(3,9,7,4,5,5,2,10,8,10,12,2,4,4,3,7))
e<-as.data.frame(c(2500,8000,NA,4500,4500,NA,2500,NA,7400,9648,1112,1532,3487,3544,NA,7000)
impute<-cbind(a,b,c,d,e)
names(impute) <- c("X","site","var2","var3", "hh_inc")
so no we have a data frame to work with, with missing values for hh_inc which I want to impute.
first step, set the number of imputations
m<-5
now run the imputation:
a.output <- amelia(x = impute, m=m, autopri=0.5,cs="X",
idvars=c("site","var2"),
logs=c("hh_inc","var3"))
a.output is now holds the data from the 5 imputations.
What I now want to do is find the average (and standard error) hh_inc for site 1 and site 2 separately using the imputed values from amelia.
How is that possible to do? I know it is possible to do if I just ignore the NA's. But this might introduce bias, hence why I imputed the values in the first place.
Cheers,
Timothy
EDIT:
I have placed a bounty to this. If no one knows the exact way to do it, then the results from the individual imputed data sets can be combined using Rubins formula (http://sites.stat.psu.edu/~jls/mifaq.html#minf)
As such, I will award to bounty to someone who can transform the 5 separate imputed datasets from the Amelia object into 5 separate, complete, data frames.
require(Amelia)
require(survey)
require(data.table)
require(plotrix)
a<-as.data.frame(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16))
b<-as.data.frame(c(1,2,2,1,2,1,1,2,1,2,2,1,1,2,1,2))
c<-as.data.frame(c(2,7,8,5,4,4,3,8,7,9,10,1,3,3,2,8))
d<-as.data.frame(c(3,9,7,4,5,5,2,10,8,10,12,2,4,4,3,7))
e<-as.data.frame(c(2500,8000,NA,4500,4500,NA,2500,NA,7400,9648,1112,1532,3487,3544,NA,7000))
impute<-cbind(a,b,c,d,e)
names(impute) <- c("X","site","var2","var3", "hh_inc")
summary(impute)
m <- 5
a.output <- amelia(x = impute, m=m, autopri=0.5,cs="X",
idvars=c("site","var2"),
logs=c("hh_inc","var3"))
stats.out <- NULL
for(i in 1:m){
df2 <- data.table(a.output$imputations[[i]])
df3 <- data.frame(dataset=i,df2[,list(std.error(hh_inc),mean(hh_inc)), by="site"])
stats.out <- rbind(stats.out, df3)
}
colnames(stats.out) <- c("dataset","site","stdError","mean")
stats.out
I'm not sure I understand your question or the structure of your data (specifically the importance of whether the data is imputed or not) but here's how I've done some summary stats by group.
require(data.table)
require(plotrix)
# create some data
df1 <- data.frame(id=seq(1,50,1), wealth = runif(50)*1000)
df1$cutter <- cut(df1$wealth, 5, labels=FALSE)
head(df1)
# put the data into a data.table to speed things up
df2 <- as.data.table(df1)
head(df2)
grp1StdErr <- df2[,std.error(wealth), by="cutter"]
grp1Mean <- df2[,mean(wealth), by="cutter"]
Hope this helps.
Or, in one grouping step :
df2[,list(std.error(wealth),mean(wealth)), by=cut(wealth,5,labels=FALSE)]