Making a for loop for two variables in R - r

I have a dataset on fecal egg counts (FEC) in sheep and would like to test the effect of different treatments. I have 120 farms treated with combination of different drugs. I am using the "eggcount" package that is based on the Bayesian approach to simulate and produce the outputs for each farm and treatment (Drug).
I want to make a loop for two variables; i) Farm and ii) Drug.
This means the for loop can generate the output for each farm and drug, similar to the subset function.
setwd("C:/Data")
dat <- read.csv("McMaster_UTC.csv", header=T, sep="'", na.string=NA)
dat$EPG1 <- as.numeric(dat$EPG1)
dat$EPG2 <- as.numeric(dat$EPG2)
# subset for Fram & Drug: if I want to run the data for each Farm and Drug. I assume with a loop I don't #need the subset function.
dat1 <-subset(dat, Farm==1 & Drug=="CLO", select=1:9)
# A model to compare the two groups (unpaired)
model <- fecr_stan(egg.1$EPG1/50, egg.1$EPG2/50, rawCounts=FALSE, paired=FALSE, indEfficacy=FALSE)
#This is the for loop that I have in my mind, but not sure how to fit these in the loop:
out <- vectro()
for(i in 1:ncol(dat)) {
for(j in 1:ncol(dat)) {
....
....
set.seed(1234)
model <- fecr_stan(egg.1$EPG1/50, egg.1$EPG2/50, rawCounts=FALSE, paired=FALSE, indEfficacy=FALSE)
model[i] <- out[i]
Here is a subset of my data:
Sample data for 2 farms with two drugs

If I understand what you want correctly, you just want to loop over each combination of Farm and Drug. Step 1 would be to work out what the levels of each of these factors are. Given you have read your CSV file without stringsAsFactors set to FALSE, we can get the levels using the levels function:
Farms = levels(dat$Farm)
Drugs = levels(dat$Drug)
Then to loop over each, you can just do:
set.seed(1234) ## Only put this inside your loop if you want to use the same
## sequence of random numbers for each model.
for(farm in Farms){
for(drug in Drugs){
dat1 <-subset(dat, Farm == farm & Drug == drug, select=1:9)
model <- fecr_stan(dat1$EPG1/50, dat1$EPG2/50,
rawCounts = TRUE, paired=FALSE,
indEfficacy=FALSE)
## Note you'll have to do something to save or store the model output here
}
}
Thanks James for your codes/comments, much appreciated. First; there are a couple of errors in my codes: egg.1$EPG1/50 and egg.1$EPG1/50 should be dat$EPG1/50 & dat$EPG1/50. I must say that I am not good with loops, just in case if I am asking silly questions here. Yes, I want to compare the difference between dat$EPG1/50 & dat$EPG2/50 for each Drug within each Farm, and save the output in a vector. When you say "loop over each", what do you mean by that? Are you suggesting that I put model <- fecr_stan(dat$EPG1/50, dat$EPG2/50, rawCounts=TRUE, paired=FALSE, indEfficacy=FALSE), and it will work? Should I put the setseed(12345) outside the loop? I appreciate having your thoughts! I attached a sample of my data as an image, if you want to have a look. I didn't know how I can paste a sample of my data here.

Related

Print multiple Outputs stored in a vector with a Loop

I'm new in R and coding in general...
I have computed multiple anova analysis on multiple columns (16 in total).
For that purpose, the method "Purr" helped me :
anova_results_5sector <- purrr::map(df_anova_ch[,3:18], ~aov(.x ~ df_anova_ch$Own_5sector))
summary(anova_results_5sector[[1]])
So the dumbest way to retrieve output (p-value, etc) is the following method
summary(anova_results_5sector$Env_Pillar)
summary(anova_results_5sector$Gov_Pillar)
summary(anova_results_5sector$Soc_Pillar)
summary(anova_results_5sector$CSR_Strat)
summary(anova_results_5sector$Comm)
summary(anova_results_5sector$ESG_Comb)
summary(anova_results_5sector$ESG_Contro)
summary(anova_results_5sector$ESG_Score)
summary(anova_results_5sector$Env_Innov)
summary(anova_results_5sector$Human_Ri)
summary(anova_results_5sector$Management)
summary(anova_results_5sector$Prod_Resp)
I've tried to use a loop :
for(i in 1:length(anova_results_5sector)){
summary(anova_results_5sector$[i])
}
It didn't work, I dont know and did not find how to deal with $ in for loop
Here you have a look of the structure of the output vector
Structure of output
I have tried several times with others methods, more or less complicated. Often the examples found online are too simple and does not allow me to adapt to my data.
Any tips ?
Thank you and sorry for such an noobie question
Whenever I use a loop for an analysis I like to store the results in a data.frame, it allows to keep a good overview. Since you did not provide a reproducible example I used the iris dataset:
data("iris")
#make a data frame to store the results with as many columns and rows as you need
anova_results <- data.frame(matrix(ncol = 3, nrow = 3))
#one column per value you want to store and one row per anova you want to run
x <- c("number", "Mean_Sq", "p_value") #assign all values you want to store as column names
colnames(anova_results) <- x
anova_results$number <- 1:3 #assign numers for each annova you want to run, eg. 3
In the loop you can now extract the results of the anova that you are interested in, I use mean squares and p-value as an example, but you can of course add others. Don't forget to add a coulmn for other values you want to add.
for (i in 2:4){
my_anova <- aov(iris[[1]] ~ iris[[i]])
p <- summary(my_anova)[[1]][["Pr(>F)"]][1] #extract the p value
anova_results$p_value[anova_results$number == i-1] <- p
mean <- summary(my_anova)[[1]][["Mean Sq"]][1] #extract the mean quares
anova_results$Mean_Sq[anova_results$number == i-1] <- mean
}
View(anova_results)

Running for loop across multiple groups

I am running the following imputation task in R as a for loop:
myData <- essuk[c(2,3,4,5,6,12)]
myDataImp <- matrix(0,dim(myData)[1],dim(myData)[2])
lower <- c(0)
upper <- c(Inf)
for (k in c(1:5))
{
gmm.fit1 <- gmm.tmvnorm(matrix(myData[,k],length(myData[,k]),1), lower=lower, upper=upper)
useMu <- matrix(gmm.fit1$coefficients[1],1,1)
useSigma <- matrix(gmm.fit1$coefficients[2],1,1)
replaceThese <- myData[,k]<=0
myDataImp[,k] <- myData[,k]
myDataImp[replaceThese,k] <- rtmvnorm(n=sum(replaceThese), c(useMu), c(useSigma), c(-Inf), c(0))
}
The steps are pretty straightforward
Define the data set and an empty imputation data set.
For column 1-5, fit a model.
Extract model estimates to be used for imputation.
Run a model using model estimates and replace values <= 0 with the new values in the imputation data set.
However, I want to do this separately for multiple groups, rather than for the full sample. Column 12 in the data set contains information on group membership (integers ranging from 1-72).
I have tried several options, including splitting the data frame with data_list <- split(myData, myData$V12) and use the lapply() function. However, this does not work due to how model estimates are formatted:
Error in as.data.frame.default(data) :
cannot coerce class ""gmm"" to a data.frame
I have also thought about the possibility of doing a nested for loop, although I am not sure how that could be accomplished. Any suggestions are much appreciated.
what about using subset() ?
myData$V12 = as.factor(myData$V12)
listofresults= c()
for (i in levels(myData$V12)){
data = subset (myData, myData$V12 == i)
#your analysis here: result saved in myDataImp
listofresults = c(listofresults, myDataImp)
}
not the most elegant, but should work.

How to loop a test against one designated control group in R?

I'm completely new to programming and R, but have a dataset that can only be analyzed with a more powerful statistics program such as R.
I have a large but simple dataset consisting of thousands of different groups with multiple samples that I want to compare against the control group with a mann whitney U test, data structure is pictured below.
Group, Measurements
a 0.14534
cont 0.42574
d 0.36347
c 0.14284
a 0.23593
d 0.36347
cont 0.33514
cont 0.29210
b 0.36345
...
The problem comes from that the nature of the test requires that only two groups are designated. However, as I have more than 1 group it does not work.
This is what I have so far and I as you see it does not work in a repeated fashion and only works if I have two groups in my input file.
data1 = read.csv(file.choose(), header=TRUE, stringsAsFactors=FALSE)
attach(data1)
testoutput <- wilcox.test(group ~ measurement, mu=0, alt="two.sided", conf.int=TRUE, conf.level=0.95, paired=FALSE, exact=FALSE, correct=TRUE)
write.table(testoutput$p.value, file="mwUtest.tsv", sep="\t")
How do I do write and loop the test properly for it to test all my groups against my designated control group? I assume the sapply or lapply functions are used before the wilcox.test, but I dont know how.
I'm sorry if this simple question has been brought up before, but I could not find any previous question regarding this specific problem.
In R, there's often many solutions for the same problem. Here's how I would solve this.
First, I would split my data and have one dataframe with experiments and one with controls:
experiments <- dat[dat$group!="cont",]
controls <- dat[dat$group=="cont",]
Then I would split my experimental data by group, and feed that to my test together with my control measurements. Note that this construction makes it easy to extract more values from the test: just return a (named) vector.
result <- lapply(split(experiments, experiments$group),function(x){
mytest = wilcox.test(x$measurement,controls$measurement,mu=0, alt="two.sided", conf.int=TRUE, conf.level=0.95, paired=FALSE, exact=FALSE, correct=TRUE)
return(mytest$p.value)
})
Combining to a table is then easy:
output <- do.call(rbind,result)
Data used:
set.seed(123)
nobs=100
dat <- data.frame(group=sample(c(LETTERS[1:6],"cont"),nobs,T),
measurement=runif(nobs),stringsAsFactors=F)

Data Partition in R

I am looking for a robust way to partition a dataset without using the sample() function, and hope to get some feedback.
As a matter of fact, I'd ideally like to get rid of the of random property inherent to the usage of sample()
samp<-data.frame(qldat) # convert zoo time-series object to data.frame
ind <- sample(2,nrow(samp),replace = TRUE, prob=c(0.8,0.2)) # splitting
#data series between training and test sets
tsamp<- samp[ind==1,] # training dataset
vsamp<- samp[ind==2,] # test set
Following some researches, I've figured out that subset() could have helped, but it could involve a bit of hard-codingthe dataset. By hard-coding I mean for a 80:20 split(%) using nrow(samp), It's possible to subset the data from row=1 to row= 0.8 * nrow(samp) for instance, acknowledging that it might not be a very efficient solution.
I've also tried createDataPartition(), but it did not match my expectation since samp does not hold any categorical data on which I could rely on for the split (e.g createDataPartition(y=samp$categoricaldata,p=0.8, list=FALSE)
PS: What I like in ind<- is the inclusion of prob=c(0.8,0.2), thus the slice is sorted out automatically. Hence any similar idea without randomly splitting tsamp && vsamp would be very appreciated.
Best,
Is this what you are looking for?
n <- nrow(samp)
train_i <- 1:round(0.8*n)
test_i <- round(0.8*n+1):n
train <- samp[train_i,]
test <- samp[test_i,]

Finding the mean of a variable within an imputed data set for population quintiles

I have a data set "base_data" which has missing values. I have therefore used the package 'Amelia' to impute the missing values into an object "a.output".
I have been able to find the mean for some variables within the imputed results using the following code:
q.out<-NULL
se.out<-NULL
for(i in 1:m) {
dclus <- svydesign(id=~site, data=a.output$base_data[[i]])
q.out <- rbind(q.out, coef(svymean(~hh_expenditure, dclus)))
se.out <- rbind(se.out, SE(svymean(~hh_expenditure, dclus)))}
I have combined the results using:
svymean.combine <- mi.meld(q = q.out, se = se.out)
Which gives me the mean and standard error for household expenditure (hh_expenditure) across the population.
However I have a variable which splits the population into wealth quintiles (wealth_quin).
As such, I am now wanting to find the average, and standard error, of the household expenditure per wealth_quin (a variable which is either 1,2,3,4,or 5).
I initially tried subsetting the imputed data, but this came up with many errors.
Is there a way to do this without having to split up the data into the 5 wealth quintiles before imputing the data?
Cheers,
Timothy
EDIT: HERE IS A WORKABLE EXAMPLE
require(Amelia)
require(survey)
a<-as.data.frame(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16))
b<-as.data.frame(c(1,2,2,1,2,1,1,2,1,2,2,1,1,2,1,2))
c<-as.data.frame(c(2,7,8,5,4,4,3,8,7,9,10,1,3,3,2,8))
d<-as.data.frame(c(3,9,7,4,5,5,2,10,8,10,12,2,4,4,3,7))
e<-as.data.frame(c(2500,8000,NA,4500,4500,NA,2500,NA,7400,9648,1112,1532,3487,3544,NA,7000)
impute<-cbind(a,b,c,d,e)
names(impute) <- c("X","site","var2","var3", "hh_inc")
so no we have a data frame to work with, with missing values for hh_inc which I want to impute.
first step, set the number of imputations
m<-5
now run the imputation:
a.output <- amelia(x = impute, m=m, autopri=0.5,cs="X",
idvars=c("site","var2"),
logs=c("hh_inc","var3"))
a.output is now holds the data from the 5 imputations.
What I now want to do is find the average (and standard error) hh_inc for site 1 and site 2 separately using the imputed values from amelia.
How is that possible to do? I know it is possible to do if I just ignore the NA's. But this might introduce bias, hence why I imputed the values in the first place.
Cheers,
Timothy
EDIT:
I have placed a bounty to this. If no one knows the exact way to do it, then the results from the individual imputed data sets can be combined using Rubins formula (http://sites.stat.psu.edu/~jls/mifaq.html#minf)
As such, I will award to bounty to someone who can transform the 5 separate imputed datasets from the Amelia object into 5 separate, complete, data frames.
require(Amelia)
require(survey)
require(data.table)
require(plotrix)
a<-as.data.frame(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16))
b<-as.data.frame(c(1,2,2,1,2,1,1,2,1,2,2,1,1,2,1,2))
c<-as.data.frame(c(2,7,8,5,4,4,3,8,7,9,10,1,3,3,2,8))
d<-as.data.frame(c(3,9,7,4,5,5,2,10,8,10,12,2,4,4,3,7))
e<-as.data.frame(c(2500,8000,NA,4500,4500,NA,2500,NA,7400,9648,1112,1532,3487,3544,NA,7000))
impute<-cbind(a,b,c,d,e)
names(impute) <- c("X","site","var2","var3", "hh_inc")
summary(impute)
m <- 5
a.output <- amelia(x = impute, m=m, autopri=0.5,cs="X",
idvars=c("site","var2"),
logs=c("hh_inc","var3"))
stats.out <- NULL
for(i in 1:m){
df2 <- data.table(a.output$imputations[[i]])
df3 <- data.frame(dataset=i,df2[,list(std.error(hh_inc),mean(hh_inc)), by="site"])
stats.out <- rbind(stats.out, df3)
}
colnames(stats.out) <- c("dataset","site","stdError","mean")
stats.out
I'm not sure I understand your question or the structure of your data (specifically the importance of whether the data is imputed or not) but here's how I've done some summary stats by group.
require(data.table)
require(plotrix)
# create some data
df1 <- data.frame(id=seq(1,50,1), wealth = runif(50)*1000)
df1$cutter <- cut(df1$wealth, 5, labels=FALSE)
head(df1)
# put the data into a data.table to speed things up
df2 <- as.data.table(df1)
head(df2)
grp1StdErr <- df2[,std.error(wealth), by="cutter"]
grp1Mean <- df2[,mean(wealth), by="cutter"]
Hope this helps.
Or, in one grouping step :
df2[,list(std.error(wealth),mean(wealth)), by=cut(wealth,5,labels=FALSE)]

Resources