Loops and ANOVA - r

So I have done this analysis in SAS already and am trying to replicate it in R but I am new in R, I know virtually nothing right now. I have tried a bunch of things but seem to get an error everywhere I go. I will try to simply things because I figure if I can make it work on a small scale I can extrapolate it to a larger scale.
Basically I have this huge data set with subjects that all have a value for a Metabolite. I want to run an ANOVA test on ALL these metabolites, there are 600+ of them. I want to find their P-values and put them all into a nice table with the Metabolite label and the p-value. Here is an example of what the data could look like.
Subject # Treatment Antibiotic Metabolite1 Metabolite2.... Metabolite600
MG_1 MD No 1.257 2.578 5.12
MG_2 MS 1SS 3.59 1.052 1.5201
MG_3 MD1SS No 1.564 1.7489 1.310
etc...
I know I can run:
fit1 <- aov(Metabolite1 ~ TREATMENT * ANTIBIOTIC, data=data1)
to calculate it for just the first Metabolite. I am trying to do a For loop just to try it out. Basically I want to know if i can use the AOV function without having to type or copy/paste it and type in 1 to 600 for everything.
In SAS I could write a macro variable and assign it a number so that when I make a name i could simply say Metabolite&i for the y value and fit&i to save the results. Is there any way to do this in R?
I've tried doing Metabolite[i] with a For (i in 1:20) but that doesn't work. Is there any way to actually reference the i in a loop? What is the proper syntax if there is?
Edit: I really don't know how to make this any simpler than it is, my data set is huge, I literally only have about 3 lines of code right now.
library(gdata)
testing = read.xls("~data1", sheet=1)
fit1 <- aov(Metabolite1 ~ TREATMENT * ANTIBIOTIC, data=data1)
summary(fit1)
This is literally all I have. As I mentioned above I tried doing
For (i in 1:20) {
fit[i] <- aov(Metabolite[i] ~ TREATMENT * ANTIBIOTIC, data=data1)
}
which does NOT work. It will just say object Metabolite not found. It totally ignores the my reference to the i value. I am just trying to start small at first.

It's difficult to debug the following code without data, but I would try something like the following:
library(tidyverse)
library(broom)
data_nested <- data1 %>% gather(key = MetaboliteType, value = Metabolite,
-Subject, -Treatment, -Antibiotic)
%>% group_by(MetaboliteType) %>% nest()
aov_fun <- function(df) {
aov(Metabolite ~ Treatment * Antibiotic, data = df)
}
(results <- data_nested %>% mutate(fit = map(data, aov_fun), tidy = map(fit, tidy))
%>% unnest(tidy))

Related

Iterating Effect Size Calculations Through Columns

I am currently comparing the size of 159 regions (ROI) in the brain between an at-risk and normal population on R. I originally calculated lm model p-values using this loop:
storage <- list()
for(i in names(ThalPC)[-c(1:8)]){
storage[[i]] <- lm(get(i) ~ Status, ThalPC)
}
table <- storage %>% tibble(
dvsub = names(.),
untidied = .
) %>%
mutate(tidy = map(untidied, broom::tidy)) %>%
unnest(tidy)
tab <- as.data.frame(table)
to <- subset(tab, select = -c(2))
newtable <- filter(to, term == "StatusControl")
ThalPC= my data frame
Status = Their status as Control or at-risk population
Now, I have around 59 regions with significant p-values and I am hoping to calculate the effect sizes for them. Currently I am trying to use this loop:
stor <- list()
for(i in names(ThalPC)[-c(1:9)]) {
stor[[i]] <- lm(get(i) ~ Status, ThalPC)
try <- effectsize(stor[[i]], type="eta")
}
However, I get the following error:
Error in get(i) : object 'Left_LGN' not found
(Left_LGN being a region that I am studying, all the 159 regions are set up as columns through the data frame)
Perhaps I am overthinking it, does anyone know any simple solution/ better approach to getting the effect sizes for them?
I am still a beginner in R and statistics so I really appreciate your input!!
Thank you!
I would guess you used attach(ThalPC) before running your first script to add columns of ThalPC to the search path. Instead, try constructing your call to lm as:
stor[[i]] <- lm(as.formula(paste(i, "~ Status")),
data = ThalPC)
It looks like you might want to collect the output of effectsize as elements of a list too, otherwise you're overwriting it each time.

Plotting logistic model with package visreg doesn´t work Data not found

My second question of the day: I want to use the visreg package to plot my logistic regression models. As long as I don't use the attribute "by" it works like a charm, but when I want to use it I get an error. The code I used to create my model is the following:
m3<- glm(alive ~ seatbelt*dvcat + sex + ageOFocc + airbag, family = binomial, data = nassCDS, start=)
summary(m3)
If I then use:
visreg(m3, "seatbelt", scale = "response")
I get the following result
which is just fine. But if I now add the "by" attribute I get an error:
visreg(m3, "seatbelt", by="dvcat", scale ="response")
I googled and as far as I understood it the function can't find the data to plot the model. But where can I supply the data? I already tried the "data=" attribute, but it wasn´t working for me (or I did it wrong). There is no console output that I can provide only the message on the graph itself. Can somebody help me? Kind regards, Jan :)
EDIT: I used the "nassCDS" datasat from vincent arel-bundocks github which you can find here: https://vincentarelbundock.github.io/Rdatasets/datasets.html I just inserted the column alive via the column dead so that i am able to use the logistic regression. Therefore i used the dplyr package with the following code:
nassCDS <- nassCDS %>%
mutate(dead1 = as.integer(dead)) %>%
mutate(alive = sjmisc::rec(dead1, rec = "2=0; 1=1")) %>%
select(seatbelt, dead, alive, dvcat, sex, ageOFocc, everything()) %>%
select(-dead1)
Furthermore i changed the columns airbag and seatbelt to numeric as it was suggested by one other stackoverflow user.

How to correctly take out zero observations in panel data in R

I'm running into some problems while running plm regressions in my panel database. Basically, I have to take out a year from my base and also all observations from some variable that are zero. I tried to make a reproducible example using a dataset from AER package.
require (AER)
library (AER)
require(plm)
library("plm")
data("Grunfeld", package = "AER")
View(Grunfeld)
#Here I randomize some observations of the third variable (capital) as zero, to reproduce my dataset
for (i in 1:220) {
x <- rnorm(10,0,1)
if (mean(x) >=0) {
Grunfeld[i,3] <- 0
}
}
View(Grunfeld)
panel <- Grunfeld
#First Method
#This is how I was originally manipulating my data and running my regression
panel <- Grunfeld
dd <-pdata.frame(panel, index = c('firm', 'year'))
dd <- dd[dd$year!=1935, ]
dd <- dd[dd$capital !=0, ]
ols_model_2 <- plm(log(value) ~ (capital), data=dd)
summary(ols_model_2)
#However, I couuldn't plot the variables of the datasets in graphs, because they weren't vectors. So I tried another way:
#Second Method
panel <- panel[panel$year!= 1935, ]
panel <- panel[panel$capital != 0,]
ols_model <- plm(log(value) ~ log(capital), data=panel, index = c('firm','year'))
summary(ols_model)
#But this gave extremely different results for the ols regression!
In my understanding, both approaches sould have yielded the same outputs in the OLS regression. Now I'm afraid my entire analysis is wrong, because I was doing it like the first way. Could anyone explain me what is happening?
Thanks in advance!
You are a running two different models. I am not sure why you would expect results to be the same.
Your first model is:
ols_model_2 <- plm(log(value) ~ (capital), data=dd)
While the second is:
ols_model <- plm(log(value) ~ log(capital), data=panel, index = c('firm','year'))
As you see from the summary of the models, both are "Oneway (individual) effect Within Model". In the first one you dont specify the index, since dd is a pdata.frame object. In the second you do specify the index, because panel is a simple data.frame. However this makes no difference at all.
The difference is using the log of capital or capital without log.
As a side note, leaving out 0 observations is often very problematic. If you do that, make sure you also try alternative ways of dealing with zero, and see how much your results change. You can get started here https://stats.stackexchange.com/questions/1444/how-should-i-transform-non-negative-data-including-zeros

predict in caret ConfusionMatrix is removing rows

I'm fairly new to using the caret library and it's causing me some problems. Any
help/advice would be appreciated. My situations are as follows:
I'm trying to run a general linear model on some data and, when I run it
through the confusionMatrix, I get 'the data and reference factors must have
the same number of levels'. I know what this error means (I've run into it before), but I've double and triple checked my data manipulation and it all looks correct (I'm using the right variables in the right places), so I'm not sure why the two values in the confusionMatrix are disagreeing. I've run almost the exact same code for a different variable and it works fine.
I went through every variable and everything was balanced until I got to the
confusionMatrix predict. I discovered this by doing the following:
a <- table(testing2$hold1yes0no)
a[1]+a[2]
1543
b <- table(predict(modelFit,trainTR2))
dim(b)
[1] 1538
Those two values shouldn't disagree. Where are the missing 5 rows?
My code is below:
set.seed(2382)
inTrain2 <- createDataPartition(y=HOLD$hold1yes0no, p = 0.6, list = FALSE)
training2 <- HOLD[inTrain2,]
testing2 <- HOLD[-inTrain2,]
preProc2 <- preProcess(training2[-c(1,2,3,4,5,6,7,8,9)], method="BoxCox")
trainPC2 <- predict(preProc2, training2[-c(1,2,3,4,5,6,7,8,9)])
trainTR2 <- predict(preProc2, testing2[-c(1,2,3,4,5,6,7,8,9)])
modelFit <- train(training2$hold1yes0no ~ ., method ="glm", data = trainPC2)
confusionMatrix(testing2$hold1yes0no, predict(modelFit,trainTR2))
I'm not sure as I don't know your data structure, but I wonder if this is due to the way you set up your modelFit, using the formula method. In this case, you are specifying y = training2$hold1yes0no and x = everything else. Perhaps you should try:
modelFit <- train(trainPC2, training2$hold1yes0no, method="glm")
Which specifies y = training2$hold1yes0no and x = trainPC2.

Removing character level outlier in R

I have a linear model1<-lm(divorce_rate~marriage_rate+median_age+population) for which the leverage plot shows an outlier at 28 (State variable id for "Nevada"). I'd like to specify a model without Nevada in the dataset. I tried the following but got stuck.
data<-read.dta("census.dta")
attach(data)
data1<-data.frame(pop,divorce,marriage,popurban,medage,divrate,marrate)
attach(data1)
model1<-lm(divrate~marrate+medage+pop,data=data1)
summary(model1)
layout(matrix(1:4,2,2))
plot(model1)
dfbetaPlots(lm(divrate~marrate+medage+pop),id.n=50)
vif(model1)
dataNV<-data[!data$state == "Nevada",]
attach(dataNV)
model3<-lm(divrate~marrate+medage+pop,data=dataNV)
The last line of the above code gives me
Error in model.frame.default(formula = divrate ~ marrate + medage + pop, :
variable lengths differ (found for 'medage')
I suspect that you have some glitch in your code such that you have attach()ed copies that are still lying around in your environment -- that's why it's really best practice not to use attach(). The following code works for me:
library(foreign)
## best not to call data 'data'
mydata <- read.dta("http://www.stata-press.com/data/r8/census.dta")
I didn't find divrate or marrate in the data set: I'm going to speculate that you want the per capita rates:
## best practice to use a new name rather than transforming 'in place'
mydata2 <- transform(mydata,marrate=marriage/pop,divrate=divorce/pop)
model1 <- lm(divrate~marrate+medage+pop,data=mydata2)
library(car)
plot(model1)
dfbetaPlots(model1)
This works fine for me in a clean session:
dataNV <- subset(mydata2,state != "Nevada")
## update() may be nice to avoid repeating details of the
## model specification (not really necessary in this case)
model3 <- update(model1,data=dataNV)
Or you can use the subset argument:
model4 <- update(model1,subset=(state != "Nevada"))

Resources