Iterating Effect Size Calculations Through Columns - r

I am currently comparing the size of 159 regions (ROI) in the brain between an at-risk and normal population on R. I originally calculated lm model p-values using this loop:
storage <- list()
for(i in names(ThalPC)[-c(1:8)]){
storage[[i]] <- lm(get(i) ~ Status, ThalPC)
}
table <- storage %>% tibble(
dvsub = names(.),
untidied = .
) %>%
mutate(tidy = map(untidied, broom::tidy)) %>%
unnest(tidy)
tab <- as.data.frame(table)
to <- subset(tab, select = -c(2))
newtable <- filter(to, term == "StatusControl")
ThalPC= my data frame
Status = Their status as Control or at-risk population
Now, I have around 59 regions with significant p-values and I am hoping to calculate the effect sizes for them. Currently I am trying to use this loop:
stor <- list()
for(i in names(ThalPC)[-c(1:9)]) {
stor[[i]] <- lm(get(i) ~ Status, ThalPC)
try <- effectsize(stor[[i]], type="eta")
}
However, I get the following error:
Error in get(i) : object 'Left_LGN' not found
(Left_LGN being a region that I am studying, all the 159 regions are set up as columns through the data frame)
Perhaps I am overthinking it, does anyone know any simple solution/ better approach to getting the effect sizes for them?
I am still a beginner in R and statistics so I really appreciate your input!!
Thank you!

I would guess you used attach(ThalPC) before running your first script to add columns of ThalPC to the search path. Instead, try constructing your call to lm as:
stor[[i]] <- lm(as.formula(paste(i, "~ Status")),
data = ThalPC)
It looks like you might want to collect the output of effectsize as elements of a list too, otherwise you're overwriting it each time.

Related

How to correctly take out zero observations in panel data in R

I'm running into some problems while running plm regressions in my panel database. Basically, I have to take out a year from my base and also all observations from some variable that are zero. I tried to make a reproducible example using a dataset from AER package.
require (AER)
library (AER)
require(plm)
library("plm")
data("Grunfeld", package = "AER")
View(Grunfeld)
#Here I randomize some observations of the third variable (capital) as zero, to reproduce my dataset
for (i in 1:220) {
x <- rnorm(10,0,1)
if (mean(x) >=0) {
Grunfeld[i,3] <- 0
}
}
View(Grunfeld)
panel <- Grunfeld
#First Method
#This is how I was originally manipulating my data and running my regression
panel <- Grunfeld
dd <-pdata.frame(panel, index = c('firm', 'year'))
dd <- dd[dd$year!=1935, ]
dd <- dd[dd$capital !=0, ]
ols_model_2 <- plm(log(value) ~ (capital), data=dd)
summary(ols_model_2)
#However, I couuldn't plot the variables of the datasets in graphs, because they weren't vectors. So I tried another way:
#Second Method
panel <- panel[panel$year!= 1935, ]
panel <- panel[panel$capital != 0,]
ols_model <- plm(log(value) ~ log(capital), data=panel, index = c('firm','year'))
summary(ols_model)
#But this gave extremely different results for the ols regression!
In my understanding, both approaches sould have yielded the same outputs in the OLS regression. Now I'm afraid my entire analysis is wrong, because I was doing it like the first way. Could anyone explain me what is happening?
Thanks in advance!
You are a running two different models. I am not sure why you would expect results to be the same.
Your first model is:
ols_model_2 <- plm(log(value) ~ (capital), data=dd)
While the second is:
ols_model <- plm(log(value) ~ log(capital), data=panel, index = c('firm','year'))
As you see from the summary of the models, both are "Oneway (individual) effect Within Model". In the first one you dont specify the index, since dd is a pdata.frame object. In the second you do specify the index, because panel is a simple data.frame. However this makes no difference at all.
The difference is using the log of capital or capital without log.
As a side note, leaving out 0 observations is often very problematic. If you do that, make sure you also try alternative ways of dealing with zero, and see how much your results change. You can get started here https://stats.stackexchange.com/questions/1444/how-should-i-transform-non-negative-data-including-zeros

Library "TableOne" multiple comparisons. Calculate line by line p-values

I received a comment from a reviewer who wanted to have all the p-values for each line of specific variables levels in a demographic characteristic table (Table 1). Even though the request appears quite strange (and inexact) to me, I would like to agree with his suggestion.
library(tableone)
## Load data
library(survival); data(pbc)
# drop ID from variable list
vars <- names(pbc)[-1]
## Create Table 1 stratified by trt (can add more stratifying variables)
tableOne <- CreateTableOne(vars = vars, strata = c("trt"), data = pbc, factorVars = c("status","edema","stage"))
print(tableOne, nonnormal = c("bili","chol","copper","alk.phos","trig"), exact = c("status","stage"), smd = TRUE)
the output:
I need to have the p-values for each level of the variables status, edema and stage, with Bonferroni correction. I went through the documentation without success.
In addition, is it correct to use chi-squared to compare sample sizes across rows?
UPDATE:
I'm not sure if my approach is correct, however I would like to share it with you. I generated for the variable status a dummy variable for each strata, than I calculated the chisq .
library(tableone)
## Load data
library(survival); data(pbc)
d <- pbc[,c("status", "trt")]
# Convert dummy variables
d$status.0 <- ifelse(d$status==0, 1,0)
d$status.1 <- ifelse(d$status==1, 1,0)
d$status.2 <- ifelse(d$status==2, 1,0)
t <- rbind(
chisq.test(d$status.0, d$trt),
# p-value = 0.7202
chisq.test(d$status.1, d$trt),
# p-value = 1
chisq.test(d$status.2, d$trt)
#p-value = 0.7818
)
t
BONFERRONI ADJ FOR MULTIPLE COMPARISONS:
p <- t[,"p.value"]
p.adjust(p, method = "bonferroni")
This question was posted some time ago, so I supose you already answered the reviewer.
I don't really understand why computing adjusted p values for just three varibles. In fact, adjusting p values depends on the number of comparisons made. If you use p.adjust() with a vector of 3 p values, results will not really be "adjusted" by the amount of comparison made (you really did more than a dozen and a half!)
I show how to extract all p-values so you can compute the adjusted ones.
To extract pValues from package tableOne there is a way calling object attributes (explained first), and two quick and dirty ways (at the bottom part).
To extract them, first I copy your code to create your tableOne:
library(tableone)
## Load data
library(survival); data(pbc)
# drop ID from variable list
vars <- names(pbc)[-1]
## Create Table 1 stratified by trt (can add more stratifying variables)
tableOne <- CreateTableOne(vars = vars, strata = c("trt"), data = pbc, factorVars = c("status","edema","stage"))
You can see what your "tableOne" object has via attributes()
attributes(tableOne)
You can see a tableOne usually has a table for continuous and categorical variables. You can use attributes() in them too
attributes(tableOne$CatTable)
# you can notice $pValues
Now you know "where" the pValues are, you can extract them with attr()
attr(tableOne$CatTable, "pValues")
Something similar with numerical variables:
attributes(tableOne$ContTable)
# $pValues are there
attr(tableOne$ContTable, "pValues")
You have pValues for Normal and NonNormal variables.
As you set them before, you can extract both
mypCont <- attr(tableOne$ContTable, "pValues") # put them in an object
nonnormal = c("bili","chol","copper","alk.phos","trig") # copied from your code
mypCont[rownames(mypCont) %in% c(nonnormal), "pNonNormal"] # extract NonNormal
"%!in%" <- Negate("%in%")
mypCont[rownames(mypCont) %!in% c(nonnormal), "pNonNormal"] # extract Normal
All that said, and your pValues extracted, I think there are two much more convenient quick and dirty ways to accomplish the same:
Quick and dirty way A: using dput() with your printed tableOne. Then search in the console where the pValues are and copy-paste them to the script, to store them in an object
Quick and dirty way B: If you look in tableOne vignette there is an "Exporting" section, you can use print(tableOne, quote = TRUE) and then just copy and paste to a spreadsheet (like LibreOffice, Excel...).
Then I would select the column with pValue, transpose it, and get it back to R, to compute adjusted p values with p.adjust() and copy them back to the spreadsheet for journal submission

Loops and ANOVA

So I have done this analysis in SAS already and am trying to replicate it in R but I am new in R, I know virtually nothing right now. I have tried a bunch of things but seem to get an error everywhere I go. I will try to simply things because I figure if I can make it work on a small scale I can extrapolate it to a larger scale.
Basically I have this huge data set with subjects that all have a value for a Metabolite. I want to run an ANOVA test on ALL these metabolites, there are 600+ of them. I want to find their P-values and put them all into a nice table with the Metabolite label and the p-value. Here is an example of what the data could look like.
Subject # Treatment Antibiotic Metabolite1 Metabolite2.... Metabolite600
MG_1 MD No 1.257 2.578 5.12
MG_2 MS 1SS 3.59 1.052 1.5201
MG_3 MD1SS No 1.564 1.7489 1.310
etc...
I know I can run:
fit1 <- aov(Metabolite1 ~ TREATMENT * ANTIBIOTIC, data=data1)
to calculate it for just the first Metabolite. I am trying to do a For loop just to try it out. Basically I want to know if i can use the AOV function without having to type or copy/paste it and type in 1 to 600 for everything.
In SAS I could write a macro variable and assign it a number so that when I make a name i could simply say Metabolite&i for the y value and fit&i to save the results. Is there any way to do this in R?
I've tried doing Metabolite[i] with a For (i in 1:20) but that doesn't work. Is there any way to actually reference the i in a loop? What is the proper syntax if there is?
Edit: I really don't know how to make this any simpler than it is, my data set is huge, I literally only have about 3 lines of code right now.
library(gdata)
testing = read.xls("~data1", sheet=1)
fit1 <- aov(Metabolite1 ~ TREATMENT * ANTIBIOTIC, data=data1)
summary(fit1)
This is literally all I have. As I mentioned above I tried doing
For (i in 1:20) {
fit[i] <- aov(Metabolite[i] ~ TREATMENT * ANTIBIOTIC, data=data1)
}
which does NOT work. It will just say object Metabolite not found. It totally ignores the my reference to the i value. I am just trying to start small at first.
It's difficult to debug the following code without data, but I would try something like the following:
library(tidyverse)
library(broom)
data_nested <- data1 %>% gather(key = MetaboliteType, value = Metabolite,
-Subject, -Treatment, -Antibiotic)
%>% group_by(MetaboliteType) %>% nest()
aov_fun <- function(df) {
aov(Metabolite ~ Treatment * Antibiotic, data = df)
}
(results <- data_nested %>% mutate(fit = map(data, aov_fun), tidy = map(fit, tidy))
%>% unnest(tidy))

Traversing a Binary Tree to Get The Splitting Conditions - ctree(party), recursive function

I am trying to reproduce the error which I get with my dataset using a general dataset. Please correct me if I am missing something.
After fitting a Classification tree using library(party), I am trying to get the split conditions of the tree on each node. I managed to write a code, which i believed was working fine, until I found a bug. Could anyone help me to solve it?
My code:
require(party)
iris$Petal.Width <- as.factor(iris$Petal.Width)#imp to convert to factorial
(ct <- ctree(Species ~ ., data = iris))
plot(ct)
#print(ct)
a <- ct #convert it to s4 object
t <- a#tree
#recursive function to traverse the tree and get the splitting conditions
recurse_tree <- function(tree,ret_list=list(),sub_list=list()){
if(!tree$terminal){
sub_list$assign <-list(tree$psplit$splitpoint,tree$psplit$variableName,class(tree$psplit))
names(sub_list)[which(names(sub_list)=="assign")] <- paste("node",tree$nodeID,sep="")
ret_list <- recurse_tree(tree$left, ret_list, sub_list)
ret_list <- recurse_tree(tree$right, ret_list, sub_list)
}
if(tree$terminal){
ret_list$assign <- c(sub_list, tree$prediction)
names(ret_list)[which(names(ret_list)=="assign")] <- paste("node",tree$nodeID,sep="")
return(ret_list)
}
return(ret_list)
}
result <- recurse_tree(t) #call to the functions
Now, the result gives me the list of of all nodes and split conditions and predictions (I assumed). But, when I check the split conditions for
expected output on Node5: {1.1, 1.2, 1.6, 1.7 } # from printing the tree print(ct), I get this
output I get on Node5 from my function: {"1" , "1.3" ,"1.4" ,"1.5" } which is basically the split condition of Node6, which is wrong. How did I get this?
z <- result[2] #I know node5 is second in the list
z <- unlist(z,recursive = F,use.names = T) #unlist
levels(z[[3]][[1]]) [which((z[[3]][[1]])==0)] #to find levels of corresponding values
What I doubt, my function(recurse_tree) is always giving me the split conditions of the right terminal node and not left node. Any help will be appreciated.

Ideas to re-write looping regression with 'for' loops

I'm having a brain freeze, and hoping one of you can point me in the right direction. My end goal is the output of various regression coefficients (mainly interested in price elasticity), which I achieved via simple multiple regression, using the "by" function.
I am using the "by" function to loop through the regression formula for each iteration of the "State.UPC" variable. Since my data is quite large (~1MM rows), I had to subset my data into groups of 3-4 states (see mystates1...mystates10). I am then performing the regression on those subsets, each time changing my data source in the "datastep3" data frame. And this is where I need your help:
What is the best way to efficiently re-write this with a combination of my existing "by" regression function, and the "for" loops, so I can bypass the step of constantly changing the data frame name in "datastep3" and the "write.csv" steps. Essentially R looping through each "mystates" data subset and doing the regression by the "State.UPC" attributes?
I have tried several combinations with no success. Pardon the amateurish question...still learning R. Here is my code:
data <-read.csv("PriceData.csv")
datastep1 <-subset(data, subset=c(X..Vol>0, Unit.Vol>0))
datastep2 <- transform(datastep1, State.UPC = paste(State,UPC, sep="."))
mystates1 <- c("AL","AR","AZ")
mystates2 <- c("CA","CO","FL")
mystates3 <- c("GA","IA","IL")
mystates4 <- c("IN","KS","KY")
mystates5 <- c("LA","MI","MN")
mystates6 <- c("MO","MS","NC")
mystates7 <- c("NJ","NM","NV")
mystates8 <- c("NY","OH","OK")
mystates9 <- c("SC","TN","TX")
mystates10 <- c("UT","VA","WI","WV")
datastep3 <-subset(datastep2, subset=State %in% mystates10)
datastep4 <-na.omit(datastep3)
PEbyItem <- by(datastep4, datastep4$State.UPC, function(df)
lm(log(Unit.Vol)~log(Price) + Distribution+Independence.Day+Labor.Day+Memorial.Day+Thanksgiving+Christmas+New.Years+
Year+Month, data=df))
x <- do.call("rbind",lapply(PEbyItem, coef))
y <-data.frame(x)
write.csv(x, file="mystates10.csv", row.names=TRUE)
Impossible to test this because you do not provide any data, but theoretically you could just combine the various mystatesN into a list and then run lapply(...) on that.
## Not tested...
get.PEbyItem <- function(i) {
datastep3 <-subset(datastep2, subset=State %in% mystates[[i]])
datastep4 <-na.omit(datastep3)
PEbyItem <- by(datastep4, datastep4$State.UPC, function(df)
lm(log(Unit.Vol)~log(Price) + Distribution+Independence.Day+Labor.Day+
Memorial.Day+Thanksgiving+Christmas+New.Years+Year+Month,
data=df))
x <- do.call("rbind",lapply(PEbyItem, coef))
y <-data.frame(x)
write.csv(x, file=paste(names(mystates[i]),"csv",sep="."), row.names=TRUE)
}
mystates <- list(ms1=mystates1, ms2=mystates2, ..., ms10=mystates10)
lapply(1:length(mystates),get.PEbyItem)
There are lots of other things that could be improved but without the dataset it's pointless to try.

Resources