How to use a string as a formula in r - r

I'm trying to do an ANOVA of all of my data frame columns against time_of_day which is a factor. The rest of my columns are all doubles and of equal length.
x = 0
pdf("Time_of_Day.pdf")
for (i in names(data_in)){
if(x > 9){
test <- aov(paste(i, "~ time_of_day"), data = data_in)
}
x = x+1
}
dev.off()
Running this code gives me this error:
Error: $ operator is invalid for atomic vectors
Where is my code calling $? How can I fix this? Sorry, I'm new to r and am quite lost.
My research question is to see if time of day has an affect on brain volume at different ROIs in the brain. Time of day is divided into three categories of morning, afternoon or night.
Edit: SOLVED
treating the string as a formula will allow this to run although I have been advised to not have this many independent values as it will inflate the statistical results of the model. I am not removing this incase someone has a similar problem with the aov() call.
x = 0
pdf("Time_of_Day.pdf")
for (i in names(data_in)){
if(x > 9){
test <- aov(as.formula(paste(i, "~ time_of_day")), data = data_in)
}
x = x+1
}
dev.off()

I guess your problem is that you don't have an ANOVA formula integrated into your aov() function. See the following working example:
data_in <- data.frame(c(1,2,3),c(4,5,6),c(7,8,9))
names(data_in) <- c("first","second","third")
for (i in seq_along(names(data_in))){
test <- aov(data_in$first ~ data_in$second, data = data_in)
print(summary(test))
}
However, it seems that you tried to calculate an ANOVA for each column, whereas you need at least two variables. That is, a nominal scaled condition variable and an interval scaled dependent variable (e.g. gender and weight). So I'm generally wondering if an ANOVA is the correct method for your question. Anyways, in order to answer this question, sample data and a summary of your research question would be needed.

Related

R: How to fit gamlss in a foor loop with a variable (character)

I have a tricky problem. I have a dataframe with more than 1000 variables and want to fit each variable to age using fp smoothing function. I know how to use gamlss() for a specific variable (vari), but that's not practical to repeat this explicitly for more than 1000 times. Moreover, I want to plot the fitting for all 1000 variable in a single figure. What I did is:
variables <- colnames(data)[7:dim(data)[2]]
for(vari in variables) {
print("ROI is:")
print(vari)
model_fem <- gamlss(vari ~ fp(age), family=GG, data=females)
But I got errors:
Error in model.frame.default(formula = vari ~ fp(age), data = females) :
variable lengths differ (found for 'fp(age)')
I think the tricky part is from fp(). I have tried to use as.formula, it didn't work. Also because females$vari return NULL, that's why we got this error.
Do you have any solution for this?
Thank you
Character values are very different from formuals. Formulas contain symbols and you need to properly rebuild them to make them dynamic. There are lots of different ways to do that, but here's one that uses reformulate to turn characters into formulas and update() to modify a base formula.
variables <- colnames(data)[7:dim(data)[2]]
form_resp <- ~ fp(age)
for(vari in variables) {
print("ROI is:")
form_model <- update(form, reformulate(".", response=vari))
print(form_model)
model_fem <- gamlss(form_model, family=GG, data=females)
}

How to correctly take out zero observations in panel data in R

I'm running into some problems while running plm regressions in my panel database. Basically, I have to take out a year from my base and also all observations from some variable that are zero. I tried to make a reproducible example using a dataset from AER package.
require (AER)
library (AER)
require(plm)
library("plm")
data("Grunfeld", package = "AER")
View(Grunfeld)
#Here I randomize some observations of the third variable (capital) as zero, to reproduce my dataset
for (i in 1:220) {
x <- rnorm(10,0,1)
if (mean(x) >=0) {
Grunfeld[i,3] <- 0
}
}
View(Grunfeld)
panel <- Grunfeld
#First Method
#This is how I was originally manipulating my data and running my regression
panel <- Grunfeld
dd <-pdata.frame(panel, index = c('firm', 'year'))
dd <- dd[dd$year!=1935, ]
dd <- dd[dd$capital !=0, ]
ols_model_2 <- plm(log(value) ~ (capital), data=dd)
summary(ols_model_2)
#However, I couuldn't plot the variables of the datasets in graphs, because they weren't vectors. So I tried another way:
#Second Method
panel <- panel[panel$year!= 1935, ]
panel <- panel[panel$capital != 0,]
ols_model <- plm(log(value) ~ log(capital), data=panel, index = c('firm','year'))
summary(ols_model)
#But this gave extremely different results for the ols regression!
In my understanding, both approaches sould have yielded the same outputs in the OLS regression. Now I'm afraid my entire analysis is wrong, because I was doing it like the first way. Could anyone explain me what is happening?
Thanks in advance!
You are a running two different models. I am not sure why you would expect results to be the same.
Your first model is:
ols_model_2 <- plm(log(value) ~ (capital), data=dd)
While the second is:
ols_model <- plm(log(value) ~ log(capital), data=panel, index = c('firm','year'))
As you see from the summary of the models, both are "Oneway (individual) effect Within Model". In the first one you dont specify the index, since dd is a pdata.frame object. In the second you do specify the index, because panel is a simple data.frame. However this makes no difference at all.
The difference is using the log of capital or capital without log.
As a side note, leaving out 0 observations is often very problematic. If you do that, make sure you also try alternative ways of dealing with zero, and see how much your results change. You can get started here https://stats.stackexchange.com/questions/1444/how-should-i-transform-non-negative-data-including-zeros

error with rda test in vegan r package. Variable not being read correctly

I am trying to perform a simple RDA using the vegan package to test the effects of depth, basin and sector on genetic population structure using the following data frame.
datafile.
The "ALL" variable is the genetic population assignment (structure).
In case the link to my data doesn't work well, I'll paste a snippet of my data frame here.
I read in the data this way:
RDAmorph_Oct6 <- read.csv("RDAmorph_Oct6.csv")
My problems are two-fold:
1) I can't seem to get my genetic variable to read correctly. I have tried three things to fix this.
gen=rda(ALL ~ Depth + Basin + Sector, data=RDAmorph_Oct6, na.action="na.exclude")
Error in eval(specdata, environment(formula), enclos = globalenv()) :
object 'ALL' not found
In addition: There were 12 warnings (use warnings() to see them)
so, I tried things like:
> gen=rda("ALL ~ Depth + Basin + Sector", data=RDAmorph_Oct6, na.action="na.exclude")
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
so I specified numeric
> RDAmorph_Oct6$ALL = as.numeric(RDAmorph_Oct6$ALL)
> gen=rda("ALL ~ Depth + Basin + Sector", data=RDAmorph_Oct6, na.action="na.exclude")
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
I am really baffled. I've also tried specifying each variable with dataset$variable, but this doesn't work either.
The strange thing is, I can get an rda to work if I look the effects of the environmental variables on a different, composite, variable
MC = RDAmorph_Oct6[,5:6]
H_morph_var=rda(MC ~ Depth + Basin + Sector, data=RDAmorph_Oct6, na.action="na.exclude")
Note that I did try to just extract the ALL column for the genetic rda above. This didn't work either.
Regardless, this leads to my second problem.
When I try to plot the rda I get a super weird plot. Note the five dots in three places. I have no idea where these come from.
I will have to graph the genetic rda, and I figure I'll come up with the same issue, so I thought I'd ask now.
I've been though several tutorials and tried many iterations of each issue. What I have provided here is I think the best summary. If anyone can give me some clues, I would much appreciate it.
The documentation, ?rda, says that the left-hand side of the formula specifying your model needs to be a data matrix. You can't pass it the name of a variable in the data object as the left-hand side (or at least if this was ever anticipated, doing so exposes bugs in how we parse the formula which is what leads to further errors).
What you want is a data frame containing a variable ALL for the left-hand side of the formula.
This works:
library('vegan')
df <- read.csv('~/Downloads/RDAmorph_Oct6.csv')
ALL <- df[, 'ALL', drop = FALSE]
Notice the drop = FALSE, which stops R from dropping the empty dimension (i.e. converting the single column data frame to a vector.
Then your original call works:
ord <- rda(ALL ~ Basin + Depth + Sector, data = df, na.action = 'na.exclude')
The problem is that rda expects a separate df for the first part of the formula (ALL in your code), and does not use the one in the data = argument.
As mentioned above, you can create a new df with the variable needed for analysis, but here's a oneline solution that should also work:
gen <- rda(RDAmorph_Oct6$ALL ~ Depth + Basin + Sector, data = RDAmorph_Oct6, na.action = na.exclude)
This is partly similar to Gavin simpson's answer. There is also a problem with the categorical vectors in your data frame. You can either use library(data.table) and the rowid function to set the categorical variables to unique integers. Most preferably, not use them. I also wanted to set the ID vector as site names, but I am too lazy now.
library(data.table)
RDAmorph_Oct6 <- read.csv("C:/........../RDAmorph_Oct6.csv")
#remove NAs before. I like looking at my dataframes before I analyze them.
RDAmorph_Oct6 <- na.omit(RDAmorph_Oct6)
#I removed one duplicate
RDAmorph_Oct6 <- RDAmorph_Oct6[!duplicated(RDAmorph_Oct6$ID),]
#Create vector with only ALL
ALL <- RDAmorph_Oct6$ALL
#Create data frame with only numeric vectors and remove ALL
dfn <- RDAmorph_Oct6[,-c(1,4,11,12)]
#Select all categorical vectors.
dfc <- RDAmorph_Oct6[,c(1,11,12)]
#Give the categorical vectors unique integers doesn't do this for ID (Why?).
dfc2 <- as.data.frame(apply(dfc, 2, function(x) rowid(x)))
#Bind back with numeric data frame
dfnc <- cbind.data.frame(dfn, dfc2)
#Select only what you need
df <- dfnc[c("Depth", "Basin", "Sector")]
#The rest you know
rda.out <- rda(ALL ~ ., data=df, scale=T)
plot(rda.out, scaling = 2, xlim=c(-3,2), ylim=c(-1,1))
#Also plot correlations
plot(cbind.data.frame(ALL, df))
Sector and depth have the highest variation. Almost logical, since there are only three vectors used. The assignment of integers to the categorical vector has probably no meaning at all. The function assigns from top to bottom unique integers to the following unique character string. I am also not really sure which question you want to answer. Based on this you can organize the data frame.

Linear regression for multivariate time series in R

As part of my data analysis, I am using linear regression analysis to check whether I can predict tomorrow's value using today's data.
My data are about 100 time series of company returns. Here is my code so far:
returns <- read.zoo("returns.csv", header=TRUE, sep=",", format="%d-%m-%y")
returns_lag <- lag(returns)
lm_univariate <- lm(returns_lag$companyA ~ returns$companyA)
This works without problems, now I wish to run a linear regression for every of the 100 companies. Since setting up each linear regression model manually would take too much time, I would like to use some kind of loop (or apply function) to shorten the process.
My approach:
test <- lapply(returns_lag ~ returns, lm)
But this leads to the error "unexpected symbol in "test2" " since the tilde is not being recognized there.
So, basically I want to run a linear regression for every company separately.
The only question that looks similar to what I wanted is Linear regression of time series over multiple columns , however there the data seems to be stored in a matrix and the code example is quite messy compared to what I was looking for.
Formulas are great when you know the exact name of the variables you want to include in the regression. When you are looping over values, they aren't so great. Here's an example that uses indexing to extract the columns of interest for each iteration
#sample data
x.Date <- as.Date("2003-02-01") + c(1, 3, 7, 9, 14) - 1
returns <- zoo(cbind(companya=rnorm(10), companyb=rnorm(10)), x.Date)
returns_lag <- lag(returns)
$loop over columns/companies
xx<-lapply(setNames(1:ncol(returns),names(returns)), function(i) {
today <-returns_lag[,i]
yesterday <-head(returns[,i], -1)
lm(today~yesterday)
})
xx
This will return the results for each column as a list.
Using the dyn package (which loads zoo) we can do this:
library(dyn)
z <- zoo(EuStockMarkets) # test data
lapply(as.list(z), function(z) dyn$lm(z ~ lag(z, -1)))

Dynamic time-series prediction and rollapply

I am trying to get a rolling prediction of a dynamic timeseries in R (and then work out squared errors of the forecast). I based a lot of this code on this StackOverflow question, but I am very new to R so I am struggling quite a bit. Any help would be much appreciated.
require(zoo)
require(dynlm)
set.seed(12345)
#create variables
x<-rnorm(mean=3,sd=2,100)
y<-rep(NA,100)
y[1]<-x[1]
for(i in 2:100) y[i]=1+x[i-1]+0.5*y[i-1]+rnorm(1,0,0.5)
int<-1:100
dummydata<-data.frame(int=int,x=x,y=y)
zoodata<-as.zoo(dummydata)
prediction<-function(series)
{
mod<-dynlm(formula = y ~ L(y) + L(x), data = series) #get model
nextOb<-nrow(series)+1
#make forecast
predicted<-coef(mod)[1]+coef(mod)[2]*zoodata$y[nextOb-1]+coef(mod)[3]*zoodata$x[nextOb-1]
#strip timeseries information
attributes(predicted)<-NULL
return(predicted)
}
rolling<-rollapply(zoodata,width=40,FUN=prediction,by.column=FALSE)
This returns:
20 21 ..... 80
10.18676 10.18676 10.18676
Which has two problems I was not expecting:
Runs from 20->80, not 40->100 as I would expect (as the width is 40)
The forecasts it gives out are constant: 10.18676
What am I doing wrong? And is there an easier way to do the prediction than to write it all out? Thanks!
The main problem with your function is the data argument to dynlm. If you look in ?dynlm you will see that the data argument must be a data.frame or a zoo object. Unfortunately, I just learned that rollapply splits your zoo objects into array objects. This means that dynlm, after noting that your data argument was not of the right form, searched for x and y in your global environment, which of course were defined at the top of your code. The solution is to convert series into a zoo object. There were a couple of other issues with your code, I post a corrected version here:
prediction<-function(series) {
mod <- dynlm(formula = y ~ L(y) + L(x), data = as.zoo(series)) # get model
# nextOb <- nrow(series)+1 # This will always be 21. I think you mean:
nextOb <- max(series[,'int'])+1 # To get the first row that follows the window
if (nextOb<=nrow(zoodata)) { # You won't predict the last one
# make forecast
# predicted<-coef(mod)[1]+coef(mod)[2]*zoodata$y[nextOb-1]+coef(mod)[3]*zoodata$x[nextOb-1]
# That would work, but there is a very nice function called predict
predicted=predict(mod,newdata=data.frame(x=zoodata[nextOb,'x'],y=zoodata[nextOb,'y']))
# I'm not sure why you used nextOb-1
attributes(predicted)<-NULL
# I added the square error as well as the prediction.
c(predicted=predicted,square.res=(predicted-zoodata[nextOb,'y'])^2)
}
}
rollapply(zoodata,width=20,FUN=prediction,by.column=F,align='right')
Your second question, about the numbering of your results, can be controlled by the align argument is rollapply. left would give you 1..60, center (the default) would give you 20..80 and right gets you 40..100.

Resources