Using Zelig with ggplot2, graphing simulations and models - r

I am attempting to using ggplot2 to graph some basic simulations and multi-variable regression models but am at a loss.
I am using Zelig 3.5 (as newer Zeligs have glitches with simulations)
Based on a blog I found, I tried this
AppMod1 <- (s1$qi)
AppMod1 <- data.frame(AppMod1$ev)
AppMod1 <- melt(AppMod1, measure=1:86)
AppMod1 <- ggplot(AppMod1, aes(approve, year)) +
geom_point() +
geom_smooth(colour="blue") +
theme_tufte()
AppMod1
`
This didn't work. I got an error
"Error: measure variables not found in data:NA"
My models are m1, m2, and m3, and my simulations are m1 and m2. I am using the "approval" data set which comes in Zelig.
The models are calculated as follows
data(approval)
m1 <- zelig(approve~avg.price, model="ls", data=approval)
m2 <- zelig(approve~avg.price+sept.oct.2001+iraq.war, model="ls", data=approval)
m3 <- zelig(approve~avg.price+sept.oct.2001+avg.price:sept.oct.2001, model="ls", data=approval)
And the simulations are
x1 <- setx(m2, sept.oct.2001= 1)
s1 <- sim(m2, x=x1)
summary(s1)
x1 <- setx(m2, sept.oct.2001= 0)
s1 <- sim(m2, x=x1)
summary(s1)
oilprice <- min(approval$avg.price):max(approval$avg.price)
x2 <- setx(m2, sept.oct.2001=0, avg.price=oilprice)
s2 <-sim (m2, x=x2)
plot.ci(s2)
oilprice <- min(approval$avg.price):max(approval$avg.price)
x2 <- setx(m2, sept.oct.2001=1, avg.price=oilprice)
s2 <-sim (m2, x=x2)
plot.ci(s2)

It looks like the error resulted from your call to melt.
Note that in the second line of code AppMod1 <- data.frame(AppMod1$ev) you overwrite the assignment you made in your first line of code AppMod1 <- (s1$qi). So after these two lines of code AppMod1 is equal to a data frame with the single column ev.
Now melt tries to melt this data frame and the call to melt indicates that there are 86 columns of measure.vars, when in fact there is only one column in the data frame. That results in the error you described.
I can't quite tell from your code what you're expecting AppMod1 to look like. When I run your code, s1$qi contains only NULL values. At the very least, you'll need AppMod1 to include columns for approve and year in order for your ggplot code to work as written.
Hopefully this is enough information to go on for now. It will be easier to provide additional help if you show what you expect AppMod1 to look like before and after the call to melt.

Related

How to correctly take out zero observations in panel data in R

I'm running into some problems while running plm regressions in my panel database. Basically, I have to take out a year from my base and also all observations from some variable that are zero. I tried to make a reproducible example using a dataset from AER package.
require (AER)
library (AER)
require(plm)
library("plm")
data("Grunfeld", package = "AER")
View(Grunfeld)
#Here I randomize some observations of the third variable (capital) as zero, to reproduce my dataset
for (i in 1:220) {
x <- rnorm(10,0,1)
if (mean(x) >=0) {
Grunfeld[i,3] <- 0
}
}
View(Grunfeld)
panel <- Grunfeld
#First Method
#This is how I was originally manipulating my data and running my regression
panel <- Grunfeld
dd <-pdata.frame(panel, index = c('firm', 'year'))
dd <- dd[dd$year!=1935, ]
dd <- dd[dd$capital !=0, ]
ols_model_2 <- plm(log(value) ~ (capital), data=dd)
summary(ols_model_2)
#However, I couuldn't plot the variables of the datasets in graphs, because they weren't vectors. So I tried another way:
#Second Method
panel <- panel[panel$year!= 1935, ]
panel <- panel[panel$capital != 0,]
ols_model <- plm(log(value) ~ log(capital), data=panel, index = c('firm','year'))
summary(ols_model)
#But this gave extremely different results for the ols regression!
In my understanding, both approaches sould have yielded the same outputs in the OLS regression. Now I'm afraid my entire analysis is wrong, because I was doing it like the first way. Could anyone explain me what is happening?
Thanks in advance!
You are a running two different models. I am not sure why you would expect results to be the same.
Your first model is:
ols_model_2 <- plm(log(value) ~ (capital), data=dd)
While the second is:
ols_model <- plm(log(value) ~ log(capital), data=panel, index = c('firm','year'))
As you see from the summary of the models, both are "Oneway (individual) effect Within Model". In the first one you dont specify the index, since dd is a pdata.frame object. In the second you do specify the index, because panel is a simple data.frame. However this makes no difference at all.
The difference is using the log of capital or capital without log.
As a side note, leaving out 0 observations is often very problematic. If you do that, make sure you also try alternative ways of dealing with zero, and see how much your results change. You can get started here https://stats.stackexchange.com/questions/1444/how-should-i-transform-non-negative-data-including-zeros

Exporting output of custom multiple regressions from R to Latex

I am trying to export the results of multiple regressions in a single table. Ideally, it should be formatted similar to stargazer() output.
The problem is that I have not found reliably working R functions for the kind of regressions I need (Fama-MacBeth regressions), so I use my custom regression functions, which produce all necessary output (estimates of coefficients, standard errors, t-stat, R^2).
Does stargazer() or other similar function have the parameters, which allow me to export results of multiple regressions to Latex in a nice form when output of my regression is just a dataframe?
EDIT: I was just wondering whether it is possible to create publication-style tables, looking like this:
Here's a simple example that might help you forward (example is too long for a comment, so making this an answer):
library(stargazer)
library(broom)
## generate dummy data
set.seed(123)
x <- runif(1000)
z <- x^0.5
y <- x + z + rnorm(1000, sd=.05)
model1 <- lm(y ~ x)
model2 <- lm(y ~ z)
## transform model summaries into dataframes
tidy(model1) -> model1_tidy
tidy(model2) -> model2_tidy
merge(model1_tidy, model2_tidy, by='term', all.x=T, all.y=T) -> output
stargazer(output, type='latex', summary=FALSE)
You will need to figure out the column headers by yourself but I believe you get the idea.

Linear regression for multivariate time series in R

As part of my data analysis, I am using linear regression analysis to check whether I can predict tomorrow's value using today's data.
My data are about 100 time series of company returns. Here is my code so far:
returns <- read.zoo("returns.csv", header=TRUE, sep=",", format="%d-%m-%y")
returns_lag <- lag(returns)
lm_univariate <- lm(returns_lag$companyA ~ returns$companyA)
This works without problems, now I wish to run a linear regression for every of the 100 companies. Since setting up each linear regression model manually would take too much time, I would like to use some kind of loop (or apply function) to shorten the process.
My approach:
test <- lapply(returns_lag ~ returns, lm)
But this leads to the error "unexpected symbol in "test2" " since the tilde is not being recognized there.
So, basically I want to run a linear regression for every company separately.
The only question that looks similar to what I wanted is Linear regression of time series over multiple columns , however there the data seems to be stored in a matrix and the code example is quite messy compared to what I was looking for.
Formulas are great when you know the exact name of the variables you want to include in the regression. When you are looping over values, they aren't so great. Here's an example that uses indexing to extract the columns of interest for each iteration
#sample data
x.Date <- as.Date("2003-02-01") + c(1, 3, 7, 9, 14) - 1
returns <- zoo(cbind(companya=rnorm(10), companyb=rnorm(10)), x.Date)
returns_lag <- lag(returns)
$loop over columns/companies
xx<-lapply(setNames(1:ncol(returns),names(returns)), function(i) {
today <-returns_lag[,i]
yesterday <-head(returns[,i], -1)
lm(today~yesterday)
})
xx
This will return the results for each column as a list.
Using the dyn package (which loads zoo) we can do this:
library(dyn)
z <- zoo(EuStockMarkets) # test data
lapply(as.list(z), function(z) dyn$lm(z ~ lag(z, -1)))

Ideas to re-write looping regression with 'for' loops

I'm having a brain freeze, and hoping one of you can point me in the right direction. My end goal is the output of various regression coefficients (mainly interested in price elasticity), which I achieved via simple multiple regression, using the "by" function.
I am using the "by" function to loop through the regression formula for each iteration of the "State.UPC" variable. Since my data is quite large (~1MM rows), I had to subset my data into groups of 3-4 states (see mystates1...mystates10). I am then performing the regression on those subsets, each time changing my data source in the "datastep3" data frame. And this is where I need your help:
What is the best way to efficiently re-write this with a combination of my existing "by" regression function, and the "for" loops, so I can bypass the step of constantly changing the data frame name in "datastep3" and the "write.csv" steps. Essentially R looping through each "mystates" data subset and doing the regression by the "State.UPC" attributes?
I have tried several combinations with no success. Pardon the amateurish question...still learning R. Here is my code:
data <-read.csv("PriceData.csv")
datastep1 <-subset(data, subset=c(X..Vol>0, Unit.Vol>0))
datastep2 <- transform(datastep1, State.UPC = paste(State,UPC, sep="."))
mystates1 <- c("AL","AR","AZ")
mystates2 <- c("CA","CO","FL")
mystates3 <- c("GA","IA","IL")
mystates4 <- c("IN","KS","KY")
mystates5 <- c("LA","MI","MN")
mystates6 <- c("MO","MS","NC")
mystates7 <- c("NJ","NM","NV")
mystates8 <- c("NY","OH","OK")
mystates9 <- c("SC","TN","TX")
mystates10 <- c("UT","VA","WI","WV")
datastep3 <-subset(datastep2, subset=State %in% mystates10)
datastep4 <-na.omit(datastep3)
PEbyItem <- by(datastep4, datastep4$State.UPC, function(df)
lm(log(Unit.Vol)~log(Price) + Distribution+Independence.Day+Labor.Day+Memorial.Day+Thanksgiving+Christmas+New.Years+
Year+Month, data=df))
x <- do.call("rbind",lapply(PEbyItem, coef))
y <-data.frame(x)
write.csv(x, file="mystates10.csv", row.names=TRUE)
Impossible to test this because you do not provide any data, but theoretically you could just combine the various mystatesN into a list and then run lapply(...) on that.
## Not tested...
get.PEbyItem <- function(i) {
datastep3 <-subset(datastep2, subset=State %in% mystates[[i]])
datastep4 <-na.omit(datastep3)
PEbyItem <- by(datastep4, datastep4$State.UPC, function(df)
lm(log(Unit.Vol)~log(Price) + Distribution+Independence.Day+Labor.Day+
Memorial.Day+Thanksgiving+Christmas+New.Years+Year+Month,
data=df))
x <- do.call("rbind",lapply(PEbyItem, coef))
y <-data.frame(x)
write.csv(x, file=paste(names(mystates[i]),"csv",sep="."), row.names=TRUE)
}
mystates <- list(ms1=mystates1, ms2=mystates2, ..., ms10=mystates10)
lapply(1:length(mystates),get.PEbyItem)
There are lots of other things that could be improved but without the dataset it's pointless to try.

Removing character level outlier in R

I have a linear model1<-lm(divorce_rate~marriage_rate+median_age+population) for which the leverage plot shows an outlier at 28 (State variable id for "Nevada"). I'd like to specify a model without Nevada in the dataset. I tried the following but got stuck.
data<-read.dta("census.dta")
attach(data)
data1<-data.frame(pop,divorce,marriage,popurban,medage,divrate,marrate)
attach(data1)
model1<-lm(divrate~marrate+medage+pop,data=data1)
summary(model1)
layout(matrix(1:4,2,2))
plot(model1)
dfbetaPlots(lm(divrate~marrate+medage+pop),id.n=50)
vif(model1)
dataNV<-data[!data$state == "Nevada",]
attach(dataNV)
model3<-lm(divrate~marrate+medage+pop,data=dataNV)
The last line of the above code gives me
Error in model.frame.default(formula = divrate ~ marrate + medage + pop, :
variable lengths differ (found for 'medage')
I suspect that you have some glitch in your code such that you have attach()ed copies that are still lying around in your environment -- that's why it's really best practice not to use attach(). The following code works for me:
library(foreign)
## best not to call data 'data'
mydata <- read.dta("http://www.stata-press.com/data/r8/census.dta")
I didn't find divrate or marrate in the data set: I'm going to speculate that you want the per capita rates:
## best practice to use a new name rather than transforming 'in place'
mydata2 <- transform(mydata,marrate=marriage/pop,divrate=divorce/pop)
model1 <- lm(divrate~marrate+medage+pop,data=mydata2)
library(car)
plot(model1)
dfbetaPlots(model1)
This works fine for me in a clean session:
dataNV <- subset(mydata2,state != "Nevada")
## update() may be nice to avoid repeating details of the
## model specification (not really necessary in this case)
model3 <- update(model1,data=dataNV)
Or you can use the subset argument:
model4 <- update(model1,subset=(state != "Nevada"))

Resources