Using lm() in R in data with many zeroes gives error - r

I'm new to data analysis, and I have a couple questions about using lm() in R to create a linear regression model of my data.
My data looks like this:
testID userID timeSpentStudying testGrade
12345 007 10 90
09876 008 0 75
And my model:
model <- lm(formula = data$testGrade ~ timeSpentStudying, data = data)
I'm getting the following error (twice), across just under 60 rows of data from RStudio:
Warning messages:
1: In sqrt(crit * p * (1 - hh)/hh) : NaNs produced
2: In sqrt(crit * p * (1 - hh)/hh) : NaNs produced
My question is, does the problem have to do with the data containing many instances of zero being the value, such as above under the 'timeSpentStudying' column? If so, how do I handle that? Shouldn't lm() be able to handle values of zero, especially if that would give significance to the data itself?
Thanks!

So far I have been unable to replicate this, e.g.:
dd <- data.frame(y=rnorm(1000),x=c(rep(0,990),1:10))
model <- lm(y~x, data = dd)
summary(model)
Searching the R code base for the code listed in your error and tracing back indicates that the relevant lines are in plot.lm, the function that plots diagnostics, and that the problem is that you are somehow getting a value >1 for the leverage or "hat values" of one of your data points. However, I can't see how you could be achieving that. Data would make this much clearer!

Related

Fama MacBeth regression pmg function error in R

I've been trying to run a Fama Macbeth regression using the pmg function for my data "Dev_Panel" but I keep getting this error message:
Fehler in pmg(BooktoMarket ~ Returns + Profitability + BEtoMEpersistence, :
Insufficient number of time periods
I've read in other posts on here that this could be due to NAs in the data. But I've already removed these from the panel.
Additionally, I've used the pmg function on the data frame "Em_Panel" for which I have undertaken the exact same data cleaning measures as for the "Dev_Panel". The regression for this panel worked, but it only produces a coefficient for the intercept. The other coefficients are NA.
Here's the code I used for the Em_Panel:
require(foreign)
require(plm)
require(lmtest)
Em_Panel <- read.csv2("Em_Panel.csv", na="NA")
FMR_Em <- pmg(BooktoMarket~Returns+Profitability+BEtoMEpersistence, Em_Panel, index = c("companyID", "years"))
And here's the code for the Dev_Panel:
Em_Panel <- read.csv2("Dev_Panel.csv", na="NA")
FMR_Dev <- pmg(BooktoMarket~Returns+Profitability+BEtoMEpersistence, Dev_Panel, index = c("companyID", "years"))
Since this seemingly is a problem concerning my data I will gladly provide it:
http://www.filedropper.com/empanel
http://www.filedropper.com/devpanel
Thank you so much for any help!!!
Edit
After switching the arguments as suggested the error is now produced by the Dev_Panel and not the Em_Panel.
Also the regression for the Em_Panel now only provides a coefficient for the intercept. The other coefficients are NA.

R won't run model because it insists the data is not an UnmarkedFrame Occu Subject

I am trying to create a dual-species occupancy model using unmarkedFrameOccuMulti. I've been successful in producing the UMF and have even got a basic plot of the detections but when I try to run an individual model I get the error message;
Error in occu(~1, ~Vill_Dist, umf) : Data is not an unmarkedFrameOccu object.
I've made sure the csvs have the same number of rows etc. I'm a bit mythed because I can't find much online and the UMF itself has ran perfectly, just R can't seem to seperate out the aspects of it?
S <- 2 # number of species
M <- 354 #number of sites - i.e. number of sites with actual data (#i.e. not NAs/transects that were taken - some transects were done 14 times, others as little as 2 times)
J <- 9.07 #average number of visits per transect
y <- list(matrix(rbinom(354, 1, 0.456)), #species 1 leopard
matrix(rbinom(354, 1, 0.033))) #species 2 wolf
So the above is code I'm following from the R help on unmarkedoccumulti. The ordering of the numbers is based on the rbinom function. i.e. 0.033% of the sites surveyed wolves were seen.
obscov <- read.csv("grazcov2.csv")
Error message is ObsCovData needs M*obsNum of rows
umf <- unmarkedFrameOccuMulti(y=y, siteCovs = predcovs2, obsCovs = NULL)
predcovs2
summary(umf)
plot(umf)
umf
m1 <- occu(~1, ~Vill_Dist, umf) - this is the code that doesn't work - Vill_Dist being one of the covariates in the csv - spellt correctly/same etc.
I was expected to produce a model that would predict occurence of leopards/wolves based off the covariates.
As I was writing this out I had an idea for what might be going wrong. I couldn't get the model to work previously because I was putting in the detection data in csv format rather than using the simple binomial function.
Is it simply that R cannot mix csv/imported data and the binomial data?

Error using esttab in R

I'm trying to use esttab to output regression results in R. However, every time I run it I get an error:
Error in FUN(X[[i]], ...) : variable names are limited to 10000 bytes
. Any ideas how to solve it? My code is below:
reg <- lm(y ~ ln_gdp + diffxln_gdp + diff + year, data=df)
eststo(reg)
esttab(store=reg)
The input data comes from approx 25,000 observations. It's all coded as numeric. I can share more information that is deemed relevant but I don't know what that would be right now.
Thanks!

R newbie having issues with lm function

I have the following code to get the famafrench regression of a set of data:
#Regression
ff_reg = lm(e25 ~ rmrf+smb+hml, data=dat);
However, I keep getting the error "invalid type (list) for variable e25".
e25 was defined earlier in the program as a set of data obtained from subtracting 'rf' from a matrix made up of 25 columns:
e25 = (dat[,7:31]) - dat$rf;
(where dat is an CSV file read in to R and rf is one of the columns within that file)
Why is this error coming up and how can I resolve it?
On advice, here is the full code that I am running...
dat = read.csv("ff2014.csv", as.is=TRUE);
##excess portfolio returns
e25 = (dat[,7:31]) - dat$rf;
#print(e25);
#Regression
ff_reg = lm(e25 ~ rmrf+smb+hml, data=dat);
print(summary(ffreg));
From help("lm"):
If response is a matrix a linear model is fitted separately by least-squares to each column of the matrix.
So, if that's what you intend to do, you need to make your data.frame a matrix before you call lm:
e25 <- as.matrix(e25)

Forecasting with `tslm` returning dimension error

I'm having a similar problem to the questioners here had with the linear model predict function, but I am trying to use the "time series linear model" function from Rob Hyndman's forecasting package.
Predict.lm in R fails to recognize newdata
predict.lm with newdata
totalConv <- ts(varData[,43])
metaSearch <- ts(varData[,45])
PPCBrand <- ts(varData[,38])
PPCGeneric <- ts(varData[,34])
PPCLocation <- ts(varData[,35])
brandDisplay <- ts(varData[,29])
standardDisplay <- ts(varData[,3])
TV <- ts(varData[,2])
richMedia <- ts(varData[,46])
df.HA <- data.frame(totalConv, metaSearch,
PPCBrand, PPCGeneric, PPCLocation,
brandDisplay, standardDisplay,
TV, richMedia)
As you can see I've tried to avoid the names issues by creating a data frame of the time series objects.
However, I then fit a tslm object (time series linear model) as follows -
fit1 <- tslm(totalConv ~ metaSearch
+ PPCBrand + PPCGeneric + PPCLocation
+ brandDisplay + standardDisplay
+ TV + richMedia data = df.HA
)
Despite having created a data frame and named all the objects properly I get the same dimension error as these other users have experienced.
Error in forecast.lm(fit1) : Variables not found in newdata
In addition: Warning messages:
1: 'newdata' had 10 rows but variables found have 696 rows
2: 'newdata' had 10 rows but variables found have 696 rows
the model frame seems to give sensible names to all of the variables, so I don't know what is up with the forecast function:-
names(model.frame(fit1))
[1] "totalConv" "metaSearch" "PPCBrand" "PPCGeneric" "PPCLocation" "brandDisplay"
[7] "standardDisplay" "TV" "richMedia"
Can anyone suggest any other improvements to my model specification that might help the forecast function to run?
EDIT 1: Ok, just so there's a working example, I've used the data given in Irsal's answer to this question (converting to time series objects) and then fitted the tslm. I get the same error (different dimensions obviously):-
Is there an easy way to revert a forecast back into a time series for plotting?
I'm really confused about what I'm doing wrong, my code looks identical to that used in all of the examples on this....
data <- c(11,53,50,53,57,69,70,65,64,66,66,64,61,65,69,61,67,71,74,71,77,75,85,88,95,
93,96,89,95,98,110,134,127,132,107,94,79,72,68,72,70,66,62,62,60,59,61,67,
74,87,112,134,51,50,38,40,44,54,52,51,48,50,49,49,48,57,52,53,50,50,55,50,
55,60,65,67,75,66,65,65,69,72,93,137,125,110,93,72,61,55,51,52,50,46,46,45,
48,44,45,53,55,65,89,112,38,7,39,35,37,41,51,53,57,52,57,51,52,49,48,48,51,
54,48,50,50,53,56,64,71,74,66,69,71,75,84,93,107,111,112,90,75,62,53,51,52,
51,49,48,49,52,50,50,59,58,69,95,148,49,83,40,40,40,53,57,54,52,56,53,55,
55,51,54,45,49,46,52,49,50,57,58,63,73,66,63,72,72,71,77,105,97,104,85,73,
66,55,52,50,52,48,48,46,48,53,49,58,56,72,84,124,76,4,40,39,36,38,48,55,49,
51,48,46,46,47,44,44,45,43,48,46,45,50,50,56,62,53,62,63)
data2 <- c(rnorm(237))
library(forecast)
nData <- ts(data)
nData2 <- ts(data2)
dat.ts <- tslm(nData~nData2)
forecast(dat.ts)
Error in forecast.lm(dat.ts) : Variables not found in newdata
In addition: Warning messages:
1: 'newdata' had 10 rows but variables found have 237 rows
2: 'newdata' had 10 rows but variables found have 237 rows
EDIT 2: Same error even if I combine both series into a data frame.
nData.df <- data.frame(nData, nData2)
dat.ts <- tslm(nData~nData2, data = nData.df)
forecast(dat.ts)
tslm fits a linear regression model. You need to provide the future values of the explanatory variables if you want to forecast. These should be provided via the newdata argument of forecast.lm.

Resources