With this data input:
A B C D
0.0513748973337 0.442624990365 0.044669941640565 12023787.0495
-0.047511808790502 0.199057057555 0.067542653775225 6674747.75598
0.250333519823608 0.0400359422093 -0.062361320324768 10836244.44
0.033600922318947 0.118359141703 0.048493523722074 7521473.94034
0.00492552770819 0.0851342003243 0.027123088894137 8742685.39098
0.02053037069955 0.0535545969759 0.06352586720282 8442677.4204
0.09050961131549 0.044871795257 0.049363888991624 7223126.70424
0.082789930841618 0.0230375009412 0.090676778601245 8974611.5623
0.06396481119371 0.0467280364963 0.128097065131764 8167179.81463
and this code:
library(plm);
mydata <- read.csv("reproduce_small.csv", sep = "\t");
plm(C ~ log(D), data = mydata, model = "pooling"); # works
plm(A ~ log(B), data = mydata, model = "pooling"); # error
the second plm call returns the following error:
Error in Math.factor(B) : ‘log’ not meaningful for factors
reproduce_small.csv contains the ten lines of data pasted above. Obviously, B is not a factor, it is clearly a numeric vector. This means that plm thinks it is a factor. The questions are "why?", but more importantly "how do I fix this?"
Things I've tried:
#1) mydata$B.log <- log(mydata$B) results in
Error in model.frame.default(formula = y ~ X - 1, drop.unused.levels = TRUE) :
variable lengths differ (found for 'X')
which is in itself weird, since A and B.log have clearly the same length.
#2) plm(A ~ log(D), data = mydata, model = "pooling"); results in the same error as #1.
#3) plm(C ~ log(B), data = mydata, model = "pooling"); results in the same original error (log not meaningful for factors).
#4) plm(A ~ log(B + 1), data = mydata, model = "pooling"); results in
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
In addition: Warning message:
In Ops.factor(B, 1) : ‘+’ not meaningful for factors
#5) plm(A ~ as.numeric(as.character(log(B))), data = mydata, model = "pooling"); results in the same original error (log not meaningful for factors).
EDIT: As suggested, I'm including the result of str(mydata):
> str(mydata)
'data.frame': 9 obs. of 4 variables:
$ A: num 0.05137 -0.04751 0.25033 0.0336 0.00493 ...
$ B: num 0.4426 0.1991 0.04 0.1184 0.0851 ...
$ C: num 0.0447 0.0675 -0.0624 0.0485 0.0271 ...
$ D: num 12023787 6674748 10836244 7521474 8742685 ...
Also trying mydata <- read.csv("reproduce_small.csv", sep = "\t", stringsAsFactors = FALSE); didn't work.
Helix123 in the comments pointed out that the data.frame should be converted to a pdata.frame. So, for instance, a solution to this toy example will be:
mydata$E <- c("x", "x", "x", "x", "x", "y", "y", "y", "y"); # Create E as an "index"
mydata <- pdata.frame(mydata, index = "E"); # convert to pdata.frame
plm(A ~ log(B), data = mydata, model = "pooling"); # now it works!
EDIT:
As to "why" this happens, as Helix123 pointed out in comments, is that, when passed a data.frame instead of a pdata.frame, plm quietly assumes that the first two columns are indexes, and converts them to factor under the hood. Then plm will throw an unhelpful error, instead of launching a warning that the object passed is not of the correct type, or that it made an assumption at all.
Related
I am quite new to R and new in this forum. I have been trying to use a gam model to model phytoplankton species count data against environmental predictors, but I am stuck with en error.
My code is the following:
file <- read.csv("sg1.csv", header = TRUE, sep = ";", dec = ".", check.names = FALSE,na.strings=c("","NA")) #my dataset contains empty cells that I substitute with NA
data.selected <- file[,c(5,6,14:19)] #I select only the columns on which I am interested
data.no_na <- na.omit(data.selected)
colnames(data.no_na) <- c("T", "S", "P", "Si", "DIN", "DIN_P", "Si_DIN", "Diato")
set.seed(123)
training.samples <- data.no_na$Diato %>% createDataPartition(p =0.8, list = FALSE) #to use Diatom as outcome variable
train.data <- data.no_na[training.samples, ]
test.data <- data.no.na[-training.samples, ]
model <- gam(Diato ~ s(T) + s(S) + s(P) + s(Si), data = train.data)
When I run the code, I get this error: Error in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) :
NA/NaN/Inf in chiamata a funzione esterna (arg 1)
Inoltre: Warning messages:
1: In mean.default(xx) : argument is not numeric or logical: returning NA
2: In Ops.factor(xx, shift[i]) : ‘-’ not meaningful for factors
I saw that this happens only when I put the T parameter in the command line and if I analyse the values using data.no_na$T I get the list of values and '3011 Levels: 1.25321 10 10.001 10.0043 10.0094 10.025 10.0304 ... S' at the end.
Can someone help me understand what is going on and what I am doing wrong? Thank you in advance!
Please let me know if you need any further information.
I fitted a model using the lmer() function (it works well). I have 11 explanatory variables. Three of them, if present in model, cause the step() function (from package lmerTest) to return the error: "Variables length differ (found on "...")" where "..." is the formula call.
I don't have any NA values in the data: there are 600 rows and all three of the problematic variables (H, I, J) are factors.
My code is:
library(purrr) ## for rdunif()
library(lmerTest)
data2 = as.data.frame(matrix(c(rdunif(600*7,1,5),
rdunif(600*3,0,1),
rdunif(600,1,9),
rep(c("a","b"),300)),
nrow = 600), byrow = FALSE)
names(data2) = c("A","B","C","D", "E","F","G","H","I","J","Z","M")
data2[,7:10] = lapply(data2[,7:10],factor)
data2[,c(1:6,11)] = lapply(data2[,c(1:6,11)],as.numeric)
mod1 = lmer(Z ~ A+B+C+D+E+F+G+
#H+
#I+
#J+
(1|M),data2)
step.mod1 = lmerTest::step(mod1) #it works
#
mod2 = lmer(Z ~ A+B+C+D+E+F+G+H+
#I+
#J+
(1|M),data2)
step.mod2 = lmerTest::step(mod2) #it does not work and returns: Variables length differ (found on "A+B+C+D+E+F+G+")
mod3 = lmer(Z ~ A+B+C+D+E+F+G+H+I+J+
(1|M),data2)
step.mod3 = lmerTest::step(mod3) #it does not work and returns: Variables length differ (found on "A+B+C+D+E+F+G+H+I+")
I know that this error is common when there are NAs, but what is the error in this case? How can I fix it?
By splitting the dataset via SplitUplift, the 2 sets training and validating and also split.data1 are lists.
if I try to create the DualUplift function, the result is an error
Error in contrasts<-(*tmp*, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
I tried to change the SplitUplift in a data frame by
using:
split.data1<- SplitUplift(data1, 0.5, group = c("train","visit"))
str(split.data1)
split.data2 <- data.frame(split.data1)
str(split.data2)
which results in
Error in training[, 1:9] : incorrect number of dimensions.
read.csv(*Used dataset*)
data1 <- read.csv2(*Used dataset*)
library(tools4uplift)
library(dummies)
set.seed(1988)
group = c("train", "visit")
split.data1<- SplitUplift(data1, 0.5, group = c("train", "visit"))
str(split.data1)
split.data2 <- data.frame(split.data1)
str(split.data2)
training <- split.data1[[1]]
str(training)
validating <- split.data1[[2]]
"base.tm" <- DualUplift(training, "train", "visit", predictors = colnames(training[,1:9]))
I expect that there will be an outcome for the "base.tm" instead of a Error Message
I found other questions regarding this topic, such as this, however I am keep getting the error message
Error in xy.coords(x, y, xlabel, ylabel, log) : 'x' and 'y' lengths
differ
Below is the code I am using:
library(DAAG)
attach(ultrasonic)
g.poly = lm(UR ~ poly(MD, 3), data = ultrasonic)
cv.poly <- cv.lm(ultrasonic, g.poly ,m=3, plotit=TRUE, printit=TRUE, dots=FALSE, seed=29)
Of course, the length is same:
> length(UR)
[1] 214
> length(MD)
[1] 214
Note that in the same script, I perform another linear regression with crossvalidation, which works.
library(DAAG)
g.lin = lm(log(UR) ~ MD, data = ultrasonic)
cv.lin <- cv.lm(ultrasonic, g.lin ,m=3, plotit=TRUE, printit=TRUE, dots=FALSE, seed=29)
Any idea why the polynomial regression crossvalidation does not work?
EDIT
To get the data:
install.packages('nlsmsn')
library('nlsmsn')
data(Ultrasonic)
#names differ, i am using copy in local machine with lower case u(ultrasonic) and different column names, but data are identical.
#UR = y
#MD = x
DAAG:::cv.lm obviously does not support everything you can do with lm, e.g., it does not support functions in the formula. You need to take an intermediate step.
mf <- as.data.frame(model.matrix(y ~ poly(x), data = Ultrasonic))
mf$y <- Ultrasonic$y
mf$`(Intercept)` <- NULL
#sanitize names
names(mf) <- make.names(names(mf))
#[1] "poly.x." "y"
g.poly.san <- lm(y ~ ., data = mf)
cv.poly <- cv.lm(mf, g.poly.san, m=3, plotit=TRUE, printit=TRUE, dots=FALSE, seed=29)
#works
I'm having some problems with the predict function when using bayesglm. I've read some posts that say this problem may arise when the out of sample data has more levels than the in sample data, but I'm using the same data for the fit and predict functions. Predict works fine with regular glm, but not with bayesglm. Example:
control <- y ~ x1 + x2
# this works fine:
glmObject <- glm(control, myData, family = binomial())
predicted1 <- predict.glm(glmObject , myData, type = "response")
# this gives an error:
bayesglmObject <- bayesglm(control, myData, family = binomial())
predicted2 <- predict.bayesglm(bayesglmObject , myData, type = "response")
Error in X[, piv, drop = FALSE] : subscript out of bounds
# Edit... I just discovered this works.
# Should I be concerned about using these results?
# Not sure why is fails when I specify the dataset
predicted3 <- predict(bayesglmObject, type = "response")
Can't figure out how to predict with a bayesglm object. Any ideas? Thanks!
One of the reasons could be to do with the default setting for the parameter "drop.unused.levels" in the bayesglm command. By default, this parameter is set to TRUE. So if there are unused levels, it gets dropped during model building. However, the predict function still uses the original data with the unused levels present in the factor variable. This causes differences in level between the data used for model building and the one used for prediction (even it is the same data fame -in your case, myData). I have given an example below:
n <- 100
x1 <- rnorm (n)
x2 <- as.factor(sample(c(1,2,3),n,replace = TRUE))
# Replacing 3 with 2 makes the level = 3 as unused
x2[x2==3] <- 2
y <- as.factor(sample(c(1,2),n,replace = TRUE))
myData <- data.frame(x1 = x1, x2 = x2, y = y)
control <- y ~ x1 + x2
# this works fine:
glmObject <- glm(control, myData, family = binomial())
predicted1 <- predict.glm(glmObject , myData, type = "response")
# this gives an error - this uses default drop.unused.levels = TRUE
bayesglmObject <- bayesglm(control, myData, family = binomial())
predicted2 <- predict.bayesglm(bayesglmObject , myData, type = "response")
Error in X[, piv, drop = FALSE] : subscript out of bounds
# this works fine - value of drop.unused.levels is set to FALSE
bayesglmObject <- bayesglm(control, myData, family = binomial(),drop.unused.levels = FALSE)
predicted2 <- predict.bayesglm(bayesglmObject , myData, type = "response")
I think a better way would be to use droplevels to drop the unused levels from the data frame beforehand and use it for both model building and prediction.