Error in array, regression loop using "plyr" - r

Good morning,
I´m currently trying to run a truncated regression loop on my dataset. In the following I will give you a reproducible example of my dataframe.
library(plyr)
library(truncreg)
df <- data.frame("grid_id" = rep(c(1,2), 6),
"htcm" = rep(c(160,170,175), 4),
stringsAsFactors = FALSE)
View(df)
Now I tried to run a truncated regression on the variable "htcm" grouped by grid_id to receive only coefficients (intercept such as sigma), which I then stored into a dataframe. This code is written based on the ideas of #hadley
reg <- dlply(df, "grid_id", function(.)
truncreg(htcm ~ 1, data = ., point = 160, direction = "left")
)
regcoef <- ldply(reg, coef)
As this code works for one of my three datasets, I receive error messages for the other two ones. The datasets do not differ in any column but in their absolute length
(length(df1) = 4,000; length(df2) = 100,000; length(df3) = 13,000)
The error message which occurs is
"Error in array(x, c(length(x), 1L), if (!is.null(names(x))) list(names(x), : 'data' must be of type vector, was 'NULL'
I do not even know how to reproduce an example where this error code occurs, because this code works totally fine with one of my three datasets.
I already accounted for missing values in both columns.
Does anyone has a guess what I can fix to this code?
Thanks!!
EDIT:
I think I found the origin of error in my code, the problem is most likely about that in a truncated regression model, the standard deviation is calculated which automatically implies more than one observation for any group. As there are also groups with only n = 1 observations included, the standard deviation equals zero which causes my code to detect a vector of length = NULL. How can I drop the groups with less than two observations within the regression code?

Related

Error in `$<-.data.frame`(`*tmp*`, "P_value", value = 9.66218350888067e-05) : replacement has 1 row, data has 0

Can some-one help me with my code, i have a code which is calculating a lot of logistic regression at the same time. i used this code also for a lm model and then it worked quite wel, however i tried to adapt it to a glm model but it does not work anymore.
Output_logistic <- data.frame()
glm_output = glm(test[,1] ~ test_2[,1], family = binomial ('logit'))
Output_2 <- data.frame(R_spuared = summary(glm_output)$r.squared)
Output_2$P_value <- summary(glm_output)$coefficients[2,4]
Output_2$Variabele <- paste(colnames(test))
Output_2$Variabele_1 <- paste(colnames(test_2))
Output_2$N_NA <- length(glm_output$na.action)
Output_2$df <- paste(glm_output$df.residual)
Output_logistic <- rbind(Output_logistic,Output_2)
running this code gives the next error:
Error in $<-.data.frame(*tmp*, "P_value", value = 9.66218350888067e-05) :
replacement has 1 row, data has 0
does anybody know what i have to adapt so that the code will work?
Thanks in advance
Your Output_2 is an empty data.frame (it has no rows) because summary(glm_output)$r.squared does not exist, because glm doesn’t report this value.
If you need the R-squared value you’ll have to calculate it yourself. But to fix the error you can simply change your code to construct the data-frame from the existing data in the summary:
output_2 = data.frame(
P_value = summary(glm_output)$coefficients[2, 4],
Variable = colnames(test),
# … etc.
)

variable lengths differ error when rollapply lm

I am trying to run a rolling window regression on a number of time series but encountered this strange problem. The following codes reproduce my data. I have a data frame containing returns named "rt" and a data frame containing factors named "factors". Then I produce a function to obtain the regression constant variable.
mat<-as.data.frame(matrix(runif(88*6), nrow = 88, ncol = 6))
colnames(mat)<-c("MKT","SMB","HML","AA","BB","CC")
rt<-mat[,c(4,6)]
factors<-mat[,c(1:3)]
coeffstat_alpha<-function(x){
fit<-lm(x~MKT+SMB+HML,data=factors,na.action=na.omit)
nn<-c(t(coeftest(fit)))[1]
return(nn)
}
When I run this function on the whole sample, it works.
apply(rt,2,FUN=coeffstat_alpha)
but when I rollapply the function, I received the error message
rollapply(reg[,1],width=24,FUN=coeffstat_alpha,by=1,align="left")
"Error in model.frame.default(formula = x ~ MKT + SMB + HML, data = factors, :
variable lengths differ (found for 'MKT')"
I have tried to fixed the problem by search online but couldn't find a post with the similar question. Can anyone help? Thanks!
As the error message suggests the length of variables differ meaning you are passing x in the function which is of length 24 (width) whereas using factors matrix which has 88 rows in it. For this to run you need to have equal length of x as well as factor. You can change the function to
library(lmtest)
coeffstat_alpha<-function(x){
fit<-lm(rt[x, 1]~MKT+SMB+HML,data=factors[x, ],na.action=na.omit)
nn<-c(t(coeftest(fit)))[1]
return(nn)
}
and use sapply as :
sapply(1:(nrow(rt)-23), function(x) coeffstat_alpha(x:(x+23)))

Error related to randomisation test within lapply() function in R

I have 30 datasets that are conbined in a data list. I wanted to analyze spatial point pattern by L function along with randomisation test. Codes are following.
The first code works well for a single dataset (data1) but once it is applied to a list of dataset with lapply() function as shown in 2nd code, it gives me a very long error like so,
"Error in Kcross(X, i, j, ...) : No points have mark i = Acoraceae
Error in envelopeEngine(X = X, fun = fun, simul = simrecipe, nsim =
nsim, : Exceeded maximum number of errors"
Can anybody tell me what is wrong with 2nd code?
grp <- factor(data1$species)
window <- ripras(data1$utmX, data1$utmY)
pp.grp <- ppp(data1$utmX, data1$utmY, window=window, marks=grp)
L.grp <- alltypes(pp.grp, Lest, correlation = "Ripley")
LE.grp <- alltypes(pp.grp, Lcross, nsim = 100, envelope = TRUE)
plot(L.grp)
plot(LE.grp)
L.LE.sp <- lapply(data.list, function(x) {
grp <- factor(x$species)
window <- ripras(x$utmX, x$utmY)
pp.grp <- ppp(x$utmX, x$utmY, window = window, marks = grp)
L.grp <- alltypes(pp.grp, Lest, correlation = "Ripley")
LE.grp <- alltypes(pp.grp, Lcross, envelope = TRUE)
result <- list(L.grp=L.grp, LE.grp=LE.grp)
return(result)
})
plot(L.LE.sp$LE.grp[1])
This question is about the R package spatstat.
It would help if you could add a minimal working example including data which demonstrate this problem.
If that is not available, please generate the error on your computer, then type traceback() and capture the output and post it here. This will trace the location of the error.
Without this information, my best guess is the following:
The error message says No points have mark i=Acoraceae. That means that the code is expecting a point pattern to include points of type Acoraceae but found that there were none. This can happen because in alltypes(... envelope=TRUE) the code generates random point patterns according to complete spatial randomness. In the simulated patterns, the number of points of type Acoraceae (say) will be random according to a Poisson distribution with a mean equal to the number of points of type Acoraceae in the observed data. If the number of Acoraceae in the actual data is small then there is a reasonable chance that the simulated pattern will contain no Acoraceae at all. This is probably what is causing the error message No points have mark i=Acoraceae.
If this interpretation is correct then you should be able to suppress the error by including the argument fix.marks=TRUE, that is,
alltypes(pp.grp, Lcross, envelope=TRUE, fix.marks=TRUE, nsim=99)
I'm not suggesting this is necessarily appropriate for your application, but this should remove the error message if my guess is correct.
In the latest development version of spatstat, available on github, the code for envelope has been tweaked to detect this error.

Errors while performing caret tuning in R

I am building a predictive model with caret/R and I am running into the following problems:
When trying to execute the training/tuning, I get this error:
Error in if (tmps < .Machine$double.eps^0.5) 0 else tmpm/tmps :
missing value where TRUE/FALSE needed
After some research it appears that this error occurs when there missing values in the data, which is not the case in this example (I confirmed that the data set has no NAs). However, I also read somewhere that the missing values may be introduced during the re-sampling routine in caret, which I suspect is what's happening.
In an attempt to solve problem 1, I tried "pre-processing" the data during the re-sampling in caret by removing zero-variance and near-zero-variance predictors, and automatically inputting missing values using a carets knn automatic imputing method preProcess(c('zv','nzv','knnImpute')), , but now I get the following error:
Error: Matrices or data frames are required for preprocessing
Needless to say I checked and confirmed that the input data set are indeed matrices, so I dont understand why I get this second error.
The code follows:
x.train <- predict(dummyVars(class ~ ., data = train.transformed),train.transformed)
y.train <- as.matrix(select(train.transformed,class))
vbmp.grid <- expand.grid(estimateTheta = c(TRUE,FALSE))
adaptive_trctrl <- trainControl(method = 'adaptive_cv',
number = 10,
repeats = 3,
search = 'random',
adaptive = list(min = 5, alpha = 0.05,
method = "gls", complete = TRUE),
allowParallel = TRUE)
fit.vbmp.01 <- train(
x = (x.train),
y = (y.train),
method = 'vbmpRadial',
trControl = adaptive_trctrl,
preProcess(c('zv','nzv','knnImpute')),
tuneGrid = vbmp.grid)
The only difference between the code for problem (1) and (2) is that in (1), the pre-processing line in the train statement is commented out.
In summary,
-There are no missing values in the data
-Both x.train and y.train are definitely matrices
-I tried using a standard 'repeatedcv' method in instead of 'adaptive_cv' in trainControl with the same exact outcome
-Forgot to mention that the outcome class has 3 levels
Anyone has any suggestions as to what may be going wrong?
As always, thanks in advance
reyemarr
I had the same problem with my data, after some digging i found that I had some Inf (infinite) values in one of the columns.
After taking them out (df <- df %>% filter(!is.infinite(variable))) the computation ran without error.

Path Analysis with plspm

I am attempting to build a Partial Least Squares Path Model using 'plspm'. After reading through the tutorial and formatting my data I am getting hung up on an error:
"Error in if (w.dif < tol || itermax == iter) break : missing value where TRUE/FALSE needed".
I assume that this error is the result of missing values for some of the latent variables (e.g. Soil_Displaced) has a lot of NAs because this variable was only measured in a subset of the replicates in the experiment. Is there a way to get around this error and work with variables with a lot of missing values. I am attaching my code and dateset here and the dataset can also be found in this dropbox file; https://www.dropbox.com/sh/51x08p4yf5qlbp5/-al2pwdCol
this is my code for now:
# inner model matrix
warming = c(0,0,0,0,0,0)
Treatment=c(0,0,0,0,0,0)
Soil_Displaced = c(1,1,0,0,0,0)
Mass_Lost_10mm = c(1,1,0,0,0,0)
Mass_Lost_01mm = c(1,1,0,0,0,0)
Daily_CO2 = c(1,1,0,1,0,0)
Path_inner = rbind(warming, Treatment, Soil_Displaced, Mass_Lost_10mm, Mass_Lost_01mm,Daily_CO2 )
innerplot(Path_inner)
#develop the outter model
Path_outter = list (3, 4:5, 6, 7, 8, 9)
# modes
#designates the model as a reflective model
Path_modes = rep("A", 6)
# Run it plspm(Data, inner matrix, outer list, modes)
Path_pls = plspm(data.2011, Path_inner, Path_outter, Path_modes)
Any input on this issue would be helpful. Thanks!
plspm does work limited with missing values, you have to set the scaling to numeric.
for your example the code looks as follows:
example_scaling = list(c("NUM"),
c("NUM", "NUM"),
c("NUM"),
c("NUM"),
c("NUM"),
c("NUM"))
Path_pls = plspm(data.2011, Path_inner, Path_outter, Path_modes, scaling = example_scaling)
But heres the limitation:
However if your dataset contains one observation where all indicators of a latent variable are missing values, this won't work.
First Case: F.e. the latent variable "Treatment" has 2 indicators, if one of them is NA, it works fine.
Second Case: But if there is just one observation where both indicators are NA, it won't work.
Since youre measuring the other 5 latent variables with just one indicator and you say your data contains lots of missing values, the second one will likely be the case.
PLSPM will not work with missing values therefore I had to interpolate some of the missing values from known observations. When this was done the code above worked great!.

Resources