Error: negative length vectors are not allowed - r

I have a relatively big dataframe in R called df which is about 2.9 gb in size, with dimensions
3701578 rows and 94 columns. I am trying to run the following command with the package pls to perform a principal component regression (pcr):
set.seed(1)
y_cols = tail(colnames(df),1) # select last column as dependent variable
x_cols = colnames(df)[-c(1, 2, 93, 94)] # PCA applied only to columns from 3 to 92, whose components become the regressors for pcr
formula = as.formula(
paste0("`",y_cols,"`", " ~ ", paste(paste0("`", x_cols, "`"), collapse = " + "))
) # to ease the writing down the formula
model <- pcr(formula=formula, data=df[df$date<19801231,], scale=FALSE, center=FALSE)
I get the following error:
Error in array(0, dim = c(npred, nresp, ncomp)): negative length vectors are not allowed
Traceback:
1. pcr(formula = formula, data = df[df$date < 19801231, ], scale = FALSE,
. center = FALSE)
2. eval(cl, parent.frame())
3. eval(cl, parent.frame())
4. pls::mvr(formula = formula, data = df[df$date < 19801231, ],
. scale = FALSE, center = FALSE, method = "svdpc")
5. fitFunc(X, Y, ncomp, Y.add = Y.add, center = center, ...)
6. array(0, dim = c(npred, nresp, ncomp))
Slicing the dataframe as in the formula gives a smaller dataframe of 751024 rows × 94 columns. At the beginning I thought (based on similar cases I found online) that this could be due to a memory limit, but actually I have around 1000 gb of RAM available so that is definitely not the case. Funny thing, I have no problem if I run the same command on the entire dataframe df. Creating a new object e.g. new<- df[df$date < 19801231, ] and then running the code does not help either. I managed to get it running if I set some missing data (relatively few) to zero in new. However, if I keep the missing data, the pcr command runs smoothly if I use the entire (bigger) df. Somebody has any idea about this behavior?

Related

How to input matrix data into brms formula?

I am trying to input matrix data into the brm() function to run a signal regression. brm is from the brms package, which provides an interface to fit Bayesian models using Stan. Signal regression is when you model one covariate using another within the bigger model, and you use the by parameter like this: model <- brm(response ~ s(matrix1, by = matrix2) + ..., data = Data). The problem is, I cannot input my matrices using the 'data' parameter because it only allows one data.frame object to be inputted.
Here are my code and the errors I obtained from trying to get around that constraint...
First off, my reproducible code leading up to the model-building:
library(brms)
#100 rows, 4 columns. Each cell contains a number between 1 and 10
Data <- data.frame(runif(100,1,10),runif(100,1,10),runif(100,1,10),runif(100,1,10))
#Assign names to the columns
names(Data) <- c("d0_10","d0_100","d0_1000","d0_10000")
Data$Density <- as.matrix(Data)%*%c(-1,10,5,1)
#the coefficients we are modelling
d <- c(-1,10,5,1)
#Made a matrix with 4 columns with values 10, 100, 1000, 10000 which are evaluation points. Rows are repeats of the same column numbers
Bins <- 10^matrix(rep(1:4,times = dim(Data)[1]),ncol = 4,byrow =T)
Bins
As mentioned above, since 'data' only allows one data.frame object to be inputted, I've tried other ways of inputting my matrix data. These methods include:
1) making the matrix within the brm() function using as.matrix()
signalregression.brms <- brm(Density ~ s(Bins,by=as.matrix(Data[,c(c("d0_10","d0_100","d0_1000","d0_10000"))])),data = Data)
#Error in is(sexpr, "try-error") :
argument "sexpr" is missing, with no default
2) making the matrix outside the formula, storing it in a variable, then calling that variable inside the brm() function
Donuts <- as.matrix(Data[,c(c("d0_10","d0_100","d0_1000","d0_10000"))])
signalregression.brms <- brm(Density ~ s(Bins,by=Donuts),data = Data)
#Error: The following variables can neither be found in 'data' nor in 'data2':
'Bins', 'Donuts'
3) inputting a list containing the matrix using the 'data2' parameter
signalregression.brms <- brm(Density ~ s(Bins,by=donuts),data = Data,data2=list(Bins = 10^matrix(rep(1:4,times = dim(Data)[1]),ncol = 4,byrow =T),donuts=as.matrix(Data[,c(c("d0_10","d0_100","d0_1000","d0_10000"))])))
#Error in names(dat) <- object$term :
'names' attribute [1] must be the same length as the vector [0]
None of the above worked; each had their own errors and it was difficult troubleshooting them because I couldn't find answers or examples online that were of a similar nature in the context of brms.
I was able to use the above techniques just fine for gam(), in the mgcv package - you don't have to define a data.frame using 'data', you can call on variables defined outside of the gam() formula, and you can make matrices inside the gam() function itself. See below:
library(mgcv)
signalregression2 <- gam(Data$Density ~ s(Bins,by = as.matrix(Data[,c("d0_10","d0_100","d0_1000","d0_10000")]),k=3))
#Works!
It seems like brms is less flexible... :(
My question: does anyone have any suggestions on how to make my brm() function run?
Thank you very much!
My understanding of signal regression is limited enough that I'm not convinced this is correct, but I think it's at least a step in the right direction. The problem seems to be that brm() expects everything in its formula to be a column in data. So we can get the model to compile by ensuring all the things we want are present in data:
library(tidyverse)
signalregression.brms = brm(Density ~
s(cbind(d0_10_bin, d0_100_bin, d0_1000_bin, d0_10000_bin),
by = cbind(d0_10, d0_100, d0_1000, d0_10000),
k = 3),
data = Data %>%
mutate(d0_10_bin = 10,
d0_100_bin = 100,
d0_1000_bin = 1000,
d0_10000_bin = 10000))
Writing out each column by hand is a little annoying; I'm sure there are more general solutions.
For reference, here are my installed package versions:
map_chr(unname(unlist(pacman::p_depends(brms)[c("Depends", "Imports")])), ~ paste(., ": ", pacman::p_version(.), sep = ""))
[1] "Rcpp: 1.0.6" "methods: 4.0.3" "rstan: 2.21.2" "ggplot2: 3.3.3"
[5] "loo: 2.4.1" "Matrix: 1.2.18" "mgcv: 1.8.33" "rstantools: 2.1.1"
[9] "bayesplot: 1.8.0" "shinystan: 2.5.0" "projpred: 2.0.2" "bridgesampling: 1.1.2"
[13] "glue: 1.4.2" "future: 1.21.0" "matrixStats: 0.58.0" "nleqslv: 3.3.2"
[17] "nlme: 3.1.149" "coda: 0.19.4" "abind: 1.4.5" "stats: 4.0.3"
[21] "utils: 4.0.3" "parallel: 4.0.3" "grDevices: 4.0.3" "backports: 1.2.1"

I am facing this problem in R, AOV function

This is my code:
> av = aov(r ~ tf);av
r = matrix with numerical data
tf= factor data
This is the error:
Error in model.frame.default(formula = r ~ tf, drop.unused.levels = TRUE) :
variable lengths differ (found for 'tf')
What is possibly wrong? I am very new to this, I have checked my previous steps and everything seems right. Please let me know if you need any additional information
Number of rows of the matrix should be the same as the length of the vector 'tf'. If that is not the case, it would show the length difference error. Below code works as the number of rows of 'r' is 10 and length of 'tf' is 10
r <- matrix(rnorm(5 * 10), 10, 5)
tf <- factor(sample(letters[1:3], 10, replace = TRUE))
aov(r ~ tf)

"Input datasets must be dataframes" error in kamila package in R

I have a mixed type data set, one continuous variable, and eight categorical variables, so I wanted to try kamila clustering. It gives me an error when I use one continuous variable, but when I use two continuous variables it is working.
library(kamila)
data <- read.csv("mixed.csv",header=FALSE,sep=";")
conInd <- 9
conVars <- data[,conInd]
conVars <- data.frame(scale(conVars))
catVarsFac <- data[,c(1,2,3,4,5,6,7,8)]
catVarsFac[] <- lapply(catVarsFac, factor)
kamRes <- kamila(conVars, catVarsFac, numClust=5, numInit=10,calcNumClust = "ps",numPredStrCvRun = 10, predStrThresh = 0.5)
Error in kamila(conVar = conVar[testInd, ], catFactor =
catFactor[testInd, : Input datasets must be dataframes
I think the problem is that the function assumes that you have at least two of both data types (i.e. >= 2 continuous variables, and >= 2 categorical variables). It looks like you supplied a single column index (conInd = 9, just column 9), so you have only one continuous variable in your data. Try adding another continuous variable to your continuous data.
I had the same problem (with categoricals) and this approach fixed it for me.
I think the ultimate source of the error in the program is at around line 170 of the source code. Here's the relevant snippet...
numObs <- nrow(conVar)
numInTest <- floor(numObs/2)
for (cvRun in 1:numPredStrCvRun) {
for (ithNcInd in 1:length(numClust)) {
testInd <- sample(numObs, size = numInTest, replace = FALSE)
testClust <- kamila(conVar = conVar[testInd,],
catFactor = catFactor[testInd, ],
numClust = numClust[ithNcInd],
numInit = numInit, conWeights = conWeights,
catWeights = catWeights, maxIter = maxIter,
conInitMethod = conInitMethod, catBw = catBw,
verbose = FALSE)
When the code partitions your data into a training set, it's selecting rows from a one-column data.frame, but that returns a vector by default in that case. So you end up with "not a data.frame" even though you did supply a data.frame. That's where the error comes from.
If you can't dig up another variable to add to your data, you could edit the code such that the calls to kamila in the cvRun for loop wrap the data.frame() function around any subsetted conVar or catFactor, e.g.
testClust <- kamila(conVar = data.frame(conVar[testInd,]),
catFactor = data.frame(catFactor[testInd,], ... )
and just save that as your own version of the function called say, my_kamila, which you could use instead.
Hope this helps.

How to interpolate values in a whole database with approx function?

I have recorded values from different treatments at different moments. I had done a linear interpolation with these points using the approx () function. As a result I got the predicted values. I have done this procedure only for one repetition belonging to one treatment. Now I want to do it for the whole database. The criterion that I decided to use was to create a new column (see below "polyname") which include treatment+block, and then adjust approx () function with polyname as a criteria for splitting the database but I could not find how to perform this (see below the output error). Any help will be really appreciated.
Here is the script and the link with the database.
http://www.filedropper.com/dataexample
names(dataexample)
#To summary the categorical variables
str(dataexample)
# Transform
dataexample$x<-as.numeric(as.character(dataexample$x))
str(dataexample)
# Create a new column (polyname) combining treatment and block, separated by ","
dataexample$polyname <- paste(dataexample$treat, dataexample$block, sep=",")
#Split the database and run approx function with the new column polyname
model1<-lapply (split(dataexample, dataexample$polyname), approx(x, y, method="linear", xout=7:148, yleft=0, yright=0, rule = 1, f = 0, ties = mean))
model1
Output Error:
> #Split the database and run approx function
> model1<-lapply (split(dataexample, dataexample$polyname), approx(x, y, method="linear", xout=7:148, yleft=0, yright=0, rule = 1, f = 0, ties = mean))
Error in xy.coords(x, y) : object 'y' not found
> model1
Error: object 'model1' not found
Thanks in advance.
Regards.
Matías.

Error in multiple regression: number of items to replace is not a multiple of replacement length

I am trying to split my data into training and test data using a code obtained from my professor but am getting errors. I thought it was because of the data's format, but I went back to hard code it and nothing works. The data is in matrix form right now and I believe the code is being used to predict how accurate the logistic regression is.
A = matrix(
c(64830,18213,4677,24761,9845,17504,22137,12531,5842,28827,66161,18852,5581,27219,10159,17527,23402,11409,8115,31425,68426,18274,5513,25687,10971,14104,19604,13438,6011,30055,69716,18366,5735,26556,11733,16605,20644,15516,5750,31116,73128,18906,5759,28555,11951,19810,22086,17425,6152,28469,1,1,1,0,1,0,0,0,0,1),
nrow = 10,
ncol = 6,
byrow = FALSE)
n<-row(A);
K<-ncol(A)-1;
x<-matrix(0,n,K);
for(i in 1:K){x[,i]<-A[,i];}
#A[,i] is 10long and x[,i] is 1long.
A[,i:length(x[,i])]=x[,i]
y<-A[,K+1];
#training/test data split:
idx<-sample(1:n,floor(n/2),replace=FALSE);
xtr<-x[idx,]; ytr<-y[idx];
xts<-x[-idx,]; yts<-y[-idx];
#fit the logistic model to it
myglm<-glmnet(xtr,ytr,family = "binomial");
#Error in if (is.null(np) | (np[2] <= 1)) stop("x should be a matrix with 2 or more columns") : argument is of length zero
#apply traning data to test data
mypred<-predict(myglm,newx=xts,type="response",s=0.01);
posteriprob<-mypred[,,1];
yhat<-matrix(1,nrow(xts),1);
for(i in 1:nrow(xts))
{
yhat[i]<-which.max(posteriprob[i,]);
}
acc<-sum(yhat+2==yts)/nrow(xts);
cat("accuracy of test data:", acc, "\n");
The first forloop gives me this error:
Error in x[, i] <- A[, i]:
Number of items to replace is not a multiple of replacement length
When I run the logistic model using xtr/ytr I get error in if (is.null(np) | (np[2] <= 1)) stop("x should be a matrix with 2 or more columns"):
argument is of length zero
For the first error, it was a typo. Change n<-row(A) to n<-nrow(A) and it should've worked. But after that, A[,i:length(x[,i])]=x[,i] produces other error, since the size of A is 10x6 while the length(x[,i]) is 10. Probaby you wanted to do different thing here, than what is currently coded.
For the second error, the xtr should have a size of at least n x 2. Also, your data is not appropriate for binomial glm. Observations should be either 1 or 0.

Resources