I'm trying to estimate a model using speedglm in R. The dataset is large (~69.88 million rows and 38 columns). Multiplying the number of rows and columns results in ~2.7 billion which is outside the integer limit. I can't provide the data, but the following examples recreate the issue.
library(speedglm)
# large example that works
require(biglm)
n <- 500000
k <- 500
y <- rgamma(n, 1.5, 1)
x <- round(matrix(rnorm(n*k), n, k), digits = 3)
colnames(x) <- paste("s", 1:k, sep = "")
da <- data.frame(y, x)
fo <- as.formula(paste("y~", paste(paste("s", 1:k, sep = ""), collapse = "+")))
working.example <- speedglm(fo, data = da, family = Gamma(log))
# repeat with large enough size to break
k <- 5000 # 10 times larger than above
x <- round(matrix(rnorm(n*k), n, k), digits = 3)
colnames(x) <- paste("s", 1:k, sep = "")
da <- data.frame(y, x)
fo <- as.formula(paste("y~", paste(paste("s", 1:k, sep = ""), collapse = "+")))
failed.example <- speedglm(fo, data = da, family = Gamma(log))
# attempting to resolve error with chunksize
attempted.fixed.example <- speedglm(fo, data = da, family = Gamma(log), chunksize = 10^6)
This causes an error and integer overflow warning.
Error in if (!replace && is.null(prob) && n > 1e+07 && size <= n/2) .Internal(sample2(n, :
missing value where TRUE/FALSE needed
In addition: Warning message:
In nrow(X) * ncol(X) : NAs produced by integer overflow
I understand the warning, but I do not understand the error. They seem to be related in this case as they appear together after each attempt.
Removing columns allows the estimation to complete. It does not seem to matter which columns are removed; removing interacted or non-interacted variables will both result in a completed estimation. The chunksize option was added after receiving the error initially, but has not helped.
My questions are: (1) what causes the first error? (2) is there a way to estimate models using data such that the number of rows by the number of columns is larger than the integer limit? (3) is there a better na.action to use in this case?
Thanks,
JP.
Running: R version 3.3.3 (2017-03-06)
Actual code below:
dft_var <- c("cltvV0", "cltvV60", "cltvV120", "VCFLBRQ", "ageV0",
"ageV1", "ageV8", "ageV80", "FICOV300", "FICOV650",
"FICOV900", "SingleHouse", "Apt", "Mobile", "Duplex",
"Row", "Modular", "Rural", "FirstTimeBuyer",
"FirstTimeBuyerMissing", "brwtotinMissing", "IncomeRatio",
"VintageBefore2001", "NFLD", "yoy.fcpwti:province_n")
logit1 <- speedglm(formula = paste("DefaultFlag ~ ",
paste(dft_var, collapse = "+"),
sep = ""),
family = binomial(logit),
na.action = na.exclude,
data = default.data,
chunksize = 1*10^7)
Update:
Based on my investigation below, #James figured out that the problem can be avoided by providing non-NULL value for the parameter sparse in the call of the speedglm function, as it prevents the internal call of the is.sparse function.
Using the example above, the following should now work:
speedglm(fo, data = da, family = Gamma(log), sparse = FALSE)
My original answer:
Both the warning and the error come from the same line in the function is.sparse in the package speedglm.
The line is:
sample(X,round((nrow(X)*ncol(X)*camp),digits=0),replace=FALSE)
The warning happens because of the use of nrow(X)*ncol(X) for a large matrix. The nrow and ncol functions return integer values, which can overflow. here is an illustration.
nr = 1000000L
nc = 1000000L
nr*nc
# [1] NA
# Warning message:
# In nr * nc : NAs produced by integer overflow
The error happens because the sample function is confused when X is a large matrix and size = NA. Here is an illustration:
sample(matrix(1,3000,1000000), NA, replace=FALSE)
# Error in if (useHash) .Internal(sample2(n, size)) else .Internal(sample(n, :
# missing value where TRUE/FALSE needed
Thanks to #Andrey 's guidance I was able to solve the problem. The issue was the sample function in the is.sparse check. To bypass this I set sparse=FALSE in the options for speedglm (this should work for sparse=TRUE as well, though I haven't tried.) This is because speedglm calls is.sparse via speedglm.wfit in the following way:
if (is.null(sparse))
sparse <- is.sparse(x = x, sparsellim, camp)
So setting sparse avoids the is.sparse function.
Using the example above, the following should now work:
speedglm(fo, data = da, family = Gamma(log), sparse = FALSE)
Related
Here is my code:
data <-data.frame(matrix(0,nrow = 9,ncol = 2))
data[,1] <- c(0,15,41,81,146,211,438,958,1733)
data[,2] <-c(0.000000,5.7013061,13.2662515,26.0874534,42.2710547,55.6050052,75.597474,112.6755999,109.45890071)
rownames(data) <- c("E0_TAP","E3_TAP","E4_TAP","E5_TAP","E6_TAP","E7_TAP","E8_TAP","E10_TAP","E12_TAP")
colnames(data) <- c('S','v')
This is the light saturation curve of photosystem II in Chlamydomonas reinhardtii. I would like to find the best fitting for my curve using the Michaelis-Menten distribution model. I tried with the drm() command in this way :
model.drm <- drm (v ~ cluster(S), data = data, fct = MM.2())
When I run this code the calculation of the fitting starts, but it's interrupted by an error that I do not really comprehend:
Error in parse(text = paste(paste(rep("c(", nrep - 1), collapse = ""), :
<text>:2:39: unexpected ')'
1: mu[(1+( 1 * (i - 1))),] %*%
2: mu[( 2 + ( 1 * (i - 1))),drop=FALSE,])
^
In addition: Warning message:
In cbind(mu[, 2:(nclass - 1)], 1) - mu[, seq(nclass - 1)] :
longer object length is not a multiple of shorter object length
Timing stopped at: 0 0 0
Although I will keep trying solve the problem by myself, I would really appreciate if someone could help me fixing it quicker or finding an alternative way to perform the analysis.
Thanks in advance!
Thanks to the help of a friend here follows the answer:
data <-data.frame(matrix(0,nrow = 9,ncol = 2))
data[,1] <- c(0,15,41,81,146,211,438,958,1733)
data[,2] <-c(0.000000,5.7013061,13.2662515,26.0874534,42.2710547,55.6050052,75.597474,112.6755999,109.45890071)
rownames(data) <- c("E0_TAP","E3_TAP","E4_TAP","E5_TAP","E6_TAP","E7_TAP","E8_TAP","E10_TAP","E12_TAP")
colnames(data) <- c('S','v')
data <- t(data) #traspose
data1 <- cbind(data,data) #duplicate
data1 <- cbind(data1,data1) # quadruplicate
data <- as.data.frame(t(data1)) #transpose
model.drm <- drm (v ~ cluster(S), data = data, fct = MM.2()) #fitting analysis
S <- data[,1]
v <- data[,2]
mml <- data.frame(S = seq(0, max(S)+9000, length.out = 200))
mml$v <- predict(model.drm, newdata = mml)
s <- mml[,1]
v <- mml[,2]
plot(s,v)
lines(s,v,lty=2,col="red",lwd=3)
coeff <- as.data.frame(coef(summary(model.drm)))
The issue comes from the dataset itself. To bypass the error, a n-uplication of my data was needed. I assume that it would be even better having more replicas of the experiment instead of cloning the selfsame.
Please leave a comment!
I am attempting to fit a hierarchical negative binomial model with bayesm. Though my data is proprietary, I was able to recreate the same error with the margarine dataset. The error I get is as follows:
> look <- rhierNegbinRw(Data = list(regdata = dat1), Mcmc = list(R = 1000,
nprint = 100))
Z not specified - using a column of ones instead
Error in alpha <= 0 :
comparison (4) is possible only for atomic and list types
I set up the mock data as follows(the regression is completely nonsensical -- just trying to get the thing to work):
data(margarine)
chpr <- margarine$choicePrice
chpr$hhid <- as.factor(chpr$hhid)
N <- nlevels(chpr$hhid)
dat1 <- vector(mode = "list", length = N)
for (i in 1:N) {
dat1[[i]]$y <- chpr[chpr$hhid==levels(chpr$hhid)[i], "PPk_Stk"]
dat1[[i]]$X <- model.matrix( ~ choice + PBB_Stk,
data = chpr[chpr$hhid == levels(chpr$hhid)[i], ])
}
I would greatly appreciate any insight into this issue.
I am trying to compute prediction intervals for my neural network created with the neuralnet package.
I use R in Tableau Software, by creating .RData files containing my functions and loaded in Tableau.
It's a simple NN, with one hidden layer containing 5 nodes. I searched and found this package : nnetpredint
So I tried to use it, using their examples.
I tried also to change the way I use it (train/test in same data frame, separated data frames with the same columns names etc.)
And the best result I had was the prediction, but without the lowerBound and upperBound columns.
In fact, I got exactly the same result as when I use compute(myNN, etc.), but I don't have the second and third columns.
Thanks for your help,
EDIT :
My data is coming from tableau, my function take five parameters which are :
ValuesToExplain,train1,train2,test1,test2.
Then, i create and train my NN with the 3first and try to compute the two last.
(test1 = k*train1 and test2 = k2*train2 for now but it will probably move in the future).
Here is my whole code :
NNetwork <- function(objectiveValues, knownValues1, knownValues2, newData, newData2){
numberOfColumn = 3
##Create the training dataframe
training <- data.frame(objectiveValues, knownValues1,knownValues2)
training[which(is.na(training[,"objectiveValues"])),"objectiveValues"]<- mean(training[,"objectiveValues"], na.rm = TRUE)
training[which(is.na(training[,"knownValues1"])),"knownValues1"]<- mean(training[,"knownValues1"], na.rm = TRUE)
training[which(is.na(training[,"knownValues2"])),"knownValues2"]<- mean(training[,"knownValues2"], na.rm = TRUE)
## Create the testing dataframe
testing <- data.frame(objectiveValues,newData,newData2)
names(testing) <- c("objectiveValues", "knownValues1", "knownValues2")
testing[which(is.na(testing[,"objectiveValues"])),"objectiveValues"]<- mean(testing[,"objectiveValues"], na.rm = TRUE)
testing[which(is.na(testing[,"knownValues1"])),"knownValues1"]<- mean(testing[,"knownValues1"], na.rm = TRUE)
testing[which(is.na(testing[,"knownValues2"])),"knownValues2"]<- mean(testing[,"knownValues2"], na.rm = TRUE)
## Scaling
maxs <- apply(training, 2, max)
mins <- apply(training, 2, min)
trainingScaled <- as.data.frame(scale(training, center = mins, scale = maxs - mins))
testingScaled <- as.data.frame(scale(testing, center = mins, scale = maxs - mins))
### NeuralNetwork Part
library(neuralnet)
n <- names(trainingScaled)
f <- as.formula(paste("objectiveValues ~", paste(n[!n %in% "objectiveValues"], collapse = " + ")))
# Training NN
nn <- neuralnet(f, data=trainingScaled,hidden=5,linear.output=TRUE)
# Using NN
computedTrainingScaled <- compute(nn,trainingScaled[,2:numberOfColumn])
computedFromNNScaled <- compute(nn,testingScaled[,2:numberOfColumn])
# UnScaling
computedTraining <- computedTrainingScaled$net.result*(max(training$objectiveValues)-min(training$objectiveValues))+min(training$objectiveValues)
computedFromNN <- computedFromNNScaled$net.result*(max(training$objectiveValues)-min(training$objectiveValues))+min(training$objectiveValues)
RSquare = (1-( (sum((training$objectiveValues - computedTraining)^2))/(sum((training$objectiveValues - mean(training$objectiveValues))^2)) ))*100
RSE = sum((training$objectiveValues - computedTraining)^2)/nrow(training)
res <- (1:nrow(training))
library(nnetpredint) # Getting prediction confidence interval
x <- trainingScaled[,2:numberOfColumn]
y <- trainingScaled[1]
newData <- testingScaled[,2:numberOfColumn]
# S3 generic method: Object of nn
yPredInt <- nnetPredInt(nn, x, y, newData)
for(i in 1:nrow(training)){
res[i] <- paste(computedFromNN[i],RSquare,RSE, sep="#")
}
return(res)
}
save(NNetwork, file = "NNetwork.RData")
Here, i removed the part using the nnetpredint pckage because it was not working, but it was like this :
library(nnetpredint)
y <- trainingScaled
x <- trainingScaled[,2:3]
newData <- testingScaled[,2:3]
yPredInt <- nnetPredInt(nn, x, y, newData)
My problem is that when I try to access yPredInt$lowerBound or yPredInt$upperBound , they don't exist.
Normally when topic modeling I use something along the lines of:
matrix <- create_matrix(cbind(as.vector(lda_data)), language="english", removeNumbers=TRUE, weighting=weightTf)
k <- 20 #Hardcoded temp value
lda <- LDA(matrix, k, method = "Gibbs", control = list(iter = 1000, burnin = 1000))
Terms <- terms(lda, 20)
But with a mid sized data set (3.2M rows) I get the following error calculating the matrix:
Warning message:
In nr * nc : NAs produced by integer overflow
Error in as.matrix(textColumns) :
error in evaluating the argument 'x' in selecting a method for function 'as.matrix': Error in vector(typeof(x$v), nr * nc) : vector size cannot be NA
Is there a different library/approach that avoids this error? (The code works fine on small data sets)
Alternatively, when using a TermDocumentMatrix as the matrix for the LDA, my resulting Terms are entirely numerical, is there a way to strings (words) instead?
I've used an alternate approach to creating the matrix which works on the large data set:
dtm <- DocumentTermMatrix(donation_message,
control = list(stemming = TRUE, stopwords = TRUE,
removeNumbers = TRUE, removePunctuation = TRUE))
dtm <- removeSparseTerms(dtm, 0.99)
rowTotals <- apply(dtm , 1, sum) #Find the sum of words in each Document
dtm <- dtm[rowTotals> 0, ] #Remove all docs without words
k <- 20 #Hardcoded temp value
lda <- LDA(dtm, k, method = "Gibbs", control = list(iter = 1000, burnin = 1000)) #seed = 1000, thin = 100
Terms <- terms(lda, 20)
I want to use the ICC::ICCbare function within a loop. However, the ICCbare uses the concrete variable names as input, e.g.:
ICCbare(x = group, y = variable1, data = dat)
whereby both "group" and "variable1" are columns of the data.frame "dat" (i.e., dat$variable1); ICCbarecannot be used with y = dat[, i].
In order to program a loop I therefore need to evaluate some R code within the function call of ICCbare. My idea was the following:
for(i in 1:10){
ICCbare(group, names(dat)[i], data = dat)
}
However, this does not work. The following error is printed:
Error in '[.data.frame`(data, yc) : undefined columns selected'
Is there a way to evaluate the statement names(dat)[i]) first before it is passed to the function call?
Here is a minimum working example for my problem:
# Create data set
dat <- data.frame(group=c(rep("A",5),
rep("B",5)),
variable1=1:10,
variable2=rnorm(10))
# Loop
for (i in names(dat)[2:3]){
ICCbare("group", i, data = dat)
}
I agree with #agstudy. This is a bad example of non-standard evaluation. You can use this as a workaround:
v <- "variable1"
ICCbare("group", v, data = dat)
#Error in `[.data.frame`(data, yc) : undefined columns selected
eval(bquote(ICCbare("group", .(v), data = dat)))
#$ICC
#[1] 0.8275862
It is a bug in ICCbare that try to to manage arguments as name in a bad manner.
function (x, y, data)
{
ICCcall <- Call <- match.call()
xc <- as.character(ICCcall[[2L]]) ## this is ugly!
yc <- as.character(ICCcall[[3L]])
inds <- unique(data[xc])[[1]]
tdata <- data.frame(data[yc], data[xc])
Personally I would remove the first lines and just use assume that arguments are just column names.
ICCbare_simple <-
function (xc, yc, data)
{
## remove lines before this one
inds <- unique(data[xc])[[1]]
## the rest of the code
.....
}
I'm the maintainer of ICC and I want to thank you for the excellent discussion. I know this is a very late reply, but I just updated the package and the new version (v2.3.0) should fix the "ugly" code and the problem encountered by the OP. See examples in this gist.
I just wanted to post this here in case anyone was searching with a similar problem. Thanks again, sorry for the delay.
Here is the content of the gist:
ICC non-standard evaluation examples
The ICC package for R calculates the intraclass correlation coefficient (ICC) from a one-way analysis of variance. Recently, the package was updated to better execute R's non-standard evaluation within each function (version 2.3.0 and higher). The package functions should now be able to handle a range of possible scenarios for calling the functions in, what I hope, is a less grotesque and more standard way of writing R functions. To demonstrate, below are some of those scenarios. Note, the examples use the ICCbare function, but the way in which the function arguments are supplied will apply to all of the functions in ICC.
First, load the package (and make sure the version is >2.3.0)
library(ICC)
packageVersion("ICC")
Columns of a data.frame
Here we supply the column names and the data.frame that contains the data to calculate the ICC. We will use the ChickWeight data fame.
data(ChickWeight)
ICCbare(x = Chick, y = weight, data = ChickWeight)
#$ICC
#[1] 0.1077609
Iterating through columns of a data.frame
In this case, we might have a data.frame in which we want to estimate the ICC for a number of different types of measurements that each has the same grouping or factor variable (e.g., x). The extreme of this might be in a simulation or bootstrapping scenario or even with some fancy high-throughput phenotyping/data collection. The point being, we want to automate the calculation of the ICC for each column.
First, we will simulate our own dataset with 3 traits to use in the example:
set.seed(101)
n <- 15 # number of individuals/groups/categories/factors
k <- 3 # number of measures per 'n'
va <- 1 # variance among
icc <- 0.6 # expected ICC
vw <- (va * (1 - icc)) / icc # solve for variance within
simdf <- data.frame(ind = rep(LETTERS[1:n], each = k),
t1 = rep(rnorm(n, 10, sqrt(va)), each = k) + rnorm(n*k, 0, sqrt(vw)),
t2 = rep(rnorm(n, 10, sqrt(va)), each = k) + rnorm(n*k, 0, sqrt(vw)),
t3 = rep(rnorm(n, 10, sqrt(va)), each = k) + rnorm(n*k, 0, sqrt(vw)))
Two ways to run through the columns come to mind: iteratively pass the name of each column or iteratively pass the column index. I will demonstrate both below. I do these in for loops so it is easier to see, but an easy extension would be to vectorise this by using something from the apply family of functions. First, passing the name:
for(i in names(simdf)[-1]){
cat(i, ":")
tmp.icc <- ICCbare(x = ind, y = i, data = simdf)
cat(tmp.icc, "\n")
}
#t1 : 0.60446
#t2 : 0.6381197
#t3 : 0.591065
or even like this:
for(i in 1:3){
cat(paste0("t", i), ": ")
tmp.icc <- ICCbare(x = ind, y = paste0("t", i), data = simdf)
cat(tmp.icc, "\n")
}
#t1 : 0.60446
#t2 : 0.6381197
#t3 : 0.591065
Alternatively, pass the column index:
for(i in 2:ncol(simdf)){
cat(names(simdf)[i], ": ")
tmp.icc <- ICCbare(x = ind, y = simdf[, i], data = simdf)
cat(tmp.icc, "\n")
}
#t1 : 0.60446
#t2 : 0.6381197
#t3 : 0.591065
Passing a character as an argument is deprecated
Note that the function will still work if a character is passed directly (e.g., "t1"), albeit with a warning. The warning just means that this may no longer work in future versions of the package. For example:
ICCbare(x = ind, y = "t1", data = simdf)
#[1] 0.60446
#Warning message:
#In ICCbare(x = ind, y = "t1", data = simdf) :
# passing a character string to 'y' is deprecated since ICC version
# 2.3.0 and will not be supported in future versions. The argument
# to 'y' should either be an unquoted column name of 'data' or an object
Note, however, that an expression evaluating to a character (e.g., paste0("t", 1)) doesn't throw the warning, which is nice!