Error in summary.formula : matrix variables must have column dimnames - r

I am new to R, and I can't fix the bug after searching for one hour. It seems that there's no similar problem posted before.
I followed the instruction from https://stats.idre.ucla.edu/r/dae/ordinal-logistic-regression/ ,and want to test the proportional assumption for my data.
Following is my code:
sf <- function(y) {
c('Y>=1' = qlogis(mean(y >= 1)),
'Y>=2' = qlogis(mean(y >= 2)),
'Y>=3' = qlogis(mean(y >= 3)),
'Y>=3' = qlogis(mean(y >= 4)),
'Y>=3' = qlogis(mean(y >= 5)))
}
(s <- with(dat, summary(as.numeric(implied_rating) ~ GDP + importance, fun = sf)))
But the error occurs.
"Error in summary.formula(matrix(as.numeric(implied_rating)) ~
matrix(GDP) + : matrix variables must have column dimnames"
What should I do?
Many thanks in advance!

Solved. I thought dimnames is colnames...
Just mannually set dimnames to every column.
But I still wonder if there's better way to solve the problem.

Related

BVP in R: how to get a numerical solution?

Good day to you all :)
I'm need some help with the boundary value problem.
This is my equation:
sin(1)*y'' + (1+cos(1)*x^2)*y = -1
-1 < x < 1
y(-1) = y(1) = 0
And I'd like to solve this equation by built-in method.
So, I asked the same question on Maple forum (https://www.mapleprimes.com/questions/225225-Dsolve-Does-Not-Work-solving-A-Boundary-Value-Problem) and there is no build-in method.
Maple have No built-in collocation methods, least squares method, Galerkin method to solve ode's.
dsolve it have method series,but in your case dosen't work.
After that, I decided to try to solve this equation with R.
library('bvpSolve')
fun <- function(x,y,pars) {
d1 = sin(1)*y[2]
d2 = -1 - (1 + cos(1))*y[1]
return(list(c(d1, d2)))
}
init = c(y = 0, dy = NA)
end = c(y = 0, dy = NA)
sol = bvpcol(yini = init, x = seq(-1, 1, 0.01),
func = fun, yend = end, parms = NULL)
x = sol[, 1]
y = sol[, 2]
plot(x, y, type = "l")
It builds graphs (probably correct) (at least they match the graphs in Maple)), but I still can't get the numerical solution.
I'd like to get the same result (http://fredrik-j.blogspot.com/2009/02/galerkins-method-in-sympy.html), but in R.
I'd be very grateful if you could suggest some built-in numerical R methods that would help solve this equation.
I don't realy think it's a good idea to explicitly interpolate the function result (as suggested by the Maple community).Is there a simple and effective way to do this?

How to resolve integer overflow errors in R estimation

I'm trying to estimate a model using speedglm in R. The dataset is large (~69.88 million rows and 38 columns). Multiplying the number of rows and columns results in ~2.7 billion which is outside the integer limit. I can't provide the data, but the following examples recreate the issue.
library(speedglm)
# large example that works
require(biglm)
n <- 500000
k <- 500
y <- rgamma(n, 1.5, 1)
x <- round(matrix(rnorm(n*k), n, k), digits = 3)
colnames(x) <- paste("s", 1:k, sep = "")
da <- data.frame(y, x)
fo <- as.formula(paste("y~", paste(paste("s", 1:k, sep = ""), collapse = "+")))
working.example <- speedglm(fo, data = da, family = Gamma(log))
# repeat with large enough size to break
k <- 5000 # 10 times larger than above
x <- round(matrix(rnorm(n*k), n, k), digits = 3)
colnames(x) <- paste("s", 1:k, sep = "")
da <- data.frame(y, x)
fo <- as.formula(paste("y~", paste(paste("s", 1:k, sep = ""), collapse = "+")))
failed.example <- speedglm(fo, data = da, family = Gamma(log))
# attempting to resolve error with chunksize
attempted.fixed.example <- speedglm(fo, data = da, family = Gamma(log), chunksize = 10^6)
This causes an error and integer overflow warning.
Error in if (!replace && is.null(prob) && n > 1e+07 && size <= n/2) .Internal(sample2(n, :
missing value where TRUE/FALSE needed
In addition: Warning message:
In nrow(X) * ncol(X) : NAs produced by integer overflow
I understand the warning, but I do not understand the error. They seem to be related in this case as they appear together after each attempt.
Removing columns allows the estimation to complete. It does not seem to matter which columns are removed; removing interacted or non-interacted variables will both result in a completed estimation. The chunksize option was added after receiving the error initially, but has not helped.
My questions are: (1) what causes the first error? (2) is there a way to estimate models using data such that the number of rows by the number of columns is larger than the integer limit? (3) is there a better na.action to use in this case?
Thanks,
JP.
Running: R version 3.3.3 (2017-03-06)
Actual code below:
dft_var <- c("cltvV0", "cltvV60", "cltvV120", "VCFLBRQ", "ageV0",
"ageV1", "ageV8", "ageV80", "FICOV300", "FICOV650",
"FICOV900", "SingleHouse", "Apt", "Mobile", "Duplex",
"Row", "Modular", "Rural", "FirstTimeBuyer",
"FirstTimeBuyerMissing", "brwtotinMissing", "IncomeRatio",
"VintageBefore2001", "NFLD", "yoy.fcpwti:province_n")
logit1 <- speedglm(formula = paste("DefaultFlag ~ ",
paste(dft_var, collapse = "+"),
sep = ""),
family = binomial(logit),
na.action = na.exclude,
data = default.data,
chunksize = 1*10^7)
Update:
Based on my investigation below, #James figured out that the problem can be avoided by providing non-NULL value for the parameter sparse in the call of the speedglm function, as it prevents the internal call of the is.sparse function.
Using the example above, the following should now work:
speedglm(fo, data = da, family = Gamma(log), sparse = FALSE)
My original answer:
Both the warning and the error come from the same line in the function is.sparse in the package speedglm.
The line is:
sample(X,round((nrow(X)*ncol(X)*camp),digits=0),replace=FALSE)
The warning happens because of the use of nrow(X)*ncol(X) for a large matrix. The nrow and ncol functions return integer values, which can overflow. here is an illustration.
nr = 1000000L
nc = 1000000L
nr*nc
# [1] NA
# Warning message:
# In nr * nc : NAs produced by integer overflow
The error happens because the sample function is confused when X is a large matrix and size = NA. Here is an illustration:
sample(matrix(1,3000,1000000), NA, replace=FALSE)
# Error in if (useHash) .Internal(sample2(n, size)) else .Internal(sample(n, :
# missing value where TRUE/FALSE needed
Thanks to #Andrey 's guidance I was able to solve the problem. The issue was the sample function in the is.sparse check. To bypass this I set sparse=FALSE in the options for speedglm (this should work for sparse=TRUE as well, though I haven't tried.) This is because speedglm calls is.sparse via speedglm.wfit in the following way:
if (is.null(sparse))
sparse <- is.sparse(x = x, sparsellim, camp)
So setting sparse avoids the is.sparse function.
Using the example above, the following should now work:
speedglm(fo, data = da, family = Gamma(log), sparse = FALSE)

How to add columns to data.table using eval

I have a data table of observation and model of being yes and no. For simplicity I have assumed only to groups. I wast to calculate some categorical statistics which I want to have control over which one to be chosen. I know how to do it using eval and save it in another data.table but I want to add to the existing data.table as I have only one row for each group. Could anyone help me?
First I create the contingency table for each group.
DT <- data.table::data.table(obs = rep(c("yes","no"), 5), mod = c(rep("yes",5), rep("no", 5)), groupBy = c(1,1,1,1,1,2,1,1,2,1))
categorical <- DT[, .(a = sum(obs == category[1] & mod == category[1]),
b = sum(obs == category[2] & mod == category[1]),
c = sum(obs == category[1] & mod == category[2]),
d = sum(obs == category[2] & mod == category[2])), by = groupBy]
Then define the statistics
my_exprs = quote(list(
n = a+b+c+d,
s = (a+c)/(a+b+c+d),
r = (a+b)/(a+b+c+d)))
If i use the following lines, it will give me a new data.table:
statList <- c("n","s")
w = which(names(my_exprs) %in% statList)
categorical[, eval(my_exprs[c(1,w)]), by = groupBy]
How to use := in this example to add the results to my old DT, here called categorical?! I did the following and got error message:
categorical[, `:=`(eval(my_exprs[c(1,w)])), by = groupBy]
Error in `[.data.table`(categorical, , `:=`(eval(my_exprs[c(1, w)])), :
In `:=`(col1=val1, col2=val2, ...) form, all arguments must be named.
Thanks,
I cannot reproduce your example, but it might work to keep your my_exprs, but define
my_newcols = as.call(c(quote(`:=`), my_exprs))
as in Arun's answer.
Alternately, you could just construct the expression with a := at the start:
my_newcols = quote(`:=`(n = a+b+c+d, s = a+c))

R apply function to data based on index column value

Example:
require(data.table)
example = matrix(c(rnorm(15, 5, 1), rep(1:3, each=5)), ncol = 2, nrow = 15)
example = data.table(example)
setnames(example, old=c("V1","V2"), new=c("target", "index"))
example
threshold = 100
accumulating_cost = function(x,y) { x-cumsum(y) }
whats_left = accumulating_cost(threshold, example$target)
whats_left
I want whats_left to consist of the difference between threshold and the cumulative sum of values in example$target for which example$index = 1, and 2, and 3. So I used the following for loop:
rm(whats_left)
whats_left = vector("list")
for(i in 1:max(example$index)) {
whats_left[[i]] = accumulating_cost(threshold, example$target[example$index==i])
}
whats_left = unlist(whats_left)
whats_left
plot(whats_left~c(1:15))
I know for loops aren't the devil in R, but I'm habituating myself to use vectorization when possible (including getting away from apply, being a for loop wrapper). I'm pretty sure it's possible here, but I can't figure out how to do it. Any help would be much appreciated.
All you trying to do is accumulate the cost by index. Thus, you might want to use the by argument as in
example[, accumulating_cost(threshold, target), by = index]

How to apply a distribution function for each row in data frame

I know similar questions have been asked in this site here, here, and here, but none of them tackles my problem.
I've a data frame which I want to apply the rdirichlet function (from gtools) to each line. So, each line shall be consider as aplha.
data = NULL
data <- data.frame(rbind(
oct = c(60, 32, 8),
sep = c(53, 35, 12),
ago = c(54, 40, 6)
))
data <- data/100*1000
library(gtools) # contains the function
sim <- 10000 # simulation
My first attenpt was to use apply, it does work, but the output is not that clear for conducting further analysis; each row computation becomes a vector:
p = apply(data, 1, function(x) rdirichlet(sim, alpha = x + 1))
I also try in a loop without success:
p = NULL
for(i in 1:length(data)) {
p[i] <- rdirichlet(sim, alpha = data[i] + 1)
}
Any tip how can I solve this?
Well firstly you might want to change the data in your anonymous function in the apply to x to match the x in function(x)
apply(data, 1, function(x) rdirichlet(sim, alpha = x + 1))
This works for me, as in it provides an output with three columns and 30000 rows.
Two important things here. First, vectorizing is the best way to go:
ans <- apply(data, 1, function(x) rdirichlet(sim, alpha = x + 1))
By doing this, you'll receive each row computations as vector, essentially k vs sim like.
Then you'll need to subsample things like:
margin <- ans[1:100000,1] - ans[100001:200000,1]

Resources