I have an R programm for a regression that somehow gives me an error message that I do not understand. The regression model takes as input heat input heat data (Q_htg) and the corresponding temperature data (T_amb) and then builds a linear regression for those two variables. Afterwards I want to use the trained regression model to predict some outputs. Here is the code:
dalinearPowerScaling2.function <-
function(Dataset,
numberOfDaysForAggregation,
normOutsideTemperature) {
heatingPower <- Dataset$Q_htg
outSideTemperature <- Dataset$T_amb
aggregationLevel <- numberOfDaysForAggregation * 1440
index <- 0
meanValuesOutsideTemperature <-
vector(, length(outSideTemperature) / aggregationLevel)
for (i in seq(1, length(outSideTemperature), aggregationLevel)) {
sum <- 0
for (j in seq(i, i + aggregationLevel - 1, 1)) {
sum <- sum + outSideTemperature[j]
}
index <- index + 1
meanValuesOutsideTemperature[index] <- sum / aggregationLevel
}
index <- 0
meanValuesHeatingDemand <-
vector(, length(heatingPower) / aggregationLevel)
for (i in seq(1, length(heatingPower), aggregationLevel)) {
sum <- 0
for (j in seq(i, i + aggregationLevel - 1, 1)) {
sum <- sum + heatingPower[j]
}
index <- index + 1
meanValuesHeatingDemand[index] <- sum / aggregationLevel
}
linearModel <-
lm(meanValuesHeatingDemand ~ meanValuesOutsideTemperature)
abline(linearModel, col = "red")
pred <- predict(linearModel, data.frame(meanValuesOutsideTemperature = c(normOutsideTemperature)))
List<-list(meanValuesHeatingDemand, meanValuesOutsideTemperature)
List2 <- vector("list", length(heatingPower)/aggregationLevel)
for (i in seq(1, length(meanValuesHeatingDemand),1)){
List2 [[i]]<-c(meanValuesHeatingDemand[i], meanValuesOutsideTemperature[i])
}
List3<-List2[order(sapply(List2, function(x) x[1], simplify=TRUE), decreasing=FALSE)]
firstTemperatureWithHeatingDemand<-0
firstHeatingDemand<-0
for (i in seq(1, length(List3), 1)) {
if(List3[[i]][1]>0) {
firstTemperatureWithHeatingDemand<-List3[[i]][2]
firstHeatingDemand<-List3[[i]][1]
break}
}
regression2ValuesX <- vector(, 5)
regression2ValuesY <- vector(, 5)
regression2ValuesX [1] <- firstTemperatureWithHeatingDemand
regression2ValuesY [1] <-firstHeatingDemand
List3<-List2[order(sapply(List2, function(x) x[1], simplify=TRUE), decreasing=TRUE)]
for (i in seq(1, length(regression2ValuesX) - 1, 1)) {
regression2ValuesX[i + 1]<-List3[[i]][2]
regression2ValuesY[i + 1]<-List3[[i]][1]
}
plot(regression2ValuesX, regression2ValuesY)
linearModel2 <-
lm(regression2ValuesY ~ regression2ValuesX)
abline(linearModel2, col = "blue")
pred <- predict(linearModel2, data.frame(regression2ValuesX = c(normOutsideTemperature)))
paste("Predicted heating demand:", round(pred))
}
When I run with the command
linearPowerScaling2.function(data_heat_test, 1, -12)
I get the error message:
Error in int_abline(a = a, b = b, h = h, v = v, untf = untf, ...) :
plot.new has not been called yet
3.
int_abline(a = a, b = b, h = h, v = v, untf = untf, ...)
2.
abline(linearModel, col = "red") at LinearPowerScaling2_Function.R#33
1.
linearPowerScaling2.function(data_heat_test, 1, -12)
The data itself should be okay. Can anyone tell me, what the problem is?
Without reproducible minimal example it's hard to test if this solves it, but the error message tells you that you are calling abline() before calling plot().
That's exactly what happens on line 33...
Hope this helps.
Check here to see how to make a minimal reproducible example.
Related
I am trying to implement NLPCA (Nonlinear PCA) on a data set using the homals package in R but I keep on getting the following error message:
Error in dimnames(x) <- dn : length of 'dimnames' [1] not equal to array extent
The data set I use can be found in the UCI ML Repository and it's called dat when imported in R: https://archive.ics.uci.edu/ml/datasets/South+German+Credit+%28UPDATE%29
Here is my code (some code is provided once the data set is downloaded):
nlpcasouthgerman <- homals(dat, rank=1, level=c('nominal','numerical',rep('nominal',2),
'numerical','nominal',
rep('ordinal',2), rep('nominal',2),
'ordinal','nominal','numerical',
rep('nominal',2), 'ordinal',
'nominal','ordinal',rep('nominal',3)),
active=c(FALSE, rep(TRUE, 20)), ndim=3, verbose=1)
I am trying to predict the first attribute, therefore I set it to be active=FALSE.
The output looks like this (skipped all iteration messages):
Iteration: 1 Loss Value: 0.000047
Iteration: 2 Loss Value: 0.000044
...
Iteration: 37 Loss Value: 0.000043
Iteration: 38 Loss Value: 0.000043
Error in dimnames(x) <- dn :
length of 'dimnames' [1] not equal to array extent
I don't understand why this error comes up. I have used the same code on some other data set and it worked fine so I don't see why this error persists. Any suggestions about what might be going wrong and how I could fix this issue?
Thanks!
It seems the error comes from code generating NAs in the homals function, specifically for your data for the number_credits levels, which causes problems with sort(as.numeric((rownames(clist[[i]])))) and the attempt to catch the error, since one of the levels does not give an NA value.
So either you have to modify the homals function to take care of such an edge case, or change problematic factor levels. This might be something to file as a bug report to the package maintainer.
As a work-around in your case you could do something like:
levels(dat$number_credits)[1] <- "_1"
and the function should run without problems.
Edit:
I think one solution would be to change one line of code in the homals function, but no guarantee this does work as intended. Better submit a bug report to the package author/maintainer - see https://cran.r-project.org/web/packages/homals/ for the address.
Using rnames <- as.numeric(rownames(clist[[i]]))[order(as.numeric(rownames(clist[[i]])))] instead of rnames <- sort(as.numeric((rownames(clist[[i]])))) would allow the following code to identify NAs, but I am not sure why the author did not try to preserve factor levels outright.
Anyway, you could run a modified function in your local environment, which would require to explicitly call internal (not exported) homals functions, as shown below. Not necessarily the best approach, but would help you out in a pinch.
homals <- function (data, ndim = 2, rank = ndim, level = "nominal", sets = 0,
active = TRUE, eps = 0.000001, itermax = 1000, verbose = 0) {
dframe <- data
name <- deparse(substitute(dframe))
nobj <- nrow(dframe)
nvar <- ncol(dframe)
vname <- names(dframe)
rname <- rownames(dframe)
for (j in 1:nvar) {
dframe[, j] <- as.factor(dframe[, j])
levfreq <- table(dframe[, j])
if (any(levfreq == 0)) {
newlev <- levels(dframe[, j])[-which(levfreq == 0)]
}
else {
newlev <- levels(dframe[, j])
}
dframe[, j] <- factor(dframe[, j], levels = sort(newlev))
}
varcheck <- apply(dframe, 2, function(tl) length(table(tl)))
if (any(varcheck == 1))
stop("Variable with only 1 value detected! Can't proceed with estimation!")
active <- homals:::checkPars(active, nvar)
rank <- homals:::checkPars(rank, nvar)
level <- homals:::checkPars(level, nvar)
if (length(sets) == 1)
sets <- lapply(1:nvar, "c")
if (!all(sort(unlist(sets)) == (1:nvar))) {
print(cat("sets union", sort(unlist(sets)), "\n"))
stop("inappropriate set structure !")
}
nset <- length(sets)
mis <- rep(0, nobj)
for (l in 1:nset) {
lset <- sets[[l]]
if (all(!active[lset]))
(next)()
jset <- lset[which(active[lset])]
for (i in 1:nobj) {
if (any(is.na(dframe[i, jset])))
dframe[i, jset] <- NA
else mis[i] <- mis[i] + 1
}
}
for (j in 1:nvar) {
k <- length(levels(dframe[, j]))
if (rank[j] > min(ndim, k - 1))
rank[j] <- min(ndim, k - 1)
}
x <- cbind(homals:::orthogonalPolynomials(mis, 1:nobj, ndim))
x <- homals:::normX(homals:::centerX(x, mis), mis)$q
y <- lapply(1:nvar, function(j) homals:::computeY(dframe[, j], x))
sold <- homals:::totalLoss(dframe, x, y, active, rank, level, sets)
iter <- pops <- 0
repeat {
iter <- iter + 1
y <- homals:::updateY(dframe, x, y, active, rank, level, sets,
verbose = verbose)
smid <- homals:::totalLoss(dframe, x, y, active, rank, level,
sets)/(nobj * nvar * ndim)
ssum <- homals:::totalSum(dframe, x, y, active, rank, level, sets)
qv <- homals:::normX(homals:::centerX((1/mis) * ssum, mis), mis)
z <- qv$q
snew <- homals:::totalLoss(dframe, z, y, active, rank, level,
sets)/(nobj * nvar * ndim)
if (verbose > 0)
cat("Iteration:", formatC(iter, digits = 3, width = 3),
"Loss Value: ", formatC(c(smid), digits = 6,
width = 6, format = "f"), "\n")
r <- abs(qv$r)/2
ops <- sum(r)
aps <- sum(La.svd(crossprod(x, mis * z), 0, 0)$d)/ndim
if (iter == itermax) {
stop("maximum number of iterations reached")
}
if (smid > sold) {
warning(cat("Loss function increases in iteration ",
iter, "\n"))
}
if ((ops - pops) < eps)
break
else {
x <- z
pops <- ops
sold <- smid
}
}
ylist <- alist <- clist <- ulist <- NULL
for (j in 1:nvar) {
gg <- dframe[, j]
c <- homals:::computeY(gg, z)
d <- as.vector(table(gg))
lst <- homals:::restrictY(d, c, rank[j], level[j])
y <- lst$y
a <- lst$a
u <- lst$z
ylist <- c(ylist, list(y))
alist <- c(alist, list(a))
clist <- c(clist, list(c))
ulist <- c(ulist, list(u))
}
dimlab <- paste("D", 1:ndim, sep = "")
for (i in 1:nvar) {
if (ndim == 1) {
ylist[[i]] <- cbind(ylist[[i]])
ulist[[i]] <- cbind(ulist[[i]])
clist[[i]] <- cbind(clist[[i]])
}
options(warn = -1)
# Here is the line that I changed in the code:
# rnames <- sort(as.numeric((rownames(clist[[i]]))))
rnames <- as.numeric(rownames(clist[[i]]))[order(as.numeric(rownames(clist[[i]])))]
options(warn = 0)
if ((any(is.na(rnames))) || (length(rnames) == 0))
rnames <- rownames(clist[[i]])
if (!is.matrix(ulist[[i]]))
ulist[[i]] <- as.matrix(ulist[[i]])
rownames(ylist[[i]]) <- rownames(ulist[[i]]) <- rownames(clist[[i]]) <- rnames
rownames(alist[[i]]) <- paste(1:dim(alist[[i]])[1])
colnames(clist[[i]]) <- colnames(ylist[[i]]) <- colnames(alist[[i]]) <- dimlab
colnames(ulist[[i]]) <- paste(1:dim(as.matrix(ulist[[i]]))[2])
}
names(ylist) <- names(ulist) <- names(clist) <- names(alist) <- colnames(dframe)
rownames(z) <- rownames(dframe)
colnames(z) <- dimlab
dummymat <- as.matrix(homals:::expandFrame(dframe, zero = FALSE, clean = FALSE))
dummymat01 <- dummymat
dummymat[dummymat == 2] <- NA
dummymat[dummymat == 0] <- Inf
scoremat <- array(NA, dim = c(dim(dframe), ndim), dimnames = list(rownames(dframe),
colnames(dframe), paste("dim", 1:ndim, sep = "")))
for (i in 1:ndim) {
catscores.d1 <- do.call(rbind, ylist)[, i]
dummy.scores <- t(t(dummymat) * catscores.d1)
freqlist <- apply(dframe, 2, function(dtab) as.list(table(dtab)))
cat.ind <- sequence(sapply(freqlist, length))
scoremat[, , i] <- t(apply(dummy.scores, 1, function(ds) {
ind.infel <- which(ds == Inf)
ind.minfel <- which(ds == -Inf)
ind.nan <- which(is.nan(ds))
ind.nael <- which((is.na(ds) + (cat.ind != 1)) ==
2)
ds[-c(ind.infel, ind.minfel, ind.nael, ind.nan)]
}))
}
disc.mat <- apply(scoremat, 3, function(xx) {
apply(xx, 2, function(cols) {
(sum(cols^2, na.rm = TRUE))/nobj
})
})
result <- list(datname = name, catscores = ylist, scoremat = scoremat,
objscores = z, cat.centroids = clist, ind.mat = dummymat01,
loadings = alist, low.rank = ulist, discrim = disc.mat,
ndim = ndim, niter = iter, level = level, eigenvalues = r,
loss = smid, rank.vec = rank, active = active, dframe = dframe,
call = match.call())
class(result) <- "homals"
result
}
I am trying to implement logistic regression and the function works manually, but for some reason I get the error "Error in nrow(X) : object 'X' not found ", even though X is defined before the nrow command. I use the UCI Data "Adult" to test it.
If i try to run the function manually there is no error. Can anyone explain that?
# Sigmoidfunction
sigmoid <- function(z){
g <- 1/(1+exp(-z))
return(g)
}
# Costfunction
cost <- function(theta){
n <- nrow(X)
g <- sigmoid(X %*% theta)
J <- (1/n)*sum((-Y*log(g)) - ((1-Y)*log(1-g)))
return(J)
}
log_reg <- function(datafr, m){
# Train- und Testdaten Split
sample <- sample(1:nrow(datafr), m)
df_train <- datafr[sample,]
df_test <- datafr[-sample,]
num_features <- ncol(datafr) - 1
num_label <- ncol(datafr)
label_levels <- levels(datafr[, num_label])
datafr[, num_features+1] <- ifelse(datafr[, num_label] == names(table(datafr[,num_label]))[1], 0, 1)
# Predictor variables
X <- as.matrix(df_train[, 1:num_features])
X_test <- as.matrix(df_test[, 1:num_features])
# Add ones to X
X <- cbind(rep(1, nrow(X)), X)
X_test <- cbind(rep(1, nrow(X_test)), X_test)
# Response variable
Y <- as.matrix(df_train[, num_label] )
Y <- ifelse(Y == names(table(Y))[1], 0, 1)
Y_test <- as.matrix(df_test[, num_label] )
Y_test <- ifelse(Y_test == names(table(Y_test))[1], 0, 1)
# Intial theta
initial_theta <- rep(0, ncol(X))
# Derive theta using gradient descent using optim function
theta_optim <- optim(par=initial_theta, fn=cost)
predictions <- ifelse(sigmoid(X_test%*%theta_optim$par)>=0.5, 1, 0)
# Generalization error
error_rate <- sum(predictions!=Y_test)/length(Y_test)
return(error_rate)
}
### Adult Data
data <- read.table('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data',
sep = ',', fill = F, strip.white = T)
colnames(data) <- c('age', 'workclass', 'fnlwgt', 'education',
'education_num', 'marital_status', 'occupation', 'relationship', 'race', 'sex',
'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income')
# Featureselection
datafr <- data[, c("age", "education_num", "hours_per_week", "income")]
log_reg(datafr = datafr, m = 20)
You are calling cost() in which you refer to X, but X has not been defined in cost(). Either define it within log_reg() after you have defined X, or, better, make X a parameter for cost().
cost <- function(theta, X, Y){
n <- nrow(X)
g <- sigmoid(X %*% theta)
J <- (1/n)*sum((-Y*log(g)) - ((1-Y)*log(1-g)))
return(J)
}
And later
theta_optim <- optim(par=initial_theta, fn=cost, X=X, Y=Y)
In general, try to avoid having variables used in a function which are not defined explicitly as arguments to that function. Otherwise you will always end up with problems like this one.
Also, how did I find it out? I used traceback():
> traceback()
5: nrow(X) at #2
4: fn(par, ...)
3: (function (par)
fn(par, ...))(c(0, 0, 0, 0))
2: optim(par = initial_theta, fn = cost) at #33
1: log_reg(datafr = datafr, m = 20)
I'm trying to implement the R function ecdf().
I'm considering two cases: one with t 1-dimensional, the other with t as a vector.
#First case
my.ecdf<-function(x,t) {
indicator<-ifelse(x<=t,1,0)
out<-sum(indicator)/length(x)
out
}
#Second case
my.ecdf<-function(x,t) {
out<-length(t)
for(i in 1:length(t)) {
indicator<-ifelse(x<=t[i],1,0)
out[i]<-sum(indicator)/length(t)
}
out
}
How can I check whether I'm doing the right thing with the R function ecdf() or not? This function take as argument just x, therefore I can't specify the value of t.
You could just plot the results and see that it gives something very similar:
# slightly improved version of my.ecdf
my.ecdf<-function(x,t) {
out<-numeric(length(t))
for(i in 1:length(t)) {
indicator <- as.numeric(x<=t[i])
out[i] <- sum(indicator)/length(t)
}
out
}
# test 1
x <- rnorm(1000)
plot(ecdf(x))
lines(seq(-4, 4, length=1000),
my.ecdf(x, seq(-4, 4, length=1000)),
col='red')
# test 2
x <- rexp(1000)
plot(ecdf(x))
lines(seq(0, 8, length=1000),
my.ecdf(x, seq(0, 8, length=1000)),
col='red')
A general tip - you can view the source code of any function by typing its name into the console without parentheses or arguments:
edcf
function (x)
{
x <- sort(x)
n <- length(x)
if (n < 1)
stop("'x' must have 1 or more non-missing values")
vals <- unique(x)
rval <- approxfun(vals, cumsum(tabulate(match(x, vals)))/n,
method = "constant", yleft = 0, yright = 1, f = 0, ties = "ordered")
class(rval) <- c("ecdf", "stepfun", class(rval))
assign("nobs", n, envir = environment(rval))
attr(rval, "call") <- sys.call()
rval
}
I have this matrix calculations in my code that are taking a long time to run. So far the only way I can think of to speed is up is to use a foreach instead of a for loop, but I feel like there's more that can be done. Is there some way of vectorizing things or using an alternative to for loop that I'm missing out on?
Thanks!
require(foreach)
require(mvtnorm)
# some dummy input values
omega.input.jP <- matrix(rnorm(3000*5, 0.1, 0.1), 3000, 5)
nsteps.obs <- ncol(omega.input.jP)
sigma.j <- rnorm(3000, 0.02, 0.05)
rho1.j <- rnorm(3000, 0.8, 0.1)
rho2.j <- rnorm(3000, 0.05, 0.1)
y.lastobs <- 0.3
mu.input.jP <- matrix(NA, nrow(omega.input.jP), ncol(omega.input.jP))
# note: j is an index denoting sample number (here there are 3000 samples in total, and P denotes the time step (5 time steps here)
mu.input.jP <- foreach (j = 1:nrow(mu.input.jP), .combine = "rbind") %do% {
omega <- omega.input.jP[j, ]
Sigma.mu <- GetSigmaMu(nsteps = nsteps.obs, sigma_ar = sigma.j[j], rho1 = rho1.j[j], rho2 = rho2.j[j])
mu.input.P <- GetConditionalMu(omega = omega, Sigma.mu = Sigma.mu, y = y.lastobs)
return(mu.input.P)
}
GetSigmaMu <- function( # Get Sigma.mu, a \code{nsteps} x \code{nsteps} matrix, for AR(2) process
nsteps,
sigma_ar,
rho1,
rho2
) {
rho <- c(rho1, rho2)
cor <- ARMAacf(ar = rho, pacf = FALSE, lag.max = nsteps) # phi's, first element is phi0 = 1
var <- sigma_ar^2/(1 - sum(rho*cor[2:3])) # stationary variance # cor[2:3] gives first two phi's; cor[1] gives phi0 = 1 # change JR, 20140304
cov <- cor*var
Sigma.mu <- matrix(NA, nsteps, nsteps)
for (i in 1:nsteps) {
for (k in 1:nsteps) {
Sigma.mu[i,k] <- cov[abs(i-k)+1]
}
}
return(Sigma.mu)
}
GetConditionalMu <- function( # Get values of mu given y
omega,
Sigma.mu,
y,
method = "svd" # Method to get eigenvalues in matrix. Default method does not work, "svd" used instead.
) {
nsteps <- length(omega)
one <- rep(1, nsteps)
mean.mu.cond <- c(omega + (1/(sum(Sigma.mu)))*(Sigma.mu %*% one)*c(nsteps*y - t(one) %*% omega))
Sigma.mu.cond <- Sigma.mu - (1/(sum(Sigma.mu)))*(Sigma.mu %*% one %*% t(one) %*% Sigma.mu)
mu.cond <- rmvnorm(1, mean.mu.cond, Sigma.mu.cond, method = method)
return(mu.cond)
}
I am actually tring to get db index for k=2:63. This problem occurs when I run kmeans in a loop otherwise it does not show error.
Here is my main function
for(i in 2:63) {
print(i)
# kmeans++ algorithm
r_cluster_result = kmpp(r_cluster_mat,i)
r_cluster_data = data.frame(r_cluster_result$cluster)
db = index.DB(x=r_cluster,cl=r_cluster_result$cluster,d=NULL,centrotypes="centroids",p=2,q=1)
db_values = c(db_values,db$DB)
}
Here is my kmeans++ function code
kmpp <- function(X, k) {
n <- nrow(X)
C <- numeric(k)
C[1] <- sample(1:n, 1)
for (i in 2:k) {
dm <- distmat(X, X[C, ])
pr <- apply(dm, 1, min);
pr[C] <- 0
C[i] <- sample(1:n, 1, prob = pr)
}
kmeans(X, X[C, ])
}
Error msg
Error in sample(1:n, 1, prob = pr) : NA in probability vector
In addition: Warning message:
In sqrt(XX + YY - 2 * XY) : NaNs produced
dput(X) is
dput(x)