R Caret: NA Errors when there is no missing value - r

I am attempting to run a classification algorithm for a dataset with no missing values. Here is the dataset description:
'data.frame': 59977 obs. of 6 variables:
$ gender : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 1 1 2 2 ...
$ age : num 35.7 35.7 35.7 35.7 35.7 ...
$ code : Factor w/ 492 levels "ADN105","AXN16B",..: 128 128 128 363 363 363 104 104 221 221 ...
$ totalflags : num 4 4 4 4 4 4 3 3 2 2 ...
$ measure2 : num 30 30 30 1 1 1 23 23 22 22 ...
$ outcome : num 1 1 1 0 0 0 1 1 1 1 ...
- attr(*, "na.action")=Class 'omit' Named int [1:138] 3718 3719 5493 5494 5495 5496 7302 7303 8415 8416 ...
.. ..- attr(*, "names")= chr [1:138] "4929" "4930" "7384" "7385" ...
When I run the following command
x <- Mydataset[,1:5]
y <- Mydataset[,6]
fit <- glmnet(x, y, family="binomial", alpha=0.5, lambda=0.001)
I get
Error in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs, :
NA/NaN/Inf in foreign function call (arg 5)
In addition: Warning message:
In lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs, :
NAs introduced by coercion
Before running the glm model, I did this:
Mydataset <- na.omit(Mydataset)
And checked to make sure no NA's exist:
sapply(Mydataset, function(y) sum(length(which(is.na(y)))))
and I got:
gender age code totalflags measure2 outcome
0 0 0 0 0 0
I looked at other questions for couldn't find anything relevant. Appreciate any thoughts and help in this
EDIT: ANSWER
I did a little digging and decided to change the data frame to numeric matrix and the model ran without complaining. This is the code that helped me:
x <- data.matrix(Mydataset[,1:5])
y <- data.matrix(Mydataset[,6])

The most likely cause is small or zero numbers of factor variables within one or more levels. Try this first:
Mydataset [ c('gender', 'code') ] <-
lapply( Mydataset [ c('gender', 'code') ], factor)
If that's not effective then you should show the actual code used and better description and names of all objects used. At the moment we don't even know what are x and y.
EDIT: The glmnet function does not have a formula interface and is not set up to handle data.frames and factors the way that typical R regression functions would allow. After looking at the structure of x (still a list/dataframe) and reviewing the help page for ?glmnet and doing a bit of searching for the correct way to handle factors when a numeric matrix is the expected input, I suggest converting your factors to dummies with model.matrix. It's going to be easier for interpretation of the results if you change the default contrast scheme for treatment contrasts (See https://stats.stackexchange.com/questions/69804/group-categorical-variables-in-glmnet):
contr.Dummy <- function(contrasts, ...){
conT <- contr.treatment(contrasts=FALSE, ...)
conT
}
options(contrasts=c(ordered='contr.Dummy', unordered='contr.Dummy'))
x.m <- model.matrix( ~.-1, x)
fit <- glmnet(x=x.m, y, family="binomial", alpha=0.5, lambda=0.001)

Related

Error in MEEM(object, conLin, control$niterEM) in lme function

I'm trying to apply the lme function to my data, but the model gives follow message:
mod.1 = lme(lon ~ sex + month2 + bat + sex*month2, random=~1|id, method="ML", data = AA_patch_GLM, na.action=na.exclude)
Error in MEEM(object, conLin, control$niterEM) :
Singularity in backsolve at level 0, block 1
dput for data, copy from https://pastebin.com/tv3NvChR (too large to include here)
str(AA_patch_GLM)
'data.frame': 2005 obs. of 12 variables:
$ lon : num -25.3 -25.4 -25.4 -25.4 -25.4 ...
$ lat : num -51.9 -51.9 -52 -52 -52 ...
$ id : Factor w/ 12 levels "24641.05","24642.03",..: 1 1 1 1 1 1 1 1 1 1 ...
$ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
$ bat : int -3442 -3364 -3462 -3216 -3216 -2643 -2812 -2307 -2131 -2131 ...
$ year : chr "2005" "2005" "2005" "2005" ...
$ month : chr "12" "12" "12" "12" ...
$ patch_id: Factor w/ 45 levels "111870.17_1",..: 34 34 34 34 34 34 34 34 34 34 ...
$ YMD : Date, format: "2005-12-30" "2005-12-31" "2005-12-31" ...
$ month2 : Ord.factor w/ 7 levels "January"<"February"<..: 7 7 7 7 7 1 1 1 1 1 ...
$ lonsc : num [1:2005, 1] -0.209 -0.213 -0.215 -0.219 -0.222 ...
$ batsc : num [1:2005, 1] 0.131 0.179 0.118 0.271 0.271 ...
What's the problem?
I saw a solution applying the lme4::lmer function, but there is another option to continue to use lme function?
The problem is that you have collinear combinations of predictors. In particular, here are some diagnostics:
## construct the fixed-effect model matrix for your problem
X <- model.matrix(~ sex + month2 + bat + sex*month2, data = AA_patch_GLM)
lc <- caret::findLinearCombos(X)
colnames(X)[lc$linearCombos[[1]]]
## [1] "sexM:month2^6" "(Intercept)" "sexM" "month2.L"
## [5] "month2.C" "month2^4" "month2^5" "month2^6"
## [9] "sexM:month2.L" "sexM:month2.C" "sexM:month2^4" "sexM:month2^5"
This is in a weird order, but it suggests that the sex × month interaction is causing problems. Indeed:
with(AA_patch_GLM, table(sex, month2))
## sex January February March April May June December
## F 367 276 317 204 43 0 6
## M 131 93 90 120 124 75 159
shows that you're missing data for one sex/month combination (i.e., no females were sampled in June).
You can:
construct the sex/month interaction yourself (data$SM <- with(data, interaction(sex, month2, drop = TRUE))) and use ~ SM + bat — but then you'll have to sort out main effects and interactions yourself (ugh)
construct the model matrix by hand (as above), drop the redundant column(s), then include all the resulting columns in the model:
d2 <- with(AA_patch_GLM,
data.frame(lon,
as.data.frame(X),
id))
## drop linearly dependent column
## note data.frame() has "sanitized" variable names (:, ^ both converted to .)
d2 <- d2[names(d2) != "sexM.month2.6"]
lme(reformulate(colnames(d2)[2:15], response = "lon"),
random=~1|id, method="ML", data = d2)
Again, the results will be uglier than the simpler version of the model.
use a patched version of nlme (I submitted a patch here but it hasn't been considered)
remotes::install_github("bbolker/nlme")

how to calculate GBM accuracy in r

I used the gbm() function to create the model and I want to get the accuracy. Here is my code:
df<-read.csv("http://freakonometrics.free.fr/german_credit.csv", header=TRUE)
str(df)
F=c(1,2,4,5,7,8,9,10,11,12,13,15,16,17,18,19,20,21)
for(i in F) df[,i]=as.factor(df[,i])
library(caret)
set.seed(1000)
intrain<-createDataPartition(y=df$Creditability, p=0.7, list=FALSE)
train<-df[intrain, ]
test<-df[-intrain, ]
install.packages("gbm")
library("gbm")
df_boosting<-gbm(Creditability~.,distribution = "bernoulli", n.trees=100, verbose=TRUE, interaction.depth=4,
shrinkage=0.01, data=train)
summary(df_boosting)
yhat.boost<-predict (df_boosting ,newdata =test, n.trees=100)
mean((yhat.boost-test$Creditability)^2)
However, when using the summary function, an error appears. The error message is as follows.
Error in plot.window(xlim, ylim, log = log, ...) :
유한한 값들만이 'xlim'에 사용될 수 있습니다
In addition: Warning messages:
1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf
And, When measuring the MSE with the mean function, the following error also appears:
Warning message:
In Ops.factor(yhat.boost, test$Creditability) :
요인(factors)에 대하여 의미있는 ‘-’가 아닙니다.
Do you know why these two errors appear? Thank you in advance.
In your code the problem is in the definition of the (binary) response variable Creditability. You declare it as factor but gbm needs a numerical response variable.
Here is the code:
df <- read.csv("http://freakonometrics.free.fr/german_credit.csv", header=TRUE)
F <- c(2,4,5,7,8,9,10,11,12,13,15,16,17,18,19,20,21)
for(i in F) df[,i]=as.factor(df[,i])
str(df)
Creditability now is a binary numerical variable:
'data.frame': 1000 obs. of 21 variables:
$ Creditability : int 1 1 1 1 1 1 1 1 1 1 ...
$ Account.Balance : Factor w/ 4 levels "1","2","3","4": 1 1 2 1 1 1 1 1 4 2 ...
$ Duration.of.Credit..month. : int 18 9 12 12 12 10 8 6 18 24 ...
$ Payment.Status.of.Previous.Credit: Factor w/ 5 levels "0","1","2","3",..: 5 5 3 5 5 5 5 5 5 3 ...
$ Purpose : Factor w/ 10 levels "0","1","2","3",..: 3 1 9 1 1 1 1 1 4 4 ...
...
... and the remaining part of the code works nicely:
library(caret)
set.seed(1000)
intrain <- createDataPartition(y=df$Creditability, p=0.7, list=FALSE)
train <- df[intrain, ]
test <- df[-intrain, ]
library("gbm")
df_boosting <- gbm(Creditability~., distribution = "bernoulli",
n.trees=100, verbose=TRUE, interaction.depth=4,
shrinkage=0.01, data=train)
par(mar=c(3,14,1,1))
summary(df_boosting, las=2)
##########
var rel.inf
Account.Balance Account.Balance 36.8578980
Credit.Amount Credit.Amount 12.0691120
Duration.of.Credit..month. Duration.of.Credit..month. 10.5359895
Purpose Purpose 10.2691646
Payment.Status.of.Previous.Credit Payment.Status.of.Previous.Credit 9.1296524
Value.Savings.Stocks Value.Savings.Stocks 4.9620662
Instalment.per.cent Instalment.per.cent 3.3124252
...
##########
yhat.boost <- predict(df_boosting , newdata=test, n.trees=100)
mean((yhat.boost-test$Creditability)^2)
[1] 0.2719788
Hope this can help you.

Extract multiple objects from list in R

I have some output from the vegan function specaccum. It is a list of 8 objects of varying lengths;
> str(SPECIES)
List of 8
$ call : language specaccum(comm = PRETEND.DATA, method = "rarefaction")
$ method : chr "rarefaction"
$ sites : num [1:5] 1 2 3 4 5
$ richness : num [1:5] 20.9 34.5 42.8 47.4 50
$ sd : num [1:5] 1.51 2.02 1.87 1.35 0
$ perm : NULL
$ individuals: num [1:5] 25 50 75 100 125
$ freq : num [1:50] 1 2 3 2 4 3 3 3 4 2 ...
- attr(*, "class")= chr "specaccum"
I want to extract three of the lists ('richness', 'sd' and 'individuals') and convert them to columns in a data frame. I have developed a workaround;
SPECIES.rich <- data.frame(SPECIES[["richness"]])
SPECIES.sd <- data.frame(SPECIES[["sd"]])
SPECIES.individuals <- data.frame(SPECIES[["individuals"]])
SPECIES.df <- cbind(SPECIES.rich, SPECIES.sd, SPECIES.individuals)
But this seems clumsy and protracted. I wonder if anyone could suggest a neater solution? (Should I be looking at something with lapply??) Thanks!
Example data to generate the specaccum output;
Set.Seed(100)
PRETEND.DATA <- matrix(sample(0:1, 250, replace = TRUE), 5, 50)
library(vegan)
SPECIES <- specaccum(PRETEND.DATA, method = "rarefaction")
We can concatenate the names in a vector and extract it
SPECIES.df <- data.frame(SPECIES[c("richness", "sd", "individuals")])
Another alternative, similar to akrun, is:
ctoc1 = as.data.frame(cbind(SPECIES$richness, SPECIES$sd, SPECIES$individuals))
Please note that in both cases (my answer and akrun) you will get an error if the lengths of the columns do not match.
e.g.: SPECIES.df <- data.frame(SPECIES[c( "sd", "freq")])
Error in data.frame(richness = c(20.5549865665613, 33.5688503093388, 41.4708434700877, :
arguments imply differing number of rows:7, 47
If so, remember to use length() function :
length(SPECIES$sd) <- 47 # this will add NAs to increase the column length.
SPECIES.df <- data.frame(SPECIES[c("sd", "freq")])
SPECIES.df # dataframe with 2 columns and 7 rows.

daply: Correct results, but confusing structure

I have a data.frame mydf, that contains data from 27 subjects. There are two predictors, congruent (2 levels) and offset (5 levels), so overall there are 10 conditions. Each of the 27 subjects was tested 20 times under each condition, resulting in a total of 10*27*20 = 5400 observations. RT is the response variable. The structure looks like this:
> str(mydf)
'data.frame': 5400 obs. of 4 variables:
$ subject : Factor w/ 27 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
$ congruent: logi TRUE FALSE FALSE TRUE FALSE TRUE ...
$ offset : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 5 5 1 2 5 5 2 2 3 5 ...
$ RT : int 330 343 457 436 302 311 595 330 338 374 ...
I've used daply() to calculate the mean RT of each subject in each of the 10 conditions:
myarray <- daply(mydf, .(subject, congruent, offset), summarize, mean = mean(RT))
The result looks just the way I wanted, i.e. a 3d-array; so to speak 5 tables (one for each offset condition) that show the mean of each subject in the congruent=FALSE vs. the congruent=TRUE condition.
However if I check the structure of myarray, I get a confusing output:
List of 270
$ : num 417
$ : num 393
$ : num 364
$ : num 399
$ : num 374
...
# and so on
...
[list output truncated]
- attr(*, "dim")= int [1:3] 27 2 5
- attr(*, "dimnames")=List of 3
..$ subject : chr [1:27] "1" "2" "3" "5" ...
..$ congruent: chr [1:2] "FALSE" "TRUE"
..$ offset : chr [1:5] "1" "2" "3" "4" ...
This looks totally different from the structure of the prototypical ozone array from the plyr package, even though it's a very similar format (3 dimensions, only numerical values).
I want to compute some further summarizing information on this array, by means of aaply. Precisely, I want to calculate the difference between the congruent and the incongruent means for each subject and offset.
However, already the most basic application of aaply() like aaply(myarray,2,mean) returns non-sense output:
FALSE TRUE
NA NA
Warning messages:
1: In mean.default(piece, ...) :
argument is not numeric or logical: returning NA
2: In mean.default(piece, ...) :
argument is not numeric or logical: returning NA
I have no idea, why the daply() function returns such weirdly structured output and thereby prevents any further use of aaply. Any kind of help is kindly appreciated, I frankly admit that I have hardly any experience with the plyr package.
Since you haven't included your data it's hard to know for sure, but I tried to make a dummy set off your str(). You can do what you want (I'm guessing) with two uses of ddply. First the means, then the difference of the means.
#Make dummy data
mydf <- data.frame(subject = rep(1:5, each = 150),
congruent = rep(c(TRUE, FALSE), each = 75),
offset = rep(1:5, each = 15), RT = sample(300:500, 750, replace = T))
#Make means
mydf.mean <- ddply(mydf, .(subject, congruent, offset), summarise, mean.RT = mean(RT))
#Calculate difference between congruent and incongruent
mydf.diff <- ddply(mydf.mean, .(subject, offset), summarise, diff.mean = diff(mean.RT))
head(mydf.diff)
# subject offset diff.mean
# 1 1 1 39.133333
# 2 1 2 9.200000
# 3 1 3 20.933333
# 4 1 4 -1.533333
# 5 1 5 -34.266667
# 6 2 1 -2.800000

I get error "Error in nnet.default(x, y, w, ...) : too many (77031) weights" while training neural networks

I am trying to train neural networks in R using package nnet. Following is the information about my training data.
str(traindata)
'data.frame': 10327 obs. of 196 variables:
$ stars : num 5 5 5 3.5 3.5 4.5 3.5 5 5 3.5 ...
$ open : num 1 1 1 1 1 1 1 1 1 1 ...
$ city : Factor w/ 61 levels "ahwatukee","anthem",..: 36 38
$ review_count : int 3 5 4 5 14 6 21 4 14 10 ...
$ name : Factor w/ 8204 levels " leftys barber shop",..:
$ longitude : num -112 -112 -112 -112 -112 ...
$ latitude : num 33.6 33.6 33.5 33.4 33.7 ...
$ greek : int 0 0 0 0 0 0 0 0 0 0 ...
$ breakfast...brunch : int 0 0 0 0 0 0 0 0 0 0 ...
$ soup : int 0 0 0 0 0 0 0 0 0 0 ...
I have truncated this information.
When I run the following:
library(nnet)
m4 <- nnet(stars~.,data=traindata,size=10, maxit=1000)
I get the following error:
Error in nnet.default(x, y, w, ...) : too many (84581) weights
When I try changing weights in the argument like:
m4 <- nnet(stars~.,data=traindata,size=10, maxit=1000,weights=1000)
Then I get the following error:
Error in model.frame.default(formula = stars ~ ., data = traindata, weights = 1000) :
variable lengths differ (found for '(weights)')
What is the mistake I am making? How do I avoid or correct this error? Maybe the problem is with my understanding of "weights".
Either increase MaxNWts to something that will accommodate the size of your model, or reduce size to make your model smaller.
You probably also want to think some more on exactly which variables to include in the model. Just looking at the data provided, name is a factor with more than 8000 levels; you're not going to get anything sensible out of it with only 10000 observations. city might be more useful, but again, 61 levels in something as complex as a neural net is likely to be marginal.
Increase 'MaxNWts' option to something larger than 84581.
The option to set to increase the number of weights allowed in the network is MaxNWts, not weights (set to specify weights for each sample).
Increase the MaxNWts parameter by passing it directly
m4 <- nnet(stars~.,data=traindata,size=10, maxit=1000,MaxNWts=84581)

Resources