RMLSE validation between rpart model and test set showing Na and zero - r

I am seeking a little help as I have hit a wall.
I have trained a model (CART) with a train dataset and am looking to validate the accuracy of the model with RMLSE on a test set.
I have the following:
data.frame': 5463 obs. of 15 variables:
$ Start_date: chr "2011-01-20 02:00:00" "2011-01-20 05:00:00" "2011-01-20 06:00:00"
$ Season : Factor w/ 4 levels "spring","summer",..: 1 1 1 1 1 1 1 1 1 1
$ Holiday : Factor w/ 2 levels "Not","Holiday": 1 1 1 1 1 1 1 1 1 1 ...
$ Workingday: int 1 1 1 1 1 1 1 1 1 1 ...
$ Weather : Factor w/ 4 levels "Clear","Cloudy",..: 1 1 1 1 1 2 1 2 2 2..
$ Temp : num 10.66 9.84 9.02 9.02 9.02 ...
$ Humidity : int 56 60 60 55 55 52 48 45 42 45 ...
$ Windspeed : num 0 15 15 15 19 ...
$ Count : num 1 1 1 8 18 6 3 4 5 3 ...
$ Date : chr "2011-01-20" "2011-01-20" "2011-01-20" "2011-01-20" ...
$ Hour : Factor w/ 24 levels "00","01","02",..: 3 6 7 8 9 10 11 12 .
$ Year : chr "2011" "2011" "2011" "2011" ...
$ Month : chr "01" "01" "01" "01" ...
$ Weekday : Factor w/ 7 levels "Friday","Monday",..: 5 5 5 5 5 5 5 5 5 5
$ Hour_Bin : num 0 0 0 0 0 0 0 0 0 0 ...
$ temp_Bin : num 1 1 1 2 2 2 2 2 2 2 ...
$ year_Bin : num 1 1 1 1 1 1 1 1 1 1 ...
The predicted values is vector of:
Named num [1:5463] 9 9 9 9 9 9 9 9 9 9 ...
- attr(*, "names")= chr [1:5463] "9266" "9267" "9268" "9269" ...
I have used the function:
Evaluate_Model <- function (test, pred) {
return(sqrt(1/nrow(test)*sum((log(pred+1)-log(test$Count+1))^2)))
}
and also tried the matrix package
library('Metrics')
rmsle(test$Count, pred)
when I try to get the Root Mean Squared Logarithmic Error, I am returned either [0] or [Na].
I gone through the process of converting the count variable to different data types, and also tried putting the prediction into a dataframe and evaluate it from their.
I have also trained a model with one attribute and tried to evaluate these models, but am still hetting the same result.
My target variable (count) and the other attributes have zero values, but these are real values, not na's.
IS it the training of the algorithm, the data types???
Any help would be appreciated.
A sample of the model code:
model3 <- rpart(Count~Month+Temp, data = train)
# round prediction
pred <- round(predict(model3, newdata = test))
Evaluate_Model(test, pred)
Thanks in advance.

Related

SensoMineR panelperf error in contrasts , contrasts can only be applied to factors with 2 or more levels

I am trying to run the following code using the panelperf function from the SensoMineR package:
panelperf<-panelperf(data,formul="~Product+Panelist+rep+Product:Panelist+Product:rep+Panelist:rep", firstvar=4)
My data frame consists on: column 1: panelists, filled with names of panelists
column 2: product: filled with the names of the products
column 3: rep , the replicate of the product measured
column 4 till the end: variables that were measured (light fruit, dark fruit, alcohol, sourness, etc)
All my variables are dbl
But I get the following error when running the function:
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels In addition: Warning messages: 1: In xtfrm.data.frame(x) : cannot xtfrm data frames 2: In xtfrm.data.frame(x) : cannot xtfrm data frames 3: In xtfrm.data.frame(x) : cannot xtfrm data frames
sapply(lapply(data, unique), length)
Product Panelist rep Lightfruit Darkfruit Applepear Citrus DryFruit Nutty Vegetables Earthy Floral
13 12 2 87 83 72 67 76 69 76 67 66
Woddy hgt Chemical Chocolate Honey Cheesy Alcohol Overallaroma Astringent Sour Hot Viscocity
62 64 80 57 65 69 86 85 88 85 85 83
Sweet Bitter
87 86
So none of the variables has only 2 levels as the error suggests
I have been reading answers about this error, but either the answers dont apply to my case or, as a non very experienced R user, I do not follow what they suggest.
I would appreciate your help a lot!
Thank you!
And let me know if you need more information!
The code to check the factor levels can be
i1 <- sapply(data, \(x) is.factor(x) && nlevels(x) > 1)
i2 <- !sapply(data, is.factor)
data1 <- data[i1|i2]
Here is an example that shows the issue - with the data having nlevels for factor > 1, it works
library(SensoMineR)
data(chocolates)
res <- panelperf(sensochoc, firstvar = 5, formul = "~Product+Panelist+
Session+Product:Panelist+Session:Product+Panelist:Session")
> str(res)
List of 4
$ p.value : num [1:14, 1:6] 8.85e-14 6.44e-08 1.75e-28 3.74e-40 1.18e-22 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:14] "CocoaA" "MilkA" "CocoaF" "MilkF" ...
.. ..$ : chr [1:6] "Product " "Panelist " "Session " "Product:Panelist" ...
$ variability: num [1:14, 1:6] 0.139 0.103 0.397 0.532 0.267 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:14] "CocoaA" "MilkA" "CocoaF" "MilkF" ...
.. ..$ : chr [1:6] "Product " "Panelist " "Session " "Product:Panelist" ...
$ res : num [1:14, 1] 1.87 1.89 1.41 1.47 1.63 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:14] "CocoaA" "MilkA" "CocoaF" "MilkF" ...
.. ..$ : chr "stdev residual"
$ r2 : num [1:14, 1] 0.673 0.761 0.846 0.882 0.862 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:14] "CocoaA" "MilkA" "CocoaF" "MilkF" ...
.. ..$ : chr "r2"
> str(sensochoc)
'data.frame': 348 obs. of 18 variables:
$ Panelist : Factor w/ 29 levels "1","2","3","4",..: 1 1 1 1 1 1 2 2 2 2 ...
$ Session : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
$ Rank : Factor w/ 6 levels "1","2","3","4",..: 1 6 3 5 2 4 1 4 3 5 ...
$ Product : Factor w/ 6 levels "choc1","choc2",..: 6 3 2 1 4 5 4 3 6 2 ...
$ CocoaA : int 7 6 8 7 8 7 6 4 5 5 ...
$ MilkA : int 6 7 6 8 5 5 1 2 1 2 ...
$ CocoaF : int 6 2 5 8 4 3 8 3 8 8 ...
$ MilkF : int 5 7 4 3 4 5 1 4 1 1 ...
$ Caramel : int 5 8 7 3 4 6 0 0 0 0 ...
$ Vanilla : int 3 4 4 2 4 2 0 0 0 0 ...
$ Sweetness : int 7 7 5 4 5 5 1 5 1 0 ...
$ Acidity : int 2 2 5 7 6 4 0 0 0 0 ...
$ Bitterness : int 4 2 6 8 6 7 3 0 3 6 ...
$ Astringency: int 5 3 6 6 4 4 0 0 0 0 ...
$ Crunchy : int 8 3 7 3 6 6 8 4 6 8 ...
$ Melting : int 3 8 5 2 3 6 5 8 2 2 ...
$ Sticky : int 4 6 4 3 7 4 0 3 1 4 ...
$ Granular : int 3 5 3 5 3 7 0 1 1 0 ...
If we change the 'Session' level 2 to NA (which have only 2 levels), it shows the error
levels(sensochoc$Session)[2] <- NA
res1 <- panelperf(sensochoc, firstvar = 5, formul = "~Product+Panelist+
Session+Product:Panelist+Session:Product+Panelist:Session")
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
With OP's code, it still shows the data to be having more than 1 level because unique returns NA as well if present and thus the length will include the NA element, here it is 2 in total
> sapply(lapply(sensochoc, unique), length)
Panelist Session Rank Product CocoaA MilkA CocoaF MilkF Caramel Vanilla Sweetness Acidity
29 2 6 6 11 11 11 11 11 10 11 11
Bitterness Astringency Crunchy Melting Sticky Granular
11 11 11 11 11 11
where as with the specific code in this post, nlevels remove the NA an return only the count of non-NA levels
i1 <- sapply(sensochoc, \(x) is.factor(x) && nlevels(x) > 1)
i2 <- !sapply(sensochoc, is.factor)
names(sensochoc)[i1]
[1] "Panelist" "Rank" "Product"
names(sensochoc)[sapply(sensochoc, is.factor)]
[1] "Panelist" "Session" "Rank" "Product"
Session is omitted. We may need to change the formula to omit the terms that have Session

Issues producing a ROC curve with a KNN Model - undefined columns

I do apologize if this is rudimentary however I have run through the tracebook and tried googling to no real avail. Every time I try and run my code to produce a ROC curve I keep getting returned
Error in `[.data.frame`(data, , class) : undefined columns selected
I have checked the data and they are single column characters (as required)
library(cutpointr)
Temp1 <- predict(KnnModel, newdata=TestData, type="prob")
KnnProbs <- predict(object = KnnModel, newdata = TestData, type = "prob")
KnnProbs <- as.character(KnnProbs$`0`)
clch <- as.character(TrainData$loan_status)
KnnROC <- roc(data = TestData$loan_status, x = KnnProbs, class = clch)
plot(KnnROC, print.auc = T)
Any ideas as to what I am doing wrong and how to fix this
EDIT: The TrainData is of the following
'data.frame': 1500 obs. of 13 variables:
$ loan_amnt : num 6000 17625 8500 5000 10000 ...
$ loan_status : Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 1 1 ...
$ int_rate : num 13.33 15.61 6.68 6.92 14.98 ...
$ term : num 1 1 1 1 1 1 2 1 1 1 ...
$ installment : num 203 616 261 154 347 ...
$ grade : num 3 4 1 1 3 4 4 3 2 1 ...
$ emp_length : num 10 11 3 3 8 3 3 3 2 1 ...
$ annual_inc : num 30000 49000 53100 60000 37000 ...
$ dti : num 25.5 12.2 26.2 27.8 31.4 ...
$ sub_grade : num 13 16 3 4 13 16 18 12 8 4 ...
$ verification_status: num 1 2 2 3 3 3 3 1 3 1 ...
$ home_ownership : Factor w/ 6 levels "ANY","MORTGAGE",..: 6 6 6 5 2 2 2 2 6 2 ...
$ pymnt_plan : Factor w/ 2 levels "n","y": 1 1 1 1 1 1 1 1 1 1 ...

R - geeglm Error: contrasts can be applied only to factors with 2 or more levels

I have applied GEE to the following dataset (str as below). Everything is fine.
> str(cd4.5m2)
'data.frame': 1300 obs. of 7 variables:
$ id : Factor w/ 260 levels "1","5","29","32",..: 1 1 1 1 1 2 2 2 2 2 ...
$ Treatment: Factor w/ 4 levels "Alternating",..: 2 2 2 2 2 1 1 1 1 1 ...
$ Age : num 36.4 36.4 36.4 36.4 36.4 ...
$ Gender : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
$ logcd4 : num 3.14 3.04 2.77 2.83 3.22 ...
$ Week : num 0 7.57 15.57 23.57 32.57 ...
$ Time : int 0 1 2 3 4 0 1 2 3 4 ...
I then transformed the outcome variable, reason being we want to monitor the change over time. So the str of the transformed data looks like below, which is almost exactly the same as the previous one (other than some name changes).
> str(cd4.5m1)
'data.frame': 1300 obs. of 6 variables:
$ id : Factor w/ 260 levels "1","5","29","32",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Treatment : Factor w/ 4 levels "Alternating",..: 2 1 4 1 3 3 1 4 1 3 ...
$ Age : num 36.4 35.9 47.5 37.3 42.7 ...
$ Gender : Factor w/ 2 levels "Female","Male": 2 2 2 1 2 2 2 2 2 2 ...
$ Week : num 1 1 1 1 1 1 1 1 1 1 ...
$ cd4.change.norm: num 0.572 0.572 0.572 0.572 0.572 ...
I then run the GEE again and it gives me the error.
> gee1.default <- geeglm(cd4.change.norm ~ Treatment, data=cd4.5m1, id=id, family=gaussian, corstr="unstructured")
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
I also tested all variables in the data, they all contain multiple values. So I'm completely lost here. I also saw a lot of posts on this Error, but none seem to be able to address my issue here. Help appreciated!
I changed the correlation structure to AR1, and it worked. I did test the correlation (decreased over time) and AR1 is the correct structure to use.
But normally unstructured should be the save option?
I just reordered my data and it works. I'd like to suggest you try reordering your data like cd4.5m1<-cd4.5m1[order(cd4.5m1$id),]. Credits:KDG

lmer Error: number of levels of each grouping factor must be < number of observations

I would like to do a ANOVA to get to know, where there is significance. I already surched for an answer of my problem but doesn`t find the mistake.
names:
[1] "Tier_ID" "species" "Klima" "Ressource" "Datum" "Gewicht" "IngestionRate"
data frame:
'data.frame': 70 obs. of 7 variables:
$ Tier_ID : Factor w/ 70 levels "Raupe1","Raupe10",..: 1 12 23 34 45 56 67 69 70 2 ...
$ species : Factor w/ 1 level "Agrotis exclamationis": 1 1 1 1 1 1 1 1 1 1 ...
$ Klima : Factor w/ 1 level "BL4": 1 1 1 1 1 1 1 1 1 1 ...
$ Ressource : Factor w/ 3 levels "Kontrolle","N",..: 2 2 2 2 2 2 2 2 2 2 ...
$ Datum : Factor w/ 1 level "06.08.2015": 1 1 1 1 1 1 1 1 1 1 ...
$ Gewicht : num 7.8 4.1 10.8 51.2 33.3 17.9 40.6 11.7 35.1 7.1 ...
$ IngestionRate: num 0.385 1.057 1.598 0.164 0.396 ...
I did subsets like this:
K_NPK<-subset(Agro,Agro$Ressource!="N")
my model:
mod4 <- lmer(IngestionRate~Ressource+(1|Gewicht), data=K_NPK)
it answers:
Error: number of levels of each grouping factor must be < number of observations
But if I do this subset
N_K<-subset(Agro,Agro$Ressource!="NPK")
and this model
mod4 <- lmer(IngestionRate~Ressource+(1|Gewicht), data=N_K)
If this runs there is no error.
I hope you understand what I try to do.
Can anybody tell me whats wrong?

Error in `row.names<-.data.frame using mlogit in R language

Here are the steps I'm following to do a Multinomial Linear Regression.
> z<-read.table("2008 Racedata.txt", header=TRUE, sep="\t", row.names=NULL)
> head(z)
datekey raceno horseno place winner draw winodds log_odds jwt hwt
1 2008091501 1 8 1 1 2 12.0 2.484907 128 1170
2 2008091501 1 11 2 0 3 8.6 2.151762 123 1135
3 2008091501 1 6 3 0 5 7.0 1.945910 127 1114
4 2008091501 1 12 4 0 10 23.0 3.135494 123 1018
5 2008091501 1 14 5 0 4 11.0 2.397895 113 1027
6 2008091501 1 5 6 0 14 50.0 3.912023 131 972
> x<-mlogit.data(z,choice="winner",shape="long",id.var="datekey",alt.var="horseno")
Error in `row.names<-.data.frame`(`*tmp*`, value = c("1.8", "1.11", "1.6", :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘10.2’, ‘10.4’, ‘10.8’,
‘100.7’, ‘101.12’, ‘102.1’, ‘102.3’, ‘103.2’, ‘103.4’,
‘103.6’, ‘104.12’, ‘104.3’, ‘104.9’, ‘105.1’, ‘105.5’,
‘105.6’, ‘105.8’, ‘106.11’, ‘106.12’, ‘106.13’, ‘106.7’,
‘107.10’, ‘107.14’, ‘107.3’, ‘108.12’, ‘108.2’, ‘108.6’,
‘108.9’, ‘109.1’, ‘109.14’, ‘109.7’, ‘11.12’, ‘11.5’,
‘11.9’, ‘110.2’, ‘110.3’, ‘110.4’, ‘110.9’, ‘111.1’,
‘111.7’, ‘112.12’, ‘112.3’, ‘112.6’, ‘112.8’, ‘113.10’,
‘113.13’, ‘113.7’, ‘114.12’, ‘114.2’, ‘114.9’, ‘115.10’,
‘115.13’, ‘115.5’, ‘116.11’, ‘116.6’, ‘117.14’, ‘117.3’,
‘117.7’, ‘118.1’, ‘118.13’, ‘118.2’, ‘118.9’, ‘119.10’,
‘119.5’, ‘119.6’, ‘119.8’, ‘12.1’, ‘12.10’, ‘12.3’,
‚Äò12.6‚Äô, ‚Äò120.2‚Äô, ‚Äò120.4‚Äô, ‚Äò120.7‚ [... truncated]
>
What step am I missing here? Why the duplicates in row.names?
Thanks,
Walt
Two problems.
You seem to have some problem with encoding since we are seeing lots of umlauts and accent marks in that error message. Furthernore I am wondering if that datekey column got converted into a factor class?
In this case it it referring to an error in construction of the row.names attribute of the new object, x. If you do:
with( z, table( datekey, horseno) )
... you may see an a horse with multiple entries on the same day.
Actually there were no duplicate datekey x horseno combos. Changing to factor for horseno and datekey and then switching the "long" argument to "wide" produces error free result with this result:
z$datekey <- as.character(z$datekey)
z$horseno <- as.character(z$horseno)
x<-mlogit.data(z,choice="winner",shape="wide",id.var="datekey",alt.var="horseno")
str(x)
#----------
Classes ‘mlogit.data’ and 'data.frame': 18312 obs. of 11 variables:
$ datekey : Factor w/ 733 levels "2008091501","2008091502",..: 1 1 1 1 1 1 1 1 1 1 ...
$ raceno : int 1 1 1 1 1 1 1 1 1 1 ...
$ horseno : chr "0" "1" "0" "1" ...
$ place : int 1 1 2 2 3 3 4 4 5 5 ...
$ winner : logi FALSE TRUE TRUE FALSE TRUE FALSE ...
$ draw : int 2 2 3 3 5 5 10 10 4 4 ...
$ winodds : num 12 12 8.6 8.6 7 7 23 23 11 11 ...
$ log_odds: num 2.48 2.48 2.15 2.15 1.95 ...
$ jwt : int 128 128 123 123 127 127 123 123 113 113 ...
$ hwt : int 1170 1170 1135 1135 1114 1114 1018 1018 1027 1027 ...
$ chid : num 1 1 2 2 3 3 4 4 5 5 ...
- attr(*, "index")='data.frame': 18312 obs. of 3 variables:
..$ chid: Factor w/ 9156 levels "1","2","3","4",..: 1 1 2 2 3 3 4 4 5 5 ...
..$ alt : Factor w/ 2 levels "0","1": 1 2 1 2 1 2 1 2 1 2 ...
..$ id : Factor w/ 733 levels "2008091501","2008091502",..: 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "choice")= chr "winner"

Resources