Contrasts can be applied only to factor - r

I have a question about R.
I am using a test called levene.test to test a homogeneity of variance.
I know that you need a factor variable with at least two levels in order for this to work. And from what I see, I do have at least two levels for the factor variable that I am using. But somehow I keep getting the error of:
> nocorlevene <- levene.test(geno1rs11809462$SIF1, geno1rs11809462$k, correction.method = "correction.factor")
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
I even try generate a variable from a binomial distribution:
k<-rbinom(1304, 1, 0.5)
and then use that as a factor, but is still not working.
Lastly I create a variable with 3 levels:
k<-sample(c(1,0,2), 1304, replace=T)
but some how still not working and getting the same error of:
nocorlevene <- levene.test(geno1rs11809462$SIF1, geno1rs11809462$k, correction.method="zero.removal")
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
This is the output of the type of the variable in the data:
> str(geno1rs11809462)
'data.frame': 1304 obs. of 16 variables:
$ id : chr "WG0012669-DNA_A03_K05743" "WG0012669-DNA_A04_K05752" "WG0012669-DNA_A05_K05761" "WG0012669-DNA_A06_K05785" ...
$ rs11809462 : Factor w/ 2 levels "2/1","2/2": 2 2 2 2 2 2 2 2 2 2 ...
..- attr(*, "names")= chr "WG0012669-DNA_A03_K05743" "WG0012669-DNA_A04_K05752" "WG0012669-DNA_A05_K05761" "WG0012669-DNA_A06_K05785" ...
$ FID : chr "9370" "9024" "14291" "4126" ...
$ AGE_CALC : num 61 47 NA 62.5 55.6 59.7 46.6 41.2 NA 46.6 ...
$ MREFSUM : num 185 325 NA 211 212 ...
$ NORSOUTH : Factor w/ 3 levels "0","1","NA": 1 1 3 1 1 1 1 1 3 1 ...
$ smoke1 : Factor w/ 3 levels "0","1","NA": 2 2 3 1 1 1 2 1 3 1 ...
$ smoke2 : Factor w/ 3 levels "0","1","NA": 1 1 3 2 2 2 1 2 3 2 ...
$ ANYCG60 : num 0 0 NA 1 0 0 0 0 NA 1 ...
$ DCCT_HBA_MEAN: num 7.39 6.93 NA 7.37 7.56 7.86 6.22 8.88 NA 8.94 ...
$ EDIC_HBA : num 7.17 7.63 NA 8.66 9.68 7.74 6.59 9.34 NA 7.86 ...
$ HBAEL : num 7.3 8.82 NA 9.1 9.3 ...
$ ELDTED_HBA : num 7.23 7.76 NA 8.36 9.21 7.92 6.64 9.64 NA 9.09 ...
$ SIF1 : num 19.6 17 NA 23.8 24.1 ...
$ sex : Factor w/ 2 levels "0","1": 1 1 2 2 2 2 1 1 1 1 ...
$ k : Factor w/ 3 levels "0","1","2": 1 1 2 3 1 3 3 3 1 2 ...
As you can see the variable k, sex have 3 and 2 levels respectively but somehow I still get that error message.
> head(geno1rs11809462)
id rs11809462 FID AGE_CALC MREFSUM NORSOUTH smoke1 smoke2 ANYCG60
1 WG0012669-DNA_A03_K05743 2/2 9370 61.0 184.5925 0 1 0 0
2 WG0012669-DNA_A04_K05752 2/2 9024 47.0 325.0047 0 1 0 0
3 WG0012669-DNA_A05_K05761 2/2 14291 NA NA NA NA NA NA
4 WG0012669-DNA_A06_K05785 2/2 4126 62.5 211.2557 0 0 1 1
5 WG0012669-DNA_A08_K05802 2/2 11280 55.6 212.2922 0 0 1 0
6 WG0012669-DNA_A09_K05811 2/2 11009 59.7 261.0116 0 0 1 0
DCCT_HBA_MEAN EDIC_HBA HBAEL ELDTED_HBA SIF1 sex k
1 7.39 7.17 7.30 7.23 19.6136 0 0
2 6.93 7.63 8.82 7.76 17.0375 0 0
3 NA NA NA NA NA 1 1
4 7.37 8.66 9.10 8.36 23.8333 1 2
5 7.56 9.68 9.30 9.21 24.1338 1 0
6 7.86 7.74 8.53 7.92 25.7272 1 2
If anyone can give me some hints as to why this is happening, it would be great. I just don't know why the variable k or sex or having different levels are giving me error when I run the test.
thank you

I think I may have solved the problem. I believe it is due to NA value in the data. Because after I removed the na using say
x<-na.omit(original_data)
then apply the levene test on x, the warning message disappears.
Hopefully this is the cause of the problem.

If your factor has only one level, you will get this error. To check to see the levels of your factor variables, use lapply(df, levels). It will return nothing for non-factor variables, but will easily let you identify which variable is the offender. This is especially helpful if, like me, you have hundreds of variables.

You need to actually convert your variable to a factor. Just having three (or a finite) number of values does not necessarily make it a factor.
use x <- factor(x) to convert
When you look at the output of str(), it shows you the type of each variable:
<..cropped..>
$ SIF1 : num 19.6 17 NA 23.8 24.1 ...
$ sex : Factor w/ 2 levels "0","1": 1 1 2 2 2 2 1 1 1 1 ...
$ k : Factor w/ 3 levels "0","1","2": 1 1 2 3 1 3 3 3 1 2 ...
notice that $k is a factor but SIF1 is not
Thus, use
geno1rs11809462$SIF1 <- factor(geno1rs11809462$SIF1)

Related

Issues producing a ROC curve with a KNN Model - undefined columns

I do apologize if this is rudimentary however I have run through the tracebook and tried googling to no real avail. Every time I try and run my code to produce a ROC curve I keep getting returned
Error in `[.data.frame`(data, , class) : undefined columns selected
I have checked the data and they are single column characters (as required)
library(cutpointr)
Temp1 <- predict(KnnModel, newdata=TestData, type="prob")
KnnProbs <- predict(object = KnnModel, newdata = TestData, type = "prob")
KnnProbs <- as.character(KnnProbs$`0`)
clch <- as.character(TrainData$loan_status)
KnnROC <- roc(data = TestData$loan_status, x = KnnProbs, class = clch)
plot(KnnROC, print.auc = T)
Any ideas as to what I am doing wrong and how to fix this
EDIT: The TrainData is of the following
'data.frame': 1500 obs. of 13 variables:
$ loan_amnt : num 6000 17625 8500 5000 10000 ...
$ loan_status : Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 1 1 ...
$ int_rate : num 13.33 15.61 6.68 6.92 14.98 ...
$ term : num 1 1 1 1 1 1 2 1 1 1 ...
$ installment : num 203 616 261 154 347 ...
$ grade : num 3 4 1 1 3 4 4 3 2 1 ...
$ emp_length : num 10 11 3 3 8 3 3 3 2 1 ...
$ annual_inc : num 30000 49000 53100 60000 37000 ...
$ dti : num 25.5 12.2 26.2 27.8 31.4 ...
$ sub_grade : num 13 16 3 4 13 16 18 12 8 4 ...
$ verification_status: num 1 2 2 3 3 3 3 1 3 1 ...
$ home_ownership : Factor w/ 6 levels "ANY","MORTGAGE",..: 6 6 6 5 2 2 2 2 6 2 ...
$ pymnt_plan : Factor w/ 2 levels "n","y": 1 1 1 1 1 1 1 1 1 1 ...

Import from Web CSV to the data frame?

I've got a CSV file from the link Hearthstone Arena Card Pickup probability
It's just a list of vectors now, and I want to convert into 9 column data frame. so it may look like:
My current code is as follows but it's not working at all.
hsd <- read.csv("hearthstonedraw.csv", header = TRUE)
hsd1 <- as.data.frame(hsd,ncol = 9)
hsd1
Answer goest out to Maurits Evers and Adam Sampson.
read.csv can read from the address you indicate and automatically converts character columns into factors (default behaviour) as well as calulating the number of columns.
hsd1 <- read.csv("https://bnetcmsus-a.akamaihd.net/cms/gallery/LN4X4GN4W59R1532566073433.csv", header = TRUE)
str(hsd1)
# 'data.frame': 3931 obs. of 9 variables:
# $ Draft.Class : Factor w/ 9 levels "Druid","Hunter",..: 1 1 1 1 1 1 1 1 1 1 ...
# $ Card.Name : Factor w/ 995 levels "Abominable Bowman",..: 716 813 646 500 263 964 549 186 509 984 ...
# $ Rarity : Factor w/ 5 levels "basic","common",..: 1 1 2 2 2 1 1 2 5 2 ...
# $ Type : Factor w/ 3 levels "Minion","Spell",..: 2 2 2 2 1 2 2 1 2 2 ...
# $ Card.Class : Factor w/ 10 levels "druid","hunter",..: 1 1 1 1 1 1 1 1 1 1 ...
# $ Average : num 1.47 1.45 1.44 1.17 1.03 ...
# $ P.1.or.more.: num 0.78 0.776 0.774 0.696 0.649 ...
# $ P.2.or.more.: num 0.436 0.431 0.428 0.327 0.273 ...
# $ P.3.or.more.: num 0.1784 0.1757 0.1724 0.1081 0.0819 ...
ncol(hsd1)
# [1] 9
# There are 9 columns in the data frame

R - geeglm Error: contrasts can be applied only to factors with 2 or more levels

I have applied GEE to the following dataset (str as below). Everything is fine.
> str(cd4.5m2)
'data.frame': 1300 obs. of 7 variables:
$ id : Factor w/ 260 levels "1","5","29","32",..: 1 1 1 1 1 2 2 2 2 2 ...
$ Treatment: Factor w/ 4 levels "Alternating",..: 2 2 2 2 2 1 1 1 1 1 ...
$ Age : num 36.4 36.4 36.4 36.4 36.4 ...
$ Gender : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
$ logcd4 : num 3.14 3.04 2.77 2.83 3.22 ...
$ Week : num 0 7.57 15.57 23.57 32.57 ...
$ Time : int 0 1 2 3 4 0 1 2 3 4 ...
I then transformed the outcome variable, reason being we want to monitor the change over time. So the str of the transformed data looks like below, which is almost exactly the same as the previous one (other than some name changes).
> str(cd4.5m1)
'data.frame': 1300 obs. of 6 variables:
$ id : Factor w/ 260 levels "1","5","29","32",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Treatment : Factor w/ 4 levels "Alternating",..: 2 1 4 1 3 3 1 4 1 3 ...
$ Age : num 36.4 35.9 47.5 37.3 42.7 ...
$ Gender : Factor w/ 2 levels "Female","Male": 2 2 2 1 2 2 2 2 2 2 ...
$ Week : num 1 1 1 1 1 1 1 1 1 1 ...
$ cd4.change.norm: num 0.572 0.572 0.572 0.572 0.572 ...
I then run the GEE again and it gives me the error.
> gee1.default <- geeglm(cd4.change.norm ~ Treatment, data=cd4.5m1, id=id, family=gaussian, corstr="unstructured")
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
I also tested all variables in the data, they all contain multiple values. So I'm completely lost here. I also saw a lot of posts on this Error, but none seem to be able to address my issue here. Help appreciated!
I changed the correlation structure to AR1, and it worked. I did test the correlation (decreased over time) and AR1 is the correct structure to use.
But normally unstructured should be the save option?
I just reordered my data and it works. I'd like to suggest you try reordering your data like cd4.5m1<-cd4.5m1[order(cd4.5m1$id),]. Credits:KDG

lmer Error: number of levels of each grouping factor must be < number of observations

I would like to do a ANOVA to get to know, where there is significance. I already surched for an answer of my problem but doesn`t find the mistake.
names:
[1] "Tier_ID" "species" "Klima" "Ressource" "Datum" "Gewicht" "IngestionRate"
data frame:
'data.frame': 70 obs. of 7 variables:
$ Tier_ID : Factor w/ 70 levels "Raupe1","Raupe10",..: 1 12 23 34 45 56 67 69 70 2 ...
$ species : Factor w/ 1 level "Agrotis exclamationis": 1 1 1 1 1 1 1 1 1 1 ...
$ Klima : Factor w/ 1 level "BL4": 1 1 1 1 1 1 1 1 1 1 ...
$ Ressource : Factor w/ 3 levels "Kontrolle","N",..: 2 2 2 2 2 2 2 2 2 2 ...
$ Datum : Factor w/ 1 level "06.08.2015": 1 1 1 1 1 1 1 1 1 1 ...
$ Gewicht : num 7.8 4.1 10.8 51.2 33.3 17.9 40.6 11.7 35.1 7.1 ...
$ IngestionRate: num 0.385 1.057 1.598 0.164 0.396 ...
I did subsets like this:
K_NPK<-subset(Agro,Agro$Ressource!="N")
my model:
mod4 <- lmer(IngestionRate~Ressource+(1|Gewicht), data=K_NPK)
it answers:
Error: number of levels of each grouping factor must be < number of observations
But if I do this subset
N_K<-subset(Agro,Agro$Ressource!="NPK")
and this model
mod4 <- lmer(IngestionRate~Ressource+(1|Gewicht), data=N_K)
If this runs there is no error.
I hope you understand what I try to do.
Can anybody tell me whats wrong?

Error when I try to predict class probabilities in R - caret

I've build a model using caret. When the training was completed I got the following warning:
Warning message:
In train.default(x, y, weights = w, ...) :
At least one of the class levels are not valid R variables names; This may cause errors if class probabilities are generated because the variables names will be converted to: X0, X1
The names of the variables are:
str(train)
'data.frame': 7395 obs. of 30 variables:
$ alchemy_category : Factor w/ 13 levels "arts_entertainment",..: 2 8 6 6 11 6 1 6 3 8 ...
$ alchemy_category_score : num 3737 2052 4801 3816 3179 ...
$ avglinksize : num 2.06 3.68 2.38 1.54 2.68 ...
$ commonlinkratio_1 : num 0.676 0.508 0.562 0.4 0.5 ...
$ commonlinkratio_2 : num 0.206 0.289 0.322 0.1 0.222 ...
$ commonlinkratio_3 : num 0.0471 0.2139 0.1202 0.0167 0.1235 ...
$ commonlinkratio_4 : num 0.0235 0.1444 0.0426 0 0.0432 ...
$ compression_ratio : num 0.444 0.469 0.525 0.481 0.446 ...
$ embed_ratio : num 0 0 0 0 0 0 0 0 0 0 ...
$ frameTagRatio : num 0.0908 0.0987 0.0724 0.0959 0.0249 ...
$ hasDomainLink : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ html_ratio : num 0.246 0.203 0.226 0.266 0.229 ...
$ image_ratio : num 0.00388 0.08865 0.12054 0.03534 0.05047 ...
$ is_news : Factor w/ 2 levels "0","1": 2 2 2 2 2 1 2 1 2 1 ...
$ lengthyLinkDomain : Factor w/ 2 levels "0","1": 2 2 2 1 2 1 1 1 1 2 ...
$ linkwordscore : num 24 40 55 24 14 12 21 5 17 14 ...
$ news_front_page : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ non_markup_alphanum_characters: num 5424 4973 2240 2737 12032 ...
$ numberOfLinks : num 170 187 258 120 162 55 93 132 194 326 ...
$ numwords_in_url : num 8 9 11 5 10 3 3 4 7 4 ...
$ parametrizedLinkRatio : num 0.1529 0.1818 0.1667 0.0417 0.0988 ...
$ spelling_errors_ratio : num 0.0791 0.1254 0.0576 0.1009 0.0826 ...
$ label : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 2 1 2 2 ...
$ isVideo : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 1 1 ...
$ isFashion : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 2 1 2 1 ...
$ isFood : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ hasComments : Factor w/ 2 levels "0","1": 1 2 2 2 2 1 2 2 1 2 ...
$ hasGoogleAnalytics : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 2 2 2 1 ...
$ hasInlineCSS : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 2 1 2 2 ...
$ noOfMetaTags : num 10 12 6 10 13 2 6 6 9 5 ...
My code is the following:
ctrl <- trainControl(method = "CV",
number=10,
classProbs = TRUE,
allowParallel = TRUE,
summaryFunction = twoClassSummary)
set.seed(476)
rfFit <- train(formula,
data=train,
method = "rf",
tuneGrid = expand.grid(.mtry = seq(4,20,by=2)),
ntrees=1000,
importance = TRUE,
metric = "ROC",
trControl = ctrl)
pred <- predict.train(rfFit, newdata = test, type = "prob")
I get the error: Error in [.data.frame(out, , obsLevels, drop = FALSE) :
undefined columns selected
The variable names on the test data set are:
str(test)
'data.frame': 3171 obs. of 29 variables:
$ alchemy_category : Factor w/ 13 levels "arts_entertainment",..: 8 4 12 4 10 12 12 8 1 2 ...
$ alchemy_category_score : num 5307 4825 1 6708 5416 ...
$ avglinksize : num 2.56 3.77 2.27 2.52 1.85 ...
$ commonlinkratio_1 : num 0.39 0.462 0.496 0.706 0.471 ...
$ commonlinkratio_2 : num 0.257 0.205 0.385 0.346 0.161 ...
$ commonlinkratio_3 : num 0.0441 0.0513 0.1709 0.123 0.0323 ...
$ commonlinkratio_4 : num 0.0221 0 0.1709 0.0906 0 ...
$ compression_ratio : num 0.49 0.782 1.25 0.449 0.454 ...
$ embed_ratio : num 0 0 0 0 0 0 0 0 0 0 ...
$ frameTagRatio : num 0.0671 0.0429 0.0588 0.0581 0.093 ...
$ hasDomainLink : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ html_ratio : num 0.23 0.366 0.162 0.147 0.244 ...
$ image_ratio : num 0.19944 0.08 10 0.00596 0.03571 ...
$ is_news : Factor w/ 2 levels "0","1": 2 1 1 2 2 1 1 2 1 1 ...
$ lengthyLinkDomain : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 1 1 1 ...
$ linkwordscore : num 15 62 42 41 34 35 15 22 41 7 ...
$ news_front_page : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ non_markup_alphanum_characters: num 5643 382 2420 5559 2209 ...
$ numberOfLinks : num 136 39 117 309 155 266 55 145 110 1 ...
$ numwords_in_url : num 3 2 1 10 10 7 1 9 5 0 ...
$ parametrizedLinkRatio : num 0.2426 0.1282 0.5812 0.0388 0.0968 ...
$ spelling_errors_ratio : num 0.0806 0.1765 0.125 0.0631 0.0653 ...
$ isVideo : Factor w/ 2 levels "0","1": 1 2 1 2 2 2 1 1 2 2 ...
$ isFashion : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 1 1 ...
$ isFood : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ hasComments : Factor w/ 2 levels "0","1": 2 1 1 2 2 2 1 2 2 1 ...
$ hasGoogleAnalytics : Factor w/ 2 levels "0","1": 1 2 2 2 2 1 1 2 1 1 ...
$ hasInlineCSS : Factor w/ 2 levels "0","1": 2 2 2 1 1 2 2 2 1 1 ...
$ noOfMetaTags : num 3 6 5 9 16 22 6 9 7 0 ...
If I omit the type="prob" part, I get no error.
Any ideas?
Could it be the length of the variable "alchemy_category" which is appended with the respective factor levels e.g. "alchemy_categoryarts_entertainment" inside the model??
The answer is in bold at the top of your post =]
What are you modeling? Is it alchemy_category? The code only says formula and we can't see it.
When you ask for class probabilities, model predictions are a data frame with separate columns for each class/level. If alchemy_category doesn't have levels that are valid column names, data.frame converts then to valid names. That creates a problem because the code is looking for a specific name but the data frame as a different (but valid) name.
For example, if I had
> test <- factor(c("level1", "level 2"))
> levels(test)
[1] "level 2" "level1"
> make.names(levels(test))
[1] "level.2" "level1"
the code would be looking for "level 2" but there is only "level.2".
As stated above the class values must be factors and must be valid names. Another way to insure this is,
levels(all.dat$target) <- make.names(levels(factor(all.dat$target)))
I have read through the answers above while facing a similar problem. A formal solution is to do this on the train and test datasets. Make sure you include the response variable in the feature.names too.
feature.names=names(train)
for (f in feature.names) {
if (class(train[[f]])=="factor") {
levels <- unique(c(train[[f]]))
train[[f]] <- factor(train[[f]],
labels=make.names(levels))
}
}
This creates syntactically correct labels for all factors.
As #Sam Firke already pointed out in comments (but I overlooked it) levels TRUE/FALSE also don't work. So I converted them to yes/no.
As per the above example, usually refactoring the outcome variable will fix the problem. It's better to change in the original dataset before partitioning into training and test datasets
levels <- unique(data$outcome)
data$outcome <- factor(data$outcome, labels=make.names(levels))
As others pointed out earlier, this problem only occurs when classProbs=TRUE which causes the train function to generate additional statistics related to the outcome class

Resources