I have a column of type factor in my data whose summary looks as follows
$COL_256
0 1 <NA>
31557 0 0
As you can see, there are only three levels for this column and two of them have zero occurrences, which means it's basically just one factor level.
The trouble with this is that, when I do certain operations like regression, I get an error which says,
contrasts can be applied only to factors with 2 or more levels
How can I remove this column which have all their occurrences in just one of the factor levels?
EDIT : I tried droplevels(df) as suggested but now my column looks as follows and gives the same error.
$COL_256
0
31557
You could test to see what the status of variables are and drop them if they are constants. E.g.:
dat <- data.frame(y=1:3,x=factor("a",levels=c("a","b")),x2=letters[c(1,2,1)])
# y = numeric, x=constant with 2 factor levels, x2=not constant with 2 factor levels
dat[sapply(dat,function(x) length(levels({if(is.factor(x)) droplevels(x) else x}))!=1 )]
# y x2
#1 1 a
#2 2 b
#3 3 a
Related
I have a factor variable Category (e.g., Category A, B, C), and I am trying to fill any blank values in the column with NA. However, when I run this command, the Category variable is replaced with integer values representing the levels of the factor. How do I go about retaining the actual factor characters?
full$CATEGORY <- ifelse(full$CATEGORY == "", NA, full$CATEGORY)
Short answer. Avoid ifelse for this use case. If you defined your data something like this
> full=data.frame(CATEGORY=factor(c('a','b','','d','a','')))
Then you can accomplish what you want with an assignment statement. This will avoid having to spend time converting and then converting back.
> full$CATEGORY[full$CATEGORY=='']=NA
> full
CATEGORY
1 a
2 b
3 <NA>
4 d
5 a
6 <NA>
>
Problem:
Confusion Matrix needed for measuring Sensitivity & Specificity.
Issue:
For confusion matrix data, I have levels that match, in Confusion matrix, data cannot have more levels than the reference, so when my levels match (e.g. below in list), what then is the error referring to with respect to 'data cannot have more levels than the reference'? the reference is the 'model_prediction' [last length() / str(). My dependent variable is a factor variable.
Effort Tried:
For the R code, first I ran prediction with factor results and included na.action:
loans_predict_fcm <- factor(predict(full, newdata = data_train, type = "response", na.action = na.pass))
With results from a separate table(), e.g, pred_table; I was successful in calculating Sensitivity and Specificity ok using formula. However, I would like to cross confirm this with a confusionMatrix(). But I'm having trouble getting confusionMatrix() to work.
Sensitivity <- 100*(pred_table[1,1])/sum(pred_table[1,1] + pred_table[1,1])
Specificity <- 100*(pred_table[2,2])/sum(pred_table[2,1] + pred_table[2,2])
When I attempt to run the confusionMatrix() with the factored predict(). Then inspect the levels, and find them not matching, so that is why confusionMatrix() failed on data cannot have more levels than reference, which in this run model_prediction.
confusionMatrix(loans_predict_fcm, model_prediction, positive="1")
identical (levels(loans_predict_fcm), levels(model_prediction))
> FALSE
> length(loans_predict_fcm)
[1] 27724
> str(loans_predict_fcm)
Factor w/ 27424 levels "0.13079979710253",..: 15967 9625 15966 10703 7830 12394 21291 15023 17920 18442 ...
- attr(*, "names")= chr [1:27724] "11413" "2561" "25337" "1643" ...
> length(loans_train_data$statusRank)
[1] 27724
> str(loans_train_data$statusRank)
Factor w/ 2 levels "Bad","Good": 2 2 2 1 1 2 2 2 1 2 ...
> length(model_prediction)
[1] 27724
> str(model_prediction)
Factor w/ 2 levels "Bad","Good": 2 2 2 2 2 2 2 2 2 2 ...
>
For the Confusion Matrix Sensitivity & Specificity issue, the column names and row names had difference, so I was able to solve it using a custom function, that was basically setdiff on colnames() and rownames(), and creating a matrix vector mat.or.vec() for the length of missings colnames.
I have a dataframe with two columns. One is an ID column (string), the second consists of strings several hundred characters long (DNA sequences). I want to identify the unique DNA sequences and group the unique groups together.
Using:
data$duplicates<-duplicated(data$seq, fromLast = TRUE)
I have successfully identified whether a specific row is a duplicate or not. This is not sufficient - I want to know whether I have 2, 3, etc. duplicates, and to which ID's do they correspond to (it is important that the ID always stays with its corresponding sequence).
Maybe something like:
for data$duplicates = TRUE... "add number in data$grouping
corresponding to the set of duplicates."
I don't know how to write the code for the last part.
I appreciate any and all help, thank you.
Edit: As an example:
df <- data.frame(ID = c("seq1","seq2","seq3","seq4","seq5"),seq= c("AAGTCA",AGTCA","AGCCTCA","AGTCA","AGTCAGG"))
I would like the output to be a new column (e.g.: df$grouping) where a numeric value is given to each unique group, so in this case:
("1","2","3","2","4")
I would like the output to be a new column (e.g.: df$grouping) where a numeric value is given to each unique group, so in this case:
Since df$seq is already a factor, we can just use the level number. This is given when a factor is coerced to an integer.
df$grouping = as.integer(df$seq)
df
# ID seq grouping
# 1 seq1 AAGTCA 1
# 2 seq2 AGTCA 3
# 3 seq3 AGCCTCA 2
# 4 seq4 AGTCA 3
# 5 seq5 AGTCAGG 4
If, in your real data, the seq column is not of class factor, you can still use df$grouping = as.integer(factor(df$seq)). By default the order of the groups will be alphabetical---you can modify this by giving the levels argument to factor in the order you want. For example, df$grouping = as.integer(factor(df$seq, levels = unique(df$seq))) will put the levels (and thus the grouping integers) in the order in which they first occur.
If you want to see the number of rows in each group, use table, e.g.
table(df$seq)
# AAGTCA AGCCTCA AGTCA AGTCAGG
# 1 1 2 1
table(df$grouping)
# 1 2 3 4
# 1 1 2 1
sort(table(df$seq), decreasing = T)
# AGTCA AAGTCA AGCCTCA AGTCAGG
# 2 1 1 1
I am having an issue with displaying the correct grouping of a factor variable after using MICE. I believe this is an R thing, but I included it with mice just to be sure.
So, I run my mice algorithm, here is a snipit of how I call I format it in the mice algorithm. Note, I want it to be 0 for no drug, and 1 for yes drug, so I coerce it to be a factor with levels 0 and 1 before I run it
mydat$drug=factor(mydat$drug,levels=c(0,1),labels=c(0,1))
I then run mice and it runs logistic regression (this is the default) on drug, along with my other variables to be imputed.
I can extract the results of one of the imputations when it is complete by
drug=complete(imp,1)$drug
We can view it
> head(drug)
[1] 0 0 1 0 1 1
attr(,"contrasts")
2
0 0
1 1
Levels: 0 1
So the data is certainly 0,1.
However, when I do something with it, like cbind, it changes to 1's and 2's
> head(cbind(drug))
drug
[1,] 1
[2,] 1
[3,] 2
[4,] 1
[5,] 2
[6,] 2
Even when I coerce it to a numeric
> head(as.numeric(drug))
[1] 1 1 2 1 2 2
I want to say it has something to do with the contrasts, but when I delete the contrast by doing
attr(drug,"contrasts")=NULL
It still shows up with 1's and 2's when called and printed by others.
I am able to get it to print correctly by using I()
> head(I(drug))
[1] 0 0 1 0 1 1
Levels: 0 1
So, I believe that this is an R issue, but I don't know how to remedy it. Is using I() the correct solution, or is it just a workaround that happens to work here? What is actually happening behind the scenes that is making the output display as 1's and 2's?
Thanks
Factors start with the first level being represented internally by 1.
Your two options:
1) Adjust for 1-based index of levels:
as.numeric(drug) - 1
2) Take the labels of the factors and convert to numeric:
as.numeric(as.character(drug))
Some people will point you in the direction of the faster option that does the same thing:
as.numeric(levels(drug))[drug]
I'd also consider using logical values instead of factor in the first place.
mydat$drug = as.logical(mydat$drug)
The 0s and 1s are the names of your levels. The underlying integer corresponding to the names is 1 and 2. You can see with str,
str(drug)
# Factor w/ 2 levels "0","1": 2 2 2 2 2 2 1 1 2 2
When you coerce the factor to numeric, you drop the names and get the integer representation.
This is how R encodes factors. The underlying numeric representation of the factors always starts with 1. As you can see with the following to examples:
as.numeric(factor(c(0,1)))
as.numeric(factor(c(A,B)))
Not sure about the specifics about how MICE works, but if it requires a factor instead of a simple 0/1 numeric variable to use logistic regression, you can always hack the results with something like the following:
as.numeric(as.character(factor(c(0,1))))
or in your specific case
drug <- as.numeric(as.character(drug))
I had applied mode imputation to replace the missing values contained in a categorical variable. The original values were included in variable A. As for imputed values it will be represented as variable B. The variable A consists values of 1 and 2 as follows:
A
1
2
1
1
2
The imputed values included in variable B are shown below.
B
2
2
2
2
2
The question is how can I compute the percentage of correctly classified values of the categorical variable as a measurement of error performance?
Your (example) data:
A <- c(1,2,1,1,2)
B <- c(2,2,2,2,2)
If you want to see which Bs were classified correctly, you can use
A == B
which is TRUE if B matches A, and FALSE otherwise.
Then for a percentage you could:
sum(A == B)/length(A)
, where sum(A==B) counts how many elements were correctly classified.
Or
mean(A == B)
is a cool way to say the same thing.