labs = letters[3:7]
vec = rep(1:5,2)
How do I get a factor whose levels are "c" "d" "e" "f" "g" ?
You can do something like this:
labs = letters[3:7]
vec = rep(1:5,2)
factorVec <- factor(x=vec, levels=sort(unique(vec)), labels = c( "c", "d", "e", "f", "g"))
I have sorted the unique(vec), so as to make results consistent. unique() will return unique values based on the first occurrence of the element. By specifying the order, the code becomes more robust.
Also by specifying the levels and labels both, I think that code will become more readable.
EDIT
If you look in the documentation using ?factor, you will find :
levels
an optional vector of the values (as character strings) that x might have taken. The default is the unique set of values taken by as.character(x), sorted into increasing order of x. Note that this set can be specified as smaller than sort(unique(x))
So you can note that there is some sorting inside the factor faction itself. But it is my opinion that one should add the levels information, so as to make code more readable.
Related
I'm trying to figure out a simple way to do something like this with dplyr (data set = COL, variable = SEX):
COL[COL$SEX == "MACHO","SEX"] <- "M"
COL[COL$SEX == "HEMBRA","SEX"] <- "F"
Should be simple but this is? in the only command line? the best I can do at the moment. Is there an easier way?
Instead of multiple assignments, an option is to convert to factor with levels and labels specifying
COL$SEX <- factor(COL$SEX, levels = c("MACHO", "HEMBRA", labels = c("M", "F"))
Or another option is to convert to a logical vector, then change it to numeric index by adding 1, and replace the values based on the index
COL$SEX <- c("M", "F")[1 + (COL$SEX == "HEMBRA")]
I used h2o.relevel to reorder the levels of a factor df$x. But, when I tried to get the min or max using h2o.which_min(df$x) and h2o.which_max, the output was: NAN. This tells me that h2o.relevel does not set a increasing order for instance.
Example:
x: factor w/4 levels "B" "D" "A" "C". df is the dataframe.
I tried this: With h2o.relevel(df$x, levels = c("A", "B", "C", "D")), I'm able to rearrange the levels TO "A", "B", "C", "D", but A is not the minimum and D is not the maximum. h2o.which_min(df$x) and h2o.which_max return NAN.
How can I make A the min value and D the max value? Please help. Thank you
Enum (aka factor, aka categorical) in H2O are not ordinal.
So it's not possible to do comparisons in this way.
If you really want to do this, I recommend duplicating the column so that the original remains a factor and the duplicate is an integer.
I am trying to re-code values of a vector based on some levels and labels. Importantly, I can have a multitude of value to replace (levels) with a multitude of other value (labels) and I don't know in advance how many I have. Additionally, two levels can have the same label.
Here is an example: I have a vector "a". I would like to re-code each value in "a_levels" by the corresponding labels in "a_labels".
a = c(5,6,5,5,7,8,7)
a_levels = c(5, 6, 7, 8)
a_labels = c('a', 'a', 'c', 'd')
I can assume that the first value of a_levels corresponds to the first value of a_labels etc.)
So I would like to get
[1] "a" "a" "a" "a" "c" "d" "c"
Importantly, I have some constraints that do not allow me to apply so commons solutions:
1) Note that a_labels contains the label "a", twice, so I cannot use
factor(a, levels = a_levels,
labels = a_labels)
2) In my data I have a lot of value to replace, and I even don't know
in advance which levels I need to replace with which labels.
I only get the two vectors a_levels and a_labels
For these reasons I cannot use several ifelse() statements, or the recode function from dplyr.
recode(a,
'5' = 'a',
'6' = 'a',
'7' = 'c',
'8' = 'd')
because I don't know the values and labels in advance.
It should be simple to do that, but I did not find a way.
Thanks to nicola. The following works very well.
a_labels[ match(a,a_levels) ]
I'll divide this question into two parts, being the first a general question, and the second a specific one.
First - I would like to know if there a is a possible way to label numeric factors but still keep its original numeric levels. This is specially confusing since I realised that when we pass a label argument to a factor, it then becomes this factor's levels, for example:
x<- factor(c(1,2,3, 2, 3, 1, 2), levels = c(1, 2, 3), labels = c("a", "b", "c"))
levels(x)
#[1] "a" "b" "c"
labels(x)
#[1] "1" "2" "3" "4" "5" "6" "7"
I would like to know if there is a way, like it does in Stata, to label the categories of a factor. I want to be able to sum x while its elements show as "a, "b or "c", but keep the value 1, 2, or 3.
Second- I'm asking this because I have a very large data set which has columns with numeric categories. This data set comes with a dictionary in xlsx which I read and treat into R, so each column has its numeric categories and their respective labels. I'm attempting to read the dictionary, create a list of categories and labels inside a list of columns and then read the data set, loop through the columns and label the variables. These labels are important so I don't have to look at the dictionary every time I have to interpret something on the data set. And the numeric levels are important because since I have a lot of dummy variables (yes or no variables) I want to be able to sum them.
Here's my code (I use the data.table package):
dic<- readRDS(dictionary_filename)
# Reading data set #
data <- fread(dataset_filename, header = T, sep = "|", encoding = "UTF-8", na.strings = c("NA", ""))
# Treating the data.set #
# Identifying which lines of the dictionary have categorized variables. This is very specific to my dictionary strcture #
index<- which(!is.na(dic$num.categoria))
# storing the names of columns that have categorized variables #
names_var<- dic$`Var name`[index]
names_var<- names_var[!is.na(names_var)]
# Creating a data frame with categorized variables which will be later split into lists #
df<- as.data.frame(dic[index,])
# Transforming the index column to factor so it is possible to split the data frame into a list with sublists for each categorized column #
df$N<- as.factor(df$N)
# Splitting the data frame to list
lst<- split(df, df$N)
# Creating a labels list and a levels list #
lbs<- list()
lvs<- list()
for (i in 1:length(lst)){
lbs[[i]]<- as.vector(lst[[i]]$category)
lvs[[i]]<- as.vector(lst[[i]]$category.number)
}
# Changing the data set columns into factors with ther respective levels and labels #
k<- 1
for (var in names_var){
set(data, j =var, value = factor(data[[var]], levels = lvs[[k]], labels = lbs[[k]]))
k<- k +1
}
I realize the code is a bit abstract since i don't provide the data set or the dictionary, but it is just so you could have an idea. My code works, it runs with no error and it does what I hoped it would do (all the categorized columns are now showing their labels, for example, "yes" or "no" when before it was 1 or 0). Except from the fact that I can no longer access the original numbers in levels, which I need to in the next part of my project.
It would be preferable if there is a general way of doing so, since I run this code in a function, with many columns with different data sets and different dictionaries. Is there a way to accomplish this?
PS.: I have read the documentation in R and the answers to those questions:
Factor, levels, and original values
Having issues using order function in R
But unfortunately I wasn't able to figure it out by myself, it just became obvious that using the "labels" argument in "factor" was not the way to get it done.
Thank you so much!
I am trying to redefine the levels that are assigned when I am using cbind to create a dataframe from select columns of other dataframes. The dataframes contain integers, and the rownames are strings:
outTable<-data.frame(cbind(contRes$wt, bRes$log2FoldChange, cRes$log2FoldChange, dRes$log2FoldChange, aRes$log2FoldChange), row.names=row.names(aRes))
Using the following, I get the levels of the columns:
levels(as.factor(colnames(outTable)))
[1] "F" "N" "RH" "RK" "W"
I would like to change that order by passing something like:
levels(as.factor(colnames(outTable)))<-c("W", "RK", "RH", "F", "N")
but I get the error:
could not find function "as.factor<-"
The end purpose is to set the X axis order of a boxplot in ggplot2. Am I approaching this the right way? if so, what am I missing, and if not how would be the best way to?
Use
factor(colnames(outTable), levels=c("W", "RK", "RH", "F", "N"))
If you use levels()<- you will simply rename/replace level names; you don't re-order them. This is certainly not he behavior you want. The best way to re-order them all is to just use factor()
You can specify levels as an argument in the as.factor function
factor(colnames(outTable), levels = c("W", "RK", "RH", "F", "N"), ordered=T)