Defining the levels of dataframe columns in R - r

I am trying to redefine the levels that are assigned when I am using cbind to create a dataframe from select columns of other dataframes. The dataframes contain integers, and the rownames are strings:
outTable<-data.frame(cbind(contRes$wt, bRes$log2FoldChange, cRes$log2FoldChange, dRes$log2FoldChange, aRes$log2FoldChange), row.names=row.names(aRes))
Using the following, I get the levels of the columns:
levels(as.factor(colnames(outTable)))
[1] "F" "N" "RH" "RK" "W"
I would like to change that order by passing something like:
levels(as.factor(colnames(outTable)))<-c("W", "RK", "RH", "F", "N")
but I get the error:
could not find function "as.factor<-"
The end purpose is to set the X axis order of a boxplot in ggplot2. Am I approaching this the right way? if so, what am I missing, and if not how would be the best way to?

Use
factor(colnames(outTable), levels=c("W", "RK", "RH", "F", "N"))
If you use levels()<- you will simply rename/replace level names; you don't re-order them. This is certainly not he behavior you want. The best way to re-order them all is to just use factor()

You can specify levels as an argument in the as.factor function
factor(colnames(outTable), levels = c("W", "RK", "RH", "F", "N"), ordered=T)

Related

replace values within the same column

I'm trying to figure out a simple way to do something like this with dplyr (data set = COL, variable = SEX):
COL[COL$SEX == "MACHO","SEX"] <- "M"
COL[COL$SEX == "HEMBRA","SEX"] <- "F"
Should be simple but this is? in the only command line? the best I can do at the moment. Is there an easier way?
Instead of multiple assignments, an option is to convert to factor with levels and labels specifying
COL$SEX <- factor(COL$SEX, levels = c("MACHO", "HEMBRA", labels = c("M", "F"))
Or another option is to convert to a logical vector, then change it to numeric index by adding 1, and replace the values based on the index
COL$SEX <- c("M", "F")[1 + (COL$SEX == "HEMBRA")]

Renaming values in R after binning with cut()

I had a list of numerical values that I wanted to bin using cut(). Now each row has been replaced with the range that it fell into, in the form of ranges using brackets e.g. [0,140] meaning between 0 and 140 inclusive
The problem is these names are lengthy, and eventually require exponent notation, making them even longer, and it makes the graph illegible. Using typeof() it appears it's still in integer form, but I can't figure out how to rename them the way I would with factors. When I tried with factor() and the labels parameter, I was told that sort only worked on atomic lists.
As an example, here's essentially what I tried on my dataset, except with the built-in iris dataset:
data(iris)
iris[1] <- cut(iris[[1]], 10, include.lowest=TRUE)
iris[1] <- factor(iris[1], labels = c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j"))
It returns the error:
Error in sort.list(y) : 'x' must be atomic for 'sort.list'
Have you called 'sort' on a list?

R conditional variable replacement in dataframe

I need to recode variable (column) values in a dataframe. The following snippet replaces my values with what looks like array indexes instead of the categorial values:
CMlist <- c("CMdysphagiascreen","CMStrokeUnit","CMVTE","CMantithromd2")
for (i in CMlist) {
RHSSP[[i]] <- ifelse(RHSSP[[i]] == "NDOC", "Y", RHSSP[[i]])
RHSSP[[i]] <- ifelse(RHSSP[[i]] == "U", "N", RHSSP[[i]])
RHSSP[[i]] <- ifelse(is.NULL(RHSSP[[i]]), "N", RHSSP[[i]])
}
No doubt there's a better method for doing this. Can someone explain what's wrong with my attempt and maybe a better way of going about it?

R How to convert a numeric into factor with predefined labels

labs = letters[3:7]
vec = rep(1:5,2)
How do I get a factor whose levels are "c" "d" "e" "f" "g" ?
You can do something like this:
labs = letters[3:7]
vec = rep(1:5,2)
factorVec <- factor(x=vec, levels=sort(unique(vec)), labels = c( "c", "d", "e", "f", "g"))
I have sorted the unique(vec), so as to make results consistent. unique() will return unique values based on the first occurrence of the element. By specifying the order, the code becomes more robust.
Also by specifying the levels and labels both, I think that code will become more readable.
EDIT
If you look in the documentation using ?factor, you will find :
levels
an optional vector of the values (as character strings) that x might have taken. The default is the unique set of values taken by as.character(x), sorted into increasing order of x. Note that this set can be specified as smaller than sort(unique(x))
So you can note that there is some sorting inside the factor faction itself. But it is my opinion that one should add the levels information, so as to make code more readable.

R - show only levels used in a subset of data frame

I have a rather large data frame with a factor that has a lot of levels (more than 4,000). I have another column in the same data frame that I'm using as a reference, and what I'd like to find is a subset of the levels whenever this reference column is NA.
The first step I'm using is subsetrows <- which(is.na(mydata$reference)) but after that I'm stuck. I want something like levels(mydata[subsetrows,mydata$factor]) but unfortunately, this command shows me all the levels and not just the ones existing in subsetrows. I suppose I could create a new vector outside of my data frame of only my subset rows and then drop any unused levels, but is there any easier/cleaner way to do this, possibly without copying my data outside the data frame?
As an example of what I want returned, if my data frame has factor levels from A to Z, but in my subset only P, R and Y appear, I want something that returns the levels P, R and Y.
You can certainly accomplish this with base functions. But my personal preference is to use dplyr with chained operations such as this:
library(dplyr)
d %>%
filter(is.na(ref)) %>%
select(field) %>%
distinct()
data
d <- data.frame(
field = c("A", "B", "C", "A", "B", "C"),
ref = c(NA, "a", "b", NA, "c", NA)
)
I modified a suggestion in the comments by Marat to use the function unique that seems to return the correct levels.
Solution:
subsetrows <- which(is.na(mydata$reference))
unique(as.character(mydata$factor[subsetrows]))
While I like learning new packages and functions, this solution seems better at this point since it's more compact and easier for me to understand if I need to revisit this code at some distant point in the future.

Resources