replace values within the same column - r

I'm trying to figure out a simple way to do something like this with dplyr (data set = COL, variable = SEX):
COL[COL$SEX == "MACHO","SEX"] <- "M"
COL[COL$SEX == "HEMBRA","SEX"] <- "F"
Should be simple but this is? in the only command line? the best I can do at the moment. Is there an easier way?

Instead of multiple assignments, an option is to convert to factor with levels and labels specifying
COL$SEX <- factor(COL$SEX, levels = c("MACHO", "HEMBRA", labels = c("M", "F"))
Or another option is to convert to a logical vector, then change it to numeric index by adding 1, and replace the values based on the index
COL$SEX <- c("M", "F")[1 + (COL$SEX == "HEMBRA")]

Related

How can I identify specific strings in a dataframe and assign a new column to specific values?

Apologies if my question is a bit rubbish! - I'm quite new to R, and have never been great at coding.
Background: I have a dataframe of gene names and other values. I have a list of specific gene names which I want to label as of interest in a new column. I.e. column name: "OfInterest", with values either "Y" or "N".
Here's what I've tried so far:
DataframeName <- import("filename.xlsx") %>%
as_tibble()
GenesOfInterest <- c('GeneA', 'GeneB', 'GeneC', etc...)
DataframeName$OfInterest <- 'N'
DataframeName$OfInterest <- (if_else(DataframeName$GeneSymbol == GenesOfInterest, DataframeName$OfInterest <- 'Y', DataframeName$OfInterest <- 'N', NULL))
So basically... I'm trying to say that if any of the strings within my "GenesOfInterest" list are found within my column "GeneSymbol", the corresponding row should be filled in as "Y" in my "OfInterest" column.
I'm then gonna ggplot the data in to a volcano plot, wherein I'll have anything from the "OfInterest" column colour coded accordingly - But I think I should be okay with this part.
As it is, the code seems to find the very first string in my list (i.e. GeneA) and marks that as "Y" accordingly into the correct column, but stops after this first one, with warning message:
Warning message:
In DataframeName$GeneSymbol == GenesOfInterest :
longer object length is not a multiple of shorter object length
Unfortunately, I don't understand what this means!
If anyone would be able to kindly offer help/suggestions, I'd be very thankful :) I feel like it is probably an easy fix, but I'm pretty inexperienced with R.
Thanks for your time! :)
DataframeName$OfInterest[DataframeName$GeneSymbol %in% GenesOfInterest] <- 'Y'
DataframeName$OfInterest[DataframeName$OfInterest != 'Y'] <- 'N'
We can use case_when
library(dplyr)
DataframeName %>%
mutate(OfInterest = case_when(GeneSymbol %in% GenesOfInterest ~ 'Y', TRUE ~ 'N'))
Don't assign the value in ifelse. Try this -
DataframeName$OfInterest <- ifelse(DataframeName$GeneSymbol %in% GenesOfInterest, 'Y', 'N')

efficient way to replace vector of values according to second vector of bin values in R

I have a vector of numeric values (vals.to.convert in example code below) representing elevations (in meters). I need to replace each value with a related metric that are associated with 1-meter bins (data in the 'becomes' column of the conversion.df data.frame below).
Right now I'm using cut() with conversion.df$becomes as the labels then coercing with as.character() and as.numeric() to get the binned numeric conversion.
Can anyone recommend a more efficient and elegant way to do this?
For example, with a raster, you can use raster::reclassify and a data.frame structured like conversion.df to make the substitution.
Here is example code:
vals.to.convert <- sample(1:80, 500, replace = T)
conversion.df <- data.frame(from = 0:79,
to = 1:80,
becomes = runif(80))
converted <- as.numeric(as.character(cut(vals.to.convert, 0:nrow(conversion.df), labels = conversion.df$becomes)))
you could use findInterval
converted <- conversion.df$becomes[
findInterval(vals.to.convert, conversion.df$from) - 1L]
or cut
converted <- conversion.df$becomes [cut(vals.to.convert, 0:80)]

Recoding values based on two vectors (levels and labels) with identical labels and multiple values to replace

I am trying to re-code values of a vector based on some levels and labels. Importantly, I can have a multitude of value to replace (levels) with a multitude of other value (labels) and I don't know in advance how many I have. Additionally, two levels can have the same label.
Here is an example: I have a vector "a". I would like to re-code each value in "a_levels" by the corresponding labels in "a_labels".
a = c(5,6,5,5,7,8,7)
a_levels = c(5, 6, 7, 8)
a_labels = c('a', 'a', 'c', 'd')
I can assume that the first value of a_levels corresponds to the first value of a_labels etc.)
So I would like to get
[1] "a" "a" "a" "a" "c" "d" "c"
Importantly, I have some constraints that do not allow me to apply so commons solutions:
1) Note that a_labels contains the label "a", twice, so I cannot use
factor(a, levels = a_levels,
labels = a_labels)
2) In my data I have a lot of value to replace, and I even don't know
in advance which levels I need to replace with which labels.
I only get the two vectors a_levels and a_labels
For these reasons I cannot use several ifelse() statements, or the recode function from dplyr.
recode(a,
'5' = 'a',
'6' = 'a',
'7' = 'c',
'8' = 'd')
because I don't know the values and labels in advance.
It should be simple to do that, but I did not find a way.
Thanks to nicola. The following works very well.
a_labels[ match(a,a_levels) ]

R How to convert a numeric into factor with predefined labels

labs = letters[3:7]
vec = rep(1:5,2)
How do I get a factor whose levels are "c" "d" "e" "f" "g" ?
You can do something like this:
labs = letters[3:7]
vec = rep(1:5,2)
factorVec <- factor(x=vec, levels=sort(unique(vec)), labels = c( "c", "d", "e", "f", "g"))
I have sorted the unique(vec), so as to make results consistent. unique() will return unique values based on the first occurrence of the element. By specifying the order, the code becomes more robust.
Also by specifying the levels and labels both, I think that code will become more readable.
EDIT
If you look in the documentation using ?factor, you will find :
levels
an optional vector of the values (as character strings) that x might have taken. The default is the unique set of values taken by as.character(x), sorted into increasing order of x. Note that this set can be specified as smaller than sort(unique(x))
So you can note that there is some sorting inside the factor faction itself. But it is my opinion that one should add the levels information, so as to make code more readable.

Converting two columns of a data frame to a named vector

I need to convert a multi-row two-column data.frame to a named character vector.
My data.frame would be something like:
dd = data.frame(crit = c("a","b","c","d"),
name = c("Alpha", "Beta", "Caesar", "Doris")
)
and what I actually need would be:
whatiwant = c("a" = "Alpha",
"b" = "Beta",
"c" = "Caesar",
"d" = "Doris")
Use the names function:
whatyouwant <- as.character(dd$name)
names(whatyouwant) <- dd$crit
as.character is necessary, because data.frame and read.table turn characters into factors with default settings.
If you want a one-liner:
whatyouwant <- setNames(as.character(dd$name), dd$crit)
You can also use deframe(x) from the tibble package for this.
tibble::deframe()
It converts the first column to names and second column to values.
You can make a vector from dd$name, and add names using names(), but you can do it all in one step with structure():
whatiwant <- structure(as.character(dd$name), names = as.character(dd$crit))
Here is a very general, easy, tidy way:
library(dplyr)
iris %>%
pull(Sepal.Length, Species)
The first argument is the values, the second argument is the names.
For variety, try split and unlist:
unlist(split(as.character(dd$name), dd$crit))
# a b c d
# "Alpha" "Beta" "Caesar" "Doris"
There's also a magrittr solution to this via the exposition pipe (%$%):
library(magrittr)
dd %$% set_names(as.character(name), crit)
Minor advantage over tibble::deframe is that one doesn't have to have exactly a two-column frame/tibble as argument (i.e., avoid a select(value_col, name_col) %>%).
Note that the magrittr::set_names versus base::setNames is exchangeable. I simply prefer the former since it matches "set_(col|row)?names".

Resources