How to fix a subsetting issue in R - r

I am trying to subset my dataframe, but when I do some of the factors are not being brought in and left behind.
When I try this code it gives me a dataframe that has 2048 obs, but then when I try the next set of code I still have COW, Negative Control, and Positive Control in the subset.
Controls_data <- subset(data_all, SampleID == c('COW', 'Negative Control', 'Positive Control'))
Sample_data <- subset(data_all, SampleID != c("COW", "Negative Control", "Positive Control"))
I should have 6,144 in the Controls_data. I double checked this in excel because I thought that maybe they were spelled differently or had spaces.

As #arg0naut and #Gregor both writes and suggests. Your problem is that == uses R's standard reuse rules and then does pairwise comparison. So that is not what you want to do.
Compare the outputs from the following lines of codes.:
letters == c("c", "e")
letters %in% c("c", "e")
letters == c("c", "e", "d")
Notice the warning the last case. In your case, the left hand side happens to be a multiple of the right and you are not warned.
You could also use the match function in your case:
match(c("c", "e", "d"), letters)

Related

Renaming values in R after binning with cut()

I had a list of numerical values that I wanted to bin using cut(). Now each row has been replaced with the range that it fell into, in the form of ranges using brackets e.g. [0,140] meaning between 0 and 140 inclusive
The problem is these names are lengthy, and eventually require exponent notation, making them even longer, and it makes the graph illegible. Using typeof() it appears it's still in integer form, but I can't figure out how to rename them the way I would with factors. When I tried with factor() and the labels parameter, I was told that sort only worked on atomic lists.
As an example, here's essentially what I tried on my dataset, except with the built-in iris dataset:
data(iris)
iris[1] <- cut(iris[[1]], 10, include.lowest=TRUE)
iris[1] <- factor(iris[1], labels = c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j"))
It returns the error:
Error in sort.list(y) : 'x' must be atomic for 'sort.list'
Have you called 'sort' on a list?

SparkR: Assign values of a column with condition

I want to replace values of a column with a certain condition.
Example of R data frame:
df <- data.frame(id=c(1:7),value=c("a", "b", "c", "d", "e", "c", "c"))
I want to replace values "c" and "d", in column value by "e".
In R, it can be done this way
df[df$value %in% c("c","d"),]$value <- "e"
I tried to do the same thing in sparkR. Tried ifelse, when functions but couldn't give me the desired result.
Does anyway run into the same issue?
The first comment of mtoto works well (with spark 3.0.1) and should be transformed in answer and accepted.
df$value <- ifelse(df$value %in% c("c","d"), "e", df$value)
Another valid slightly different method to replace strings in a column could be the following:
df$value <- regexp_replace(df$value, "c", "e")

R conditional variable replacement in dataframe

I need to recode variable (column) values in a dataframe. The following snippet replaces my values with what looks like array indexes instead of the categorial values:
CMlist <- c("CMdysphagiascreen","CMStrokeUnit","CMVTE","CMantithromd2")
for (i in CMlist) {
RHSSP[[i]] <- ifelse(RHSSP[[i]] == "NDOC", "Y", RHSSP[[i]])
RHSSP[[i]] <- ifelse(RHSSP[[i]] == "U", "N", RHSSP[[i]])
RHSSP[[i]] <- ifelse(is.NULL(RHSSP[[i]]), "N", RHSSP[[i]])
}
No doubt there's a better method for doing this. Can someone explain what's wrong with my attempt and maybe a better way of going about it?

R How to convert a numeric into factor with predefined labels

labs = letters[3:7]
vec = rep(1:5,2)
How do I get a factor whose levels are "c" "d" "e" "f" "g" ?
You can do something like this:
labs = letters[3:7]
vec = rep(1:5,2)
factorVec <- factor(x=vec, levels=sort(unique(vec)), labels = c( "c", "d", "e", "f", "g"))
I have sorted the unique(vec), so as to make results consistent. unique() will return unique values based on the first occurrence of the element. By specifying the order, the code becomes more robust.
Also by specifying the levels and labels both, I think that code will become more readable.
EDIT
If you look in the documentation using ?factor, you will find :
levels
an optional vector of the values (as character strings) that x might have taken. The default is the unique set of values taken by as.character(x), sorted into increasing order of x. Note that this set can be specified as smaller than sort(unique(x))
So you can note that there is some sorting inside the factor faction itself. But it is my opinion that one should add the levels information, so as to make code more readable.

R - show only levels used in a subset of data frame

I have a rather large data frame with a factor that has a lot of levels (more than 4,000). I have another column in the same data frame that I'm using as a reference, and what I'd like to find is a subset of the levels whenever this reference column is NA.
The first step I'm using is subsetrows <- which(is.na(mydata$reference)) but after that I'm stuck. I want something like levels(mydata[subsetrows,mydata$factor]) but unfortunately, this command shows me all the levels and not just the ones existing in subsetrows. I suppose I could create a new vector outside of my data frame of only my subset rows and then drop any unused levels, but is there any easier/cleaner way to do this, possibly without copying my data outside the data frame?
As an example of what I want returned, if my data frame has factor levels from A to Z, but in my subset only P, R and Y appear, I want something that returns the levels P, R and Y.
You can certainly accomplish this with base functions. But my personal preference is to use dplyr with chained operations such as this:
library(dplyr)
d %>%
filter(is.na(ref)) %>%
select(field) %>%
distinct()
data
d <- data.frame(
field = c("A", "B", "C", "A", "B", "C"),
ref = c(NA, "a", "b", NA, "c", NA)
)
I modified a suggestion in the comments by Marat to use the function unique that seems to return the correct levels.
Solution:
subsetrows <- which(is.na(mydata$reference))
unique(as.character(mydata$factor[subsetrows]))
While I like learning new packages and functions, this solution seems better at this point since it's more compact and easier for me to understand if I need to revisit this code at some distant point in the future.

Resources