I have a simple data frame read from a csv:
df <- read.csv("test.csv", na.strings = c("", NA))
that gives this (I am using R):
x y
1 a 1
2 <NA> 1
3 <NA> 2
4 a 1
5 b 2
6 <NA> 3
I would like to change the NA in df$x to another value based on the value in df$y. Such that, if df$x = NA and df$y = 1, df$x = "p1", if df$x = NA and df$y = 2, df$x = "p2" and so on to look like this:
x y
1 a 1
2 p1 1
3 p2 2
4 a 1
5 b 2
6 p3 3
df <- df %>% replace_na(list(x = "p1"))
Lets me change the whole of df$x, but I am unable to find anything that puts a stringent condition on it. It would seem I need to use an ifelse statement but I cannot seem to get the syntax correct.
mutate(df, x = ifelse(is.na(x) && y == 1, x == 'p1', x == 'p2'))
Thanks for any help.
You can use the following solution. It should be noted that those <NA> values your are trying to change or stored as character strings so they are not considered NA values (you can verify that with is.na(df1$x).
library(dplyr)
df1 %>%
mutate(x = ifelse(x == "<NA>" & y == 1, "p1", ifelse(x == "<NA>" & y == 2,
"p2", ifelse(x == "<NA>" & y == 3,
"p3", x))))
x y
1 a 1
2 p1 1
3 p2 2
4 a 1
5 b 2
6 p3 3
Here is another very concise way for as many replacements as possible:
library(dplyr)
library(stringr)
library(glue)
df1 %>%
mutate(x = str_replace(x, "<NA>", glue::glue("p{y}")))
x y
1 a 1
2 p1 1
3 p2 2
4 a 1
5 b 2
6 p3 3
You can try the code below using replace + is.na
transform(
df,
x = replace(x, is.na(x), paste0("p", y[is.na(x)]))
)
which gives
x y
1 a 1
2 p1 1
3 p2 2
4 a 1
5 b 2
6 p3 3
Data
> dput(df)
structure(list(x = c("a", NA, NA, "a", "b", NA), y = c(1L, 1L,
2L, 1L, 2L, 3L)), class = "data.frame", row.names = c("1", "2",
"3", "4", "5", "6"))
In base R you can use a simple one-liner
df$x <- ifelse(is.na(df$x), paste0('p', df$y), df$x)
# x y
# 1 1 1
# 2 p1 1
# 3 p2 2
# 4 1 1
# 5 2 2
# 6 p3 3
Related
I have this type of data, where Sequis a grouping variable:
df <- data.frame(
Sequ = c(1,1,1,
2,2,2,
3,3,
4,4),
Answerer = c("A", NA, NA, "A", NA, NA, "B", NA, "C", NA),
PP_by = c(rep("A",5), rep("B",5)),
pp = c(0.1,0.2,0.3, 1, NA, NA, NA, NA, NA, NA)
)
I need to remove any Sequ where
(i) Answerer == PP_by AND
(ii) there is any NA in pp
I've tried this, but it obviously implements just the first condition (i):
library(dplyr)
df %>%
group_by(Sequ) %>%
filter(
all(!is.na(pp))
)
The expected result is:
Sequ Answerer PP_by pp
1 1 A A 0.1
2 1 <NA> A 0.2
3 1 <NA> A 0.3
9 4 C B NA
10 4 <NA> B NA
EDIT:
I've come up with this solution:
df %>%
group_by(Sequ) %>%
filter(
first(Answerer) != first(PP_by)
|
all(!is.na(pp))
)
Here's another way:
df %>%
group_by(Sequ) %>%
filter(!(
any(Answerer == PP_by, na.rm = TRUE) &
any(is.na(pp))
))
# # A tibble: 5 × 4
# # Groups: Sequ [2]
# Sequ Answerer PP_by pp
# <dbl> <chr> <chr> <dbl>
# 1 1 A A 0.1
# 2 1 NA A 0.2
# 3 1 NA A 0.3
# 4 4 C B NA
# 5 4 NA B NA
I have a dataframe like this. A small sample actually the df is bigger:
LOW 1 4 NA
MID 3 4 4
HIG 2 5 4
And would like to get the difference for LOW and HIG with MID so the ending df would be like this:
LOW 2 0 NA
MID 3 4 4
HIG 1 1 0
So you're getting: LOW = 3 - 1 = 2 and HIG = 3 - 2 = 1. I cand do it via VBA macros but want to scale with R.
It can be done with mutate_if/mutate_at
library(dplyr)
df1 %>%
mutate_if(is.numeric, ~ case_when(grp != 'MID' ~
abs(. - .[grp == 'MID']), TRUE ~ .))
# grp v1 v2 v3
#1 LOW 2 0 NA
#2 MID 3 4 4
#3 HIG 1 1 0
Or in base R
i1 <- df1$grp == 'MID'
df1[!i1, -1] <- abs(df1[!i1, -1] - rep(unlist(df1[i1, -1]), each = sum(!i1)))
data
df1 <- structure(list(grp = c("LOW", "MID", "HIG"), v1 = c(1L, 3L, 2L
), v2 = c(4L, 4L, 5L), v3 = c(NA, 4L, 4L)), class = "data.frame", row.names = c(NA,
-3L))
You can change the 'LOW', 'HIG' rows after subtracting by 'MID' :
df1[df1$grp == 'LOW', -1] <- abs(df1[df1$grp == 'MID',-1]- df1[df1$grp == 'LOW',-1])
df1[df1$grp == 'HIG', -1] <- abs(df1[df1$grp == 'MID',-1]- df1[df1$grp == 'HIG',-1])
df1
# grp v1 v2 v3
#1 LOW 2 0 NA
#2 MID 3 4 4
#3 HIG 1 1 0
I am tying to replace 0's in my dataframe of thousands of rows and columns with half the minimum value greater than zero from that column. I would also not want to include the first four columns as they are indexes.
So if I start with something like this:
index <- c("100p", "200p", 300p" 400p")
ratio <- c(5, 4, 3, 2)
gene <- c("gapdh", NA, NA,"actb"
species <- c("mouse", NA, NA, "rat")
a1 <- c(0,3,5,2)
b1 <- c(0, 0, 4, 6)
c1 <- c(1, 2, 3, 4)
as.data.frame(q) <- cbind(index, ratio, gene, species, a1, b1, c1)
index ratio gene species a1 b1 c1
100p 5 gapdh mouse 0 0 1
200p 4 NA NA 3 0 2
300p 3 NA NA 5 4 3
400p 2 actb rat 2 6 4
I would hope to gain a result such as this:
index ratio gene species a1 b1 c1
100p 5 gapdh mouse 1 2 1
200p 4 NA NA 3 2 2
300p 3 NA NA 5 4 3
400p 2 actb rat 2 6 4
I have tried the following code:
apply(q[-4], 2, function(x) "[<-"(x, x==0, min(x[x > 0]) / 2))
but I keep getting the error:Error in min(x[x > 0])/2 : non-numeric argument to binary operator
Any help on this? Thank you very much!
We can use lapply and replace the 0 values with minimum value in column by 2.
cols<- 5:7
q[cols] <- lapply(q[cols], function(x) replace(x, x == 0, min(x[x>0], na.rm = TRUE)/2))
q
# index ratio gene species a1 b1 c1
#1 100p 5 gapdh mouse 1 2 1
#2 200p 4 <NA> <NA> 3 2 2
#3 300p 3 <NA> <NA> 5 4 3
#4 400p 2 actb rat 2 6 4
In dplyr, we can use mutate_at
library(dplyr)
q %>% mutate_at(cols,~replace(., . == 0, min(.[.>0], na.rm = TRUE)/2))
data
q <- structure(list(index = structure(1:4, .Label = c("100p", "200p",
"300p", "400p"), class = "factor"), ratio = c(5, 4, 3, 2), gene = structure(c(2L,
NA, NA, 1L), .Label = c("actb", "gapdh"), class = "factor"),
species = structure(c(1L, NA, NA, 2L), .Label = c("mouse",
"rat"), class = "factor"), a1 = c(0, 3, 5, 2), b1 = c(0,
0, 4, 6), c1 = c(1, 2, 3, 4)), class = "data.frame", row.names = c(NA, -4L))
A slightly different (and potentially faster for large datasets) dplyr option with a bit of maths could be:
q %>%
mutate_at(vars(5:length(.)), ~ (. == 0) * min(.[. != 0])/2 + .)
index ratio gene species a1 b1 c1
1 100p 5 gapdh mouse 1 2 1
2 200p 4 <NA> <NA> 3 2 2
3 300p 3 <NA> <NA> 5 4 3
4 400p 2 actb rat 2 6 4
And the same with base R:
q[, 5:length(q)] <- lapply(q[, 5:length(q)], function(x) (x == 0) * min(x[x != 0])/2 + x)
For reference, considering your original code, I believe your function was not the issue. Instead, the error comes from applying the function to non-numeric data.
# original data
index <- c("100p", "200p", "300p" , "400p")
ratio <- c(5, 4, 3, 2)
gene <- c("gapdh", NA, NA,"actb")
species <- c("mouse", NA, NA, "rat")
a1 <- c(0,3,5,2)
b1 <- c(0, 0, 4, 6)
c1 <- c(1, 2, 3, 4)
# data frame
q <- as.data.frame(cbind(index, ratio, gene, species, a1, b1, c1))
# examine structure (all cols are factors)
str(q)
# convert factors to numeric
fac_to_num <- function(x){
x <- as.numeric(as.character(x))
x
}
# apply to cols 5 thru 7 only
q[, 5:7] <- apply(q[, 5:7],2,fac_to_num)
# examine structure
str(q)
# use original function only on numeric data
apply(q[, 5:7], 2, function(x) "[<-"(x, x==0, min(x[x > 0]) / 2))
I am trying to create a new variable based on some conditions.
My data looks like
a b
1 NA
2 3
3 3
NA 2
NA NA
What I want is a variable c such that
when a is not NA, b is NA, c = a
when a is NA, b is not NA, c = b
when a is NA, b is NA, c = NA
when a is not NA, b is not NA, and a == b, c = a
when a is not NA, b is not NA, and a != b, c = "multiple_values"
How can I do this?
It seems like ifelse() can't do what I want.
Except for one of the condition, i.e non-NA elements in both 'a', 'b', and they are not equal to each others, all other conditions are met with coalesce. So, we can do a case_when to generate the "multiple_values" based on the last condition and all others by applying coalesce
library(dplyr)
df1 %>%
mutate(c = case_when(!is.na(a) & !is.na(b) & a != b ~ "multiple_values",
TRUE ~ as.character(coalesce(a, b))))
# a b c
#1 1 NA 1
#2 2 3 multiple_values
#3 3 3 3
#4 NA 2 2
#5 NA NA <NA>
data
df1 <- structure(list(a = c(1L, 2L, 3L, NA, NA), b = c(NA, 3L, 3L, 2L,
NA)), class = "data.frame", row.names = c(NA, -5L))
In base R you could use within.
dat <- within(dat, {
c <- NA
c[!is.na(a) & is.na(b)] <- a[!is.na(a) & is.na(b)]
c[is.na(a) & !is.na(b)] <- b[is.na(a) & !is.na(b)]
# # c[is.na(a) & is.na(b)] <- NA # redundant
c[!is.na(a) & !is.na(b) & a == b] <- a[!is.na(a) & !is.na(b) & a == b]
c[!is.na(a) & !is.na(b) & a != b] <- "multiple_values"
})
dat
# a b c
# 1 1 NA 1
# 2 2 3 multiple_values
# 3 3 3 3
# 4 NA 2 2
# 5 NA NA <NA>
Data: dat <- data.frame(a=c(1:3, NA, NA), b=c(NA, 3, 3, 2, NA))
ifelse can do what you want but it's just that there would be lot of nested statements
df$c <- with(df, ifelse(!is.na(a) & is.na(b), a,
ifelse(is.na(a) & !is.na(b), b,
ifelse(is.na(a) & is.na(b), NA,
ifelse(!is.na(a) & !is.na(b) & a == b, a, "multiple_values")))))
df
# a b c
#1 1 NA 1
#2 2 3 multiple_values
#3 3 3 3
#4 NA 2 2
#5 NA NA <NA>
Here is another base R answer that uses mapply to loop through the pairs of values, a simple function that combines them and drops NAs, and uses switch to decide on the outcome.
df1$c <-
mapply(function(x, y) {
z <- c(x, y)
z <- unique(z[!is.na(z)])
switch(length(z) + 1L, NA, z, "many")
}, df1$a, df1$b)
which returns
df1
a b c
1 1 NA 1
2 2 3 many
3 3 3 3
4 NA 2 2
5 NA NA <NA>
Using data.table, you can:
df1 <- structure(list(a = c(1L, 2L, 3L, NA, NA), b = c(NA, 3L, 3L, 2L,
NA)), class = "data.frame", row.names = c(NA, -5L))
library(data.table)
df1 <- as.data.table(df1)
df1[, c:="NONE"]
df1[!is.na(a) & is.na(b), c:=a]
df1[is.na(a) & !is.na(b), c:=b]
df1[is.na(a) & is.na(b), c:=NA]
df1[!is.na(a) & !is.na(b) & a==b, c:=a]
df1[!is.na(a) & !is.na(b) & a!=b, c:="multiple values"]
I have multiple inputs like:
a <- x y z
1 2 2
2 3 2
3 2 4
4 2 4
5 2 1
b <- c(1,2)
c <- c(2,3)
i want to subset this data based on a condition that a$x contains values greater than equal to b[i] and less than equal to c[i]
output should look like:
d <- x y z
1 2 2
2 3 2
2 3 2
3 2 4
i have tried this:
d = as.data.frame(matrix(ncol=3, nrow=0))
names(d) = names(a)
for (i in 1:length(b){
d <- rbind(d,a[which(a$x>=b[i] & a$x<=c[i]),])
}
Using dplyr::filter function:
sub_list <- lapply(1:length(b), function(i) a %>% filter(x >= b[i] & x <= c[i]))
do.call(rbind, sub_list)
x y z
1 1 2 2
2 2 3 2
3 2 3 2
4 3 2 4
Input data:
a <- structure(list(x = 1:5, y = c(2L, 3L, 2L, 2L, 2L), z = c(2L,
2L, 4L, 4L, 1L)), .Names = c("x", "y", "z"), class = "data.frame", row.names = c(NA,
-5L))
b <- c(1,2)
c <- c(2,3)