remove rows containing NA based on condition - r

df <- data.frame(x = 1:7, y = c(NA, NA, 5, 10, NA, 20, 30))
From df I want to remove rows containing NA in y based on the condition that the x value in that row is smaller than the x value in the row with the minimum y value to obtain this data frame.
data.frame(x = 3:7, y = c(5, 10, NA, 20, 30))
dlypr() solutions preferable!

We could use which.min to get the index of minimum 'y' value, subset the 'x' create the comparison with the 'x' values along with the expression for NA elements in 'y' and negate (!)
subset(df, !(x< x[which.min(y)] & is.na(y)))
-output
x y
3 3 5
4 4 10
5 5 NA
6 6 20
7 7 30
Or the same logic can be applied with dplyr::filter
library(dplyr)
df %>%
filter(!(x< x[which.min(y)] & is.na(y)))
-ouptut
x y
1 3 5
2 4 10
3 5 NA
4 6 20
5 7 30
data
df <- structure(list(x = 1:7, y = c(NA, NA, 5, 10, NA, 20, 30)),
class = "data.frame", row.names = c(NA,
-7L))

Use logical indices for each of the conditions and combine them with logical AND, &:
df <- data.frame(x = 1:7, y = c(NA, NA, 5, 10, NA, 20, 30))
i <- is.na(df$y)
j <- df$x < df$y
df[!i & j, ]
# x y
#3 3 5
#4 4 10
#6 6 20
#7 7 30

Related

Forming a new column from whichever of two columns isn’t NA [duplicate]

This question already has answers here:
Replace a value NA with the value from another column in R
(5 answers)
Closed last month.
I have a simplified dataframe:
test <- data.frame(
x = c(1,2,3,NA,NA,NA),
y = c(NA, NA, NA, 3, 2, NA),
a = c(NA, NA, NA, NA, NA, TRUE)
)
I want to create a new column rating that has the value of the number in either column x or column y. The dataset is such a way that whenever there's a numeric value in x, there's a NA in y. If both columns are NAs, then the value in rating should be NA.
In this case, the expected output is: 1,2,3,3,2,NA
With coalesce:
library(dplyr)
test %>%
mutate(rating = coalesce(x, y))
x y a rating
1 1 NA NA 1
2 2 NA NA 2
3 3 NA NA 3
4 NA 3 NA 3
5 NA 2 NA 2
6 NA NA TRUE NA
library(dplyr)
test %>%
mutate(rating = if_else(is.na(x),
y, x))
x y a rating
1 1 NA NA 1
2 2 NA NA 2
3 3 NA NA 3
4 NA 3 NA 3
5 NA 2 NA 2
6 NA NA TRUE NA
Here several solutions.
# Input
test <- data.frame(
x = c(1,2,3,NA,NA,NA),
y = c(NA, NA, NA, 3, 2, NA),
a = c(NA, NA, NA, NA, NA, TRUE)
)
# Base R solution
test$rating <- ifelse(!is.na(test$x), test$x,
ifelse(!is.na(test$y), test$y, NA))
# dplyr solution
library(dplyr)
test <- test %>%
mutate(rating = case_when(!is.na(x) ~ x,
!is.na(y) ~ y,
TRUE ~ NA_real_))
# data.table solution
library(data.table)
setDT(test)
test[, rating := ifelse(!is.na(x), x, ifelse(!is.na(y), y, NA))]
Created on 2022-12-23 with reprex v2.0.2
test <- data.frame(
x = c(1,2,3,NA,NA,NA),
y = c(NA, NA, NA, 3, 2, NA),
a = c(NA, NA, NA, NA, NA, TRUE)
)
test$rating <- dplyr::coalesce(test$x, test$y)

How do I change NA in df$x when df$y == 1

I have a simple data frame read from a csv:
df <- read.csv("test.csv", na.strings = c("", NA))
that gives this (I am using R):
x y
1 a 1
2 <NA> 1
3 <NA> 2
4 a 1
5 b 2
6 <NA> 3
I would like to change the NA in df$x to another value based on the value in df$y. Such that, if df$x = NA and df$y = 1, df$x = "p1", if df$x = NA and df$y = 2, df$x = "p2" and so on to look like this:
x y
1 a 1
2 p1 1
3 p2 2
4 a 1
5 b 2
6 p3 3
df <- df %>% replace_na(list(x = "p1"))
Lets me change the whole of df$x, but I am unable to find anything that puts a stringent condition on it. It would seem I need to use an ifelse statement but I cannot seem to get the syntax correct.
mutate(df, x = ifelse(is.na(x) && y == 1, x == 'p1', x == 'p2'))
Thanks for any help.
You can use the following solution. It should be noted that those <NA> values your are trying to change or stored as character strings so they are not considered NA values (you can verify that with is.na(df1$x).
library(dplyr)
df1 %>%
mutate(x = ifelse(x == "<NA>" & y == 1, "p1", ifelse(x == "<NA>" & y == 2,
"p2", ifelse(x == "<NA>" & y == 3,
"p3", x))))
x y
1 a 1
2 p1 1
3 p2 2
4 a 1
5 b 2
6 p3 3
Here is another very concise way for as many replacements as possible:
library(dplyr)
library(stringr)
library(glue)
df1 %>%
mutate(x = str_replace(x, "<NA>", glue::glue("p{y}")))
x y
1 a 1
2 p1 1
3 p2 2
4 a 1
5 b 2
6 p3 3
You can try the code below using replace + is.na
transform(
df,
x = replace(x, is.na(x), paste0("p", y[is.na(x)]))
)
which gives
x y
1 a 1
2 p1 1
3 p2 2
4 a 1
5 b 2
6 p3 3
Data
> dput(df)
structure(list(x = c("a", NA, NA, "a", "b", NA), y = c(1L, 1L,
2L, 1L, 2L, 3L)), class = "data.frame", row.names = c("1", "2",
"3", "4", "5", "6"))
In base R you can use a simple one-liner
df$x <- ifelse(is.na(df$x), paste0('p', df$y), df$x)
# x y
# 1 1 1
# 2 p1 1
# 3 p2 2
# 4 1 1
# 5 2 2
# 6 p3 3

Replace values outside range with NA using replace_with_na function

I have the following dataset
structure(list(a = c(2, 1, 9, 2, 9, 8), b = c(4, 5, 1, 9, 12,
NA), c = c(50, 34, 77, 88, 33, 60)), class = "data.frame", row.names = c(NA,
-6L))
a b c
1 2 4 50
2 1 5 34
3 9 1 77
4 2 9 88
5 9 12 33
6 8 NA 60
From column b I only want values between 4-9. Column c between 50-80. Replacing the values outside the range with NA, resulting in
structure(list(a = c(2, 1, 9, 2, 9, 8), b = c(4, 5, NA, 9, NA,
NA), c = c(50, NA, 77, NA, NA, 60)), class = "data.frame", row.names = c(NA,
-6L))
a b c
1 2 4 50
2 1 5 NA
3 9 NA 77
4 2 9 NA
5 9 NA NA
6 8 NA 60
I've tried several things with replace_with_na_at function where this seemed most logical:
test <- replace_with_na_at(data = test, .vars="c",
condition = ~.x < 2 & ~.x > 2)
However, nothing I tried works. Does somebody know why? Thanks in advance! :)
You can subset with a logical vector testing your conditions.
x$b[x$b < 4 | x$b > 9] <- NA
x$c[x$c < 50 | x$c > 80] <- NA
x
# a b c
#1 2 4 50
#2 1 5 NA
#3 9 NA 77
#4 2 9 NA
#5 9 NA NA
#6 8 NA 60
Data:
x <- structure(list(a = c(2, 1, 9, 2, 9, 8), b = c(4, 5, 1, 9, 12,
NA), c = c(50, 34, 77, 88, 33, 60)), class = "data.frame", row.names = c(NA,
-6L))
Yet another base R solution, this time with function is.na<-
is.na(test$b) <- with(test, b < 4 | b > 9)
is.na(test$c) <- with(test, c < 50 | c > 80)
A package naniar solution with a pipe could be
library(naniar)
library(magrittr)
test %>%
replace_with_na_at(
.vars = 'b',
condition = ~(.x < 4 | .x > 9)
) %>%
replace_with_na_at(
.vars = 'c',
condition = ~(.x < 50 | .x > 80)
)
You should mention the packages you are using. From googling, i'm guessing you are using naniar. The problem appears to be that you did not properly specify the condition, but the following should work:
library(naniar)
test <- structure(list(a = c(2, 1, 9, 2, 9, 8),
b = c(4, 5, 1, 9, 12, NA),
c = c(50, 34, 77, 88, 33, 60)),
class = "data.frame",
row.names = c(NA, -6L))
replace_with_na_at(test, "c", ~.x < 50 | .x > 80)
#> a b c
#> 1 2 4 50
#> 2 1 5 NA
#> 3 9 1 77
#> 4 2 9 NA
#> 5 9 12 NA
#> 6 8 NA 60
Created on 2020-06-02 by the reprex package (v0.3.0)
You simply could use Map to replace your values with NA.
dat[2:3] <- Map(function(x, y) {x[!x %in% y] <- NA;x}, dat[2:3], list(4:9, 50:80))
dat
# a b c
# 1 2 4 50
# 2 1 5 NA
# 3 9 NA 77
# 4 2 9 NA
# 5 9 NA NA
# 6 8 NA 60
Data:
dat <- structure(list(a = c(2, 1, 9, 2, 9, 8), b = c(4, 5, 1, 9, 12,
NA), c = c(50, 34, 77, 88, 33, 60)), class = "data.frame", row.names = c(NA,
-6L))
We can use map2
library(purrr)
library(dplyr)
df1[c('b', 'c')] <- map2(df1 %>%
select(b, c), list(c(4, 9), c(50,80)), ~
replace(.x, .x < .y[1]|.x > .y[2], NA))

How can I replace zeros with half the minimum value within a column?

I am tying to replace 0's in my dataframe of thousands of rows and columns with half the minimum value greater than zero from that column. I would also not want to include the first four columns as they are indexes.
So if I start with something like this:
index <- c("100p", "200p", 300p" 400p")
ratio <- c(5, 4, 3, 2)
gene <- c("gapdh", NA, NA,"actb"
species <- c("mouse", NA, NA, "rat")
a1 <- c(0,3,5,2)
b1 <- c(0, 0, 4, 6)
c1 <- c(1, 2, 3, 4)
as.data.frame(q) <- cbind(index, ratio, gene, species, a1, b1, c1)
index ratio gene species a1 b1 c1
100p 5 gapdh mouse 0 0 1
200p 4 NA NA 3 0 2
300p 3 NA NA 5 4 3
400p 2 actb rat 2 6 4
I would hope to gain a result such as this:
index ratio gene species a1 b1 c1
100p 5 gapdh mouse 1 2 1
200p 4 NA NA 3 2 2
300p 3 NA NA 5 4 3
400p 2 actb rat 2 6 4
I have tried the following code:
apply(q[-4], 2, function(x) "[<-"(x, x==0, min(x[x > 0]) / 2))
but I keep getting the error:Error in min(x[x > 0])/2 : non-numeric argument to binary operator
Any help on this? Thank you very much!
We can use lapply and replace the 0 values with minimum value in column by 2.
cols<- 5:7
q[cols] <- lapply(q[cols], function(x) replace(x, x == 0, min(x[x>0], na.rm = TRUE)/2))
q
# index ratio gene species a1 b1 c1
#1 100p 5 gapdh mouse 1 2 1
#2 200p 4 <NA> <NA> 3 2 2
#3 300p 3 <NA> <NA> 5 4 3
#4 400p 2 actb rat 2 6 4
In dplyr, we can use mutate_at
library(dplyr)
q %>% mutate_at(cols,~replace(., . == 0, min(.[.>0], na.rm = TRUE)/2))
data
q <- structure(list(index = structure(1:4, .Label = c("100p", "200p",
"300p", "400p"), class = "factor"), ratio = c(5, 4, 3, 2), gene = structure(c(2L,
NA, NA, 1L), .Label = c("actb", "gapdh"), class = "factor"),
species = structure(c(1L, NA, NA, 2L), .Label = c("mouse",
"rat"), class = "factor"), a1 = c(0, 3, 5, 2), b1 = c(0,
0, 4, 6), c1 = c(1, 2, 3, 4)), class = "data.frame", row.names = c(NA, -4L))
A slightly different (and potentially faster for large datasets) dplyr option with a bit of maths could be:
q %>%
mutate_at(vars(5:length(.)), ~ (. == 0) * min(.[. != 0])/2 + .)
index ratio gene species a1 b1 c1
1 100p 5 gapdh mouse 1 2 1
2 200p 4 <NA> <NA> 3 2 2
3 300p 3 <NA> <NA> 5 4 3
4 400p 2 actb rat 2 6 4
And the same with base R:
q[, 5:length(q)] <- lapply(q[, 5:length(q)], function(x) (x == 0) * min(x[x != 0])/2 + x)
For reference, considering your original code, I believe your function was not the issue. Instead, the error comes from applying the function to non-numeric data.
# original data
index <- c("100p", "200p", "300p" , "400p")
ratio <- c(5, 4, 3, 2)
gene <- c("gapdh", NA, NA,"actb")
species <- c("mouse", NA, NA, "rat")
a1 <- c(0,3,5,2)
b1 <- c(0, 0, 4, 6)
c1 <- c(1, 2, 3, 4)
# data frame
q <- as.data.frame(cbind(index, ratio, gene, species, a1, b1, c1))
# examine structure (all cols are factors)
str(q)
# convert factors to numeric
fac_to_num <- function(x){
x <- as.numeric(as.character(x))
x
}
# apply to cols 5 thru 7 only
q[, 5:7] <- apply(q[, 5:7],2,fac_to_num)
# examine structure
str(q)
# use original function only on numeric data
apply(q[, 5:7], 2, function(x) "[<-"(x, x==0, min(x[x > 0]) / 2))

filtering a dataframe in R based on how many elements in a Row are filled out

I have the following data frame (dput at end):
> d
a b d
1 1 NA NA
2 NA NA NA
3 2 2 2
4 3 3 NA
I want to filter the rows that have at least two items that are not NA. I wish to get the result -- how do I do that?:
> d
a b d
3 2 2 2
4 3 3 NA
> dput(d)
structure(list(a = c(1, NA, 2, 3), b = c(NA, NA, 2, 3), d = c(NA,
NA, 2, NA)), .Names = c("a", "b", "d"), row.names = c(NA, -4L
), class = "data.frame")
We can get the rowSums of the logical matrix (is.na(d)), use that to create a logical vector (..<2) to subset the rows.
d[rowSums(is.na(d))<2,]
# a b d
#3 2 2 2
#4 3 3 NA
Or as #DavidArenburg mentioned, it can be also done with Reduce
df[Reduce(`+`, lapply(df, is.na)) < 2, ]

Resources