Editing a data frame with certain conditions - r

I have a dataframe and would like to remove some specific cases depending on a simple rule: if x equals 2, y should be NA.
Here is an example:
x <- c(1, 2, 1, 2, 1, 2, 1, 2)
y <- c(5, 5, NA, NA, 6, 6, 4, 4)
df <- data.frame(x, y)
df
x y
1 1 5
2 2 5
3 1 NA
4 2 NA
5 1 6
6 2 6
7 1 4
8 2 4
And the output should look like that:
x y
1 1 5
2 2 NA
3 1 NA
4 2 NA
5 1 6
6 2 NA
7 1 4
8 2 NA
Is there a way to solve that with ifelse? I am grateful for any help.

You could do
df$y[df$x == 2] <- NA
df
# x y
#1 1 5
#2 2 NA
#3 1 NA
#4 2 NA
#5 1 6
#6 2 NA
#7 1 4
#8 2 NA
Or with replace
df$y <- replace(df$y, df$x == 2, NA)
Using same logic in dplyr mutate
library(dplyr)
df %>%
mutate(y = replace(y, x==2, NA))
Or the ifelse version
df$y <- ifelse(df$x == 2, NA, df$y)
df %>%
mutate(y = ifelse(x == 2, NA, y))

Related

Replacing leading NAs by group with 0s, but Keep other NAs

I have a COVID data frame grouped by state with 60 columns. As the COVID started at different times across states, therefore there are NAs before values for different states. Different indicators (column9) also have data starting differently. Below is a sample df I made for the demonstration.
state <- c(rep("A", 6), rep("B", 6))
time <- c(1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6)
x1 <- c(NA, NA, NA, 4, 5, 6, NA, NA, 3, 4, 5, NA)
x2 <- c(NA, 2, 3, NA, 5, 6, NA, NA, NA, 4, 5, 6)
x3 <- c(NA, NA, 3, 4, 5, NA, NA, 2, NA, 4, 5, 6)
df <- data.frame(state, time, x1, x2, x3)
df
state time x1 x2 x3
1 A 1 NA NA NA
2 A 2 NA 2 NA
3 A 3 NA 3 3
4 A 4 4 NA 4
5 A 5 5 5 5
6 A 6 6 6 NA
7 B 1 NA NA NA
8 B 2 NA NA 2
9 B 3 3 NA NA
10 B 4 4 4 4
11 B 5 5 5 5
12 B 6 NA 6 6
I'm trying to replace all the leading NAs with 0 for each state, but keep other NAs. The results should look like below:
state time x1 x2 x3
1 A 1 0 0 0
2 A 2 0 2 0
3 A 3 0 3 3
4 A 4 4 NA 4
5 A 5 5 5 5
6 A 6 6 6 NA
7 B 1 0 0 0
8 B 2 0 0 2
9 B 3 3 0 NA
10 B 4 4 4 4
11 B 5 5 5 5
12 B 6 NA 6 6
One solution I came up with is to replace NAs by the condition of the cumulative sums, as below:
df1 <- df %>%
group_by(state) %>%
mutate(
check.sum1 = cumsum(replace_na(x1, 0)),
x1 = if_else(check.sum1 != 0, x1, 0),
check.sum2 = cumsum(replace_na(x2, 0)),
x2 = if_else(check.sum2 != 0, x2, 0),
check.sum3 = cumsum(replace_na(x3, 0)),
x3 = if_else(check.sum3 != 0, x3, 0)
)
df1
This method worked fine. But since there are 60 columns, I want to wrap it up with a function and/or use apply(). But it gives out error messages:
df2 <- df %>%
group_by(state) %>%
apply(
df[3:5], MARGIN = 2, FUN = function(x) mutate(
check.sum = cumsum(replace_na(x, 0)),
x = if_else(check.sum != 0, x, 0)
)
)
Error in FUN(newX[, i], ...) : unused argument (df[3:5])
#or
func <- function(x) {
mutate(
check.sum = cumsum(replace_na(x, 0)),
x = if_else(check.sum != 0, x, 0)
)
}
df3 <- df %>%
group_by(state) %>%
apply(
df[3:5], MARGIN = 2, func
)
Error in match.fun(FUN) :
'df[3:5]' is not a function, character or symbol
So there are three specific questions:
How to create the user-defined functions by using columns as arguments.
How to use apply() function. and
Are there any other ways of using exiting functions, such as na.locf() or na.trim() to do the job?
Thank you!
Using by and looking where a column is.na and NA is not repeated, i.e. boolean differences are smaller or equal to zero.
do.call(rbind, by(df, df$state, \(x) {
x[] <- lapply(x, \(z) {z[is.na(z) & c(0, diff(is.na(z))) <= 0] <- 0; z})
return(x)
}))
# state time x1 x2 x3
# A.1 A 1 0 0 0
# A.2 A 2 0 2 0
# A.3 A 3 0 3 3
# A.4 A 4 4 NA 4
# A.5 A 5 5 5 5
# A.6 A 6 6 6 NA
# B.7 B 1 0 0 0
# B.8 B 2 0 0 2
# B.9 B 3 3 0 NA
# B.10 B 4 4 4 4
# B.11 B 5 5 5 5
# B.12 B 6 NA 6 6
Note: Please use update R>=4.1 for \(x) function shorthand notation or write function(x).
Using dplyr, we can do
library(dplyr)
df %>%
group_by(state) %>%
mutate(across(starts_with('x'), ~ replace(., !cumsum(!is.na(.)), 0))) %>%
ungroup
# A tibble: 12 × 5
state time x1 x2 x3
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 1 0 0 0
2 A 2 0 2 0
3 A 3 0 3 3
4 A 4 4 NA 4
5 A 5 5 5 5
6 A 6 6 6 NA
7 B 1 0 0 0
8 B 2 0 0 2
9 B 3 3 0 NA
10 B 4 4 4 4
11 B 5 5 5 5
12 B 6 NA 6 6

Modify variables in longitudinal data sets (keep first appearance of values on person-level)

I have a dataframe:
i <- c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3)
t <- c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4)
x <- c(0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1)
y <- c(5, 6, 7, 8, 4, 5, 6, 7, 6, 7, 8, 8)
j1 <- c(NA, NA, NA, NA, NA, 5, NA, 7, NA, NA, 8, 8)
dat <- data.frame(i, t, x, y, j1)
dat
i t x y j1
1 1 1 0 5 NA
2 1 2 0 6 NA
3 1 3 0 7 NA
4 1 4 0 8 NA
5 2 1 0 4 NA
6 2 2 1 5 5
7 2 3 0 6 NA
8 2 4 1 7 7
9 3 1 0 6 NA
10 3 2 0 7 NA
11 3 3 1 8 8
12 3 4 1 9 8
The dataframe refers to 3 persons "i" at 4 points in time "t". "j1" switches to "y" when "x" turns from 0 to 1 for a person "i". While "x" stays on 1 for a person, "j1" does not change within time (see person 3). When "x" is 0, "j1" is always NA.
Now I want to add a new variable "j2" to the dataframe which is a modification of "j1". The difference should be the following: For each person "i", there should be only one value for "j2". Namely, it should be the first value for "j1" for each person (the first change from 0 to 1 in "x").
Accordingly, the result should look like this:
dat
i t x y j1 j2
1 1 1 0 5 NA NA
2 1 2 0 6 NA NA
3 1 3 0 7 NA NA
4 1 4 0 8 NA NA
5 2 1 0 4 NA NA
6 2 2 1 5 5 5
7 2 3 0 6 NA NA
8 2 4 1 7 7 NA
9 3 1 0 6 NA NA
10 3 2 0 7 NA NA
11 3 3 1 8 8 8
12 3 4 1 9 8 NA
I appreciate suggestions on how to address this with dplyr
Somewhat more concise than the others:
library(tidyverse)
dat <- structure(list(i = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3), t = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4), x = c(0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1), y = c(5, 6, 7, 8, 4, 5, 6, 7, 6, 7, 8, 8), j1 = c(NA, NA, NA, NA, NA, 5, NA, 7, NA, NA, 8, 8)), class = "data.frame", row.names = c(NA, -12L))
dat %>%
group_by(i) %>%
mutate(j2 = ifelse(1:n() == which(x == 1)[1], y, NA)) %>%
ungroup()
#> # A tibble: 12 × 6
#> i t x y j1 j2
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 0 5 NA NA
#> 2 1 2 0 6 NA NA
#> 3 1 3 0 7 NA NA
#> 4 1 4 0 8 NA NA
#> 5 2 1 0 4 NA NA
#> 6 2 2 1 5 5 5
#> 7 2 3 0 6 NA NA
#> 8 2 4 1 7 7 NA
#> 9 3 1 0 6 NA NA
#> 10 3 2 0 7 NA NA
#> 11 3 3 1 8 8 8
#> 12 3 4 1 8 8 NA
possible solution
library(tidyverse)
i <- c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3)
t <- c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4)
x <- c(0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1)
y <- c(5, 6, 7, 8, 4, 5, 6, 7, 6, 7, 8, 8)
j1 <- c(NA, NA, NA, NA, NA, 5, NA, 7, NA, NA, 8, 8)
df <- data.frame(i, t, x, y, j1)
tmp <- df %>%
filter(x == 1) %>%
group_by(i) %>%
slice(1) %>%
ungroup() %>%
rename(j2 = j1)
left_join(df, tmp)
#> Joining, by = c("i", "t", "x", "y")
#> i t x y j1 j2
#> 1 1 1 0 5 NA NA
#> 2 1 2 0 6 NA NA
#> 3 1 3 0 7 NA NA
#> 4 1 4 0 8 NA NA
#> 5 2 1 0 4 NA NA
#> 6 2 2 1 5 5 5
#> 7 2 3 0 6 NA NA
#> 8 2 4 1 7 7 NA
#> 9 3 1 0 6 NA NA
#> 10 3 2 0 7 NA NA
#> 11 3 3 1 8 8 8
#> 12 3 4 1 8 8 NA
Created on 2021-09-08 by the reprex package (v2.0.1)
Function f puts NA after first value that is not NA in vector x. FUnction f is applied to j1 for each group determined by i.
f <- function(x){
ind <- which(!is.na(x))[1]
if(is.na(ind) || ind == length(x)) return(x)
x[(which.min(is.na(x))+1):length(x)] <- NA
x
}
dat %>%
group_by(i) %>%
mutate(j2 = f(j1)) %>%
ungroup()
Option1
You can use dplyr with mutate, use j1 and replace()the values for which both the current and the previous (lag()) value are non-NA with NAs:
library(dplyr)
dat %>% group_by(i) %>%
mutate(j2=replace(j1, !is.na(j1) & !is.na(lag(j1)), NA))
Option2
You can use replace() and replace all values in j1 which are not the first non-NA value (which(!is.na(j1))[1]).
dat %>% group_by(i) %>%
mutate(j2=replace(j1, which(!is.na(j1))[1], NA))
Option3
You can use purrr::accumulate() too. Call accumulate comparing consecutive (.x, .y) values form the j1 vector. If they are the same, the output will be NA.
library(dplyr)
dat %>% group_by(i) %>%
mutate(j2=purrr::accumulate(j1, ~ifelse(.x %in% .y, NA, .y)))
Output
# A tibble: 12 x 6
# Groups: i [3]
i t x y j1 j2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 0 5 NA NA
2 1 2 0 6 NA NA
3 1 3 0 7 NA NA
4 1 4 0 8 NA NA
5 2 1 0 4 NA NA
6 2 2 1 5 5 5
7 2 3 0 6 NA NA
8 2 4 1 7 7 7
9 3 1 0 6 NA NA
10 3 2 0 7 NA NA
11 3 3 1 8 8 8
12 3 4 1 8 8 NA

How best to create a new column for each two-column comparison using purrr?

Say I have the following dataframe:
ABC1_old <- c(1, 5, 3, 4, 3, NA, NA, NA, NA, NA)
ABC2_old <- c(4, 2, 1, 1, 5, NA, NA, NA, NA, NA)
ABC1_adj <- c(NA, NA, NA, NA, NA, 5, 5, 1, 2, 4)
ABC2_adj <- c(NA, NA, NA, NA, NA, 3, 2, 1, 4, 2)
df <- data.frame(ABC1_old, ABC2_old, ABC1_adj, ABC2_adj)
I want to create a column that compares each pair of ABCn_old with its corresponding ABCn_adj. (So ABC1_old would be compared against ABCn_adj, etc.) The resulting column would be called ABCn_new. The evaluation would be that if ABCn_old is NA, fill in the blank with the corresponding value in ABCn_adj, otherwise use ABCn_old's value. The new columns would look like this:
df$ABC1_new <- c(1, 5, 3, 4, 3, 5, 5, 1, 2, 4)
df$ABC2_new <- c(4, 2, 1, 1, 5, 3, 2, 1, 4, 2)
I know a simple mutate could work here, but I would like to use some kind of tidyverse looping via purrr if possible since the dataset is much larger in reality. Any ideas for the best way to achieve this?
map_dfc(split.default(df, str_remove(names(df), "_.*")), ~coalesce(!!!.x))
# A tibble: 10 x 2
ABC1 ABC2
<dbl> <dbl>
1 1 4
2 5 2
3 3 1
4 4 1
5 3 5
6 5 3
7 5 2
8 1 1
9 2 4
10 4 2
Putting it together:
df %>%
split.default(str_replace(names(.), "_.*", "_new")) %>%
map_dfc(~coalesce(!!!.x))%>%
cbind(df, .)
ABC1_old ABC2_old ABC1_adj ABC2_adj ABC1_new ABC2_new
1 1 4 NA NA 1 4
2 5 2 NA NA 5 2
3 3 1 NA NA 3 1
4 4 1 NA NA 4 1
5 3 5 NA NA 3 5
6 NA NA 5 3 5 3
7 NA NA 5 2 5 2
8 NA NA 1 1 1 1
9 NA NA 2 4 2 4
10 NA NA 4 2 4 2
Using tidyverse
library(dplyr)
library(tidyr)
library(stringr)
df %>%
mutate(rn = row_number()) %>%
pivot_longer(cols = -rn, names_to = c(".value", 'grp'),
names_sep = '_', values_drop_na = TRUE) %>%
select(-grp, -rn) %>%
rename_all(~ str_c(., '_new')) %>% bind_cols(df, .)
# ABC1_old ABC2_old ABC1_adj ABC2_adj ABC1_new ABC2_new
#1 1 4 NA NA 1 4
#2 5 2 NA NA 5 2
#3 3 1 NA NA 3 1
#4 4 1 NA NA 4 1
#5 3 5 NA NA 3 5
#6 NA NA 5 3 5 3
#7 NA NA 5 2 5 2
#8 NA NA 1 1 1 1
#9 NA NA 2 4 2 4
#10 NA NA 4 2 4 2
Or using dplyr
df %>%
mutate(across(ends_with('old'),
~ coalesce(., get(str_replace(cur_column(),
'old', 'adj'))), .names = '{.col}_new'))
I have a package on github to solve this and similar problems. In this case we could use dplyover::across2 to apply one (or more) functions to two set of columns, which can be selected with tidyselect. In the .names argument we can specify "{pre}" to refer to the common prefix of both sets of columns.
library(dplyr)
library(dplyover) # https://github.com/TimTeaFan/dplyover
df %>%
mutate(across2(ends_with("_old"),
ends_with("_adj"),
~ coalesce(.x, .y),
.names = "{pre}_new"))
#> ABC1_old ABC2_old ABC1_adj ABC2_adj ABC1_new ABC2_new
#> 1 1 4 NA NA 1 4
#> 2 5 2 NA NA 5 2
#> 3 3 1 NA NA 3 1
#> 4 4 1 NA NA 4 1
#> 5 3 5 NA NA 3 5
#> 6 NA NA 5 3 5 3
#> 7 NA NA 5 2 5 2
#> 8 NA NA 1 1 1 1
#> 9 NA NA 2 4 2 4
#> 10 NA NA 4 2 4 2
Created on 2021-05-16 by the reprex package (v0.3.0)

Merge two columns containing NA values in complementing rows

Suppose I have this dataframe
df <- data.frame(
x=c(1, NA, NA, 4, 5, NA),
y=c(NA, 2, 3, NA, NA, 6)
which looks like this
x y
1 1 NA
2 NA 2
3 NA 3
4 4 NA
5 5 NA
6 NA 6
How can I merge the two columns into one? Basically the NA values are in complementary rows. It would be nice to also obtain (in the process) a flag column containing 0 if the entry comes from x and 1 if the entry comes from y.
We can try using the coalesce function from the dplyr package:
df$merged <- coalesce(df$x, df$y)
df$flag <- ifelse(is.na(df$y), 0, 1)
df
x y merged flag
1 1 NA 1 0
2 NA 2 2 1
3 NA 3 3 1
4 4 NA 4 0
5 5 NA 5 0
6 NA 6 6 1
We can also use base R methods with max.col on the logical matrix to get the column index, cbind with row index and extract the values that are not NA
df$merged <- df[cbind(seq_len(nrow(df)), max.col(!is.na(df)))]
df$flag <- +(!is.na(df$y))
df
# x y merged flag
#1 1 NA 1 0
#2 NA 2 2 1
#3 NA 3 3 1
#4 4 NA 4 0
#5 5 NA 5 0
#6 NA 6 6 1
Or we can use fcoalesce from data.table which is written in C and is multithreaded for numeric and factor types.
library(data.table)
setDT(df)[, c('merged', 'flag' ) := .(fcoalesce(x, y), +(!is.na(y)))]
df
# x y merged flag
#1: 1 NA 1 0
#2: NA 2 2 1
#3: NA 3 3 1
#4: 4 NA 4 0
#5: 5 NA 5 0
#6: NA 6 6 1
You can do that using dplyr as follows;
library(dplyr)
# Creating dataframe
df <-
data.frame(
x = c(1, NA, NA, 4, 5, NA),
y = c(NA, 2, 3, NA, NA, 6))
df %>%
# If x is null then replace it with y
mutate(merged = coalesce(x, y),
# If x is null then put 1 else put 0
flag = if_else(is.na(x), 1, 0))
# x y merged flag
# 1 NA 1 0
# NA 2 2 1
# NA 3 3 1
# 4 NA 4 0
# 5 NA 5 0
# NA 6 6 1

eliminating categories with a certain number of non-NA values in R

I have a data frame df which looks like this
> g <- c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6)
> m <- c(1, NA, NA, NA, 3, NA, 2, 1, 3, NA, 3, NA, NA, 4, NA, NA, NA, 2, 1, NA, 7, 3, NA, 1)
> df <- data.frame(g, m)
where g is the category (1 to 6) and m are values in that category.
I've managed to find the amount of none NA values per category by :
aggregate(m ~ g, data=df, function(x) {sum(!is.na(x))}, na.action = NULL)
g m
1 1 1
2 2 3
3 3 2
4 4 1
5 5 2
6 6 3
and would now like to eliminate the rows (categories) where the number of None-NA is 1 and only keep those where the number of NA is 2 and above.
the desired outcome would be
g m
5 2 3
6 2 NA
7 2 2
8 2 1
9 3 3
10 3 NA
11 3 3
12 3 NA
17 5 NA
18 5 2
19 5 1
20 5 NA
21 6 7
22 6 3
23 6 NA
24 6 1
every g=1 and g=4 is eliminated because as shown there is only 1 none-NA in each of those categories
any suggestions :)?
If you want base R, then I suggest you use your aggregation:
df2 <- aggregate(m ~ g, data=df, function(x) {sum(!is.na(x))}, na.action = NULL)
df[ ! df$g %in% df2$g[df2$m < 2], ]
# g m
# 5 2 3
# 6 2 NA
# 7 2 2
# 8 2 1
# 9 3 3
# 10 3 NA
# 11 3 3
# 12 3 NA
# 17 5 NA
# 18 5 2
# 19 5 1
# 20 5 NA
# 21 6 7
# 22 6 3
# 23 6 NA
# 24 6 1
If you want to use dplyr, perhaps
library(dplyr)
group_by(df, g) %>%
filter(sum(!is.na(m)) > 1) %>%
ungroup()
# # A tibble: 16 × 2
# g m
# <dbl> <dbl>
# 1 2 3
# 2 2 NA
# 3 2 2
# 4 2 1
# 5 3 3
# 6 3 NA
# 7 3 3
# 8 3 NA
# 9 5 NA
# 10 5 2
# 11 5 1
# 12 5 NA
# 13 6 7
# 14 6 3
# 15 6 NA
# 16 6 1
One can try a dplyr based solution. group_by on g will help to get the desired count.
library(dplyr)
df %>% group_by(g) %>%
filter(!is.na(m)) %>%
filter(n() >=2) %>%
summarise(count = n())
#Result
# # A tibble: 6 x 2
# g count
# <dbl> <int>
# 1 2.00 3
# 2 3.00 2
# 3 5.00 2
# 4 6.00 3

Resources