Replacing leading NAs by group with 0s, but Keep other NAs - r

I have a COVID data frame grouped by state with 60 columns. As the COVID started at different times across states, therefore there are NAs before values for different states. Different indicators (column9) also have data starting differently. Below is a sample df I made for the demonstration.
state <- c(rep("A", 6), rep("B", 6))
time <- c(1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6)
x1 <- c(NA, NA, NA, 4, 5, 6, NA, NA, 3, 4, 5, NA)
x2 <- c(NA, 2, 3, NA, 5, 6, NA, NA, NA, 4, 5, 6)
x3 <- c(NA, NA, 3, 4, 5, NA, NA, 2, NA, 4, 5, 6)
df <- data.frame(state, time, x1, x2, x3)
df
state time x1 x2 x3
1 A 1 NA NA NA
2 A 2 NA 2 NA
3 A 3 NA 3 3
4 A 4 4 NA 4
5 A 5 5 5 5
6 A 6 6 6 NA
7 B 1 NA NA NA
8 B 2 NA NA 2
9 B 3 3 NA NA
10 B 4 4 4 4
11 B 5 5 5 5
12 B 6 NA 6 6
I'm trying to replace all the leading NAs with 0 for each state, but keep other NAs. The results should look like below:
state time x1 x2 x3
1 A 1 0 0 0
2 A 2 0 2 0
3 A 3 0 3 3
4 A 4 4 NA 4
5 A 5 5 5 5
6 A 6 6 6 NA
7 B 1 0 0 0
8 B 2 0 0 2
9 B 3 3 0 NA
10 B 4 4 4 4
11 B 5 5 5 5
12 B 6 NA 6 6
One solution I came up with is to replace NAs by the condition of the cumulative sums, as below:
df1 <- df %>%
group_by(state) %>%
mutate(
check.sum1 = cumsum(replace_na(x1, 0)),
x1 = if_else(check.sum1 != 0, x1, 0),
check.sum2 = cumsum(replace_na(x2, 0)),
x2 = if_else(check.sum2 != 0, x2, 0),
check.sum3 = cumsum(replace_na(x3, 0)),
x3 = if_else(check.sum3 != 0, x3, 0)
)
df1
This method worked fine. But since there are 60 columns, I want to wrap it up with a function and/or use apply(). But it gives out error messages:
df2 <- df %>%
group_by(state) %>%
apply(
df[3:5], MARGIN = 2, FUN = function(x) mutate(
check.sum = cumsum(replace_na(x, 0)),
x = if_else(check.sum != 0, x, 0)
)
)
Error in FUN(newX[, i], ...) : unused argument (df[3:5])
#or
func <- function(x) {
mutate(
check.sum = cumsum(replace_na(x, 0)),
x = if_else(check.sum != 0, x, 0)
)
}
df3 <- df %>%
group_by(state) %>%
apply(
df[3:5], MARGIN = 2, func
)
Error in match.fun(FUN) :
'df[3:5]' is not a function, character or symbol
So there are three specific questions:
How to create the user-defined functions by using columns as arguments.
How to use apply() function. and
Are there any other ways of using exiting functions, such as na.locf() or na.trim() to do the job?
Thank you!

Using by and looking where a column is.na and NA is not repeated, i.e. boolean differences are smaller or equal to zero.
do.call(rbind, by(df, df$state, \(x) {
x[] <- lapply(x, \(z) {z[is.na(z) & c(0, diff(is.na(z))) <= 0] <- 0; z})
return(x)
}))
# state time x1 x2 x3
# A.1 A 1 0 0 0
# A.2 A 2 0 2 0
# A.3 A 3 0 3 3
# A.4 A 4 4 NA 4
# A.5 A 5 5 5 5
# A.6 A 6 6 6 NA
# B.7 B 1 0 0 0
# B.8 B 2 0 0 2
# B.9 B 3 3 0 NA
# B.10 B 4 4 4 4
# B.11 B 5 5 5 5
# B.12 B 6 NA 6 6
Note: Please use update R>=4.1 for \(x) function shorthand notation or write function(x).

Using dplyr, we can do
library(dplyr)
df %>%
group_by(state) %>%
mutate(across(starts_with('x'), ~ replace(., !cumsum(!is.na(.)), 0))) %>%
ungroup
# A tibble: 12 × 5
state time x1 x2 x3
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 1 0 0 0
2 A 2 0 2 0
3 A 3 0 3 3
4 A 4 4 NA 4
5 A 5 5 5 5
6 A 6 6 6 NA
7 B 1 0 0 0
8 B 2 0 0 2
9 B 3 3 0 NA
10 B 4 4 4 4
11 B 5 5 5 5
12 B 6 NA 6 6

Related

Mutate with function over multiple columns

I have questionnaire (EHP30) answers from a list of participants, where they are rating something between 0 and 4, or -9 for not relevant. The overall score is the sum of the scores scaled to 100. If there are any not relevant answers they are ignored (unless they are all not relevant, in which case the output is missing). Any missing items sets the whole output to missing.
I have written a function that calculates the score from an input vector:
ehp30_sexual <- function(scores = c(0, 0, 0, 0, 0)){
if(anyNA(scores)){
return(NA)
} else if(!all(scores %in% c(-9, 0, 1, 2, 3, 4))){
stop("Values not in correct range (-9, 0, 1, 2, 3, 4)")
} else if(length(scores) != 5){
stop("Must be vector length of 5")
} else if(all(scores == -9)){
return(NA)
} else if(any(scores == -9)){
newscores <- scores[which(scores != -9)]
sum(newscores) * 100 / (4 * length(newscores))
} else {
sum(scores) * 100 / (4 * length(scores))
}
}
I wish to apply this function to each row of a dataframe using mutate if possible (or apply if not):
ans <- c(NA, -9, 0, 1, 2, 3, 4)
set.seed(1)
data <- data.frame(id = 1:10,
ePainAfterSex = sample(ans, 10, TRUE),
eWorriedSex = sample(ans, 10, TRUE),
eAvoidSex = sample(ans, 10, TRUE),
eGuiltyNoSex = sample(ans, 10, TRUE),
eFrustratedNoSex = sample(ans, 10, TRUE))
Any ideas? I'm happy to rewrite the function or use a case_when solution if it is any simpler.
Using dplyr::rowwise() and c_across() (inspired by #edvinsyk’s answer):
set.seed(1)
library(dplyr)
data %>%
rowwise() %>%
mutate(score = ehp30_sexual(
c_across(ePainAfterSex:eFrustratedNoSex)
)) %>%
ungroup()
# A tibble: 10 × 7
id ePainAfterSex eWorriedSex eAvoidSex eGuiltyNoSex eFrustratedNoSex score
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 NA 0 NA -9 -9 NA
2 2 1 0 4 3 3 55
3 3 4 NA 2 NA 4 NA
4 4 NA 2 2 1 1 NA
5 5 -9 2 NA 4 1 NA
6 6 2 -9 NA NA 1 NA
7 7 4 3 3 1 -9 68.8
8 8 0 3 2 0 1 30
9 9 3 -9 2 3 NA NA
10 10 -9 4 -9 -9 4 100
Is something like this what you're after? Seems easier than the function you supplied.
data = tibble(data)
data |>
mutate(across(where(is.numeric), ~ ifelse(.x == -9, NA, .x))) |>
rowwise() |>
mutate(index = sum(c_across(2:6), na.rm = TRUE)) |>
ungroup() |>
mutate(score = round(scales::rescale(index, to = c(0, 100))))
id ePainAfterSex eWorriedSex eAvoidSex eGuiltyNoSex eFrustratedNoSex index score
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 NA 0 NA NA NA 0 0
2 2 1 0 4 3 3 11 100
3 3 4 NA 2 NA 4 10 91
4 4 NA 2 2 1 1 6 55
5 5 NA 2 NA 4 1 7 64
6 6 2 NA NA NA 1 3 27
7 7 4 3 3 1 NA 11 100
8 8 0 3 2 0 1 6 55
9 9 3 NA 2 3 NA 8 73
10 10 NA 4 NA NA 4 8 73

Modify variables in longitudinal data sets (keep first appearance of values on person-level)

I have a dataframe:
i <- c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3)
t <- c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4)
x <- c(0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1)
y <- c(5, 6, 7, 8, 4, 5, 6, 7, 6, 7, 8, 8)
j1 <- c(NA, NA, NA, NA, NA, 5, NA, 7, NA, NA, 8, 8)
dat <- data.frame(i, t, x, y, j1)
dat
i t x y j1
1 1 1 0 5 NA
2 1 2 0 6 NA
3 1 3 0 7 NA
4 1 4 0 8 NA
5 2 1 0 4 NA
6 2 2 1 5 5
7 2 3 0 6 NA
8 2 4 1 7 7
9 3 1 0 6 NA
10 3 2 0 7 NA
11 3 3 1 8 8
12 3 4 1 9 8
The dataframe refers to 3 persons "i" at 4 points in time "t". "j1" switches to "y" when "x" turns from 0 to 1 for a person "i". While "x" stays on 1 for a person, "j1" does not change within time (see person 3). When "x" is 0, "j1" is always NA.
Now I want to add a new variable "j2" to the dataframe which is a modification of "j1". The difference should be the following: For each person "i", there should be only one value for "j2". Namely, it should be the first value for "j1" for each person (the first change from 0 to 1 in "x").
Accordingly, the result should look like this:
dat
i t x y j1 j2
1 1 1 0 5 NA NA
2 1 2 0 6 NA NA
3 1 3 0 7 NA NA
4 1 4 0 8 NA NA
5 2 1 0 4 NA NA
6 2 2 1 5 5 5
7 2 3 0 6 NA NA
8 2 4 1 7 7 NA
9 3 1 0 6 NA NA
10 3 2 0 7 NA NA
11 3 3 1 8 8 8
12 3 4 1 9 8 NA
I appreciate suggestions on how to address this with dplyr
Somewhat more concise than the others:
library(tidyverse)
dat <- structure(list(i = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3), t = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4), x = c(0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1), y = c(5, 6, 7, 8, 4, 5, 6, 7, 6, 7, 8, 8), j1 = c(NA, NA, NA, NA, NA, 5, NA, 7, NA, NA, 8, 8)), class = "data.frame", row.names = c(NA, -12L))
dat %>%
group_by(i) %>%
mutate(j2 = ifelse(1:n() == which(x == 1)[1], y, NA)) %>%
ungroup()
#> # A tibble: 12 × 6
#> i t x y j1 j2
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 0 5 NA NA
#> 2 1 2 0 6 NA NA
#> 3 1 3 0 7 NA NA
#> 4 1 4 0 8 NA NA
#> 5 2 1 0 4 NA NA
#> 6 2 2 1 5 5 5
#> 7 2 3 0 6 NA NA
#> 8 2 4 1 7 7 NA
#> 9 3 1 0 6 NA NA
#> 10 3 2 0 7 NA NA
#> 11 3 3 1 8 8 8
#> 12 3 4 1 8 8 NA
possible solution
library(tidyverse)
i <- c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3)
t <- c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4)
x <- c(0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1)
y <- c(5, 6, 7, 8, 4, 5, 6, 7, 6, 7, 8, 8)
j1 <- c(NA, NA, NA, NA, NA, 5, NA, 7, NA, NA, 8, 8)
df <- data.frame(i, t, x, y, j1)
tmp <- df %>%
filter(x == 1) %>%
group_by(i) %>%
slice(1) %>%
ungroup() %>%
rename(j2 = j1)
left_join(df, tmp)
#> Joining, by = c("i", "t", "x", "y")
#> i t x y j1 j2
#> 1 1 1 0 5 NA NA
#> 2 1 2 0 6 NA NA
#> 3 1 3 0 7 NA NA
#> 4 1 4 0 8 NA NA
#> 5 2 1 0 4 NA NA
#> 6 2 2 1 5 5 5
#> 7 2 3 0 6 NA NA
#> 8 2 4 1 7 7 NA
#> 9 3 1 0 6 NA NA
#> 10 3 2 0 7 NA NA
#> 11 3 3 1 8 8 8
#> 12 3 4 1 8 8 NA
Created on 2021-09-08 by the reprex package (v2.0.1)
Function f puts NA after first value that is not NA in vector x. FUnction f is applied to j1 for each group determined by i.
f <- function(x){
ind <- which(!is.na(x))[1]
if(is.na(ind) || ind == length(x)) return(x)
x[(which.min(is.na(x))+1):length(x)] <- NA
x
}
dat %>%
group_by(i) %>%
mutate(j2 = f(j1)) %>%
ungroup()
Option1
You can use dplyr with mutate, use j1 and replace()the values for which both the current and the previous (lag()) value are non-NA with NAs:
library(dplyr)
dat %>% group_by(i) %>%
mutate(j2=replace(j1, !is.na(j1) & !is.na(lag(j1)), NA))
Option2
You can use replace() and replace all values in j1 which are not the first non-NA value (which(!is.na(j1))[1]).
dat %>% group_by(i) %>%
mutate(j2=replace(j1, which(!is.na(j1))[1], NA))
Option3
You can use purrr::accumulate() too. Call accumulate comparing consecutive (.x, .y) values form the j1 vector. If they are the same, the output will be NA.
library(dplyr)
dat %>% group_by(i) %>%
mutate(j2=purrr::accumulate(j1, ~ifelse(.x %in% .y, NA, .y)))
Output
# A tibble: 12 x 6
# Groups: i [3]
i t x y j1 j2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 0 5 NA NA
2 1 2 0 6 NA NA
3 1 3 0 7 NA NA
4 1 4 0 8 NA NA
5 2 1 0 4 NA NA
6 2 2 1 5 5 5
7 2 3 0 6 NA NA
8 2 4 1 7 7 7
9 3 1 0 6 NA NA
10 3 2 0 7 NA NA
11 3 3 1 8 8 8
12 3 4 1 8 8 NA

Merge two columns containing NA values in complementing rows

Suppose I have this dataframe
df <- data.frame(
x=c(1, NA, NA, 4, 5, NA),
y=c(NA, 2, 3, NA, NA, 6)
which looks like this
x y
1 1 NA
2 NA 2
3 NA 3
4 4 NA
5 5 NA
6 NA 6
How can I merge the two columns into one? Basically the NA values are in complementary rows. It would be nice to also obtain (in the process) a flag column containing 0 if the entry comes from x and 1 if the entry comes from y.
We can try using the coalesce function from the dplyr package:
df$merged <- coalesce(df$x, df$y)
df$flag <- ifelse(is.na(df$y), 0, 1)
df
x y merged flag
1 1 NA 1 0
2 NA 2 2 1
3 NA 3 3 1
4 4 NA 4 0
5 5 NA 5 0
6 NA 6 6 1
We can also use base R methods with max.col on the logical matrix to get the column index, cbind with row index and extract the values that are not NA
df$merged <- df[cbind(seq_len(nrow(df)), max.col(!is.na(df)))]
df$flag <- +(!is.na(df$y))
df
# x y merged flag
#1 1 NA 1 0
#2 NA 2 2 1
#3 NA 3 3 1
#4 4 NA 4 0
#5 5 NA 5 0
#6 NA 6 6 1
Or we can use fcoalesce from data.table which is written in C and is multithreaded for numeric and factor types.
library(data.table)
setDT(df)[, c('merged', 'flag' ) := .(fcoalesce(x, y), +(!is.na(y)))]
df
# x y merged flag
#1: 1 NA 1 0
#2: NA 2 2 1
#3: NA 3 3 1
#4: 4 NA 4 0
#5: 5 NA 5 0
#6: NA 6 6 1
You can do that using dplyr as follows;
library(dplyr)
# Creating dataframe
df <-
data.frame(
x = c(1, NA, NA, 4, 5, NA),
y = c(NA, 2, 3, NA, NA, 6))
df %>%
# If x is null then replace it with y
mutate(merged = coalesce(x, y),
# If x is null then put 1 else put 0
flag = if_else(is.na(x), 1, 0))
# x y merged flag
# 1 NA 1 0
# NA 2 2 1
# NA 3 3 1
# 4 NA 4 0
# 5 NA 5 0
# NA 6 6 1

Editing a data frame with certain conditions

I have a dataframe and would like to remove some specific cases depending on a simple rule: if x equals 2, y should be NA.
Here is an example:
x <- c(1, 2, 1, 2, 1, 2, 1, 2)
y <- c(5, 5, NA, NA, 6, 6, 4, 4)
df <- data.frame(x, y)
df
x y
1 1 5
2 2 5
3 1 NA
4 2 NA
5 1 6
6 2 6
7 1 4
8 2 4
And the output should look like that:
x y
1 1 5
2 2 NA
3 1 NA
4 2 NA
5 1 6
6 2 NA
7 1 4
8 2 NA
Is there a way to solve that with ifelse? I am grateful for any help.
You could do
df$y[df$x == 2] <- NA
df
# x y
#1 1 5
#2 2 NA
#3 1 NA
#4 2 NA
#5 1 6
#6 2 NA
#7 1 4
#8 2 NA
Or with replace
df$y <- replace(df$y, df$x == 2, NA)
Using same logic in dplyr mutate
library(dplyr)
df %>%
mutate(y = replace(y, x==2, NA))
Or the ifelse version
df$y <- ifelse(df$x == 2, NA, df$y)
df %>%
mutate(y = ifelse(x == 2, NA, y))

eliminating categories with a certain number of non-NA values in R

I have a data frame df which looks like this
> g <- c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6)
> m <- c(1, NA, NA, NA, 3, NA, 2, 1, 3, NA, 3, NA, NA, 4, NA, NA, NA, 2, 1, NA, 7, 3, NA, 1)
> df <- data.frame(g, m)
where g is the category (1 to 6) and m are values in that category.
I've managed to find the amount of none NA values per category by :
aggregate(m ~ g, data=df, function(x) {sum(!is.na(x))}, na.action = NULL)
g m
1 1 1
2 2 3
3 3 2
4 4 1
5 5 2
6 6 3
and would now like to eliminate the rows (categories) where the number of None-NA is 1 and only keep those where the number of NA is 2 and above.
the desired outcome would be
g m
5 2 3
6 2 NA
7 2 2
8 2 1
9 3 3
10 3 NA
11 3 3
12 3 NA
17 5 NA
18 5 2
19 5 1
20 5 NA
21 6 7
22 6 3
23 6 NA
24 6 1
every g=1 and g=4 is eliminated because as shown there is only 1 none-NA in each of those categories
any suggestions :)?
If you want base R, then I suggest you use your aggregation:
df2 <- aggregate(m ~ g, data=df, function(x) {sum(!is.na(x))}, na.action = NULL)
df[ ! df$g %in% df2$g[df2$m < 2], ]
# g m
# 5 2 3
# 6 2 NA
# 7 2 2
# 8 2 1
# 9 3 3
# 10 3 NA
# 11 3 3
# 12 3 NA
# 17 5 NA
# 18 5 2
# 19 5 1
# 20 5 NA
# 21 6 7
# 22 6 3
# 23 6 NA
# 24 6 1
If you want to use dplyr, perhaps
library(dplyr)
group_by(df, g) %>%
filter(sum(!is.na(m)) > 1) %>%
ungroup()
# # A tibble: 16 × 2
# g m
# <dbl> <dbl>
# 1 2 3
# 2 2 NA
# 3 2 2
# 4 2 1
# 5 3 3
# 6 3 NA
# 7 3 3
# 8 3 NA
# 9 5 NA
# 10 5 2
# 11 5 1
# 12 5 NA
# 13 6 7
# 14 6 3
# 15 6 NA
# 16 6 1
One can try a dplyr based solution. group_by on g will help to get the desired count.
library(dplyr)
df %>% group_by(g) %>%
filter(!is.na(m)) %>%
filter(n() >=2) %>%
summarise(count = n())
#Result
# # A tibble: 6 x 2
# g count
# <dbl> <int>
# 1 2.00 3
# 2 3.00 2
# 3 5.00 2
# 4 6.00 3

Resources