I want to replace value(s) with NA by group.
have <- data.frame(id = c(1,1,1,1,2,2,2),
value = c(1,2,3,4,5,6,7))
want1 <- data.frame(id = c(1,1,1,1,2,2,2),
value = c(1,2,3,NA,5,6,NA))
want2 <- data.frame(id = c(1,1,1,1,2,2,2),
value = c(1,2,NA,NA,5,NA,NA))
want1 corresponds to replacing the last obs of value with NA and want2 corresponds to replacing last obs of value & last 2nd value with NA. I'm currently trying to do with with dplyr package but can't seem to get any traction. Any help would be much appreciated. Thanks!
We can use row_number() to test the current row against n() the total rows in the group.
have |>
group_by(id) |>
mutate(
last1 = ifelse(row_number() == n(), NA, value),
last2 = ifelse(row_number() >= n() - 1, NA, value)
)
# # A tibble: 7 × 4
# # Groups: id [2]
# id value last1 last2
# <dbl> <dbl> <dbl> <dbl>
# 1 1 1 1 1
# 2 1 2 2 2
# 3 1 3 3 NA
# 4 1 4 NA NA
# 5 2 5 5 5
# 6 2 6 6 NA
# 7 2 7 NA NA
And a general way to provide variants as different data frames.
lapply(
1:2,
function(k) {
have %>%
group_by(id) %>%
mutate(value=ifelse(row_number() <= (n() - k), value, NA))
}
)
[[1]]
# A tibble: 7 × 2
# Groups: id [2]
id value
<dbl> <dbl>
1 1 1
2 1 2
3 1 3
4 1 NA
5 2 5
6 2 6
7 2 NA
[[2]]
# A tibble: 7 × 2
# Groups: id [2]
id value
<dbl> <dbl>
1 1 1
2 1 2
3 1 NA
4 1 NA
5 2 5
6 2 NA
7 2 NA
Here is a base R way.
have <- data.frame(id = c(1,1,1,1,2,2,2),
value = c(1,2,3,4,5,6,7))
want1 <- data.frame(id = c(1,1,1,1,2,2,2),
value = c(1,2,3,NA,5,6,NA))
want2 <- data.frame(id = c(1,1,1,1,2,2,2),
value = c(1,2,NA,NA,5,NA,NA))
with(have, ave(value, id, FUN = \(x){
x[length(x)] <- NA
x
}))
#> [1] 1 2 3 NA 5 6 NA
with(have, ave(value, id, FUN = \(x){
x[length(x)] <- NA
if(length(x) > 1)
x[length(x) - 1L] <- NA
x
}))
#> [1] 1 2 NA NA 5 NA NA
Created on 2022-06-09 by the reprex package (v2.0.1)
Then reassign these results to column value.
Related
I have made a very complex solution to something I feel should have a much simpler solution.
In short what I want:
I want to compute a new column containing the minimum value across 3 columns
I want to ignore zeros and NAs
If I only have zeros and NAs I want a zero
If I have only NAs I want a NA
Here is my solution, it works, but it is very complex and produces a warning.
> library(dplyr)
> df <- data.frame(
+ id = c(1, 2, 3, 4),
+ test1 = c( NA, NA, 2 , 3),
+ test2 = c( NA, 0, 1 , 1),
+ test3 = c(NA, NA, 0 , 2)
+ )
> df2 <- df %>%
+ mutate(nieuw = apply(across(test1:test3), 1, function(x) min(x[x>0]))) %>%
+ rowwise() %>%
+ mutate(nieuw = if_else(is.na(nieuw), max(across(test1:test3), na.rm = TRUE), nieuw)) %>%
+ mutate(nieuw = ifelse(is.infinite(nieuw), NA, nieuw))
> df
id test1 test2 test3
1 1 NA NA NA
2 2 NA 0 NA
3 3 2 1 0
4 4 3 1 2
> df2
# A tibble: 4 x 5
# Rowwise:
id test1 test2 test3 nieuw
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 NA NA NA NA
2 2 NA 0 NA 0
3 3 2 1 0 1
4 4 3 1 2 1
Warning message:
Problem while computing `nieuw = if_else(...)`.
i no non-missing arguments to max; returning -Inf
i The warning occurred in row 1.
You can create a helper function and then apply it rowwise:
library(dplyr)
safe <- function(x, f, ...) ifelse(all(is.na(x)), NA,
ifelse(all(is.na(x) | x == 0),
0, f(x[x > 0], na.rm = TRUE, ...)))
df %>%
rowwise() %>%
mutate(a = safe(c_across(test1:test3), min))
# A tibble: 4 × 5
# Rowwise:
id test1 test2 test3 a
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 NA NA NA NA
2 2 NA 0 NA 0
3 3 2 1 0 1
4 4 3 1 2 1
Here is another option. It leverages making zeros and NA's very large and then recodes them at the end:
library(tidyverse)
get_min <- function(data, cols){
data[is.na(data)] <- 1e6
data[data == 0] <- 1e5
nums <- do.call(pmin, select(data, all_of(cols)))
recode(nums, `1e+06` = NA_real_, `1e+05` = 0.)
}
df %>%
mutate(nieuw = get_min(., c("test1", "test2", "test3")))
#> id test1 test2 test3 nieuw
#> 1 1 NA NA NA NA
#> 2 2 NA 0 NA 0
#> 3 3 2 1 0 1
#> 4 4 3 1 2 1
I am trying out to select a value by group from one column, and pass it as value in another column, extending for the whole group. This is similar to question asked here . BUt, some groups do not have this number: in that case, I need to fill the column with NAs. How to do this?
Dummy example:
dd1 <- data.frame(type = c(1,1,1),
grp = c('a', 'b', 'd'),
val = c(1,2,3))
dd2 <- data.frame(type = c(2,2),
grp = c('a', 'b'),
val = c(8,2))
dd3 <- data.frame(type = c(3,3),
grp = c('b', 'd'),
val = c(7,4))
dd <- rbind(dd1, dd2, dd3)
Create new column:
dd %>%
group_by(type) %>%
mutate(#val_a = ifelse(grp == 'a', val , NA),
val_a2 = val[grp == 'a'])
Expected outcome:
type grp val val_a # pass in `val_a` value of teh group 'a'
1 1 a 1 1
2 1 b 2 1
3 1 d 3 1
4 2 a 8 8
5 2 b 2 8
6 3 b 7 NA
7 3 d 4 NA # value for 'a' is missing from group 3
You were close with your first approach; use any to apply the condition to all observations in the group:
dd %>%
group_by(type) %>%
mutate(val_a = ifelse(any(grp == "a"), val[grp == "a"] , NA))
type grp val val_a
<dbl> <chr> <dbl> <dbl>
1 1 a 1 1
2 1 b 2 1
3 1 d 3 1
4 2 a 8 8
5 2 b 2 8
6 3 b 7 NA
7 3 d 4 NA
Try this:
dd %>%
group_by(type) %>%
mutate(val_a2 = val[which(c(grp == 'a'))[1]])
# # A tibble: 7 x 4
# # Groups: type [3]
# type grp val val_a2
# <dbl> <chr> <dbl> <dbl>
# 1 1 a 1 1
# 2 1 b 2 1
# 3 1 d 3 1
# 4 2 a 8 8
# 5 2 b 2 8
# 6 3 b 7 NA
# 7 3 d 4 NA
This also controls against the possibility that there could be more than one match, which may cause bad results (with or without a warning).
I have questionnaire data (rows=individuals, cols=scores on questions)and would like to compute a sumscore for individuals if they answered a given number of questions, otherwise the sumscore variable should be NA. The code below computes row sums, counts the number of NA's, assigns an otherwise not occurring value to the row sum variable in case the number of NA's is large, and then replaces that with an NA. The code works but I bet there is a more elegant way...Suggestions much appreciated.
dum<-tibble(x=c(1,NA,2,3,4),y=c(1,2,3,NA,5),z=c(1,NA,2,3,4))
dum<-dum %>%
mutate(sumsum = rowSums(select(., x:z), na.rm = TRUE))
dum<-dum %>%
mutate(countna=rowSums(is.na(select(.,x:z))))
dum<-dum %>%
mutate(sumsum=case_when(countna>=2 ~ 100,TRUE~sumsum))
dum<-dum %>%
mutate(sumsum = na_if(sumsum, 100))
You may combine your code in one statement -
library(dplyr)
dum <- tibble(x=c(1,NA,2,3,4),y=c(1,2,3,NA,5),z=c(1,NA,2,3,4))
dum <- dum %>%
mutate(sumsum = replace(rowSums(select(., x:z), na.rm = TRUE),
rowSums(is.na(select(., x:z))) >= 2, NA))
dum
# A tibble: 5 × 4
# x y z sumsum
# <dbl> <dbl> <dbl> <dbl>
#1 1 1 1 3
#2 NA 2 NA NA
#3 2 3 2 7
#4 3 NA 3 6
#5 4 5 4 13
You can also try this:
dum<-tibble(x=c(1,NA,2,3,4),y=c(1,2,3,NA,5),z=c(1,NA,2,3,4))
dum2 <- dum %>% mutate(sumsum = ifelse(rowSums(is.na(select(.,x:z)))>=2, NA,rowSums(select(., x:z), na.rm = TRUE)))
dum2
# A tibble: 5 × 4
x y z sumsum
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 3
2 NA 2 NA NA
3 2 3 2 7
4 3 NA 3 6
5 4 5 4 13
I have the following data
df <- tibble(Type=c(1,2,2,1,1,2),ID=c(6,4,3,2,1,5))
Type ID
1 6
2 4
2 3
1 2
1 1
2 5
For each of the type 2 rows, I want to find the IDs of the type 1 rows just below and above them. For the above dataset, the output will be:
Type ID IDabove IDbelow
1 6 NA NA
2 4 6 2
2 3 6 2
1 2 NA NA
1 1 NA NA
2 5 1 NA
Naively, I can write a for loop to achieve this, but that would be too time consuming for the dataset I am dealing with.
One approach using dplyr lead,lag to get next and previous value respectively and data.table's rleid to create groups of consecutive Type values.
library(dplyr)
library(data.table)
df %>%
mutate(IDabove = ifelse(Type == 2, lag(ID), NA),
IDbelow = ifelse(Type == 2, lead(ID), NA),
grp = rleid(Type)) %>%
group_by(grp) %>%
mutate(IDabove = first(IDabove),
IDbelow = last(IDbelow)) %>%
ungroup() %>%
select(-grp)
# Type ID IDabove IDbelow
# <dbl> <dbl> <dbl> <dbl>
#1 1 6 NA NA
#2 2 4 6 2
#3 2 3 6 2
#4 1 2 NA NA
#5 1 1 NA NA
#6 2 5 1 NA
A dplyr only solution:
You could create your own rleid function then apply the logic provided by Ronak(Many thanks. Upvoted).
library(dplyr)
my_func <- function(x) {
x <- rle(x)$lengths
rep(seq_along(x), times=x)
}
# this part is the same as provided by Ronak.
df %>%
mutate(IDabove = ifelse(Type == 2, lag(ID), NA),
IDbelow = ifelse(Type == 2, lead(ID), NA),
grp = my_func(Type)) %>%
group_by(grp) %>%
mutate(IDabove = first(IDabove),
IDbelow = last(IDbelow)) %>%
ungroup() %>%
select(-grp)
Output:
Type ID IDabove IDbelow
<dbl> <dbl> <dbl> <dbl>
1 1 6 NA NA
2 2 4 6 2
3 2 3 6 2
4 1 2 NA NA
5 1 1 NA NA
6 2 5 1 NA
How do I get from
# A tibble: 6 x 2
group_var psbl_NAs
<chr> <dbl>
1 a 1
2 a NA
3 a NA
4 b 1
5 b 1
6 b NA
to
# A tibble: 6 x 2
group_var psbl_NAs
<chr> <dbl>
1 b 1
2 b 1
3 b NA
using the fact that psbl_NAs in group "a" are present in more than 50% of the data?
tibble(
group_var = c(rep("a",3), rep("b",3)),
psbl_NAs = c(1, NA, NA, 1, 1, NA)
) %>%
group_by(group_var) %>%
??????
We can group_by, mutate, and then filter:
d %>%
group_by(group_var) %>%
# calculate % of NA values by group
mutate(pct_na = mean(is.na(psbl_NAs))) %>%
# only keep where % of NA values < 0.5
filter(pct_na < 0.5) %>%
select(-pct_na) # remove % NA column
# group_var psbl_NAs
# <chr> <dbl>
# 1 b 1
# 2 b 1
# 3 b NA
It might be instructive to see the result of our calculation of % NAs:
d %>%
group_by(group_var) %>%
# calculate % of NA values by group
mutate(pct_na = mean(is.na(psbl_NAs)))
# group_var psbl_NAs pct_na
# <chr> <dbl> <dbl>
# 1 a 1 0.667
# 2 a NA 0.667
# 3 a NA 0.667
# 4 b 1 0.333
# 5 b 1 0.333
# 6 b NA 0.333
Here's a one line base R solution using ave:
d[with(d, ave(psbl_NAs, group_var, FUN = function(x) mean(is.na(x)))) < 0.5,]