I would like to replace missing value from value in another column, if all values within a group are missing. Here is example and something I thought would work. There can be unlimited amount of groups.
library(tidyverse)
df <- tibble(ID = c("A", "A", "A", "B", "B", "B"),
val1 = c(1,2,3,4,5,6),
val2 = c(NA, NA, NA, NA, 2, 3))
df %>%
group_by(ID) %>%
mutate(val2 = ifelse(all(is.na(val2)), val1, val2))
# Groups: ID [2]
ID val1 val2
<chr> <dbl> <dbl>
1 A 1 1
2 A 2 1
3 A 3 1
4 B 4 NA
5 B 5 NA
6 B 6 NA
What I would like to get is val2 should get values from val1, if all val2 values are missing within group. Now it seems that it is giving me the first value. Nothing should happen if all are not missing.
Result:
# A tibble: 6 x 3
ID val1 val2
<chr> <dbl> <dbl>
1 A 1 1
2 A 2 2
3 A 3 3
4 B 4 NA
5 B 5 2
6 B 6 3
Does this work:
library(dplyr)
df %>% group_by(ID) %>% mutate(val2 = case_when(all(is.na(val2)) ~ val1, TRUE ~ val2))
# A tibble: 6 x 3
# Groups: ID [2]
ID val1 val2
<chr> <dbl> <dbl>
1 A 1 1
2 A 2 2
3 A 3 3
4 B 4 NA
5 B 5 2
6 B 6 3
You almost had it. I create an indicator which is used to replace the values:
df %>%
group_by(ID) %>%
mutate(val3 = ifelse(all(is.na(val2)),1,0)) %>%
ungroup() %>%
mutate(val2 = ifelse(val3 == 1, val1, val2)) %>%
select(-val3)
Output:
# A tibble: 6 x 3
ID val1 val2
<chr> <dbl> <dbl>
1 A 1 1
2 A 2 2
3 A 3 3
4 B 4 NA
5 B 5 2
6 B 6 3
Related
I am trying out to select a value by group from one column, and pass it as value in another column, extending for the whole group. This is similar to question asked here . BUt, some groups do not have this number: in that case, I need to fill the column with NAs. How to do this?
Dummy example:
dd1 <- data.frame(type = c(1,1,1),
grp = c('a', 'b', 'd'),
val = c(1,2,3))
dd2 <- data.frame(type = c(2,2),
grp = c('a', 'b'),
val = c(8,2))
dd3 <- data.frame(type = c(3,3),
grp = c('b', 'd'),
val = c(7,4))
dd <- rbind(dd1, dd2, dd3)
Create new column:
dd %>%
group_by(type) %>%
mutate(#val_a = ifelse(grp == 'a', val , NA),
val_a2 = val[grp == 'a'])
Expected outcome:
type grp val val_a # pass in `val_a` value of teh group 'a'
1 1 a 1 1
2 1 b 2 1
3 1 d 3 1
4 2 a 8 8
5 2 b 2 8
6 3 b 7 NA
7 3 d 4 NA # value for 'a' is missing from group 3
You were close with your first approach; use any to apply the condition to all observations in the group:
dd %>%
group_by(type) %>%
mutate(val_a = ifelse(any(grp == "a"), val[grp == "a"] , NA))
type grp val val_a
<dbl> <chr> <dbl> <dbl>
1 1 a 1 1
2 1 b 2 1
3 1 d 3 1
4 2 a 8 8
5 2 b 2 8
6 3 b 7 NA
7 3 d 4 NA
Try this:
dd %>%
group_by(type) %>%
mutate(val_a2 = val[which(c(grp == 'a'))[1]])
# # A tibble: 7 x 4
# # Groups: type [3]
# type grp val val_a2
# <dbl> <chr> <dbl> <dbl>
# 1 1 a 1 1
# 2 1 b 2 1
# 3 1 d 3 1
# 4 2 a 8 8
# 5 2 b 2 8
# 6 3 b 7 NA
# 7 3 d 4 NA
This also controls against the possibility that there could be more than one match, which may cause bad results (with or without a warning).
I'm working on creating some error reports and one of the times I'm trying to address is potential errors within the ID column id_1. I've made an alternative id column from various identifying features within the rows that I'm calling id_2. To help, I've also created a date_lag column on date to catch items that were entered within a specific period after the initial entry. The main problem that I'm having is returning the entire group that meets the criteria, including that first entry that would have an NA in the date_lag, or, if I allow the NA values through, I get more than just the items I'm looking for (id_1 1 and 2 below).
Example:
#id_1 where potential errors lie
#id_2 alternative id col I'm using to test
df <- data.table(id_1 = c(1:4, 1:4),
id_2 = c(rep(c("b", "a"), c(2, 2))),
date = c(rep(1,4),rep(20,2), rep(10,2)))
df %>%
group_by(id_2) %>%
mutate(date_lag = date - lag(date)) %>%
filter(between(date_lag, 0, 10) | is.na(date_lag))
# A tibble: 6 x 4
# Groups: id_1 [4]
id_1 id_2 date date_lag
<int> <chr> <dbl> <dbl>
1 b 1 NA
2 b 1 0
3 a 1 NA
4 a 1 0
2 b 20 0
3 a 10 9
4 a 10 0
Expected:
# A tibble: 6 x 4
# Groups: id_2 [4]
id_1 id_2 value val_lag
<int> <chr> <dbl> <dbl>
3 a 1 NA
4 a 1 NA
3 a 10 9
4 a 10 9
Perhaps, we can use diff
library(dplyr)
df %>%
group_by(id_1) %>%
filter(between(diff(date), 0, 10))
-output
# A tibble: 4 x 3
# Groups: id_1 [2]
# id_1 id_2 date
# <int> <chr> <dbl>
#1 3 a 1
#2 4 a 1
#3 3 a 10
#4 4 a 10
Concatenate with NA as the diff returns a length 1 less than the original data
df %>%
group_by(id_2) %>%
filter(between(c(NA, diff(date)), 0, 10))
# A tibble: 5 x 3
# Groups: id_2 [2]
# id_1 id_2 date
# <int> <chr> <dbl>
#1 2 b 1
#2 4 a 1
#3 2 b 20
#4 3 a 10
#5 4 a 10
I have a data frame:
df = tibble(a=c(7,6,10,12,12), b=c(3,5,8,8,7), c=c(4,4,12,15,20), week=c(1,2,3,4,5))
# A tibble: 5 x 4
a b c week
<dbl> <dbl> <dbl> <dbl>
1 7 3 4 1
2 6 5 4 2
3 10 8 12 3
4 12 8 15 4
5 12 7 20 5
and i want for every column a, b and c the week in which the observation is equal to or exceeds 10.
I.e. for column a it would be week 3, for column b it would be week NA, for column c it would be week 3 as well.
A desired ouotcome could look like this:
tibble(abc=c("a", NA, "b"), value=c(10, NA, 12), week=c(3, NA, 3))
# A tibble: 3 x 3
abc value week
<chr> <dbl> <dbl>
1 a 10 3
2 b NA NA
3 c 12 3
One way would be to get the data in long format and for each column name select the first value that is greater than 10. We fill the missing combinations with complete.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -week, names_to = 'abc') %>%
group_by(abc) %>%
slice(which(value >= 10)[1]) %>%
ungroup %>%
complete(abc = names(df)[-4])
# A tibble: 3 x 3
# abc week value
# <chr> <dbl> <dbl>
#1 a 3 10
#2 b NA NA
#3 c 3 12
Another way is to first calculate what we want and then transform the dataset into long format.
df %>%
summarise(across(a:c, list(week = ~week[which(. >= 10)[1]],
value = ~.[. >= 10][1]))) %>%
pivot_longer(cols = everything(),
names_to = c('abc', '.value'),
names_sep = "_")
I have a dataframe...
df <- tibble(
id = 1:7,
family = c("a","a","b","b","c", "d", "e")
)
Families will only contain 2 members at most (so they're either individuals or pairs).
I need a new column 'random' that assigns the number 1 to families where there is only one member (e.g. c, d and e) and randomly assigns 0 or 1 to families containing 2 members (a and b in the example).
By the end the data should look like the following (depending on the random assignment of 0/1)...
df <- tibble(
id = 1:7,
family = c("a","a","b","b","c", "d", "e"),
random = c(1, 0, 0, 1, 1, 1, 1)
)
I would like to be able to do this with a combination of group_by and mutate since I am mostly using Tidyverse.
I tried the following (but this didn't randomly assign 0/1 within families)...
df %>%
group_by(family) %>%
mutate(
random = if_else(
condition = n() == 1,
true = 1,
false = as.double(sample(0:1,1,replace = T))
)
You could sample along the sequence length of the family group and take the answer modulo 2:
df %>%
group_by(family) %>%
mutate(random = sample(seq(n())) %% 2)
#> # A tibble: 7 x 3
#> # Groups: family [5]
#> id family random
#> <int> <chr> <dbl>
#> 1 1 a 0
#> 2 2 a 1
#> 3 3 b 0
#> 4 4 b 1
#> 5 5 c 1
#> 6 6 d 1
#> 7 7 e 1
We can use if/else
library(dplyr)
df %>%
group_by(family) %>%
mutate(random = if(n() == 1) 1 else sample(rep(0:1, length.out = n())))
# A tibble: 7 x 3
# Groups: family [5]
# id family random
# <int> <chr> <dbl>
#1 1 a 0
#2 2 a 1
#3 3 b 1
#4 4 b 0
#5 5 c 1
#6 6 d 1
#7 7 e 1
Another option
df %>%
group_by(family) %>%
mutate(random = 2 - sample(1:n()))
# A tibble: 7 x 3
# Groups: family [5]
id family random
# <int> <chr> <dbl>
# 1 1 a 1
# 2 2 a 0
# 3 3 b 1
# 4 4 b 0
# 5 5 c 1
# 6 6 d 1
# 7 7 e 1
For the dataframe below I want to add the original values for Var_x after a group_by on ID and event and a max() on quest, but I cannot get my code right. Any suggestions? By the way, in my original dataframe more than 1 column needs to be added.
df <- data.frame(ID = c(1,1,1,1,1,1,2,2,2,3,3,3),
quest = c(1,1,2,2,3,3,1,2,3,1,2,3),
event = c("A","B","A","B","A",NA,"C","D","C","D","D",NA),
VAR_X = c(2,4,3,6,3,NA,6,4,5,7,5,NA))
Code:
df %>%
group_by(ID,event) %>%
summarise(quest = max(quest))
Desired output:
ID quest event VAR_X
1 1 2 B 6
2 1 3 A 3
3 2 2 D 4
4 2 3 C 5
5 3 2 D 5
Start by omiting the na values and in the end do an inner_join with the original data set.
df %>%
na.omit() %>%
group_by(ID, event) %>%
summarise(quest = max(quest)) %>%
inner_join(df, by = c("ID", "event", "quest"))
## A tibble: 5 x 4
## Groups: ID [3]
# ID event quest VAR_X
# <dbl> <fct> <dbl> <dbl>
#1 1 A 3 3
#2 1 B 2 6
#3 2 C 3 5
#4 2 D 2 4
#5 3 D 2 5
df %>%
drop_na() %>% # remove if necessary ..
group_by(ID, event) %>%
filter(quest == max(quest)) %>%
ungroup()
# A tibble: 5 x 4
# ID quest event VAR_X
#<dbl> <dbl> <chr> <dbl>
# 1 1 2 B 6
# 2 1 3 A 3
# 3 2 2 D 4
# 4 2 3 C 5
# 5 3 2 D 5