filling NA with values from another table - r

I have the following datasets in RStudio:
df =
a b
1 A
1 NA
1 A
1 NA
2 C
2 NA
2 B
3 A
3 NA
3 C
3 D
and fill_with =
a b
1 A
2 B
3 C
How do I fill the NA values in df in the b column according to the a column?
Ex: a=1, b=NA, then I look at the table fill_with at a=1, and I see that I should fill it with b=A.
In the end it should look the following way:
df =
a b
1 A
1 A
1 A
1 A
2 C
2 B
2 B
3 A
3 C
3 C
3 D

We can use ifelse
df$b <- ifelse(is.na(df$b) ,
fill_with$b[match(df$a , fill_with$a)] , df$b)
Output
a b
1 1 A
2 1 A
3 1 A
4 1 A
5 2 C
6 2 B
7 2 B
8 3 A
9 3 C
10 3 C
11 3 D

library(tidyverse)
df <- read_table("a b
1 A
1 NA
1 A
1 NA
2 C
2 NA
2 B
3 A
3 NA
3 C
3 D")
df %>%
group_by(a) %>%
fill(b, .direction = "updown")
# A tibble: 11 x 2
# Groups: a [3]
a b
<dbl> <chr>
1 1 A
2 1 A
3 1 A
4 1 A
5 2 C
6 2 B
7 2 B
8 3 A
9 3 C
10 3 C
11 3 D

Base R
tmp=which(is.na(df$b))
df$b[tmp]=fill_with$b[match(df$a,fill_with$a)[tmp]]
a b
1 1 A
2 1 A
3 1 A
4 1 A
5 2 C
6 2 B
7 2 B
8 3 A
9 3 C
10 3 C
11 3 D

library(tidyverse)
df <- data.frame(
stringsAsFactors = FALSE,
a = c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3),
b = c("A", NA, "A", NA, "C", NA, "B", "A", NA, "C", "D")
)
fill_with <- data.frame(
stringsAsFactors = FALSE,
a = c(1L, 2L, 3L),
b = c("A", "B", "C")
)
rows_update(x = df, y = fill_with, by = "a")
#> a b
#> 1 1 A
#> 2 1 A
#> 3 1 A
#> 4 1 A
#> 5 2 B
#> 6 2 B
#> 7 2 B
#> 8 3 C
#> 9 3 C
#> 10 3 C
#> 11 3 C
Created on 2022-08-22 with reprex v2.0.2

Related

How can I stack my dataset so each observation relates to all other observations but itself, within a group?

EDIT: I took one observation out from the data frame of the original post and changed some values so writing manually is easier. I am also adding the desired output, so my question is easier to read.
This is a continuation to a question I made in another post:
How can I stack my dataset so each observation relates to all other observations but itself?
In that post, I asked how can I make a row relate to all other observations but itself. I am trying to apply the answers to my dataset, but the issue is that I have a dataset with country-year-party. In my actual dataset, I want an observation to relate to every other observation within country-year.
Say for example I have a data frame with 2 countries (id1) A and B:
df <- data.frame(id1 = c("A","A","A","B","B","B"),
id2 = c("a", "b", "c", "a", "b", "c" ),
x1 = c(1,2,3,1,2,3))
df
id1 id2 x1
1 A a 1
2 A b 2
3 A c 3
4 B a 1
5 B b 2
6 B c 3
Each row in column id2 identifies one person a, b and c. I want each person to relate to every other person within country. So person a will be related to person b and c, but it has to be within country. I am trying the following codes:
df <- df %>% group_by(id1) %>% merge( df, by = NULL) %>%
filter(id2.x != id2.y)
or even:
df <- df %>% group_by(id2) %>%
left_join(df, df, by = character()) %>%
filter(id2.x != id2.y)
But it leads to the following result:
id1.x id2.x x1.x id1.y id2.y x1.y
1 A b 2 A a 1
2 A c 3 A a 1
3 B b 2 A a 1
4 B c 3 A a 1
5 A a 1 A b 2
6 A c 3 A b 2
7 B a 1 A b 2
8 B c 3 A b 2
9 A a 1 A c 3
10 A b 2 A c 3
11 B a 1 A c 3
12 B b 2 A c 3
13 A b 2 B a 1
14 A c 3 B a 1
15 B b 2 B a 1
16 B c 3 B a 1
17 A a 1 B b 2
18 A c 3 B b 2
19 B a 1 B b 2
20 B c 3 B b 2
21 A a 1 B c 3
22 A b 2 B c 3
23 B a 1 B c 3
24 B b 2 B c 3
Notice that in observation 3, person b in country B is related to person a in country A. This is what I am trying to avoid. I want person a to relate to b and c, but only within each country. How can i do that?
The desired output would be something like this:
id1.x id2.x x1.x id1.y id2.y x1.y
1 A a 1 A b 2
2 A a 1 A c 3
3 A b 2 A a 1
4 A b 2 A c 3
5 A c 3 A a 1
6 A c 3 A b 2
7 B a 1 B b 2
8 B a 1 B c 3
9 B b 2 B a 1
10 B b 2 B c 3
11 B c 3 B a 1
12 B c 3 B b 2
So, within each country A and B, each person a,b,c relates to each other but himself. I tried to clarify some questions and simplify my example, let me know if it is clear now and you need more clarification.
df %>%
group_by(id1)%>%
mutate(vals=map(row_number(), ~cur_data_all()[-.x,]))%>%
unnest(vals, names_sep = "_")
# A tibble: 12 × 6
# Groups: id1 [2]
id1 id2 x1 vals_id1 vals_id2 vals_x1
<chr> <chr> <dbl> <chr> <chr> <dbl>
1 A a 1 A b 2
2 A a 1 A c 3
3 A b 2 A a 1
4 A b 2 A c 3
5 A c 3 A a 1
6 A c 3 A b 2
7 B a 1 B b 2
8 B a 1 B c 3
9 B b 2 B a 1
10 B b 2 B c 3
11 B c 3 B a 1
12 B c 3 B b 2
Here is a base R option:
df <- data.frame(id1 = c("A","A","A","A","B","B","B","B"),
id2 = c("a", "b", "c", "d", "a", "b", "c", "d"),
x1 = c(1,2,3,4, 5,6,7,8))
#base option
by(df, df$id1, \(x){
rws <- t(combn(seq(nrow(x)), 2))
cbind(x[rws[,1],], x[rws[,2],2:3]) |>
`colnames<-`(c("id1", "id2.x","x1.x", "id2.y", "x2.y"))
}) |>
do.call(what = rbind.data.frame)|>
`row.names<-`(NULL)
#> id1 id2.x x1.x id2.y x2.y
#> 1 A a 1 b 2
#> 2 A a 1 c 3
#> 3 A a 1 d 4
#> 4 A b 2 c 3
#> 5 A b 2 d 4
#> 6 A c 3 d 4
#> 7 B a 5 b 6
#> 8 B a 5 c 7
#> 9 B a 5 d 8
#> 10 B b 6 c 7
#> 11 B b 6 d 8
#> 12 B c 7 d 8
EDIT
here is a tidyverse option
library(tidyverse)
full_join(df, df, by = "id1") |>
filter(id2.x != id2.y)
#> id1 id2.x x1.x id2.y x1.y
#> 1 A a 1 b 2
#> 2 A a 1 c 3
#> 3 A a 1 d 4
#> 4 A b 2 a 1
#> 5 A b 2 c 3
#> 6 A b 2 d 4
#> 7 A c 3 a 1
#> 8 A c 3 b 2
#> 9 A c 3 d 4
#> 10 A d 4 a 1
#> 11 A d 4 b 2
#> 12 A d 4 c 3
#> 13 B a 5 b 6
#> 14 B a 5 c 7
#> 15 B a 5 d 8
#> 16 B b 6 a 5
#> 17 B b 6 c 7
#> 18 B b 6 d 8
#> 19 B c 7 a 5
#> 20 B c 7 b 6
#> 21 B c 7 d 8
#> 22 B d 8 a 5
#> 23 B d 8 b 6
#> 24 B d 8 c 7
Building on #RitchieSacramento’s solution from your previous question, you can use expand_grid() inside group_modify().
library(dplyr)
library(tidyr)
df %>%
group_by(id1) %>%
group_modify(~ expand_grid(.x, .x, .name_repair = make.unique)) %>%
ungroup() %>%
filter(id2 != id2.1)
# A tibble: 12 × 5
id1 id2 x1 id2.1 x1.1
<chr> <chr> <dbl> <chr> <dbl>
1 A a 1 b 2
2 A a 1 c 3
3 A b 2 a 1
4 A b 2 c 3
5 A c 3 a 1
6 A c 3 b 2
7 B a 1 b 2
8 B a 1 c 3
9 B b 2 a 1
10 B b 2 c 3
11 B c 3 a 1
12 B c 3 b 2

How to create unique data frame rows from inputs of 2 lists

In this example I have a list with 4 values (Lot) and another with 3 (Method).
Lot <- c("A", "B", "C", "D")
Method <- c(1,2,3)
I need to create a data frame with a Lot and Method column where their values are repeated so each row is unique. I need it to look like this:
# Lot Method
# 1 A 1
# 2 A 2
# 3 A 3
# 4 B 1
# 5 B 2
# 6 B 3
# 7 C 1
# 8 C 2
# 9 C 3
# 10 D 1
# 11 D 2
# 12 D 3
How can this be done without creating 2 long repetitive lists like this:
Lot <- c("A","A","A", "B","B","B","C","C","C","D","D","D")
Method <- c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3)
Using expand.grid you could do
expand.grid(
Lot = c("A", "B", "C", "D"),
Method = c(1,2,3)
)
#> Lot Method
#> 1 A 1
#> 2 B 1
#> 3 C 1
#> 4 D 1
#> 5 A 2
#> 6 B 2
#> 7 C 2
#> 8 D 2
#> 9 A 3
#> 10 B 3
#> 11 C 3
#> 12 D 3
Or to get the right order we could do (thanks to #onyambu for pointing that out):
rev(expand.grid(
Method = c(1,2,3),
Lot = c("A", "B", "C", "D")
))
#> Lot Method
#> 1 A 1
#> 5 A 2
#> 9 A 3
#> 2 B 1
#> 6 B 2
#> 10 B 3
#> 3 C 1
#> 7 C 2
#> 11 C 3
#> 4 D 1
#> 8 D 2
#> 12 D 3
Or using the tidyverse you could do:
tidyr::expand_grid(
Lot = c("A", "B", "C", "D"),
Method = c(1,2,3)
)
#> # A tibble: 12 × 2
#> Lot Method
#> <chr> <dbl>
#> 1 A 1
#> 2 A 2
#> 3 A 3
#> 4 B 1
#> 5 B 2
#> 6 B 3
#> 7 C 1
#> 8 C 2
#> 9 C 3
#> 10 D 1
#> 11 D 2
#> 12 D 3
With data.table, we can use CJ:
library(data.table)
CJ(Lot = c("A", "B", "C", "D"),
Method = c(1, 2, 3))
Output
Lot Method
1: A 1
2: A 2
3: A 3
4: B 1
5: B 2
6: B 3
7: C 1
8: C 2
9: C 3
10: D 1
11: D 2
12: D 3

Group by cumulative sums with conditions

In this dataframe:
df <- data.frame(
ID = c("C", "B", "B", "B", NA, "C", "A", NA, "B", "B", "B")
)
I'd like to group the rows using cumsum with two conditions: (i) cumsum should not continue if is.na(ID) and (ii) it should not continue if the next ID value is the same as the prior. I do meet condition (i) with this:
df %>%
group_by(grp = cumsum(!is.na(ID)))
# A tibble: 11 x 2
# Groups: grp [9]
ID grp
<chr> <int>
1 C 1
2 B 2
3 B 3
4 B 4
5 NA 4
6 C 5
7 A 6
8 NA 6
9 B 7
10 B 8
11 B 9
but I don't know how to implement condition (ii) too, to obtain the desired result:
1 C 1
2 B 2
3 B 2
4 B 2
5 NA 2
6 C 3
7 A 4
8 NA 4
9 B 5
10 B 5
11 B 5
I tried it with this but I doesn't work:
df %>%
group_by(grp = cumsum(!is.na(ID) |!lag(ID,1) == ID))
Use na.locf0 from zoo to fill in the NAs and then apply rleid from data.table:
library(data.table)
library(zoo)
rleid(na.locf0(df$ID))
## [1] 1 2 2 2 2 3 4 4 5 5 5
Using tidyr and dplyr, you could do:
df %>%
mutate(grp = fill(., ID) %>% pull(),
grp = cumsum(grp != lag(grp, default = first(grp))))
ID grp
1 C 0
2 B 1
3 B 1
4 B 1
5 <NA> 1
6 C 2
7 A 3
8 <NA> 3
9 B 4
10 B 4
11 B 4
Using rle
library(zoo)
with(rle(na.locf0(df$ID)), rep(seq_along(values), lengths))
#[1] 1 2 2 2 2 3 4 4 5 5 5

Assign ID to consecutive groups column r

I would like to produce a column in a data.frame that counts the consecutive id of the groups (s column in dummy df)
dummy_df = data.frame(s = c("a", "a", "b","b", "b", "c","c", "a", "a", "c", "c","a","a"),
desired_output= c(1,1,1,1,1,1,1,2,2,2,2,3,3))
dummy_df$rleid_output = rleid(dummy_df$s)
dummy_df
s desired_output rleid_output
1 a 1 1
2 a 1 1
3 b 1 2
4 b 1 2
5 b 1 2
6 c 1 3
7 c 1 3
8 a 2 4
9 a 2 4
10 c 2 5
11 c 2 5
12 a 3 6
13 a 3 6
I would say it's similar to what rleid() does but restarting the counting when a new group is seen. However, I can't find a way to do it in such straight way. Thanks.
You can do:
dummy_df$out <- with(rle(dummy_df$s), rep(ave(lengths, values, FUN = seq_along), lengths))
Result:
s desired_output out
1 a 1 1
2 a 1 1
3 b 1 1
4 b 1 1
5 b 1 1
6 c 1 1
7 c 1 1
8 a 2 2
9 a 2 2
10 c 2 2
11 c 2 2
12 a 3 3
13 a 3 3
If you are willing to use data.table (rleid is part of the package), you can do it in two steps as follows:
library(data.table)
dummy_df = data.frame(s = c("a", "a", "b", "b", "b", "c", "c", "a", "a", "c", "c", "a", "a"))
# cast data.frame to data.table
setDT(dummy_df)
# create auxiliary variable
dummy_df[, rleid_output := rleid(s)]
# obtain desired output
dummy_df[, desired_output := rleid(rleid_output), by = "s"]
# end result
dummy_df
#> s rleid_output desired_output
#> 1: a 1 1
#> 2: a 1 1
#> 3: b 2 1
#> 4: b 2 1
#> 5: b 2 1
#> 6: c 3 1
#> 7: c 3 1
#> 8: a 4 2
#> 9: a 4 2
#> 10: c 5 2
#> 11: c 5 2
#> 12: a 6 3
#> 13: a 6 3
Created on 2020-10-16 by the reprex package (v0.3.0)
you can try a tidyverse in combination with the base R rle function
library(tidyverse)
rle(dummy_df$s) %>%
with(., data.frame(a=.$length, b=.$value)) %>%
group_by(b) %>%
mutate(n = 1:n()) %>%
with(., rep(n, times=a)) %>%
bind_cols(dummy_df, res=.)
s desired_output res
1 a 1 1
2 a 1 1
3 b 1 1
4 b 1 1
5 b 1 1
6 c 1 1
7 c 1 1
8 a 2 2
9 a 2 2
10 c 2 2
11 c 2 2
12 a 3 3
13 a 3 3

Conditional statement within group

I have a dataframe in which I want to make a new column with values based on condition within groups. So for the dataframe below, I want to make a new column n_actions which gives
Cond1. for the whole group GROUP the number 2 if a 6 appears in column STEP
Cond 2. for the whole group GROUP the number 3 if a 9 appears in column STEP
Cond 3. if not a 6 or 9 appears within column STEP for the GROUP, then 1
#dataframe start
dataframe <- data.frame(group = c("A", "A", "A", "B", "B", "B", "B", "B", "B", "C", "C", "C", "D", "D", "D", "D", "D", "D", "D", "D", "D"),
step = c(1, 2, 3, 1, 2, 3, 4, 5, 6, 1, 2, 3, 1, 2, 3, 4, 5, 6, 7, 8, 9))
# dataframe desired
dataframe$n_actions <- c(rep(1, 3), rep(2, 6,), rep(1, 3), rep(3, 9))
Try out:
library(dplyr)
dataframe %>%
group_by(group) %>%
mutate(n_actions = ifelse(9 %in% step, 3,
ifelse(6 %in% step, 2, 1)))
# A tibble: 21 x 3
# Groups: group [4]
group step n_actions
<fctr> <dbl> <dbl>
1 A 1 1
2 A 2 1
3 A 3 1
4 B 1 2
5 B 2 2
6 B 3 2
7 B 4 2
8 B 5 2
9 B 6 2
10 C 1 1
# ... with 11 more rows
Another way with dplyr's case_when:
library(dplyr)
dataframe %>%
group_by(group) %>%
mutate(
n_actions1 = case_when(
9 %in% step ~ 3,
6 %in% step ~ 2,
TRUE ~ 1
)
)
Output:
# A tibble: 21 x 3
# Groups: group [4]
group step n_actions
<fct> <dbl> <dbl>
1 A 1 1
2 A 2 1
3 A 3 1
4 B 1 2
5 B 2 2
6 B 3 2
7 B 4 2
8 B 5 2
9 B 6 2
10 C 1 1
11 C 2 1
12 C 3 1
13 D 1 3
14 D 2 3
15 D 3 3
16 D 4 3
17 D 5 3
18 D 6 3
19 D 7 3
20 D 8 3
21 D 9 3
You could divide the maximum value per group by %/% 3, it seems.
dataframe <- transform(dataframe,
n_actions2 = ave(step, group, FUN = function(x) max(x) %/% 3))
dataframe
# group step n_actions n_actions2
#1 A 1 1 1
#2 A 2 1 1
#3 A 3 1 1
#4 B 1 2 2
#5 B 2 2 2
#6 B 3 2 2
#7 B 4 2 2
#8 B 5 2 2
#9 B 6 2 2
#10 C 1 1 1
#11 C 2 1 1
#12 C 3 1 1
#13 D 1 3 3
#14 D 2 3 3
#15 D 3 3 3
#16 D 4 3 3
#17 D 5 3 3
#18 D 6 3 3
#19 D 7 3 3
#20 D 8 3 3
#21 D 9 3 3

Resources