rbind dataframes by filling missing rows from the first dataframe - r

I have 4 datasets from 4 rounds of a survey, with the first round containing 5 variables and the next ones containing only 3. This is because the ID (same sample) and the other two variables (v1 and v2) are fixed over time.
df1 <- data.frame(id = c(1:5), round=1, v1 = c(6:10), v2 = c(11:15), v3=c(16:20))
df2 <- data.frame(id = c(1:5), round=2, v3=c(26:30))
df3 <- data.frame(id = c(1:5), round=3, v3=c(36:40))
df4 <- data.frame(id = c(1:5), round=4, v3=c(46:50))
** rbind
list(df1, df2, df3, df4) %>%
bind_rows(.id = 'grp') %>%
group_by(id)
Now when I rbind them, I end up with missing rows for the two fixed variables for rounds 1 to 3:
grp id round v1 v2 v3
<chr> <int> <dbl> <int> <int> <int>
1 1 1 1 6 11 16
2 1 2 1 7 12 17
3 1 3 1 8 13 18
4 1 4 1 9 14 19
5 1 5 1 10 15 20
6 2 1 2 NA NA 26
7 2 2 2 NA NA 27
8 2 3 2 NA NA 28
9 2 4 2 NA NA 29
10 2 5 2 NA NA 30
11 3 1 3 NA NA 36
12 3 2 3 NA NA 37
13 3 3 3 NA NA 38
14 3 4 3 NA NA 39
15 3 5 3 NA NA 40
16 4 1 4 NA NA 46
17 4 2 4 NA NA 47
18 4 3 4 NA NA 48
19 4 4 4 NA NA 49
20 4 5 4 NA NA 50
but I need v1 and v2 to be filled for the next rounds as well by matching the respective ID.
Please let me know if there is any way to do this in R (or in Python).
Thank you.

list(df1, df2, df3, df4) %>%
bind_rows(.id = 'grp') %>%
group_by(id) %>%
fill(v1:v3) # from tidyr
#fill(4:6) # alternative syntax: columns 4-6
#fill(-c(1:3)) # alternative syntax: everything except columns 1:3
#fill(everything()) # alternative syntax: fill NAs in all columns
grp id round v1 v2 v3
<chr> <int> <dbl> <int> <int> <int>
1 1 1 1 6 11 16
2 1 2 1 7 12 17
3 1 3 1 8 13 18
4 1 4 1 9 14 19
5 1 5 1 10 15 20
6 2 1 2 6 11 26
7 2 2 2 7 12 27
8 2 3 2 8 13 28
9 2 4 2 9 14 29
10 2 5 2 10 15 30
11 3 1 3 6 11 36
12 3 2 3 7 12 37
13 3 3 3 8 13 38
14 3 4 3 9 14 39
15 3 5 3 10 15 40
16 4 1 4 6 11 46
17 4 2 4 7 12 47
18 4 3 4 8 13 48
19 4 4 4 9 14 49
20 4 5 4 10 15 50

Related

filter() rows from dataframe with condition on previous and next row, keeping NA values

I have a dataframe like this:
AA<-c(1,2,4,5,6,7,10,11,12,13,14,15)
BB<-c(32,21,21,NA,27,31,31,12,28,NA,48,7)
df<- data.frame(AA,BB)
I want to remove rows where BB value is equal to previous or next row, to keep only first and last occurrences from each value of BB column. I also want to keep NA rows. I arrive to that code which is not so far from what I want:
lighten_df <- df %>% filter(BB!=lag(BB) | BB!=lead(BB) | is.na(BB) )
which gives me:
> lighten_df
AA BB
1 1 32
2 2 21
3 5 NA
4 6 27
5 7 31
6 10 31
7 11 12
8 12 28
9 13 NA
10 14 48
11 15 7
My problem is that I would like to keep first and last 21 value for col BB. That's the result I expect:
AA BB
1 1 32
2 2 21
3 4 21
4 5 NA
5 6 27
6 7 31
7 10 31
8 11 12
9 12 28
10 13 NA
11 14 48
12 15 7
Any Idea?
I would suggest a different approach: define a grouping variable and keep the first and last rows within each group:
df %>%
group_by(grp = data.table::rleid(BB)) %>%
slice(unique(c(1, n())))
# # A tibble: 12 × 3
# # Groups: grp [10]
# AA BB grp
# <dbl> <dbl> <int>
# 1 1 32 1
# 2 2 21 2
# 3 4 21 2
# 4 5 NA 3
# 5 6 27 4
# 6 7 31 5
# 7 10 31 5
# 8 11 12 6
# 9 12 28 7
# 10 13 NA 8
# 11 14 48 9
# 12 15 7 10

Completing a sequence of integers by group with tidyverse in R

Given a dataset which contains a grouping variable and a column of integers which is incomplete (contains NAs) and the beginning and ending integer vary by group and the length of each group varies (and could be NA). How might one fill in the NA integer values by completing the sequence.
The following dataset may be used as an example:
library(dplyr)
set.seed(5112021)
dat1 <- bind_rows(data.frame(Group=1,Seq=(3:20)),
data.frame(Group=2,Seq=(-1:25))) %>%
mutate(rn = rnorm(45,mean=0.5,sd=1),
Seq = ifelse(rn < 0.4,NA,Seq)) %>%
select(-rn) %>%
group_by(Group) %>%
mutate(Seq = ifelse(Seq==-1,NA,Seq))
dat1
Group Seq
1 1 NA
2 1 NA
3 1 NA
4 1 6
5 1 7
6 1 8
7 1 NA
8 1 10
9 1 11
10 1 NA
11 1 13
12 1 NA
13 1 15
14 1 NA
15 1 NA
16 1 NA
17 1 NA
18 1 20
19 2 NA
20 2 0
21 2 NA
22 2 2
23 2 3
24 2 NA
25 2 5
26 2 6
27 2 7
28 2 8
29 2 NA
30 2 10
31 2 NA
32 2 12
33 2 NA
34 2 NA
35 2 NA
36 2 16
37 2 17
38 2 NA
39 2 NA
40 2 NA
41 2 NA
42 2 22
43 2 NA
44 2 NA
45 2 NA
One way to do this could be to make use of row_numbers (since they are a sequence of integers) by group and calculate the difference between the non-missing values and the row number (which is a unique value) and then add that value back to the row number.
for example
dat2 <- dat1 %>%
group_by(Group) %>%
mutate(rn = row_number(),
diff = mean(Seq-rn,na.rm=T)) %>%
mutate(New_Seq = rn+diff) %>%
select(-rn,-diff)
dat2
Group Seq New_Seq
1 1 NA 3
2 1 NA 4
3 1 NA 5
4 1 6 6
5 1 7 7
6 1 8 8
7 1 NA 9
8 1 10 10
9 1 11 11
10 1 NA 12
11 1 13 13
12 1 NA 14
13 1 15 15
14 1 NA 16
15 1 NA 17
16 1 NA 18
17 1 NA 19
18 1 20 20
19 2 NA -1
20 2 0 0
21 2 NA 1
22 2 2 2
23 2 3 3
24 2 NA 4
25 2 5 5
26 2 6 6
27 2 7 7
28 2 8 8
29 2 NA 9
30 2 10 10
31 2 NA 11
32 2 12 12
33 2 NA 13
34 2 NA 14
35 2 NA 15
36 2 16 16
37 2 17 17
38 2 NA 18
39 2 NA 19
40 2 NA 20
41 2 NA 21
42 2 22 22
43 2 NA 23
44 2 NA 24
45 2 NA 25
While this works, it doesn't seem very elegant and may be slow for very large datasets with many grouping variables. I'm curiouse if there is a more 'Tidyverse' way to do this.
You could do something like:
df %>%
group_by(Group) %>%
mutate(newseq = seq_along(Group) + (first(na.omit(Seq)) - sum(cumall(is.na(Seq)))) - 1) %>%
ungroup()
Or
df %>%
group_by(Group) %>%
mutate(newseq = seq(first(na.omit(Seq)) - sum(cumall(is.na(Seq))), length.out = n())) %>%
ungroup()
Or
df %>%
group_by(Group) %>%
mutate(newseq = 0:(n() - 1) + (first(na.omit(Seq)) - sum(cumall(is.na(Seq))))) %>%
ungroup()
All these do the same thing: shift the start of the sequence by the difference of the first non-NA value and the number of NAs before it.
Output
Group Seq newseq
<int> <int> <dbl>
1 1 NA 3
2 1 NA 4
3 1 NA 5
4 1 6 6
5 1 7 7
6 1 8 8
7 1 NA 9
8 1 10 10
9 1 11 11
10 1 NA 12
# ... with 35 more rows
First create row number, then take the max difference of Seq and row_number and add to row number:
dat1 %>%
group_by(Group) %>%
mutate(rn = row_number(),
Seq = rn + max(Seq - rn, na.rm = TRUE)) %>%
ungroup() %>%
select(-rn)
Output:
Group Seq
<dbl> <int>
1 1 3
2 1 4
3 1 5
4 1 6
5 1 7
6 1 8
7 1 9
8 1 10
9 1 11
10 1 12
11 1 13
12 1 14
13 1 15
14 1 16
15 1 17
16 1 18
17 1 19
18 1 20
19 2 -1
20 2 0
21 2 1
22 2 2
23 2 3
24 2 4
25 2 5
26 2 6
27 2 7
28 2 8
29 2 9
30 2 10
31 2 11
32 2 12
33 2 13
34 2 14
35 2 15
36 2 16
37 2 17
38 2 18
39 2 19
40 2 20
# … with 5 more rows
data:
set.seed(5112021)
dat1 <- bind_rows(data.frame(Group=1,Seq=(3:20)),
data.frame(Group=2,Seq=(-1:25))) %>%
mutate(rn = rnorm(45,mean=0.5,sd=1),
Seq = ifelse(rn < 0.4,NA,Seq)) %>%
select(-rn) %>%
group_by(Group) %>%
mutate(Seq = ifelse(Seq==-1,NA,Seq))

Repeat the first two rows for each id two times

I would like to repeat the first two rows for each id two times. I don't know how to do that. Does anyone have a suggestion?
id <- rep(1:4,each=6)
scored <- c(12,13,NA,NA,NA,NA,14,20,NA,NA,NA,NA,23,56,NA,NA,NA,NA, 45,78,NA,NA,NA,NA)
df <- data.frame(id,scored)
df
id scored
1 1 12
2 1 13
3 1 NA
4 1 NA
5 1 NA
6 1 NA
7 2 14
8 2 20
9 2 NA
10 2 NA
11 2 NA
12 2 NA
13 3 23
14 3 56
15 3 NA
16 3 NA
17 3 NA
18 3 NA
19 4 45
20 4 78
21 4 NA
22 4 NA
23 4 NA
24 4 NA
>
I want it to look like:
df
id score
1 1 12
2 1 13
3 1 12
4 1 13
5 1 12
6 1 13
7 2 14
8 2 20
9 2 14
10 2 20
11 2 14
12 2 20
13 3 23
14 3 56
15 3 23
16 3 56
17 3 23
18 3 56
19 4 45
20 4 78
21 4 45
22 4 78
23 4 45
24 4 78
>
..................................................
..................................................
..................................................
We can do a group by rep on the non-NA elements of 'scored'
library(dplyr)
df %>%
group_by(id) %>%
mutate(scored = rep(scored[!is.na(scored)], length.out = n()))
# A tibble: 24 x 2
# Groups: id [4]
# id scored
# <int> <dbl>
# 1 1 12
# 2 1 13
# 3 1 12
# 4 1 13
# 5 1 12
# 6 1 13
# 7 2 14
# 8 2 20
# 9 2 14
#10 2 20
# … with 14 more rows

Rolling sum in dplyr

set.seed(123)
df <- data.frame(x = sample(1:10, 20, replace = T), id = rep(1:2, each = 10))
For each id, I want to create a column which has the sum of previous 5 x values.
df %>% group_by(id) %>% mutate(roll.sum = c(x[1:4], zoo::rollapply(x, 5, sum)))
# Groups: id [2]
x id roll.sum
<int> <int> <int>
3 1 3
8 1 8
5 1 5
9 1 9
10 1 10
1 1 36
6 1 39
9 1 40
6 1 41
5 1 37
10 2 10
5 2 5
7 2 7
6 2 6
2 2 2
9 2 39
3 2 32
1 2 28
4 2 25
10 2 29
The 6th row should be 35 (3 + 8 + 5 + 9 + 10), the 7th row should be 33 (8 + 5 + 9 + 10 + 1) and so on.
However, the above function is also including the row itself for calculation. How can I fix it?
library(zoo)
df %>% group_by(id) %>%
mutate(Sum_prev = rollapply(x, list(-(1:5)), sum, fill=NA, align = "right", partial=F))
#you can use rollapply(x, list((1:5)), sum, fill=NA, align = "left", partial=F)
#to sum the next 5 elements scaping the current one
x id Sum_prev
1 3 1 NA
2 8 1 NA
3 5 1 NA
4 9 1 NA
5 10 1 NA
6 1 1 35
7 6 1 33
8 9 1 31
9 6 1 35
10 5 1 32
11 10 2 NA
12 5 2 NA
13 7 2 NA
14 6 2 NA
15 2 2 NA
16 9 2 30
17 3 2 29
18 1 2 27
19 4 2 21
20 10 2 19
There is the rollify function in the tibbletime package that you could use. You can read about it in this vignette: Rolling calculations in tibbletime.
library(tibbletime)
library(dplyr)
rollig_sum <- rollify(.f = sum, window = 5)
df %>%
group_by(id) %>%
mutate(roll.sum = lag(rollig_sum(x))) #added lag() here
# A tibble: 20 x 3
# Groups: id [2]
# x id roll.sum
# <int> <int> <int>
# 1 3 1 NA
# 2 8 1 NA
# 3 5 1 NA
# 4 9 1 NA
# 5 10 1 NA
# 6 1 1 35
# 7 6 1 33
# 8 9 1 31
# 9 6 1 35
#10 5 1 32
#11 10 2 NA
#12 5 2 NA
#13 7 2 NA
#14 6 2 NA
#15 2 2 NA
#16 9 2 30
#17 3 2 29
#18 1 2 27
#19 4 2 21
#20 10 2 19
If you want the NAs to be some other value, you can use, for example, if_else
df %>%
group_by(id) %>%
mutate(roll.sum = lag(rollig_sum(x))) %>%
mutate(roll.sum = if_else(is.na(roll.sum), x, roll.sum))

count row number first and then insert new row by condition [duplicate]

This question already has answers here:
How to create missing value for repeated measurement data?
(2 answers)
Closed 4 years ago.
I need to count the number of rows first after a group_by function and add up new row(s) to 6 row if the row number < 6.
My df has three variables (v1,v2,v3): v1 = group name, v2 = row number (i.e., 1,2,3,4,5,6). In the new row(s), I want to repeat the v1 value, v2 continue the couting of row number, v3 = NA
sample df
v1 v2 v3
1 1 79
1 2 32
1 3 53
1 4 33
1 5 76
1 6 11
2 1 32
2 2 42
2 3 44
2 4 12
3 1 22
3 2 12
3 3 12
3 4 67
3 5 32
expected output
v1 v2 v3
1 1 79
1 2 32
1 3 53
1 4 33
1 5 76
1 6 11
2 1 32
2 2 42
2 3 44
2 4 12
2 5 NA #insert
2 6 NA #insert
3 1 22
3 2 12
3 3 12
3 4 67
3 5 32
3 6 NA #insert
I tried to count the row number first by dplyr, but I don't know if I can or how can I add this if else condition by using the pip. Or is there other easier function?
My code
df %>%
group_by(v1) %>%
dplyr::summarise(N=n()) %>%
if (N < 6) {
# sth like that?
}
Thanks!
We can use complete
library(tidyverse)
complete(df1, v1, v2)
# A tibble: 18 x 3
# v1 v2 v3
# <int> <int> <int>
# 1 1 1 79
# 2 1 2 32
# 3 1 3 53
# 4 1 4 33
# 5 1 5 76
# 6 1 6 11
# 7 2 1 32
# 8 2 2 42
# 9 2 3 44
#10 2 4 12
#11 2 5 NA
#12 2 6 NA
#13 3 1 22
#14 3 2 12
#15 3 3 12
#16 3 4 67
#17 3 5 32
#18 3 6 NA
Here is a way to do it using merge.
df <- read.table(text =
"v1 v2 v3
1 1 79
1 2 32
1 3 53
1 4 33
1 5 76
1 6 11
2 1 32
2 2 42
2 3 44
2 4 12
3 1 22
3 2 12
3 3 12
3 4 67
3 5 32", header = T)
toMerge <- data.frame(v1 = rep(1:3, each = 6), v2 = rep(1:6, times = 3))
m <- merge(toMerge, df, by = c("v1", "v2"), all.x = T)
m
v1 v2 v3
1 1 1 79
2 1 2 32
3 1 3 53
4 1 4 33
5 1 5 76
6 1 6 11
7 2 1 32
8 2 2 42
9 2 3 44
10 2 4 12
11 2 5 NA
12 2 6 NA
13 3 1 22
14 3 2 12
15 3 3 12
16 3 4 67
17 3 5 32
18 3 6 NA

Resources