Completing a sequence of integers by group with tidyverse in R - r

Given a dataset which contains a grouping variable and a column of integers which is incomplete (contains NAs) and the beginning and ending integer vary by group and the length of each group varies (and could be NA). How might one fill in the NA integer values by completing the sequence.
The following dataset may be used as an example:
library(dplyr)
set.seed(5112021)
dat1 <- bind_rows(data.frame(Group=1,Seq=(3:20)),
data.frame(Group=2,Seq=(-1:25))) %>%
mutate(rn = rnorm(45,mean=0.5,sd=1),
Seq = ifelse(rn < 0.4,NA,Seq)) %>%
select(-rn) %>%
group_by(Group) %>%
mutate(Seq = ifelse(Seq==-1,NA,Seq))
dat1
Group Seq
1 1 NA
2 1 NA
3 1 NA
4 1 6
5 1 7
6 1 8
7 1 NA
8 1 10
9 1 11
10 1 NA
11 1 13
12 1 NA
13 1 15
14 1 NA
15 1 NA
16 1 NA
17 1 NA
18 1 20
19 2 NA
20 2 0
21 2 NA
22 2 2
23 2 3
24 2 NA
25 2 5
26 2 6
27 2 7
28 2 8
29 2 NA
30 2 10
31 2 NA
32 2 12
33 2 NA
34 2 NA
35 2 NA
36 2 16
37 2 17
38 2 NA
39 2 NA
40 2 NA
41 2 NA
42 2 22
43 2 NA
44 2 NA
45 2 NA
One way to do this could be to make use of row_numbers (since they are a sequence of integers) by group and calculate the difference between the non-missing values and the row number (which is a unique value) and then add that value back to the row number.
for example
dat2 <- dat1 %>%
group_by(Group) %>%
mutate(rn = row_number(),
diff = mean(Seq-rn,na.rm=T)) %>%
mutate(New_Seq = rn+diff) %>%
select(-rn,-diff)
dat2
Group Seq New_Seq
1 1 NA 3
2 1 NA 4
3 1 NA 5
4 1 6 6
5 1 7 7
6 1 8 8
7 1 NA 9
8 1 10 10
9 1 11 11
10 1 NA 12
11 1 13 13
12 1 NA 14
13 1 15 15
14 1 NA 16
15 1 NA 17
16 1 NA 18
17 1 NA 19
18 1 20 20
19 2 NA -1
20 2 0 0
21 2 NA 1
22 2 2 2
23 2 3 3
24 2 NA 4
25 2 5 5
26 2 6 6
27 2 7 7
28 2 8 8
29 2 NA 9
30 2 10 10
31 2 NA 11
32 2 12 12
33 2 NA 13
34 2 NA 14
35 2 NA 15
36 2 16 16
37 2 17 17
38 2 NA 18
39 2 NA 19
40 2 NA 20
41 2 NA 21
42 2 22 22
43 2 NA 23
44 2 NA 24
45 2 NA 25
While this works, it doesn't seem very elegant and may be slow for very large datasets with many grouping variables. I'm curiouse if there is a more 'Tidyverse' way to do this.

You could do something like:
df %>%
group_by(Group) %>%
mutate(newseq = seq_along(Group) + (first(na.omit(Seq)) - sum(cumall(is.na(Seq)))) - 1) %>%
ungroup()
Or
df %>%
group_by(Group) %>%
mutate(newseq = seq(first(na.omit(Seq)) - sum(cumall(is.na(Seq))), length.out = n())) %>%
ungroup()
Or
df %>%
group_by(Group) %>%
mutate(newseq = 0:(n() - 1) + (first(na.omit(Seq)) - sum(cumall(is.na(Seq))))) %>%
ungroup()
All these do the same thing: shift the start of the sequence by the difference of the first non-NA value and the number of NAs before it.
Output
Group Seq newseq
<int> <int> <dbl>
1 1 NA 3
2 1 NA 4
3 1 NA 5
4 1 6 6
5 1 7 7
6 1 8 8
7 1 NA 9
8 1 10 10
9 1 11 11
10 1 NA 12
# ... with 35 more rows

First create row number, then take the max difference of Seq and row_number and add to row number:
dat1 %>%
group_by(Group) %>%
mutate(rn = row_number(),
Seq = rn + max(Seq - rn, na.rm = TRUE)) %>%
ungroup() %>%
select(-rn)
Output:
Group Seq
<dbl> <int>
1 1 3
2 1 4
3 1 5
4 1 6
5 1 7
6 1 8
7 1 9
8 1 10
9 1 11
10 1 12
11 1 13
12 1 14
13 1 15
14 1 16
15 1 17
16 1 18
17 1 19
18 1 20
19 2 -1
20 2 0
21 2 1
22 2 2
23 2 3
24 2 4
25 2 5
26 2 6
27 2 7
28 2 8
29 2 9
30 2 10
31 2 11
32 2 12
33 2 13
34 2 14
35 2 15
36 2 16
37 2 17
38 2 18
39 2 19
40 2 20
# … with 5 more rows
data:
set.seed(5112021)
dat1 <- bind_rows(data.frame(Group=1,Seq=(3:20)),
data.frame(Group=2,Seq=(-1:25))) %>%
mutate(rn = rnorm(45,mean=0.5,sd=1),
Seq = ifelse(rn < 0.4,NA,Seq)) %>%
select(-rn) %>%
group_by(Group) %>%
mutate(Seq = ifelse(Seq==-1,NA,Seq))

Related

How I can insert values a dataframe to another dataframe

A similar to my data is:
dat1<-read.table (text=" ID Rat Garden Class Time1 Time2 Time3
1 12 12 0 15 16 20
1 13 0 1 NA NA NA
2 13 11 0 18 12 16
2 9 0 1 NA NA NA
1 6 13 0 17 14 14
1 7 0 2 NA NA NA
2 4 14 0 17 16 12
2 3 0 2 NA NA NA
", header=TRUE)
dat2<-read.table (text=" ID Value1 Value2
1 6 7
2 5 4
", header=TRUE)
I want to insert the values of dat2 to dat1 in the Time1 column. In front of numbers 1 and 2 in the class column.
I get the following outcome.
ID Rat Garden Class Time1 Time2 Time3
1 12 12 0 15 16 20
1 13 0 1 6
2 13 11 0 18 12 16
2 9 0 1 5
1 6 13 0 17 14 14
1 7 0 2 7
2 4 14 0 17 16 12
2 3 0 2 4
We may group by 'ID', and replace the 'Time1' where the NA values occur with the unlisted 'dat2' 'Value' columns where the ID matches
library(dplyr)
dat1 %>%
group_by(ID) %>%
mutate(Time1 = replace(Time1, is.na(Time1),
unlist(dat2[-1][dat2$ID == cur_group()$ID,]))) %>%
ungroup
-output
# A tibble: 8 × 7
ID Rat Garden Class Time1 Time2 Time3
<int> <int> <int> <int> <int> <int> <int>
1 1 12 12 0 15 16 20
2 1 13 0 1 6 NA NA
3 2 13 11 0 18 12 16
4 2 9 0 1 5 NA NA
5 1 6 13 0 17 14 14
6 1 7 0 2 7 NA NA
7 2 4 14 0 17 16 12
8 2 3 0 2 4 NA NA
Here is a wild ride:
First we pull the values as a vector from dat2.
Then we put alternating an NA into the vector until it gets column length of dat1 and
finally we use coalesce after cbind:
library(dplyr)
library(tidyr)
vector <- dat2 %>%
pivot_longer(-ID) %>%
arrange(name) %>%
pull(value)
col_x <- c(sapply(vector, c, rep(NA, 1)))
cbind(dat1, col_x) %>%
mutate(col_x = lag(col_x)) %>%
mutate(Time1= coalesce(Time1, col_x), .keep="unused")
ID Rat Garden Class Time1 Time2 Time3
1 1 12 12 0 15 16 20
2 1 13 0 1 6 NA NA
3 2 13 11 0 18 12 16
4 2 9 0 1 5 NA NA
5 1 6 13 0 17 14 14
6 1 7 0 2 7 NA NA
7 2 4 14 0 17 16 12
8 2 3 0 2 4 NA NA

rbind dataframes by filling missing rows from the first dataframe

I have 4 datasets from 4 rounds of a survey, with the first round containing 5 variables and the next ones containing only 3. This is because the ID (same sample) and the other two variables (v1 and v2) are fixed over time.
df1 <- data.frame(id = c(1:5), round=1, v1 = c(6:10), v2 = c(11:15), v3=c(16:20))
df2 <- data.frame(id = c(1:5), round=2, v3=c(26:30))
df3 <- data.frame(id = c(1:5), round=3, v3=c(36:40))
df4 <- data.frame(id = c(1:5), round=4, v3=c(46:50))
** rbind
list(df1, df2, df3, df4) %>%
bind_rows(.id = 'grp') %>%
group_by(id)
Now when I rbind them, I end up with missing rows for the two fixed variables for rounds 1 to 3:
grp id round v1 v2 v3
<chr> <int> <dbl> <int> <int> <int>
1 1 1 1 6 11 16
2 1 2 1 7 12 17
3 1 3 1 8 13 18
4 1 4 1 9 14 19
5 1 5 1 10 15 20
6 2 1 2 NA NA 26
7 2 2 2 NA NA 27
8 2 3 2 NA NA 28
9 2 4 2 NA NA 29
10 2 5 2 NA NA 30
11 3 1 3 NA NA 36
12 3 2 3 NA NA 37
13 3 3 3 NA NA 38
14 3 4 3 NA NA 39
15 3 5 3 NA NA 40
16 4 1 4 NA NA 46
17 4 2 4 NA NA 47
18 4 3 4 NA NA 48
19 4 4 4 NA NA 49
20 4 5 4 NA NA 50
but I need v1 and v2 to be filled for the next rounds as well by matching the respective ID.
Please let me know if there is any way to do this in R (or in Python).
Thank you.
list(df1, df2, df3, df4) %>%
bind_rows(.id = 'grp') %>%
group_by(id) %>%
fill(v1:v3) # from tidyr
#fill(4:6) # alternative syntax: columns 4-6
#fill(-c(1:3)) # alternative syntax: everything except columns 1:3
#fill(everything()) # alternative syntax: fill NAs in all columns
grp id round v1 v2 v3
<chr> <int> <dbl> <int> <int> <int>
1 1 1 1 6 11 16
2 1 2 1 7 12 17
3 1 3 1 8 13 18
4 1 4 1 9 14 19
5 1 5 1 10 15 20
6 2 1 2 6 11 26
7 2 2 2 7 12 27
8 2 3 2 8 13 28
9 2 4 2 9 14 29
10 2 5 2 10 15 30
11 3 1 3 6 11 36
12 3 2 3 7 12 37
13 3 3 3 8 13 38
14 3 4 3 9 14 39
15 3 5 3 10 15 40
16 4 1 4 6 11 46
17 4 2 4 7 12 47
18 4 3 4 8 13 48
19 4 4 4 9 14 49
20 4 5 4 10 15 50

Add rows to dataframe in R based on values in column

I have a dataframe with 2 columns: time and day. there are 3 days and for each day, time runs from 1 to 12. I want to add new rows for each day with times: -2, 1 and 0. How do I do this?
I have tried using add_row and specifying the row number to add to, but this changes each time a new row is added making the process tedious. Thanks in advance
picture of the dataframe
We could use add_row
then slice the desired sequence
and bind all to a dataframe:
library(tibble)
library(dplyr)
df1 <- df %>%
add_row(time = -2:0, Day = c(1,1,1), .before = 1) %>%
slice(1:15)
df2 <- bind_rows(df1, df1, df1) %>%
mutate(Day = rep(row_number(), each=15, length.out = n()))
Output:
# A tibble: 45 x 2
time Day
<dbl> <int>
1 -2 1
2 -1 1
3 0 1
4 1 1
5 2 1
6 3 1
7 4 1
8 5 1
9 6 1
10 7 1
11 8 1
12 9 1
13 10 1
14 11 1
15 12 1
16 -2 2
17 -1 2
18 0 2
19 1 2
20 2 2
21 3 2
22 4 2
23 5 2
24 6 2
25 7 2
26 8 2
27 9 2
28 10 2
29 11 2
30 12 2
31 -2 3
32 -1 3
33 0 3
34 1 3
35 2 3
36 3 3
37 4 3
38 5 3
39 6 3
40 7 3
41 8 3
42 9 3
43 10 3
44 11 3
45 12 3
Here's a fast way to create the desired dataframe from scratch using expand.grid(), rather than adding individual rows:
df <- expand.grid(-2:12,1:3)
colnames(df) <- c("time","day")
Results:
df
time day
1 -2 1
2 -1 1
3 0 1
4 1 1
5 2 1
6 3 1
7 4 1
8 5 1
9 6 1
10 7 1
11 8 1
12 9 1
13 10 1
14 11 1
15 12 1
16 -2 2
17 -1 2
18 0 2
19 1 2
20 2 2
21 3 2
22 4 2
23 5 2
24 6 2
25 7 2
26 8 2
27 9 2
28 10 2
29 11 2
30 12 2
31 -2 3
32 -1 3
33 0 3
34 1 3
35 2 3
36 3 3
37 4 3
38 5 3
39 6 3
40 7 3
41 8 3
42 9 3
43 10 3
44 11 3
45 12 3
You can use tidyr::crossing
library(dplyr)
library(tidyr)
add_values <- c(-2, 1, 0)
crossing(time = add_values, Day = unique(day$Day)) %>%
bind_rows(day) %>%
arrange(Day, time)
# A tibble: 45 x 2
# time Day
# <dbl> <int>
# 1 -2 1
# 2 0 1
# 3 1 1
# 4 1 1
# 5 2 1
# 6 3 1
# 7 4 1
# 8 5 1
# 9 6 1
#10 7 1
# … with 35 more rows
If you meant -2, -1 and 0 you can also use complete.
tidyr::complete(day, Day, time = -2:0)

Repeat the first two rows for each id two times

I would like to repeat the first two rows for each id two times. I don't know how to do that. Does anyone have a suggestion?
id <- rep(1:4,each=6)
scored <- c(12,13,NA,NA,NA,NA,14,20,NA,NA,NA,NA,23,56,NA,NA,NA,NA, 45,78,NA,NA,NA,NA)
df <- data.frame(id,scored)
df
id scored
1 1 12
2 1 13
3 1 NA
4 1 NA
5 1 NA
6 1 NA
7 2 14
8 2 20
9 2 NA
10 2 NA
11 2 NA
12 2 NA
13 3 23
14 3 56
15 3 NA
16 3 NA
17 3 NA
18 3 NA
19 4 45
20 4 78
21 4 NA
22 4 NA
23 4 NA
24 4 NA
>
I want it to look like:
df
id score
1 1 12
2 1 13
3 1 12
4 1 13
5 1 12
6 1 13
7 2 14
8 2 20
9 2 14
10 2 20
11 2 14
12 2 20
13 3 23
14 3 56
15 3 23
16 3 56
17 3 23
18 3 56
19 4 45
20 4 78
21 4 45
22 4 78
23 4 45
24 4 78
>
..................................................
..................................................
..................................................
We can do a group by rep on the non-NA elements of 'scored'
library(dplyr)
df %>%
group_by(id) %>%
mutate(scored = rep(scored[!is.na(scored)], length.out = n()))
# A tibble: 24 x 2
# Groups: id [4]
# id scored
# <int> <dbl>
# 1 1 12
# 2 1 13
# 3 1 12
# 4 1 13
# 5 1 12
# 6 1 13
# 7 2 14
# 8 2 20
# 9 2 14
#10 2 20
# … with 14 more rows

Rolling sum in dplyr

set.seed(123)
df <- data.frame(x = sample(1:10, 20, replace = T), id = rep(1:2, each = 10))
For each id, I want to create a column which has the sum of previous 5 x values.
df %>% group_by(id) %>% mutate(roll.sum = c(x[1:4], zoo::rollapply(x, 5, sum)))
# Groups: id [2]
x id roll.sum
<int> <int> <int>
3 1 3
8 1 8
5 1 5
9 1 9
10 1 10
1 1 36
6 1 39
9 1 40
6 1 41
5 1 37
10 2 10
5 2 5
7 2 7
6 2 6
2 2 2
9 2 39
3 2 32
1 2 28
4 2 25
10 2 29
The 6th row should be 35 (3 + 8 + 5 + 9 + 10), the 7th row should be 33 (8 + 5 + 9 + 10 + 1) and so on.
However, the above function is also including the row itself for calculation. How can I fix it?
library(zoo)
df %>% group_by(id) %>%
mutate(Sum_prev = rollapply(x, list(-(1:5)), sum, fill=NA, align = "right", partial=F))
#you can use rollapply(x, list((1:5)), sum, fill=NA, align = "left", partial=F)
#to sum the next 5 elements scaping the current one
x id Sum_prev
1 3 1 NA
2 8 1 NA
3 5 1 NA
4 9 1 NA
5 10 1 NA
6 1 1 35
7 6 1 33
8 9 1 31
9 6 1 35
10 5 1 32
11 10 2 NA
12 5 2 NA
13 7 2 NA
14 6 2 NA
15 2 2 NA
16 9 2 30
17 3 2 29
18 1 2 27
19 4 2 21
20 10 2 19
There is the rollify function in the tibbletime package that you could use. You can read about it in this vignette: Rolling calculations in tibbletime.
library(tibbletime)
library(dplyr)
rollig_sum <- rollify(.f = sum, window = 5)
df %>%
group_by(id) %>%
mutate(roll.sum = lag(rollig_sum(x))) #added lag() here
# A tibble: 20 x 3
# Groups: id [2]
# x id roll.sum
# <int> <int> <int>
# 1 3 1 NA
# 2 8 1 NA
# 3 5 1 NA
# 4 9 1 NA
# 5 10 1 NA
# 6 1 1 35
# 7 6 1 33
# 8 9 1 31
# 9 6 1 35
#10 5 1 32
#11 10 2 NA
#12 5 2 NA
#13 7 2 NA
#14 6 2 NA
#15 2 2 NA
#16 9 2 30
#17 3 2 29
#18 1 2 27
#19 4 2 21
#20 10 2 19
If you want the NAs to be some other value, you can use, for example, if_else
df %>%
group_by(id) %>%
mutate(roll.sum = lag(rollig_sum(x))) %>%
mutate(roll.sum = if_else(is.na(roll.sum), x, roll.sum))

Resources