Repeat the first two rows for each id two times - r

I would like to repeat the first two rows for each id two times. I don't know how to do that. Does anyone have a suggestion?
id <- rep(1:4,each=6)
scored <- c(12,13,NA,NA,NA,NA,14,20,NA,NA,NA,NA,23,56,NA,NA,NA,NA, 45,78,NA,NA,NA,NA)
df <- data.frame(id,scored)
df
id scored
1 1 12
2 1 13
3 1 NA
4 1 NA
5 1 NA
6 1 NA
7 2 14
8 2 20
9 2 NA
10 2 NA
11 2 NA
12 2 NA
13 3 23
14 3 56
15 3 NA
16 3 NA
17 3 NA
18 3 NA
19 4 45
20 4 78
21 4 NA
22 4 NA
23 4 NA
24 4 NA
>
I want it to look like:
df
id score
1 1 12
2 1 13
3 1 12
4 1 13
5 1 12
6 1 13
7 2 14
8 2 20
9 2 14
10 2 20
11 2 14
12 2 20
13 3 23
14 3 56
15 3 23
16 3 56
17 3 23
18 3 56
19 4 45
20 4 78
21 4 45
22 4 78
23 4 45
24 4 78
>
..................................................
..................................................
..................................................

We can do a group by rep on the non-NA elements of 'scored'
library(dplyr)
df %>%
group_by(id) %>%
mutate(scored = rep(scored[!is.na(scored)], length.out = n()))
# A tibble: 24 x 2
# Groups: id [4]
# id scored
# <int> <dbl>
# 1 1 12
# 2 1 13
# 3 1 12
# 4 1 13
# 5 1 12
# 6 1 13
# 7 2 14
# 8 2 20
# 9 2 14
#10 2 20
# … with 14 more rows

Related

filter() rows from dataframe with condition on previous and next row, keeping NA values

I have a dataframe like this:
AA<-c(1,2,4,5,6,7,10,11,12,13,14,15)
BB<-c(32,21,21,NA,27,31,31,12,28,NA,48,7)
df<- data.frame(AA,BB)
I want to remove rows where BB value is equal to previous or next row, to keep only first and last occurrences from each value of BB column. I also want to keep NA rows. I arrive to that code which is not so far from what I want:
lighten_df <- df %>% filter(BB!=lag(BB) | BB!=lead(BB) | is.na(BB) )
which gives me:
> lighten_df
AA BB
1 1 32
2 2 21
3 5 NA
4 6 27
5 7 31
6 10 31
7 11 12
8 12 28
9 13 NA
10 14 48
11 15 7
My problem is that I would like to keep first and last 21 value for col BB. That's the result I expect:
AA BB
1 1 32
2 2 21
3 4 21
4 5 NA
5 6 27
6 7 31
7 10 31
8 11 12
9 12 28
10 13 NA
11 14 48
12 15 7
Any Idea?
I would suggest a different approach: define a grouping variable and keep the first and last rows within each group:
df %>%
group_by(grp = data.table::rleid(BB)) %>%
slice(unique(c(1, n())))
# # A tibble: 12 × 3
# # Groups: grp [10]
# AA BB grp
# <dbl> <dbl> <int>
# 1 1 32 1
# 2 2 21 2
# 3 4 21 2
# 4 5 NA 3
# 5 6 27 4
# 6 7 31 5
# 7 10 31 5
# 8 11 12 6
# 9 12 28 7
# 10 13 NA 8
# 11 14 48 9
# 12 15 7 10

rbind dataframes by filling missing rows from the first dataframe

I have 4 datasets from 4 rounds of a survey, with the first round containing 5 variables and the next ones containing only 3. This is because the ID (same sample) and the other two variables (v1 and v2) are fixed over time.
df1 <- data.frame(id = c(1:5), round=1, v1 = c(6:10), v2 = c(11:15), v3=c(16:20))
df2 <- data.frame(id = c(1:5), round=2, v3=c(26:30))
df3 <- data.frame(id = c(1:5), round=3, v3=c(36:40))
df4 <- data.frame(id = c(1:5), round=4, v3=c(46:50))
** rbind
list(df1, df2, df3, df4) %>%
bind_rows(.id = 'grp') %>%
group_by(id)
Now when I rbind them, I end up with missing rows for the two fixed variables for rounds 1 to 3:
grp id round v1 v2 v3
<chr> <int> <dbl> <int> <int> <int>
1 1 1 1 6 11 16
2 1 2 1 7 12 17
3 1 3 1 8 13 18
4 1 4 1 9 14 19
5 1 5 1 10 15 20
6 2 1 2 NA NA 26
7 2 2 2 NA NA 27
8 2 3 2 NA NA 28
9 2 4 2 NA NA 29
10 2 5 2 NA NA 30
11 3 1 3 NA NA 36
12 3 2 3 NA NA 37
13 3 3 3 NA NA 38
14 3 4 3 NA NA 39
15 3 5 3 NA NA 40
16 4 1 4 NA NA 46
17 4 2 4 NA NA 47
18 4 3 4 NA NA 48
19 4 4 4 NA NA 49
20 4 5 4 NA NA 50
but I need v1 and v2 to be filled for the next rounds as well by matching the respective ID.
Please let me know if there is any way to do this in R (or in Python).
Thank you.
list(df1, df2, df3, df4) %>%
bind_rows(.id = 'grp') %>%
group_by(id) %>%
fill(v1:v3) # from tidyr
#fill(4:6) # alternative syntax: columns 4-6
#fill(-c(1:3)) # alternative syntax: everything except columns 1:3
#fill(everything()) # alternative syntax: fill NAs in all columns
grp id round v1 v2 v3
<chr> <int> <dbl> <int> <int> <int>
1 1 1 1 6 11 16
2 1 2 1 7 12 17
3 1 3 1 8 13 18
4 1 4 1 9 14 19
5 1 5 1 10 15 20
6 2 1 2 6 11 26
7 2 2 2 7 12 27
8 2 3 2 8 13 28
9 2 4 2 9 14 29
10 2 5 2 10 15 30
11 3 1 3 6 11 36
12 3 2 3 7 12 37
13 3 3 3 8 13 38
14 3 4 3 9 14 39
15 3 5 3 10 15 40
16 4 1 4 6 11 46
17 4 2 4 7 12 47
18 4 3 4 8 13 48
19 4 4 4 9 14 49
20 4 5 4 10 15 50

Add rows to dataframe in R based on values in column

I have a dataframe with 2 columns: time and day. there are 3 days and for each day, time runs from 1 to 12. I want to add new rows for each day with times: -2, 1 and 0. How do I do this?
I have tried using add_row and specifying the row number to add to, but this changes each time a new row is added making the process tedious. Thanks in advance
picture of the dataframe
We could use add_row
then slice the desired sequence
and bind all to a dataframe:
library(tibble)
library(dplyr)
df1 <- df %>%
add_row(time = -2:0, Day = c(1,1,1), .before = 1) %>%
slice(1:15)
df2 <- bind_rows(df1, df1, df1) %>%
mutate(Day = rep(row_number(), each=15, length.out = n()))
Output:
# A tibble: 45 x 2
time Day
<dbl> <int>
1 -2 1
2 -1 1
3 0 1
4 1 1
5 2 1
6 3 1
7 4 1
8 5 1
9 6 1
10 7 1
11 8 1
12 9 1
13 10 1
14 11 1
15 12 1
16 -2 2
17 -1 2
18 0 2
19 1 2
20 2 2
21 3 2
22 4 2
23 5 2
24 6 2
25 7 2
26 8 2
27 9 2
28 10 2
29 11 2
30 12 2
31 -2 3
32 -1 3
33 0 3
34 1 3
35 2 3
36 3 3
37 4 3
38 5 3
39 6 3
40 7 3
41 8 3
42 9 3
43 10 3
44 11 3
45 12 3
Here's a fast way to create the desired dataframe from scratch using expand.grid(), rather than adding individual rows:
df <- expand.grid(-2:12,1:3)
colnames(df) <- c("time","day")
Results:
df
time day
1 -2 1
2 -1 1
3 0 1
4 1 1
5 2 1
6 3 1
7 4 1
8 5 1
9 6 1
10 7 1
11 8 1
12 9 1
13 10 1
14 11 1
15 12 1
16 -2 2
17 -1 2
18 0 2
19 1 2
20 2 2
21 3 2
22 4 2
23 5 2
24 6 2
25 7 2
26 8 2
27 9 2
28 10 2
29 11 2
30 12 2
31 -2 3
32 -1 3
33 0 3
34 1 3
35 2 3
36 3 3
37 4 3
38 5 3
39 6 3
40 7 3
41 8 3
42 9 3
43 10 3
44 11 3
45 12 3
You can use tidyr::crossing
library(dplyr)
library(tidyr)
add_values <- c(-2, 1, 0)
crossing(time = add_values, Day = unique(day$Day)) %>%
bind_rows(day) %>%
arrange(Day, time)
# A tibble: 45 x 2
# time Day
# <dbl> <int>
# 1 -2 1
# 2 0 1
# 3 1 1
# 4 1 1
# 5 2 1
# 6 3 1
# 7 4 1
# 8 5 1
# 9 6 1
#10 7 1
# … with 35 more rows
If you meant -2, -1 and 0 you can also use complete.
tidyr::complete(day, Day, time = -2:0)

Completing a sequence of integers by group with tidyverse in R

Given a dataset which contains a grouping variable and a column of integers which is incomplete (contains NAs) and the beginning and ending integer vary by group and the length of each group varies (and could be NA). How might one fill in the NA integer values by completing the sequence.
The following dataset may be used as an example:
library(dplyr)
set.seed(5112021)
dat1 <- bind_rows(data.frame(Group=1,Seq=(3:20)),
data.frame(Group=2,Seq=(-1:25))) %>%
mutate(rn = rnorm(45,mean=0.5,sd=1),
Seq = ifelse(rn < 0.4,NA,Seq)) %>%
select(-rn) %>%
group_by(Group) %>%
mutate(Seq = ifelse(Seq==-1,NA,Seq))
dat1
Group Seq
1 1 NA
2 1 NA
3 1 NA
4 1 6
5 1 7
6 1 8
7 1 NA
8 1 10
9 1 11
10 1 NA
11 1 13
12 1 NA
13 1 15
14 1 NA
15 1 NA
16 1 NA
17 1 NA
18 1 20
19 2 NA
20 2 0
21 2 NA
22 2 2
23 2 3
24 2 NA
25 2 5
26 2 6
27 2 7
28 2 8
29 2 NA
30 2 10
31 2 NA
32 2 12
33 2 NA
34 2 NA
35 2 NA
36 2 16
37 2 17
38 2 NA
39 2 NA
40 2 NA
41 2 NA
42 2 22
43 2 NA
44 2 NA
45 2 NA
One way to do this could be to make use of row_numbers (since they are a sequence of integers) by group and calculate the difference between the non-missing values and the row number (which is a unique value) and then add that value back to the row number.
for example
dat2 <- dat1 %>%
group_by(Group) %>%
mutate(rn = row_number(),
diff = mean(Seq-rn,na.rm=T)) %>%
mutate(New_Seq = rn+diff) %>%
select(-rn,-diff)
dat2
Group Seq New_Seq
1 1 NA 3
2 1 NA 4
3 1 NA 5
4 1 6 6
5 1 7 7
6 1 8 8
7 1 NA 9
8 1 10 10
9 1 11 11
10 1 NA 12
11 1 13 13
12 1 NA 14
13 1 15 15
14 1 NA 16
15 1 NA 17
16 1 NA 18
17 1 NA 19
18 1 20 20
19 2 NA -1
20 2 0 0
21 2 NA 1
22 2 2 2
23 2 3 3
24 2 NA 4
25 2 5 5
26 2 6 6
27 2 7 7
28 2 8 8
29 2 NA 9
30 2 10 10
31 2 NA 11
32 2 12 12
33 2 NA 13
34 2 NA 14
35 2 NA 15
36 2 16 16
37 2 17 17
38 2 NA 18
39 2 NA 19
40 2 NA 20
41 2 NA 21
42 2 22 22
43 2 NA 23
44 2 NA 24
45 2 NA 25
While this works, it doesn't seem very elegant and may be slow for very large datasets with many grouping variables. I'm curiouse if there is a more 'Tidyverse' way to do this.
You could do something like:
df %>%
group_by(Group) %>%
mutate(newseq = seq_along(Group) + (first(na.omit(Seq)) - sum(cumall(is.na(Seq)))) - 1) %>%
ungroup()
Or
df %>%
group_by(Group) %>%
mutate(newseq = seq(first(na.omit(Seq)) - sum(cumall(is.na(Seq))), length.out = n())) %>%
ungroup()
Or
df %>%
group_by(Group) %>%
mutate(newseq = 0:(n() - 1) + (first(na.omit(Seq)) - sum(cumall(is.na(Seq))))) %>%
ungroup()
All these do the same thing: shift the start of the sequence by the difference of the first non-NA value and the number of NAs before it.
Output
Group Seq newseq
<int> <int> <dbl>
1 1 NA 3
2 1 NA 4
3 1 NA 5
4 1 6 6
5 1 7 7
6 1 8 8
7 1 NA 9
8 1 10 10
9 1 11 11
10 1 NA 12
# ... with 35 more rows
First create row number, then take the max difference of Seq and row_number and add to row number:
dat1 %>%
group_by(Group) %>%
mutate(rn = row_number(),
Seq = rn + max(Seq - rn, na.rm = TRUE)) %>%
ungroup() %>%
select(-rn)
Output:
Group Seq
<dbl> <int>
1 1 3
2 1 4
3 1 5
4 1 6
5 1 7
6 1 8
7 1 9
8 1 10
9 1 11
10 1 12
11 1 13
12 1 14
13 1 15
14 1 16
15 1 17
16 1 18
17 1 19
18 1 20
19 2 -1
20 2 0
21 2 1
22 2 2
23 2 3
24 2 4
25 2 5
26 2 6
27 2 7
28 2 8
29 2 9
30 2 10
31 2 11
32 2 12
33 2 13
34 2 14
35 2 15
36 2 16
37 2 17
38 2 18
39 2 19
40 2 20
# … with 5 more rows
data:
set.seed(5112021)
dat1 <- bind_rows(data.frame(Group=1,Seq=(3:20)),
data.frame(Group=2,Seq=(-1:25))) %>%
mutate(rn = rnorm(45,mean=0.5,sd=1),
Seq = ifelse(rn < 0.4,NA,Seq)) %>%
select(-rn) %>%
group_by(Group) %>%
mutate(Seq = ifelse(Seq==-1,NA,Seq))

R - Index position with condition

I've a data frame like this
w<-c(0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0)
i would like an index position starting after value 1.
output : NA,NA,NA,NA,NA,1,2,3,4,5,6,7,1,2,3,4,5,1,2,3,4,5,6,7,8,9
ideally applicable to a data frame.
Thanks
edit : w is a data frame,
roughly this function
m<-as.data.frame(w)
m[m!=1] <- row(m)[m!=1]
m
w
1 1
2 2
3 3
4 4
5 5
6 1
7 7
8 8
9 9
10 10
11 11
12 12
13 1
14 14
15 15
16 16
17 17
18 1
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
but with a return to 1 when value 1 is matching.
> m
w wanted
1 1 NA
2 2 NA
3 3 NA
4 4 NA
5 5 NA
6 1 1
7 7 2
8 8 3
9 9 4
10 10 5
11 11 6
12 12 7
13 1 1
14 14 2
15 15 3
16 16 4
17 17 5
18 1 1
19 19 2
20 20 3
21 21 4
22 22 5
23 23 6
24 24 7
25 25 8
26 26 9
Thanks
This assumes that the data is ordered in the way shown in example.
m$wanted <- with(m, ave(w, cumsum(c(TRUE,diff(w) <0)), FUN=seq_along))
m$wanted
#[1] 1 2 3 4 5 1 2 3 4 5 6 7 1 2 3 4 5 1 2 3 4 5 6 7 8 9
For the given data including repeated 1's and non-sequential input, the following works:
m[9,1] <- 100
m[3,1] <- 55
m[14,1] <- 60
m[14,1] <- 60
m[25,1] <- 1
m[19,1] <- 1
m$result <- 1:nrow(m) - which(m$w == 1)[cumsum(m$w == 1)] + 1
But if the data does not start on 1:
m[1,1] <- 2
Then this works:
firstone <- which(m$w == 1)[1]
subindex <- m[firstone:nrow(m),'w'] == 1
m$result <- c(rep(NA,firstone-1),1:length(subindex) - which(subindex)[cumsum(subindex)] + 1)

Resources