R Add rows to each group so each group has same number, and specify other variable - r

This is my df:
df = tibble(week = c(1,1,2,2,3,3,3,4,4,4,4),
session = c(1,2,1,2,1,2,3,1,2,3,4),
work =rep("done",11))
df
# A tibble: 11 x 3
week session work
<dbl> <dbl> <chr>
1 1 1 done
2 1 2 done
3 2 1 done
4 2 2 done
5 3 1 done
6 3 2 done
7 3 3 done
8 4 1 done
9 4 2 done
10 4 3 done
11 4 4 done
For each week there should be 4 rows with session 1 to 4.
How can I add the "missing" session rows (the rest of the variables are NA) so the df is:
df1= tibble(week = c(rep(1,4), rep(2,4), rep(3,4), rep(4,4)),
session = rep(1:4,4),
work = c("done", "done" ,NA, NA, "done", "done" ,NA, NA,"done", "done" ,"done", NA, rep("done",4)))
df1
week session work
<dbl> <int> <chr>
1 1 1 done
2 1 2 done
3 1 3 NA
4 1 4 NA
5 2 1 done
6 2 2 done
7 2 3 NA
8 2 4 NA
9 3 1 done
10 3 2 done
11 3 3 done
12 3 4 NA
13 4 1 done
14 4 2 done
15 4 3 done
16 4 4 done

tidyr::complete(df, week, session)
# A tibble: 16 x 3
week session work
<dbl> <dbl> <chr>
1 1 1 done
2 1 2 done
3 1 3 NA
4 1 4 NA
5 2 1 done
6 2 2 done
7 2 3 NA
8 2 4 NA
9 3 1 done
10 3 2 done
11 3 3 done
12 3 4 NA
13 4 1 done
14 4 2 done
15 4 3 done
16 4 4 done

Here's a data.table solution in case speed is important
# load package
library(data.table)
# set as data table
setDT(df)
# cross join to get complete combination
week <- 1:4
session <- 1:4
z <- CJ(week,session)
# join
df_1 <- df[z, on=.(week, session)]

Related

Creating an indexed column in R, grouped by user_id, and not increase when NA

I want to create a column (in R) that indexes the presence of a number in another column grouped by a user_id column. And when the other column is NA, the new desired column should not increase.
The example should bring clarity.
I have this df:
data <- data.frame(user_id = c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3),
one=c(1,NA,3,2,NA,0,NA,4,3,4,NA))
user_id tobeindexed
1 1 1
2 1 NA
3 1 3
4 2 2
5 2 NA
6 2 0
7 2 NA
8 3 4
9 3 3
10 3 4
11 3 NA
I want to make a new column looking like "desired" in the following df:
> cbind(data,data.frame(desired = c(1,1,2,1,1,2,2,1,2,3,3)))
user_id tobeindexed desired
1 1 1 1
2 1 NA 1
3 1 3 2
4 2 2 1
5 2 NA 1
6 2 0 2
7 2 NA 2
8 3 4 1
9 3 3 2
10 3 4 3
11 3 NA 3
How can I solve this?
Using colsum and group_by gets me close, but the count does not start over from 1 when the user_id changes...
> data %>% group_by(user_id) %>% mutate(desired = cumsum(!is.na(tobeindexed)))
user_id tobeindexed desired
<dbl> <dbl> <int>
1 1 1 1
2 1 NA 1
3 1 3 2
4 2 2 3
5 2 NA 3
6 2 0 4
7 2 NA 4
8 3 4 5
9 3 3 6
10 3 4 7
11 3 NA 7
Given the sample data you provided (with the one) column, this works unchanged. The code is retained below for demonstration.
base R
data$out <- ave(data$one, data$user_id, FUN = function(z) cumsum(!is.na(z)))
data
# user_id one out
# 1 1 1 1
# 2 1 NA 1
# 3 1 3 2
# 4 2 2 1
# 5 2 NA 1
# 6 2 0 2
# 7 2 NA 2
# 8 3 4 1
# 9 3 3 2
# 10 3 4 3
# 11 3 NA 3
dplyr
library(dplyr)
data %>%
group_by(user_id) %>%
mutate(out = cumsum(!is.na(one))) %>%
ungroup()
# # A tibble: 11 × 3
# user_id one out
# <dbl> <dbl> <int>
# 1 1 1 1
# 2 1 NA 1
# 3 1 3 2
# 4 2 2 1
# 5 2 NA 1
# 6 2 0 2
# 7 2 NA 2
# 8 3 4 1
# 9 3 3 2
# 10 3 4 3
# 11 3 NA 3

Expanding Data Frame with cumsum in R

I've got a data frame with historc F1 data that looks like this:
Driver
Race number
Position
Number of Career Podiums
Farina
1
1
1
Fagioli
1
2
1
Parnell
1
3
1
Fangio
2
1
1
Ascari
2
2
1
Chiron
2
3
1
...
...
...
...
Moss
47
1
4
Fangio
47
2
23
Kling
47
3
2
now I want to extend it in a way that for every race there is not only the top 3 of that specific Race but also everyone that has had a top 3 before so I can create a racing bar chart. The final data frame should look like this
Driver
Race number
Position
Number of Career Podiums
Farina
1
1
1
Fagioli
1
2
1
Parnell
1
3
1
Fangio
2
1
1
Ascari
2
2
1
Chiron
2
3
1
Farina
2
NA
1
Fagioli
2
NA
1
Parnell
2
NA
1
Parsons
3
1
1
Holland
3
2
1
Rose
3
3
1
Farina
3
NA
1
Fagioli
3
NA
1
Parnell
3
NA
1
Fangio
3
NA
1
Ascari
3
NA
1
Chiron
3
NA
1
Is there any easy way to do this? I couldnt find someone with a similar problem on google.
If I correctly understand your problem, you have only observations for the top3 drivers of every race. But you want to have observations for every driver that has ever achieved a top3 position in your dataset across all races.
For example in the following dataset, driver D only has an observation for the second race where they achieved the first place, but not the other races:
dat <- data.frame(driver = c("A", "B", "C", "D", "A", "B", "B", "A", "C"),
race_number = rep(1:3, each = 3),
position = rep(1:3, 3))
print(dat)
driver race_number position
1 A 1 1
2 B 1 2
3 C 1 3
4 D 2 1
5 A 2 2
6 B 2 3
7 B 3 1
8 A 3 2
9 C 3 3
To add entries for driver D for the races number 1 and 2 you could use tidyr's expand() function or if you want to use base R you could achieve the same using expand.grid() and unique(). This would leave you with a dataframe object containing all possible combinations between the drivers and the race numbers. Afterwards you simply have to left or right join the result with the initial dataframe.
A solution using standard tidyverse packages tidyr and dplyr could look like this:
library(dplyr)
library(tidyr)
dat %>%
expand(driver, race_number) %>%
left_join(dat)
# A tibble: 12 x 4
driver race_number position previous_podium_positions
<chr> <int> <int> <dbl>
1 A 1 1 1
2 A 2 2 2
3 A 3 2 3
4 B 1 2 1
5 B 2 3 2
6 B 3 1 3
7 C 1 3 1
8 C 2 NA NA
9 C 3 3 2
10 D 1 NA NA
11 D 2 1 1
12 D 3 NA NA
Note that the "new" observations will naturally have NAs for the position and the number of previous podium positions. The latter could be added easily via the following approach, which counts the previous
dat %>%
expand(driver, race_number) %>%
left_join(dat) %>%
arrange(race_number) %>%
mutate(previous_podium_positions = ifelse(is.na(previous_podium_positions),0,1)) %>%
group_by(driver) %>%
mutate(previous_podium_positions = cumsum(previous_podium_positions))
Joining, by = c("driver", "race_number")
# A tibble: 12 x 4
# Groups: driver [4]
driver race_number position previous_podium_positions
<chr> <int> <int> <dbl>
1 A 1 1 1
2 B 1 2 1
3 C 1 3 1
4 D 1 NA 0
5 A 2 2 2
6 B 2 3 2
7 C 2 NA 1
8 D 2 1 1
9 A 3 2 3
10 B 3 1 3
11 C 3 3 2
12 D 3 NA 1
I hope this helped. Just a brief disclaimer, these may very well be not the most resource or time-efficient solutions but rather the fastest/easiest way to solve the issue.

dplyr solution to split dataset, but keep IDs in same splits

I'm looking for a dplyr or tidyr solution to split a dataset into n chunks. However, I do not want to have any single ID go into multiple chunks. That is, each ID should appear in only one chunk.
For example, imagine "test" below is an ID variable, and the dataset has many other columns.
test<-data.frame(id= c(1,2,3,4,4,4,4,4,6,7,8,9,9,9,9,10),
val = 1:16)
out <- test %>% select(id) %>% ntile(n = 3)
out
[1] 1 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
The ID=4 would end up in chunks 1 and 2. I am wondering how to code this so that all ID=4 end up in the same chunk (doesn't matter which one). I looked at the split function but could not find a way to do this.
The desired output would be something like
test[which(out==1),]
returning
id val
1 1 1
2 2 2
3 3 3
4 4 4
5 4 5
6 4 6
7 4 7
8 4 8
Then if I wanted to look at the second chunk, I would call something like test[which(out==2),], and so on up to out==n. I only want to deal with one chunk at a time. I don't need to create all n chunks simultaneously.
You need to create a data frame, then use group_by and mutate to add columns:
test<-data_frame(id = c(1,2,3,4,4,4,4,4,6,7,8,9,9,9,9,10),
value = 1:16)
out <- test %>%
mutate(new_column = ntile(id,3))
out
# A tibble: 16 x 3
id value new_column
<dbl> <int> <int>
1 1 1 1
2 2 2 1
3 3 3 1
4 4 4 1
5 4 5 1
6 4 6 1
7 4 7 2
8 4 8 2
9 6 9 2
10 7 10 2
11 8 11 2
12 9 12 3
13 9 13 3
14 9 14 3
15 9 15 3
16 10 16 3
Or given Frank's comment you could run the ntile function on distinct/unique values of the id - then join the original table back on id:
test<-data_frame(id = c(1,2,3,4,4,4,4,4,6,7,8,9,9,9,9,10),
value = 1:16)
out <- test %>%
distinct(id) %>%
mutate(new_column = ntile(id,3)) %>%
right_join(test, by = "id")
out
# A tibble: 16 x 3
id new_column value
<dbl> <int> <int>
1 1 1 1
2 2 1 2
3 3 1 3
4 4 2 4
5 4 2 5
6 4 2 6
7 4 2 7
8 4 2 8
9 6 2 9
10 7 2 10
11 8 3 11
12 9 3 12
13 9 3 13
14 9 3 14
15 9 3 15
16 10 3 16

Shifting rows up in columns and flush remaining ones

I have a problem with moving the rows to one upper row. When the rows become completely NA I would like to flush those rows (see the pic below). My current approach for this solution however still keeping the second rows.
Here is my approach
data <- data.frame(gr=c(rep(1:3,each=2)),A=c(1,NA,2,NA,4,NA), B=c(NA,1,NA,3,NA,7),C=c(1,NA,4,NA,5,NA))
> data
gr A B C
1 1 1 NA 1
2 1 NA 1 NA
3 2 2 NA 4
4 2 NA 3 NA
5 3 4 NA 5
6 3 NA 7 NA
so using this approach
data.frame(apply(data,2,function(x){x[complete.cases(x)]}))
gr A B C
1 1 1 1 1
2 1 2 3 4
3 2 4 7 5
4 2 1 1 1
5 3 2 3 4
6 3 4 7 5
As we can see still I am having the second rows in each group!
The expected output
> data
gr A B C
1 1 1 1 1
2 2 2 3 4
3 3 4 7 5
thanks!
If there's at most one valid value per gr, you can use na.omit then take the first value from it:
data %>% group_by(gr) %>% summarise_all(~ na.omit(.)[1])
# [1] is optional depending on your actual data
# A tibble: 3 x 4
# gr A B C
# <int> <dbl> <dbl> <dbl>
#1 1 1 1 1
#2 2 2 3 4
#3 3 4 7 5
You can do it with dplyr like this:
data$ind <- rep(c(1,2), replace=TRUE)
data %>% fill(A,B,C) %>% filter(ind == 2) %>% mutate(ind=NULL)
gr A B C
1 1 1 1 1
2 2 2 3 4
3 3 4 7 5
Depending on how consistent your full data is, this may need to be adjusted.
One more solution using data.table:-
data <- data.frame(gr=c(rep(1:3,each=2)),A=c(1,NA,2,NA,4,NA), B=c(NA,1,NA,3,NA,7),C=c(1,NA,4,NA,5,NA))
library(data.table)
library(zoo)
setDT(data)
data[, A := na.locf(A), by = gr]
data[, B := na.locf(B), by = gr]
data[, C := na.locf(C), by = gr]
data <- unique(data)
data
gr A B C
1: 1 1 1 1
2: 2 2 3 4
3: 3 4 7 5

calculate each chunk by group using dplyr?

How can I get the expected calculation using dplyr package?
row value group expected
1 2 1 =NA
2 4 1 =4-2
3 5 1 =5-4
4 6 2 =NA
5 11 2 =11-6
6 12 1 =NA
7 15 1 =15-12
I tried
df=read.table(header=1, text=' row value group
1 2 1
2 4 1
3 5 1
4 6 2
5 11 2
6 12 1
7 15 1')
df %>% group_by(group) %>% mutate(expected=value-lag(value))
How can I calculate for each chunk (row 1-3, 4-5, 6-7) although row 1-3 and 6-7 are labelled as the same group number?
Here is a similar approach. I created a new group variable using cumsum. Whenever the difference between two numbers in group is not 0, R assigns a new group number. If you have more data, this approach may be helpful.
library(dplyr)
mutate(df, foo = cumsum(c(T, diff(group) != 0))) %>%
group_by(foo) %>%
mutate(out = value - lag(value))
# row value group foo out
#1 1 2 1 1 NA
#2 2 4 1 1 2
#3 3 5 1 1 1
#4 4 6 2 2 NA
#5 5 11 2 2 5
#6 6 12 1 3 NA
#7 7 15 1 3 3
As your group variable is not useful for this, create a new variable aux and use it as the grouping variable:
library(dplyr)
df$aux <- rep(seq_along(rle(df$group)$values), times = rle(df$group)$lengths)
df %>% group_by(aux) %>% mutate(expected = value - lag(value))
Source: local data frame [7 x 5]
Groups: aux
row value group aux expected
1 1 2 1 1 NA
2 2 4 1 1 2
3 3 5 1 1 1
4 4 6 2 2 NA
5 5 11 2 2 5
6 6 12 1 3 NA
7 7 15 1 3 3
Here is an option using data.table_1.9.5. The devel version introduced new functions rleid and shift (default type is "lag" and fill is "NA") that can be useful for this.
library(data.table)
setDT(df)[, expected:=value-shift(value) ,by = rleid(group)][]
# row value group expected
#1: 1 2 1 NA
#2: 2 4 1 2
#3: 3 5 1 1
#4: 4 6 2 NA
#5: 5 11 2 5
#6: 6 12 1 NA
#7: 7 15 1 3

Resources