Expanding Data Frame with cumsum in R - r

I've got a data frame with historc F1 data that looks like this:
Driver
Race number
Position
Number of Career Podiums
Farina
1
1
1
Fagioli
1
2
1
Parnell
1
3
1
Fangio
2
1
1
Ascari
2
2
1
Chiron
2
3
1
...
...
...
...
Moss
47
1
4
Fangio
47
2
23
Kling
47
3
2
now I want to extend it in a way that for every race there is not only the top 3 of that specific Race but also everyone that has had a top 3 before so I can create a racing bar chart. The final data frame should look like this
Driver
Race number
Position
Number of Career Podiums
Farina
1
1
1
Fagioli
1
2
1
Parnell
1
3
1
Fangio
2
1
1
Ascari
2
2
1
Chiron
2
3
1
Farina
2
NA
1
Fagioli
2
NA
1
Parnell
2
NA
1
Parsons
3
1
1
Holland
3
2
1
Rose
3
3
1
Farina
3
NA
1
Fagioli
3
NA
1
Parnell
3
NA
1
Fangio
3
NA
1
Ascari
3
NA
1
Chiron
3
NA
1
Is there any easy way to do this? I couldnt find someone with a similar problem on google.

If I correctly understand your problem, you have only observations for the top3 drivers of every race. But you want to have observations for every driver that has ever achieved a top3 position in your dataset across all races.
For example in the following dataset, driver D only has an observation for the second race where they achieved the first place, but not the other races:
dat <- data.frame(driver = c("A", "B", "C", "D", "A", "B", "B", "A", "C"),
race_number = rep(1:3, each = 3),
position = rep(1:3, 3))
print(dat)
driver race_number position
1 A 1 1
2 B 1 2
3 C 1 3
4 D 2 1
5 A 2 2
6 B 2 3
7 B 3 1
8 A 3 2
9 C 3 3
To add entries for driver D for the races number 1 and 2 you could use tidyr's expand() function or if you want to use base R you could achieve the same using expand.grid() and unique(). This would leave you with a dataframe object containing all possible combinations between the drivers and the race numbers. Afterwards you simply have to left or right join the result with the initial dataframe.
A solution using standard tidyverse packages tidyr and dplyr could look like this:
library(dplyr)
library(tidyr)
dat %>%
expand(driver, race_number) %>%
left_join(dat)
# A tibble: 12 x 4
driver race_number position previous_podium_positions
<chr> <int> <int> <dbl>
1 A 1 1 1
2 A 2 2 2
3 A 3 2 3
4 B 1 2 1
5 B 2 3 2
6 B 3 1 3
7 C 1 3 1
8 C 2 NA NA
9 C 3 3 2
10 D 1 NA NA
11 D 2 1 1
12 D 3 NA NA
Note that the "new" observations will naturally have NAs for the position and the number of previous podium positions. The latter could be added easily via the following approach, which counts the previous
dat %>%
expand(driver, race_number) %>%
left_join(dat) %>%
arrange(race_number) %>%
mutate(previous_podium_positions = ifelse(is.na(previous_podium_positions),0,1)) %>%
group_by(driver) %>%
mutate(previous_podium_positions = cumsum(previous_podium_positions))
Joining, by = c("driver", "race_number")
# A tibble: 12 x 4
# Groups: driver [4]
driver race_number position previous_podium_positions
<chr> <int> <int> <dbl>
1 A 1 1 1
2 B 1 2 1
3 C 1 3 1
4 D 1 NA 0
5 A 2 2 2
6 B 2 3 2
7 C 2 NA 1
8 D 2 1 1
9 A 3 2 3
10 B 3 1 3
11 C 3 3 2
12 D 3 NA 1
I hope this helped. Just a brief disclaimer, these may very well be not the most resource or time-efficient solutions but rather the fastest/easiest way to solve the issue.

Related

Creating an indexed column in R, grouped by user_id, and not increase when NA

I want to create a column (in R) that indexes the presence of a number in another column grouped by a user_id column. And when the other column is NA, the new desired column should not increase.
The example should bring clarity.
I have this df:
data <- data.frame(user_id = c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3),
one=c(1,NA,3,2,NA,0,NA,4,3,4,NA))
user_id tobeindexed
1 1 1
2 1 NA
3 1 3
4 2 2
5 2 NA
6 2 0
7 2 NA
8 3 4
9 3 3
10 3 4
11 3 NA
I want to make a new column looking like "desired" in the following df:
> cbind(data,data.frame(desired = c(1,1,2,1,1,2,2,1,2,3,3)))
user_id tobeindexed desired
1 1 1 1
2 1 NA 1
3 1 3 2
4 2 2 1
5 2 NA 1
6 2 0 2
7 2 NA 2
8 3 4 1
9 3 3 2
10 3 4 3
11 3 NA 3
How can I solve this?
Using colsum and group_by gets me close, but the count does not start over from 1 when the user_id changes...
> data %>% group_by(user_id) %>% mutate(desired = cumsum(!is.na(tobeindexed)))
user_id tobeindexed desired
<dbl> <dbl> <int>
1 1 1 1
2 1 NA 1
3 1 3 2
4 2 2 3
5 2 NA 3
6 2 0 4
7 2 NA 4
8 3 4 5
9 3 3 6
10 3 4 7
11 3 NA 7
Given the sample data you provided (with the one) column, this works unchanged. The code is retained below for demonstration.
base R
data$out <- ave(data$one, data$user_id, FUN = function(z) cumsum(!is.na(z)))
data
# user_id one out
# 1 1 1 1
# 2 1 NA 1
# 3 1 3 2
# 4 2 2 1
# 5 2 NA 1
# 6 2 0 2
# 7 2 NA 2
# 8 3 4 1
# 9 3 3 2
# 10 3 4 3
# 11 3 NA 3
dplyr
library(dplyr)
data %>%
group_by(user_id) %>%
mutate(out = cumsum(!is.na(one))) %>%
ungroup()
# # A tibble: 11 × 3
# user_id one out
# <dbl> <dbl> <int>
# 1 1 1 1
# 2 1 NA 1
# 3 1 3 2
# 4 2 2 1
# 5 2 NA 1
# 6 2 0 2
# 7 2 NA 2
# 8 3 4 1
# 9 3 3 2
# 10 3 4 3
# 11 3 NA 3

R Add rows to each group so each group has same number, and specify other variable

This is my df:
df = tibble(week = c(1,1,2,2,3,3,3,4,4,4,4),
session = c(1,2,1,2,1,2,3,1,2,3,4),
work =rep("done",11))
df
# A tibble: 11 x 3
week session work
<dbl> <dbl> <chr>
1 1 1 done
2 1 2 done
3 2 1 done
4 2 2 done
5 3 1 done
6 3 2 done
7 3 3 done
8 4 1 done
9 4 2 done
10 4 3 done
11 4 4 done
For each week there should be 4 rows with session 1 to 4.
How can I add the "missing" session rows (the rest of the variables are NA) so the df is:
df1= tibble(week = c(rep(1,4), rep(2,4), rep(3,4), rep(4,4)),
session = rep(1:4,4),
work = c("done", "done" ,NA, NA, "done", "done" ,NA, NA,"done", "done" ,"done", NA, rep("done",4)))
df1
week session work
<dbl> <int> <chr>
1 1 1 done
2 1 2 done
3 1 3 NA
4 1 4 NA
5 2 1 done
6 2 2 done
7 2 3 NA
8 2 4 NA
9 3 1 done
10 3 2 done
11 3 3 done
12 3 4 NA
13 4 1 done
14 4 2 done
15 4 3 done
16 4 4 done
tidyr::complete(df, week, session)
# A tibble: 16 x 3
week session work
<dbl> <dbl> <chr>
1 1 1 done
2 1 2 done
3 1 3 NA
4 1 4 NA
5 2 1 done
6 2 2 done
7 2 3 NA
8 2 4 NA
9 3 1 done
10 3 2 done
11 3 3 done
12 3 4 NA
13 4 1 done
14 4 2 done
15 4 3 done
16 4 4 done
Here's a data.table solution in case speed is important
# load package
library(data.table)
# set as data table
setDT(df)
# cross join to get complete combination
week <- 1:4
session <- 1:4
z <- CJ(week,session)
# join
df_1 <- df[z, on=.(week, session)]

Shifting rows up in columns and flush remaining ones

I have a problem with moving the rows to one upper row. When the rows become completely NA I would like to flush those rows (see the pic below). My current approach for this solution however still keeping the second rows.
Here is my approach
data <- data.frame(gr=c(rep(1:3,each=2)),A=c(1,NA,2,NA,4,NA), B=c(NA,1,NA,3,NA,7),C=c(1,NA,4,NA,5,NA))
> data
gr A B C
1 1 1 NA 1
2 1 NA 1 NA
3 2 2 NA 4
4 2 NA 3 NA
5 3 4 NA 5
6 3 NA 7 NA
so using this approach
data.frame(apply(data,2,function(x){x[complete.cases(x)]}))
gr A B C
1 1 1 1 1
2 1 2 3 4
3 2 4 7 5
4 2 1 1 1
5 3 2 3 4
6 3 4 7 5
As we can see still I am having the second rows in each group!
The expected output
> data
gr A B C
1 1 1 1 1
2 2 2 3 4
3 3 4 7 5
thanks!
If there's at most one valid value per gr, you can use na.omit then take the first value from it:
data %>% group_by(gr) %>% summarise_all(~ na.omit(.)[1])
# [1] is optional depending on your actual data
# A tibble: 3 x 4
# gr A B C
# <int> <dbl> <dbl> <dbl>
#1 1 1 1 1
#2 2 2 3 4
#3 3 4 7 5
You can do it with dplyr like this:
data$ind <- rep(c(1,2), replace=TRUE)
data %>% fill(A,B,C) %>% filter(ind == 2) %>% mutate(ind=NULL)
gr A B C
1 1 1 1 1
2 2 2 3 4
3 3 4 7 5
Depending on how consistent your full data is, this may need to be adjusted.
One more solution using data.table:-
data <- data.frame(gr=c(rep(1:3,each=2)),A=c(1,NA,2,NA,4,NA), B=c(NA,1,NA,3,NA,7),C=c(1,NA,4,NA,5,NA))
library(data.table)
library(zoo)
setDT(data)
data[, A := na.locf(A), by = gr]
data[, B := na.locf(B), by = gr]
data[, C := na.locf(C), by = gr]
data <- unique(data)
data
gr A B C
1: 1 1 1 1
2: 2 2 3 4
3: 3 4 7 5

Recoding missing data in longitudinal data frames with R

I have a data frame with a similar longitudinal structure as data:
data = data.frame (
ID = c("a","a","a","b","b","b","c","c", "c"),
period = c(1,2,3,1,2,3,1,2,3),
size = c(3,3,NA, NA, NA,1, 14,14, 14))
The values of the variable size are fixed so that each period has the same value for size. Yet some observations have missing values. My aim consists of replacing these missing values
with the value of size associated with the periods where there is no missing (e.g. 3 for ID "a" and 1 for ID "b").
The desired data frame should look something similar to:
data.1
ID period value
a 1 3
a 2 3
a 3 3
b 1 1
b 2 1
b 3 1
c 1 14
c 2 14
c 3 14
I have tried different combinations of the formula below but I don't get the result I am looking for.
library(dplyr)
data.1 = data %>% group_by(ID) %>%
mutate(new.size = ifelse(is.na(size), !is.na(size),
ifelse(!is.na(size), size, 0)))
That yields the following:
data.1
Source: local data frame [9 x 4]
Groups: ID [3]
ID period size new.size
(fctr) (dbl) (dbl) (dbl)
1 a 1 3 3
2 a 2 3 3
3 a 3 NA 0
4 b 1 NA 0
5 b 2 NA 0
6 b 3 1 1
7 c 1 14 14
8 c 2 14 14
9 c 3 14 14
I would be grateful if someone could give me a hint on how to get the right solution.
here another solution using dplyr with na.omit
group_by(data, ID) %>%
mutate(value=na.omit(size)[1])
Source: local data frame [9 x 4]
Groups: ID [3]
ID period size value
<fctr> <dbl> <dbl> <dbl>
1 a 1 3 3
2 a 2 3 3
3 a 3 NA 3
4 b 1 NA 1
5 b 2 NA 1
6 b 3 1 1
7 c 1 14 14
8 c 2 14 14
9 c 3 14 14
note that you can replace na.omit with max(size, na.rm=TRUE) if you are looking for maximum for example.
How about this with base R:
vals <- unique(na.omit(data[, c("ID", "size")]))
data$size <- vals$size[match(data$ID, vals$ID)]
ID period size
1 a 1 3
2 a 2 3
3 a 3 3
4 b 1 1
5 b 2 1
6 b 3 1
7 c 1 14
8 c 2 14
9 c 3 14
To correct your code, you can try the following with dplyr
library(dplyr)
data %>% group_by(ID) %>%
mutate(new.size = ifelse(is.na(size), size[!is.na(size)],size))
# ID period size new.size
# (fctr) (dbl) (dbl) (dbl)
#1 a 1 3 3
#2 a 2 3 3
#3 a 3 NA 3
#4 b 1 NA 1
#5 b 2 NA 1
#6 b 3 1 1
#7 c 1 14 14
#8 c 2 14 14
#9 c 3 14 14
Or a base R alternative with ave
data$new.size <- ave(data$size,data$ID, FUN=function(x)unique(x[!is.na(x)]))
data$new.size
#[1] 3 3 3 1 1 1 14 14 14

calculate each chunk by group using dplyr?

How can I get the expected calculation using dplyr package?
row value group expected
1 2 1 =NA
2 4 1 =4-2
3 5 1 =5-4
4 6 2 =NA
5 11 2 =11-6
6 12 1 =NA
7 15 1 =15-12
I tried
df=read.table(header=1, text=' row value group
1 2 1
2 4 1
3 5 1
4 6 2
5 11 2
6 12 1
7 15 1')
df %>% group_by(group) %>% mutate(expected=value-lag(value))
How can I calculate for each chunk (row 1-3, 4-5, 6-7) although row 1-3 and 6-7 are labelled as the same group number?
Here is a similar approach. I created a new group variable using cumsum. Whenever the difference between two numbers in group is not 0, R assigns a new group number. If you have more data, this approach may be helpful.
library(dplyr)
mutate(df, foo = cumsum(c(T, diff(group) != 0))) %>%
group_by(foo) %>%
mutate(out = value - lag(value))
# row value group foo out
#1 1 2 1 1 NA
#2 2 4 1 1 2
#3 3 5 1 1 1
#4 4 6 2 2 NA
#5 5 11 2 2 5
#6 6 12 1 3 NA
#7 7 15 1 3 3
As your group variable is not useful for this, create a new variable aux and use it as the grouping variable:
library(dplyr)
df$aux <- rep(seq_along(rle(df$group)$values), times = rle(df$group)$lengths)
df %>% group_by(aux) %>% mutate(expected = value - lag(value))
Source: local data frame [7 x 5]
Groups: aux
row value group aux expected
1 1 2 1 1 NA
2 2 4 1 1 2
3 3 5 1 1 1
4 4 6 2 2 NA
5 5 11 2 2 5
6 6 12 1 3 NA
7 7 15 1 3 3
Here is an option using data.table_1.9.5. The devel version introduced new functions rleid and shift (default type is "lag" and fill is "NA") that can be useful for this.
library(data.table)
setDT(df)[, expected:=value-shift(value) ,by = rleid(group)][]
# row value group expected
#1: 1 2 1 NA
#2: 2 4 1 2
#3: 3 5 1 1
#4: 4 6 2 NA
#5: 5 11 2 5
#6: 6 12 1 NA
#7: 7 15 1 3

Resources