How to select all rows between two events - r

Here is a representation of my dataset
set.seed(1)
library (tidyverse)
Date<-c(1:6,1:8,1:10)
ID<-c(rep(1,3*2),rep(2,4*2),rep(3,5*2))
Surgery<-c(c("Surg1",NA,NA,NA,"Surg2",NA),
c(NA,NA,NA,"Surg.a",NA,NA,"Surg.f",NA),
c("Surg.C",NA,NA,"Surg.A",NA,NA,"Surg.X",NA,NA,NA))
Complication<-sample(c(rep("Infection",8),rep("Pain",7),rep("bleeding",5),rep("Oedema",4)))
NumberOfSurgery<-c(rep(2,6),rep(2,8),rep(3,10))
OrderOfSurgery<-c(1,rep(NA,3),2,rep(NA,4),1,NA,NA,2,NA,1,NA,NA,2,NA,NA,3,rep(NA,3))
mydata<-data.frame(ID,Date,Surgery,Complication,NumberOfSurgery,OrderOfSurgery)
mydata
There are three patients. The first one has benefited from two surgeries over time, the second one 2 also and the last one three surgeries.
I would like to select all complications from the first day of surgery 1 to the first day of surgery 2, and for all individuals; in order to have such a dataset below
ID Date Surgery Complication NumberOfSurgery OrderOfSurgery
1 1 1 Surg1 Infection 2 1
2 1 2 <NA> Infection 2 NA
3 1 3 <NA> Infection 2 NA
4 1 4 <NA> Infection 2 NA
5 1 5 Surg2 Pain 2 2
10 2 4 Surg.a bleeding 2 1
11 2 5 <NA> Pain 2 NA
12 2 6 <NA> Infection 2 NA
13 2 7 Surg.f bleeding 2 2
15 3 1 Surg.C Pain 3 1
16 3 2 <NA> Pain 3 NA
17 3 3 <NA> Pain 3 NA
18 3 4 Surg.A bleeding 3 2
Here is how I proceeded:
The dates are already arranged
Firstly, I suppressed for the second individuals all the observation that occurred before the first surgery (that is from the date 1 to the date 3).
mydata2<-mydata%>%mutate(Surgery2=Surgery,OrderOfSurgery2=OrderOfSurgery)%>%
group_by(ID)%>%fill(Surgery2,OrderOfSurgery2)%>%filter(!is.na(Surgery2))
Then I could select only all observations that follow the first surgery, by doing this:
mydata3<-mydata2%>%filter(OrderOfSurgery2==1)
What I want to do is to include the row at the date of the second surgery,as I mentionned above.

Using lag() and shift OrderOfSurgery2 down by one position, filling the blank with the default value of 1 should do.
mydata3 <- mydata2 %>%
mutate(OrderOfSurgery2 = lag(OrderOfSurgery2, default = 1)) %>%
filter(OrderOfSurgery2 == 1)
# # A tibble: 13 x 8
# # Groups: ID [3]
# ID Date Surgery Complication NumberOfSurgery OrderOfSurgery Surgery2
# <dbl> <int> <chr> <chr> <dbl> <dbl> <chr>
# 1 1 1 Surg1 Infection 2 1 Surg1
# 2 1 2 NA Infection 2 NA Surg1
# 3 1 3 NA Infection 2 NA Surg1
# 4 1 4 NA Infection 2 NA Surg1
# 5 1 5 Surg2 Pain 2 2 Surg2
# 6 2 4 Surg.a bleeding 2 1 Surg.a
# 7 2 5 NA Pain 2 NA Surg.a
# 8 2 6 NA Infection 2 NA Surg.a
# 9 2 7 Surg.f bleeding 2 2 Surg.f
# <Omitted>

Related

Expanding Data Frame with cumsum in R

I've got a data frame with historc F1 data that looks like this:
Driver
Race number
Position
Number of Career Podiums
Farina
1
1
1
Fagioli
1
2
1
Parnell
1
3
1
Fangio
2
1
1
Ascari
2
2
1
Chiron
2
3
1
...
...
...
...
Moss
47
1
4
Fangio
47
2
23
Kling
47
3
2
now I want to extend it in a way that for every race there is not only the top 3 of that specific Race but also everyone that has had a top 3 before so I can create a racing bar chart. The final data frame should look like this
Driver
Race number
Position
Number of Career Podiums
Farina
1
1
1
Fagioli
1
2
1
Parnell
1
3
1
Fangio
2
1
1
Ascari
2
2
1
Chiron
2
3
1
Farina
2
NA
1
Fagioli
2
NA
1
Parnell
2
NA
1
Parsons
3
1
1
Holland
3
2
1
Rose
3
3
1
Farina
3
NA
1
Fagioli
3
NA
1
Parnell
3
NA
1
Fangio
3
NA
1
Ascari
3
NA
1
Chiron
3
NA
1
Is there any easy way to do this? I couldnt find someone with a similar problem on google.
If I correctly understand your problem, you have only observations for the top3 drivers of every race. But you want to have observations for every driver that has ever achieved a top3 position in your dataset across all races.
For example in the following dataset, driver D only has an observation for the second race where they achieved the first place, but not the other races:
dat <- data.frame(driver = c("A", "B", "C", "D", "A", "B", "B", "A", "C"),
race_number = rep(1:3, each = 3),
position = rep(1:3, 3))
print(dat)
driver race_number position
1 A 1 1
2 B 1 2
3 C 1 3
4 D 2 1
5 A 2 2
6 B 2 3
7 B 3 1
8 A 3 2
9 C 3 3
To add entries for driver D for the races number 1 and 2 you could use tidyr's expand() function or if you want to use base R you could achieve the same using expand.grid() and unique(). This would leave you with a dataframe object containing all possible combinations between the drivers and the race numbers. Afterwards you simply have to left or right join the result with the initial dataframe.
A solution using standard tidyverse packages tidyr and dplyr could look like this:
library(dplyr)
library(tidyr)
dat %>%
expand(driver, race_number) %>%
left_join(dat)
# A tibble: 12 x 4
driver race_number position previous_podium_positions
<chr> <int> <int> <dbl>
1 A 1 1 1
2 A 2 2 2
3 A 3 2 3
4 B 1 2 1
5 B 2 3 2
6 B 3 1 3
7 C 1 3 1
8 C 2 NA NA
9 C 3 3 2
10 D 1 NA NA
11 D 2 1 1
12 D 3 NA NA
Note that the "new" observations will naturally have NAs for the position and the number of previous podium positions. The latter could be added easily via the following approach, which counts the previous
dat %>%
expand(driver, race_number) %>%
left_join(dat) %>%
arrange(race_number) %>%
mutate(previous_podium_positions = ifelse(is.na(previous_podium_positions),0,1)) %>%
group_by(driver) %>%
mutate(previous_podium_positions = cumsum(previous_podium_positions))
Joining, by = c("driver", "race_number")
# A tibble: 12 x 4
# Groups: driver [4]
driver race_number position previous_podium_positions
<chr> <int> <int> <dbl>
1 A 1 1 1
2 B 1 2 1
3 C 1 3 1
4 D 1 NA 0
5 A 2 2 2
6 B 2 3 2
7 C 2 NA 1
8 D 2 1 1
9 A 3 2 3
10 B 3 1 3
11 C 3 3 2
12 D 3 NA 1
I hope this helped. Just a brief disclaimer, these may very well be not the most resource or time-efficient solutions but rather the fastest/easiest way to solve the issue.

R Add rows to each group so each group has same number, and specify other variable

This is my df:
df = tibble(week = c(1,1,2,2,3,3,3,4,4,4,4),
session = c(1,2,1,2,1,2,3,1,2,3,4),
work =rep("done",11))
df
# A tibble: 11 x 3
week session work
<dbl> <dbl> <chr>
1 1 1 done
2 1 2 done
3 2 1 done
4 2 2 done
5 3 1 done
6 3 2 done
7 3 3 done
8 4 1 done
9 4 2 done
10 4 3 done
11 4 4 done
For each week there should be 4 rows with session 1 to 4.
How can I add the "missing" session rows (the rest of the variables are NA) so the df is:
df1= tibble(week = c(rep(1,4), rep(2,4), rep(3,4), rep(4,4)),
session = rep(1:4,4),
work = c("done", "done" ,NA, NA, "done", "done" ,NA, NA,"done", "done" ,"done", NA, rep("done",4)))
df1
week session work
<dbl> <int> <chr>
1 1 1 done
2 1 2 done
3 1 3 NA
4 1 4 NA
5 2 1 done
6 2 2 done
7 2 3 NA
8 2 4 NA
9 3 1 done
10 3 2 done
11 3 3 done
12 3 4 NA
13 4 1 done
14 4 2 done
15 4 3 done
16 4 4 done
tidyr::complete(df, week, session)
# A tibble: 16 x 3
week session work
<dbl> <dbl> <chr>
1 1 1 done
2 1 2 done
3 1 3 NA
4 1 4 NA
5 2 1 done
6 2 2 done
7 2 3 NA
8 2 4 NA
9 3 1 done
10 3 2 done
11 3 3 done
12 3 4 NA
13 4 1 done
14 4 2 done
15 4 3 done
16 4 4 done
Here's a data.table solution in case speed is important
# load package
library(data.table)
# set as data table
setDT(df)
# cross join to get complete combination
week <- 1:4
session <- 1:4
z <- CJ(week,session)
# join
df_1 <- df[z, on=.(week, session)]

Calculate diff price in a unbalanced set

I have a unbalanced data frame with date, localities and prices. I would like calculate diff price among diferents localities by date. My data its unbalanced and to get all diff price I think in create data(localities) to balance data.
My data look like:
library(dplyr)
set.seed(123)
df= data.frame(date=(1:3),
locality= rbinom(21,3, 0.2),
price=rnorm(21, 50, 20))
df %>%
arrange(date, locality)
> date locality price
1 1 0 60.07625
2 1 0 35.32994
3 1 0 63.69872
4 1 1 54.76426
5 1 1 66.51080
6 1 1 28.28602
7 1 2 47.09213
8 2 0 26.68910
9 2 1 100.56673
10 2 1 48.88628
11 2 1 48.29153
12 2 2 29.02214
13 2 2 45.68269
14 2 2 43.59887
15 3 0 60.98193
16 3 0 75.89527
17 3 0 43.30174
18 3 0 71.41221
19 3 0 33.62969
20 3 1 34.31236
21 3 1 23.76955
To get balanced data I think in:
> date locality price
1 1 0 60.07625
2 1 0 35.32994
3 1 0 63.69872
4 1 1 54.76426
5 1 1 66.51080
6 1 1 28.28602
7 1 2 47.09213
8 1 2 NA
9 1 2 NA
10 2 0 26.68910
10 2 0 NA
10 2 0 NA
11 2 1 100.56673
12 2 1 48.88628
13 2 1 48.29153
14 2 2 29.02214
15 2 2 45.68269
16 2 2 43.59887
etc...
Finally to get diff price beetwen pair localities I think:
> date diff(price, 0-1) diff(price, 0-2) diff(price, 1-2)
1 1 60.07625-54.76426 60.07625-47.09213 etc...
2 1 35.32994-66.51080 35.32994-NA
3 1 63.69872-28.28602 63.69872-NA
You don't need to balance your data. If you use dcast, it will add the NAs for you.
First transform the data to show individual columns for each locality
library(data.table)
library(tidyverse)
setDT(df)
df[, rid := rowid(date, locality)]
df2 <- dcast(df, rid + date ~ locality, value.var = 'price')
# rid date 0 1 2
# 1: 1 1 60.07625 54.76426 47.09213
# 2: 1 2 26.68910 100.56673 29.02214
# 3: 1 3 60.98193 34.31236 NA
# 4: 2 1 35.32994 66.51080 NA
# 5: 2 2 NA 48.88628 45.68269
# 6: 2 3 75.89527 23.76955 NA
# 7: 3 1 63.69872 28.28602 NA
# 8: 3 2 NA 48.29153 43.59887
# 9: 3 3 43.30174 NA NA
# 10: 4 3 71.41221 NA NA
# 11: 5 3 33.62969 NA NA
Then create a data frame to_diff of differences to calculate, and pmap over that to calculate the differences. Here c0_1 corresponds to what you call in your question diff(price, 0-1).
to_diff <- CJ(0:2, 0:2)[V1 < V2]
pmap(to_diff, ~ df2[[as.character(.x)]] - df2[[as.character(.y)]]) %>%
setNames(paste0('c', to_diff[[1]], '_', to_diff[[2]])) %>%
bind_cols(df2[, 1:2])
# A tibble: 11 x 5
# c0_1 c0_2 c1_2 rid date
# <dbl> <dbl> <dbl> <int> <int>
# 1 5.31 13.0 7.67 1 1
# 2 -73.9 -2.33 71.5 1 2
# 3 26.7 NA NA 1 3
# 4 -31.2 NA NA 2 1
# 5 NA NA 3.20 2 2
# 6 52.1 NA NA 2 3
# 7 35.4 NA NA 3 1
# 8 NA NA 4.69 3 2
# 9 NA NA NA 3 3
# 10 NA NA NA 4 3
# 11 NA NA NA 5 3

R Logic For assigning HH Travel Surveys

I am trying to make a logical statement that would arrange the tours of a household.
A Tour by definition is a series of trips that would start at home and end at home. In this dataset Triptype gives you the infomration. 1=Home 2-5 are non home trips. I want to be able to create a tour number for each sample number and perno (person#). There can be more than one person in a HH who has totally different trips than the other perno. So for sampno 1032558 there is only 1 person in the HH and his tour #1 is 1-4-4-5-4-4-1. I inluded a link to the whole data set if it would help but I just need help with having a loop that could create a new column for Tour# based off of the Sampno, Perno, and Triptype so that the Tour# will be smart enought to know that it needs to reset the count of the tour number based on the change in sampno id, and perno since they would be a different person and HH.
Data From Household Travel Survey
sampno|perno|tripno|plano|Trip Type
1032558 1 NA 1 1
1032558 1 1 2 4
1032558 1 2 3 4
1032558 1 3 4 5
1032558 1 4 5 4
1032558 1 5 6 4
1032558 1 6 7 1
1033660 1 NA 1 1
1033660 1 NA 1 1
1033660 1 NA 1 1
1034462 1 NA 1 1
1034462 2 NA 1 1
Consider randomly generating a number by grouping in by. The anonymous function uses tryCatch for by slices that have no records which can be the case for more than one factor. Be sure to use a sufficiently high number like 1:10000 to accommodate your data groupings.
df_list <- by(tourdf, tourdf[c("sampno", "perno", "triptype")], function(sub) {
sub$tour_number <- tryCatch({
set.seed(sub$sampno[1] + sub$perno[1] + sub$triptype[1])
sample(1:10000, 1, replace=TRUE)
}, error = function(e) return(NA))
return(sub)
})
new_tourdf <- do.call(rbind, df_list)
new_tourdf
# sampno perno tripno plano triptype tour_number
# 1 1032558 1 NA 1 1 111
# 2 1032558 1 6 7 1 111
# 3 1033660 1 NA 1 1 1561
# 4 1033660 1 NA 1 1 1561
# 5 1033660 1 NA 1 1 1561
# 6 1034462 1 NA 1 1 8855
# 7 1034462 2 NA 1 1 5636
# 8 1032558 1 1 2 4 1727
# 9 1032558 1 2 3 4 1727
# 10 1032558 1 4 5 4 1727
# 11 1032558 1 5 6 4 1727
# 12 1032558 1 3 4 5 804

imputing forward / backward

I am trying to impute some longitudinal data in this way (see below). For each individual (id), if first values are NA, I would like to impute using the first observed value for that individual regardless when that occurs. Then, I would like to impute forward based on the last value observed for each individual (see imputed below).
var values might not necessarily increase monotonically. Those values might be a character vector.
I have tried several ways to do this, but still I cannot get a satisfactory solution.
Any ideas?
id <- c(1,1,1,1,1,1,1,2,2,2,2)
time <- c(1,2,3,4,5,6,7,3,5,7,9)
var <- c(NA,NA,1,NA,2,3,NA,NA,2,3,NA)
imputed <- c(1,1,1,1,2,3,3,2,2,3,3)
dat <- data.table(id, time, var, imputed)
id time var imputed
1: 1 1 NA 1
2: 1 2 NA 1
3: 1 3 1 1
4: 1 4 NA 1
5: 1 5 2 2
6: 1 6 3 3
7: 1 7 NA 3
8: 2 3 NA 2
9: 2 5 2 2
10: 2 7 3 3
11: 2 9 NA 3
library(zoo)
dat[, newimp := na.locf(na.locf(var, FALSE), fromLast=TRUE), by = id]
dat
# id time var imputed newimp
# 1: 1 1 NA 1 1
# 2: 1 2 NA 1 1
# 3: 1 3 1 1 1
# 4: 1 4 NA 1 1
# 5: 1 5 2 2 2
# 6: 1 6 3 3 3
# 7: 1 7 NA 3 3
# 8: 2 3 NA 2 2
# 9: 2 5 2 2 2
#10: 2 7 3 3 3
#11: 2 9 NA 3 3

Resources