In my clincal dataset, I have a unique identifors by patient ID and time, and then the variable of interest that look like so:
patientid <- c(100,100,100,101,101,101,102,102,102,104,104,104)
time <- c(1,2,3,1,2,3,1,2,3,1,2,3)
V1 <- c(1,1,NA,2,1,NA,1,3,NA,NA,1,NA)
Data <- data.frame(patientid=patientid, time=time, V1=V1)
Timepoint 3 is blank for each patient. I want to fill in timepoint three for each patient based on the following criteria. If at either time point 1 or 2 the variable is coded as a 2 or 3 then time point 3 should be coded as a 2. If at both time point 1 and 2, variable is coded as a 1 then time point point 3 should be coded as a one. If there is missing data at time point 1 or 2 then time point three should be missing. So for the toy expample it should look like this:
patientid <- c(100,100,100,101,101,101,102,102,102,104,104,104)
time <- c(1,2,3,1,2,3,1,2,3,1,2,3)
V1 <- c(1,1,1,2,1,2,1,3,2,NA,1,NA)
Data <- data.frame(patientid=patientid, time=time, V1=V1)
You can use pivot_wider from tidyr to convert your data to wide format and you can mutate the 3 column with your logic using a function with the help of map from purrr package. You can return back to the original shape of the data frame using pivot-longer
library(tidyverse)
patientid <- c(100,100,100,101,101,101,102,102,102,104,104,104)
time <- c(1,2,3,1,2,3,1,2,3,1,2,3)
V1 <- c(1,1,NA,2,1,NA,1,3,NA,NA,1,NA)
df <- data.frame(patientid=patientid, time=time, V1=V1)
flag <- function(t1,t2){
if(is.na(t1)|is.na(t2)){
NA
} else if(t1 %in% c(2,3)|t2 %in% c(2,3)){
2
} else if(t1 == 1|t2 == 1){
1
}else {
NA
}
}
df %>%
as_tibble() %>%
pivot_wider(names_from = time, values_from = V1) %>%
mutate(`3` = pmap_dbl(list(`1`,`2`),flag )) %>%
pivot_longer(-1, names_to = "time", values_to = "V1")
#> # A tibble: 12 x 3
#> patientid time V1
#> <dbl> <chr> <dbl>
#> 1 100 1 1
#> 2 100 2 1
#> 3 100 3 1
#> 4 101 1 2
#> 5 101 2 1
#> 6 101 3 2
#> 7 102 1 1
#> 8 102 2 3
#> 9 102 3 2
#> 10 104 1 NA
#> 11 104 2 1
#> 12 104 3 NA
Created on 2021-01-29 by the reprex package (v0.3.0)
This should do it!
library(tidyverse)
patientid <- c(100,100,100,101,101,101,102,102,102,104,104,104)
time <- c(1,2,3,1,2,3,1,2,3,1,2,3)
V1 <- c(1,1,NA,2,1,NA,1,3,NA,NA,1,NA)
Data <- data.frame(patientid=patientid, time=time, V1=V1)
Data <- Data %>% pivot_wider(names_from = "time", values_from = "V1",
names_prefix = "timepoint_")
timepoint_impute <- function(x,y) {
if(is.na(x) | is.na(y)) {
return(NA)
} else if(2 %in% c(x,y) | 3 %in% c(x,y)) {
return(2)
} else if(x==1 & y==1) {
return(1)
}
}
Data$timepoint_3 <- map2(.x = Data$timepoint_1, .y = Data$timepoint_2,
.f = timepoint_impute)
You end up with wide data format but if you need long data format you can just use tidyr::pivot_longer. This approach writes a custom function to handle the logic you need.
Related
ID
Date
101
10-17-2021
101
10-19-2021
101
10-20-2021
101
10-31-2021
101
11-01-2021
For each ID I want to remove observations that are within 7 days of each other. I want to keep the earliest date of the dates that are within 7 days of each other. So in this case I would want to keep "10-17-2021" and "10-31-2021". This process would continue until I have unique dates for each ID that are at least 7 days apart and do not contain other dates in between.
You can do it using group_by() and slice() functions. But first the Date column must be formatted using as.Date() function. Here is the code to remove observations within 7-day interval and keep only the earliest ID:
library(tidyverse)
df$Date <- as.Date(df$Date, format = "%m-%d-%Y")
df %>%
group_by(ID) %>%
slice(c(1, which(c(0, diff(Date)) >= 7)))
output
ID Date
101 2021-10-17
101 2021-10-31
In your example, you can't evaluate every observation independently because some of them may be removed when compared to the first value. Perhaps I'm not thinking about it the right way, but I think you need a loop to do this. Here's what I came up with (note: I made the sequence of dates longer to make sure it works):
library(dplyr)
d <- tibble(
ID = 101,
Date = seq(lubridate::mdy("01-01-2023"),
lubridate::mdy("02-07-2023"), by="days")
)
i <- 1
while(i < nrow(d)){
d <- d %>% mutate(diff = Date - d$Date[i])
d <- d %>% filter(diff <= 0 | diff > 7)
if(i < nrow(d)){
i <- i+1
}
}
d <- d %>% select(-diff)
d
#> # A tibble: 5 × 2
#> ID Date
#> <dbl> <date>
#> 1 101 2023-01-01
#> 2 101 2023-01-09
#> 3 101 2023-01-17
#> 4 101 2023-01-25
#> 5 101 2023-02-02
Created on 2023-02-08 by the reprex package (v2.0.1)
Essentially, what happens is that the loop initializes with the first observation and removes every observation within seven days. If more observations remain, it increments the counter and moves to the next day and evaluates all subsequent dates from there, keeping everything that came before.
These loops are difficult to do in the tidyverse, but you could split the data by group, run the loop on each group and then put the groups back together. Here's an example:
library(dplyr)
d <- tibble(
ID = 101,
Date = seq(lubridate::mdy("01-01-2023"),
lubridate::mdy("02-07-2023"), by="days")
)
d2 <- d %>% mutate(ID = 102)
alldat <- bind_rows(d, d2)
split_dat <- alldat %>%
group_by(ID) %>%
group_split()
result <- purrr::map(split_dat, function(d){
i <- 1
while(i < nrow(d)){
d <- d %>% mutate(diff = Date - d$Date[i])
d <- d %>% filter(diff <= 0 | diff > 7)
if(i < nrow(d)){
i <- i+1
}
}
d <- d %>% select(-diff)
d
})
result <- bind_rows(result)
result
#> # A tibble: 10 × 2
#> ID Date
#> <dbl> <date>
#> 1 101 2023-01-01
#> 2 101 2023-01-09
#> 3 101 2023-01-17
#> 4 101 2023-01-25
#> 5 101 2023-02-02
#> 6 102 2023-01-01
#> 7 102 2023-01-09
#> 8 102 2023-01-17
#> 9 102 2023-01-25
#> 10 102 2023-02-02
Created on 2023-02-08 by the reprex package (v2.0.1)
You can try using a recursive function as in this answer.
f <- function(d, ind = 1) {
ind.next <- dplyr::first(which(difftime(d, d[ind], units="days") > 7))
if (is.na(ind.next))
return(ind)
else
return(c(ind, f(d, ind.next)))
}
After the first date, the function will get the next index ind.next where the date is more than 7 days away. Recursively, add that index and get the next date after that. In the end, just return all the row indexes.
The code to use this function can group_by(ID) and slice to retain those rows based on indexes returned.
library(dplyr)
df %>%
group_by(ID) %>%
slice(f(Date))
I have a table of data which includes, among others, an ID, a (somehow sorted) grouping column and a date. For each ID, based on the minimum value of the date for a given group, I would like to filter out the rows of another given group that occurred after that date.
I thought about using pivot_wider and pivot_longer, but I was not able to operate on columns containing list values and single values simultaneously.
How can I do it efficiently (using any tidyverse method, if possible)?
For instance, given
library(dplyr)
tbl <- tibble(id = c(rep(1,5), rep(2,5)),
type = c("A","A","A","B","C","A","A","B","B","C"),
dat = as.Date("2021-12-07") - c(3,0,1,2,0,3,6,2,4,3))
# A tibble: 10 × 3
# id type dat
# <int> <chr> <date>
# 1 1 A 2021-12-04
# 2 1 A 2021-12-07
# 3 1 A 2021-12-06
# 4 1 B 2021-12-05
# 5 1 C 2021-12-07
# 6 2 A 2021-12-04
# 7 2 A 2021-12-01
# 8 2 B 2021-12-05
# 9 2 B 2021-12-03
# 10 2 C 2021-12-04
I would like the following result, where I discarded A-typed elements that occurred after the first of the B-typed ones, but none of the C-typed ones:
# A tibble: 7 × 3
# id type dat
# <int> <chr> <date>
# 1 1 A 2021-12-04
# 2 1 B 2021-12-05
# 3 1 C 2021-12-07
# 4 2 A 2021-12-01
# 5 2 B 2021-12-05
# 6 2 B 2021-12-03
# 7 2 C 2021-12-04
I like to use pivot_wider aand pivot_longer in this case. It does the trick, but maybe you are looking for something shorter.
tbl <- tibble(id = 1:5, type = c("A","A","A","B","C"), dat = as.Date("2021-12-07") - c(3,4,1,2,0)) %>%
pivot_wider(names_from = type, values_from = dat) %>%
filter(A < min(B, na.rm = TRUE) | is.na(A)) %>%
pivot_longer(2:4, names_to = "type", values_to = "dat") %>%
na.omit()
# A tibble: 4 × 3
id type dat
<int> <chr> <date>
1 1 A 2021-12-04
2 2 A 2021-12-03
3 4 B 2021-12-05
4 5 C 2021-12-07
An easy way using kind of SQL logic :
tbl_to_delete <- tbl %>% dplyr::filter(type == "A" & dat > min(tbl$dat[tbl$type=="B"]))
tbl2 <- tbl %>% dplyr::anti_join(tbl_to_delete,by=c("type","dat"))
First you isolate the rows you want to delete, then you discard them from your original data.
You can of course merge the two lines before into one for better code management :
tbl %>% anti_join(tbl %>% filter(type == "A" & dat > min(tbl$dat[tbl$type=="B"])),by=c("type","dat"))
Or if you really hate rbase :
tbl %>% anti_join(tbl %>% filter(type == "A" & dat > tbl %>% filter(type == "B") %>% pull(dat) %>% min()),by=c("type","dat"))
I have a data frame:
df1 <- data.frame(Object = c("Klaus","Klaus","Peter","Peter","Daniel","Daniel"),
PointA = as.numeric(c("7",NA,"17",NA,NA,NA)),
PointB = as.numeric(c("18","22",NA,NA,"17",NA)),
measure = c("1","2","1","2","1","2")
)
And I want this:
df2 <- data.frame(Object = c("Klaus","Klaus","Peter","Peter","Daniel","Daniel"),
PointA = as.numeric(c("7","18","17",NA,NA,"17")),
PointB = as.numeric(c("18","22",NA,NA,"17",NA)),
measure = c("1","2","1","2","1","2")
)
Which is, if there is a no value for an Object for PointA for measure == 2, I want it replaced with PointB from measure == 1 of the same Object.
First thing that comes to mind is:
library(dplyr)
df$PointA <- coalesce(df$PointA, df$PointB)
But afaik there is no way to make this condional.
Then I thought maybe something like:
df$PointA[is.na(df$PointA)] <- df$PointB
But this does not differentiate for the measure.
So I thought about:
df$PointA <- ifelse(df$measure == 2 & is.na(df$PointA), df$PointB, df$PointA)
But that does not take into account that I need the corresponding value from measure == 1.
Now, I am at a loss here. I am out of ideas how to approch this. Help?
Edit: I got two very good solutions already, but both rely on the order in the data frame. I tried, but obviously my example was to simple. I am looking for something that works under the following condition, too:
df1 <- df1[sample(nrow(df1)), ]
One possible option is using row_number() from dplyr. In case you need to sort your dataframe first, you can insert an arrange statement.
library(dplyr)
df1 %>%
arrange(Object, measure) %>%
group_by(Object) %>%
mutate(PointA = if_else(measure == 2 & is.na(PointA), PointB[row_number()-1], PointA))
# A tibble: 6 x 4
# Groups: Object [3]
# Object PointA PointB measure
# <chr> <dbl> <dbl> <chr>
# 1 Daniel NA 17 1
# 2 Daniel 17 NA 2
# 3 Klaus 7 18 1
# 4 Klaus 18 22 2
# 5 Peter 17 NA 1
# 6 Peter NA NA 2
You could use coalesce +lag as shown below:
library(tidyverse)
df1 %>%
arrange(Object, measure) %>%
group_by(Object) %>%
mutate(PointA = coalesce(PointA, lag(PointB)))
# A tibble: 6 x 4
# Groups: Object [3]
Object PointA PointB measure
<chr> <dbl> <dbl> <chr>
1 Klaus 7 18 1
2 Klaus 18 18 2
3 Peter 17 NA 1
4 Peter NA NA 2
5 Daniel NA 17 1
6 Daniel 17 NA 2
This could be condensed, but it should be relatively clear and doesn't rely on the row order at all. Beware if you have multiple rows for the same Object/Measure pair - the self-join will have multiple matches and you'll end up with a lot more rows than you started with.
library(dplyr)
df_fill = df1 %>%
filter(measure == 1) %>%
select(Object, fill_in = PointB) %>%
mutate(needs_fill = 1L)
result = df1 %>%
mutate(needs_fill = if_else(measure == 2 & is.na(PointA), 1L, NA_integer_)) %>%
left_join(df_fill) %>%
mutate(PointA = coalesce(PointA, fill_in)) %>%
select(-fill_in, -needs_fill)
result
# Object PointA PointB measure
# 1 Klaus 7 18 1
# 2 Klaus 18 22 2
# 3 Peter 17 NA 1
# 4 Peter NA NA 2
# 5 Daniel NA 17 1
# 6 Daniel 17 NA 2
Same as above but without saving the intermediate object:
result = df1 %>%
mutate(needs_fill = if_else(measure == 2 & is.na(PointA), 1L, NA_integer_)) %>%
left_join(
df1 %>%
filter(measure == 1) %>%
select(Object, fill_in = PointB) %>%
mutate(needs_fill = 1L)
) %>%
mutate(PointA = coalesce(PointA, fill_in)) %>%
select(-fill_in, -needs_fill)
Having gotten my data into the format:
pId fId
1 1 0
2 1 108
3 1 940
4 1 972
5 1 993
6 2 0
7 3 0
8 3 32
9 3 108
10 3 176
My goal is to try and (for a much longer set of data) determine which fIds each pId has in common with each other, and from that how many they have in common. My plan was to try and summarise into singular rows of pId where each fId is a list of fIds, and then loop a function like intersect() or of similar nature across that, for an ideal ouput of format:
pId1 pId2 together
1 1 2 1
2 1 3 2
3 1 4 N
4 2 3 1
etc....
EDIT: trying to work with the data in one of these ways
pId allfId allfIdSplit
1 1 0,901,940,972,993 c("0", "901", "940", "972", "993")
2 2 0 0
3 3 0,32,108,176 c("0", "32", "108", "176")
4 4 0,200,561,602,629,772,825,991 c("0", "200", "561", "602", "629", "772", "825", "991")
5 5 0 0
With code that I had so far, where df_a is startng point as shown above to give the output shown in the edit:
df_c <- df_a %>%
group_by(pId) %>%
arrange(pId) %>%
summarize(allFlights = paste(unique(flightId), collapse = ",")) %>%
mutate(allFlightsSplit = str_split(allFlights, ",")) %>%
print()
Here's one way to do it in the tidyverse. See comments in the code.
library(tidyverse)
library(magrittr)
df.counts <- combn(unique(df$pId), 2) %>% # unique combinations of pIDs
t %>% # transform to columns
as.data.frame() %>% # to data frame
set_colnames(c('pId1', 'pId2')) %>% # name the columns
left_join(df, by = c(pId1 = 'pId')) %>% # join the original data to pId1
left_join(df, by = c(pId2 = 'pId')) %>% #join original data to pId2
filter(fId.x == fId.y) %>% # get rid of duplicates
count(pId1, pId2) # count
pId1 pId2 n
<int> <int> <int>
1 1 2 1
2 1 3 2
3 2 3 1
An alternative using loops
Loops are usually not the best way to handle these types of problems in R, but since operations like combn seem to be too expensive on your real data, this may be more performant.
pids <- unique(df$pId)
result <- list()
for (x in pids) {
for (y in setdiff(pids, x)) {
x.vals <- df$fId[df$pId == x]
y.vals <- df$fId[df$pId == y]
together <- length(intersect(x.vals, y.vals))
result[[length(result) + 1]] <- data.frame(pId1 = x, pId2 = y, together = together)
}
}
df.new <- do.call(rbind, result)
pId1 pId2 together
1 1 3 2
2 2 3 1
3 3 2 1
And here is a version that preallocates the size of the final data frame, which may be even more performant:
pids <- unique(df$pId)
result <- data.frame(pId1 = rep(NA, length(pids) * (length(pids) - 1) / 2), pId2 = NA, together = NA)
row.num <- 1
for (x in pids) {
for (y in setdiff(pids, x)) {
x.vals <- df$fId[df$pId == x]
y.vals <- df$fId[df$pId == y]
together <- length(intersect(x.vals, y.vals))
result[row.num, 'pId1'] <- x
result[row.num, 'pId2'] <- y
result[row.num, 'together'] <- together
row.num <- row.num + 1
}
}
I tried asking this question before but was it was poorly stated. This is a new attempt cause I haven't solved it yet.
I have a dataset with winners, losers, date, winner_points and loser_points.
For each row, I want two new columns, one for the winner and one for the loser that shows how many points they have scored so far (as both winners and losers).
Example data:
winner <- c(1,2,3,1,2,3,1,2,3)
loser <- c(3,1,1,2,1,1,3,1,2)
date <- c("2017-10-01","2017-10-02","2017-10-03","2017-10-04","2017-10-05","2017-10-06","2017-10-07","2017-10-08","2017-10-09")
winner_points <- c(2,1,2,1,2,1,2,1,2)
loser_points <- c(1,0,1,0,1,0,1,0,1)
test_data <- data.frame(winner, loser, date = as.Date(date), winner_points, loser_points)
I want the output to be:
winner_points_sum <- c(0, 0, 1, 3, 1, 3, 5, 3, 5)
loser_points_sum <- c(0, 2, 2, 1, 4, 5, 4, 7, 4)
test_data <- data.frame(winner, loser, date = as.Date(date), winner_points, loser_points, winner_points_sum, loser_points_sum)
How I've solved it thus far is to do a for loop such as:
library(dplyr)
test_data$winner_points_sum_loop <- 0
test_data$loser_points_sum_loop <- 0
for(i in row.names(test_data)) {
test_data[i,]$winner_points_sum_loop <-
(
test_data %>%
dplyr::filter(winner == test_data[i,]$winner & date < test_data[i,]$date) %>%
dplyr::summarise(points = sum(winner_points, na.rm = TRUE))
+
test_data %>%
dplyr::filter(loser == test_data[i,]$winner & date < test_data[i,]$date) %>%
dplyr::summarise(points = sum(loser_points, na.rm = TRUE))
)
}
test_data$winner_points_sum_loop <- unlist(test_data$winner_points_sum_loop)
Any suggestions how to tackle this problem? The queries take quite some time when the row numbers add up. I've tried elaborating with the AVE function, I can do it for one column to sum a players point as winner but can't figure out how to add their points as loser.
winner <- c(1,2,3,1,2,3,1,2,3)
loser <- c(3,1,1,2,1,1,3,1,2)
date <- c("2017-10-01","2017-10-02","2017-10-03","2017-10-04","2017-10-05","2017-10-06","2017-10-07","2017-10-08","2017-10-09")
winner_points <- c(2,1,2,1,2,1,2,1,2)
loser_points <- c(1,0,1,0,1,0,1,0,1)
test_data <- data.frame(winner, loser, date = as.Date(date), winner_points, loser_points)
library(dplyr)
library(tidyr)
test_data %>%
unite(winner, winner, winner_points) %>% # unite winner columns
unite(loser, loser, loser_points) %>% # unite loser columns
gather(type, pl_pts, winner, loser, -date) %>% # reshape
separate(pl_pts, c("player","points"), convert = T) %>% # separate columns
arrange(date) %>% # order dates (in case it's not)
group_by(player) %>% # for each player
mutate(sum_points = cumsum(points) - points) %>% # get points up to that date
ungroup() %>% # forget the grouping
unite(pl_pts_sumpts, player, points, sum_points) %>% # unite columns
spread(type, pl_pts_sumpts) %>% # reshape
separate(loser, c("loser", "loser_points", "loser_points_sum"), convert = T) %>% # separate columns and give appropriate names
separate(winner, c("winner", "winner_points", "winner_points_sum"), convert = T) %>%
select(winner, loser, date, winner_points, loser_points, winner_points_sum, loser_points_sum) # select the order you prefer
# # A tibble: 9 x 7
# winner loser date winner_points loser_points winner_points_sum loser_points_sum
# * <int> <int> <date> <int> <int> <int> <int>
# 1 1 3 2017-10-01 2 1 0 0
# 2 2 1 2017-10-02 1 0 0 2
# 3 3 1 2017-10-03 2 1 1 2
# 4 1 2 2017-10-04 1 0 3 1
# 5 2 1 2017-10-05 2 1 1 4
# 6 3 1 2017-10-06 1 0 3 5
# 7 1 3 2017-10-07 2 1 5 4
# 8 2 1 2017-10-08 1 0 3 7
# 9 3 2 2017-10-09 2 1 5 4
I finally understood what you want. And I took an approach of getting cumulative points of each player at each point in time and then joining it to the original test_data data frame.
winner <- c(1,2,3,1,2,3,1,2,3)
loser <- c(3,1,1,2,1,1,3,1,2)
date <- c("2017-10-01","2017-10-02","2017-10-03","2017-10-04","2017-10-05","2017-10-06","2017-10-07","2017-10-08","2017-10-09")
winner_points <- c(2,1,2,1,2,1,2,1,2)
loser_points <- c(1,0,1,0,1,0,1,0,1)
test_data <- data.frame(winner, loser, date = as.Date(date), winner_points, loser_points)
library(dplyr)
library(tidyr)
cum_points <- test_data %>%
gather(end_game_status, player_id, winner, loser) %>%
gather(which_point, how_many_points, winner_points, loser_points) %>%
filter(
(end_game_status == "winner" & which_point == "winner_points") |
(end_game_status == "loser" & which_point == "loser_points")) %>%
arrange(date = as.Date(date)) %>%
group_by(player_id) %>%
mutate(cumulative_points = cumsum(how_many_points)) %>%
mutate(cumulative_points_sofar = lag(cumulative_points, default = 0))
select(player_id, date, cumulative_points)
output <- test_data %>%
left_join(cum_points, by = c('date', 'winner' = 'player_id')) %>%
rename(winner_points_sum = cumulative_points_sofar) %>%
left_join(cum_points, by = c('date', 'loser' = 'player_id')) %>%
rename(loser_points_sum = cumulative_points_sofar)
output
The difference to the previous question of the OP is that the OP is now asking for the cumulative sum of points each player has scored so far, i.e., before the actual date. Furthermore, the sample data set now contains a date column which uniquely identifies each row.
So, my previous approach can be used here as well, with some modifications. The solution below reshapes the data from wide to long format whereby two value variables are reshaped simultaneously, computes the cumulative sums for each player id , and finally reshapes from long back to wide format, again. In order to sum only points scored before the actual date, the rows are lagged by one.
It is important to note that the winner and loser columns contain the respective player ids.
library(data.table)
cols <- c("winner", "loser")
setDT(test_data)[
# reshape multiple value variables simultaneously from wide to long format
, melt(.SD, id.vars = "date",
measure.vars = list(cols, paste0(cols, "_points")),
value.name = c("id", "points"))][
# rename variable column
, variable := forcats::lvls_revalue(variable, cols)][
# order by date and cumulate the lagged points by id
order(date), points_sum := cumsum(shift(points, fill = 0)), by = id][
# reshape multiple value variables simultaneously from long to wide format
, dcast(.SD, date ~ variable, value.var = c("id", "points", "points_sum"))]
date id_winner id_loser points_winner points_loser points_sum_winner points_sum_loser
1: 2017-10-01 1 3 2 1 0 0
2: 2017-10-02 2 1 1 0 0 2
3: 2017-10-03 3 1 2 1 1 2
4: 2017-10-04 1 2 1 0 3 1
5: 2017-10-05 2 1 2 1 1 4
6: 2017-10-06 3 1 1 0 3 5
7: 2017-10-07 1 3 2 1 5 4
8: 2017-10-08 2 1 1 0 3 7
9: 2017-10-09 3 2 2 1 5 4