I am working with a dataset like this:
There is the participant ID and then there is the date of the exam and the date of biopsy. There are multiple observations per participant.
The dataset looks like this:
df <- data.frame(ID = c("A", "A", "B", "B", "B", "C", "C", "C", "C"),
date_of_exam = c("2020-05-03", "2020-07-08", "2020-04-12", "2020-04-01", "2020-03-12", "2020-08-12", "2020-08-10", "2020-09-12", "2020-10-01"),
date_of_biopsy = c("2020-05-01", "2020-07-06", "2020-04-15", "2020-04-03", "2020-03-09", "2020-08-15", "2020-08-09", "2020-09-14", "2020-10-05"))
Whenever the date of exam is before the date of biopsy (date_of_exam minus date_of_biopsy < 0), I want to use the next higher date of the exam.
How can I create a forloop or else that checks the condition (date_of_exam minus date_of_biopsy <0) and if this is true it chooses the next higher value of the exam, checks this again for the condition (date_of_exam minus date_of_biopsy <0) and if it is not met, takes this value or else chooses the next higher value...?
I was thinking about creating multiple
ifelse conditions in dplyr:
library(dplyr)
df %>%
group_by(ID) %>%
arrange(ID) %>%
mutate(exam_2nd_value = ifelse(test=(df$data_of_exam-df_date_of_biopsy<0)==TRUE, yes=df$date_of_exam[min(n(), 2)],no=df$date_of_exam[min(n(),1)]))%>%
ungroup()
And then some more if else, but I feel like this is not a good way and is probably not going to work.
Also, when I use the code mentioned above, I get this error:
Error: Problem with mutate() input new. x Input new can't be recycled to size 3
Can you tell me why I get this error and what I can do to solve my problem?
Thanks a lot,
Phil
If I am understanding the problem correctly, something like this should work:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(date_of_exam = as.Date(date_of_exam),
date_of_biopsy = as.Date(date_of_biopsy)) %>%
arrange(date_of_exam) %>%
mutate(lead_exam_date = lead(date_of_exam),
exam_2nd_value = if_else(date_of_exam>date_of_biopsy,
date_of_exam,
lead_exam_date)) %>%
ungroup()
Explanation
group by ID
convert all strings to date objects (which internally are just epoch times and thus can be added/subtracted like numbers)
arrange by date in each group (lowest first)
use lead to shift the date_of_exam column up by one for each group
do the comparison you want, if exam date is greater than biopsy date (eg former in the future relative to latter), use the current exam date, otherwise use the next exam date (the one created using lead)
This will give you:
ID date_of_exam date_of_biopsy lead_exam_date exam_2nd_value
<chr> <date> <date> <date> <date>
1 B 2020-03-12 2020-03-09 2020-04-01 2020-03-12
2 B 2020-04-01 2020-04-03 2020-04-12 2020-04-13
3 B 2020-04-12 2020-04-15 NA NA
4 A 2020-05-03 2020-05-01 2020-07-08 2020-05-03
5 A 2020-07-08 2020-07-06 NA 2020-07-08
6 C 2020-08-10 2020-08-09 2020-08-12 2020-08-10
7 C 2020-08-12 2020-08-15 2020-09-12 2020-09-13
8 C 2020-09-12 2020-09-14 2020-10-01 2020-10-02
9 C 2020-10-01 2020-10-05 NA NA
I'm not sure based on the way the original problem is framed what to do if there is no next exam date (hence the NAs). To deal with those, lead has a default argument which will be supplied when there is no next value.
As far as the errors you encountered, I couldn't reproduce the error you got but instead I got an error for trying to subtract two strings (the two dates in the mutate call). Yours might be the same thing just expressed a different way due to version differences or something.
Additionally, you can't use ifelse here because this function isn't vectorized; you need if_else which takes entire vectors rather than scalers (the name is super confusing).
Finally, I don't think the calls to the original dataframe with df$ would have worked since those aren't aware of the grouping you imposed earlier in the chain.
Related
I'm trying to visualize some bird data, however after grouping by month, the resulting output is out of order from the original data. It is in order for December, January, February, and March in the original, but after manipulating it results in December, February, January, March.
Any ideas how I can fix this or sort the rows?
This is the code:
BirdDataTimeClean <- BirdDataTimes %>%
group_by(Date) %>%
summarise(Gulls=sum(Gulls), Terns=sum(Terns), Sandpipers=sum(Sandpipers),
Plovers=sum(Plovers), Pelicans=sum(Pelicans), Oystercatchers=sum(Oystercatchers),
Egrets=sum(Egrets), PeregrineFalcon=sum(Peregrine_Falcon), BlackPhoebe=sum(Black_Phoebe),
Raven=sum(Common_Raven))
BirdDataTimeClean2 <- BirdDataTimeClean %>%
pivot_longer(!Date, names_to = "Species", values_to = "Count")
You haven't shared any workable data but i face this many times when reading from csv and hence all dates and data are in character.
as suggested, please convert the date data to "date" format using lubridate package or base as.Date() and then arrange() in dplyr will work or even group_by
example :toy data created
birds <- data.table(dates = c("2020-Feb-20","2020-Jan-20","2020-Dec-20","2020-Apr-20"),
species = c('Gulls','Turns','Gulls','Sandpiper'),
Counts = c(20,30,40,50)
str(birds) will show date is character (and I have not kept order)
using lubridate convert dates
birds$dates%>%lubridate::ymd() will change to date data-type
birds$dates%>%ymd()%>%str()
Date[1:4], format: "2020-02-20" "2020-01-20" "2020-12-20" "2020-04-20"
save it with birds$dates <- ymd(birds$dates) or do it in your pipeline as follows
now simply so the dplyr analysis:
birds%>%group_by(Months= ymd(dates))%>%
summarise(N=n()
,Species_Count = sum(Counts)
)%>%arrange(Months)
will give
# A tibble: 4 x 3
Months N Species_Count
<date> <int> <dbl>
1 2020-01-20 1 30
2 2020-02-20 1 20
3 2020-04-20 1 50
However, if you want Apr , Jan instead of numbers and apply as.Date() with format etc, the dates become "character" again. I woudl suggest you keep your data that way and while representing in output for others -> format it there with as.Date or if using DT or other datatables -> check the output formatting options. That way your original data remains and users see what they want.
this will make it character
birds%>%group_by(Months= as.character.Date(dates))%>%
summarise(N=n()
,Species_Count = sum(Counts)
)%>%arrange(Months)
A tibble: 4 x 3
Months N Species_Count
<chr> <int> <dbl>
1 2020-Apr-20 1 50
2 2020-Dec-20 1 40
3 2020-Feb-20 1 20
4 2020-Jan-20 1 30
I have a dataframe containing repeated measurements of a number of variables for a large number of samples. I would like to look at each variable / sample combination, compare the measurement dates with the dates in another column, and only keep the row where the minimum date difference exists.
start_date <- as.Date('2021-01-01')
end_date <- as.Date('2021-12-30')
set.seed(1984)
# The dataframe looks something like this
cheese <- data.frame(sample_id = c('1','1','1','1','2','2','2','2','1','1','1','1','2','2','2','2'),
variable = c('a','a','b','b','a','a','b','b',
'a','a','b','b','a','a','b','b'),
value = runif(n = 16, min = 1, max = 10),
measurement_date = c('2021-05-04','2021-08-22','2021-05-04','2021-08-22',
'2021-05-04','2021-08-22','2021-05-04','2021-08-22',
'2021-05-04','2021-08-22','2021-05-04','2021-08-22',
'2021-05-04','2021-08-22','2021-05-04','2021-08-22'),
date2 = as.Date(sample( as.numeric(start_date): as.numeric(end_date), 16,
replace = T),
origin = '1970-01-01'))
And I'd like it to end up like this:
sample_id variable measurement_date date2
1 a 2021-05-04 2021-06-08
1 b 2021-05-04 2021-03-21
2 a 2021-05-04 2021-01-27
2 b 2021-05-04 2021-03-15
1 a 2021-08-22 2021-09-10
1 b 2021-08-22 2021-03-22
2 a 2021-08-22 2021-11-27
2 b 2021-08-22 2021-08-13
I know that the answer falls somewhere in the realm of this similar question, but my example has a different structure that I can't seem to wrap my head around. I've tried using dplyr's filter option, but my solution only returns a single row where the smallest difference across all samples occurs. It doesn't do this for every sample / variable combination.
library(dplyr)
filtered <- cheese %>% filter(abs(difftime(measurement_date,date2)) == min(abs(difftime(measurement_date,date2))))
Based on the comment given by #caldwellst, I didn't group the variables first, which is why I was getting a single value, not one for each grouping:
library(dplyr)
filtered <- cheese %>%
group_by(sample_id, variable, measurement_date) %>%
filter(abs(difftime(measurement_date,date2)) == min(abs(difftime(measurement_date,date2))))
I want to use manual inputs to a QAQC 'log file' to update an existing dataframe. The following log file would indicate date ranges (bounded by datetime_min and datetime_max) for particular variable (or 'all' of them) observations to be omitted from the dateframe (set to NA).
library(tidyverse)
library(lubridate)
QC_log <- tibble(
variable = c("SpCond", "pH", "pH", "all"),
datetime_min = ymd_hms(c("2021-06-01 18:00:00","2021-07-19 18:00:00","2021-08-19 18:00:00","2021-11-23 18:00:00")),
datetime_max = ymd_hms(c("2021-06-02 18:00:00","2021-07-25 21:00:00","2021-08-19 20:00:00","2021-11-26 05:00:00"))
)
The log should modify the following example of a dataframe, removing observations for each variable (for now I am not worried about 'all') that fall between the date min/max.
df <- tibble(
Datetime = ymd_hms(c("2021-06-01 17:00:00","2021-06-01 18:00:00","2021-06-01 19:00:00","2021-11-23 16:00:00","2021-11-23 17:00:00","2021-11-23 18:00:00")),
SpCond = c(220,225,224,230,231,235),
pH = c(7.8,7.9,8.0,7.7,7.8,7.7)
)
I have tried pmap like this:
df%>%
{pmap(QC_log, mutate(., ..1 = ifelse(Datetime > ..2 & Datetime < ..3, "NA", ..1)))}
I assumed pmap was taking ..1,2,3 from QC_log where ..1 is 'variable', ..2 is datetime_min, and ..3 is datetime_max, passing those as arguments into mutate one QC_log row at a time, which then conditionally replaces observations with NA if they fall into the specified date range.
I think I am having a hard time understanding ideas about non-standard evaluation/how arguments get passed through functions, among other things. Hopefully this is simple for now - I would like for this functionality to eventually be more complicated (e.g., changing all observations to NA when variable = 'all'; adding in separate actions like adding a data flag rather than omitting; or using a specific criterion (e.g., "<10") to omit observations rather than a daterange.
You can do the following
inner_join(
df %>% pivot_longer(cols=c("SpCond","pH")),
QC_log,
by=c("name" = "variable")
) %>%
filter((Datetime<datetime_min) | (Datetime>datetime_max)) %>%
select(Datetime, name, value) %>%
distinct() %>%
pivot_wider(id_cols = Datetime)
Output
Datetime SpCond pH
<dttm> <dbl> <dbl>
1 2021-06-01 17:00:00 220 7.8
2 2021-06-01 18:00:00 NA 7.9
3 2021-06-01 19:00:00 NA 8
4 2021-11-23 16:00:00 230 7.7
5 2021-11-23 17:00:00 231 7.8
6 2021-11-23 18:00:00 235 7.7
And here is a data.table approach
dcast(
unique(melt(setDT(df), id="Datetime")[setDT(QC_log),on=.(variable),allow.cartesian=T,nomatch=0] %>%
.[(Datetime<datetime_min) | (Datetime>datetime_max), .(Datetime,variable,value)]),
Datetime~variable, value.var="value"
)
I am working with these data on R. These are the first six rows —without counting the first column, which the write.csv function always adds—:
> head(my_data)
client_id contract_id contract_start contract_end inter_complex
1 1 15/07/2019 15/07/2020 18092+18458i
3 3 1/01/2015 1/01/2015 16436+16436i
5 5 12/06/2020 12/06/2020 18425+18425i
13 13 1/01/2015 1/01/2015 16436+16436i
18 18 1/01/2015 1/01/2015 16436+16436i
19 19 1/01/2015 1/01/2015 16436+16436i
Each row represents a different contract. The variable inter_complex is a complex number whose real part is the numeric representation of the date when a contract started, and whose imaginary part analogously represents the date when a contract ended. In case you're wondering, you can obtain that column by executing this:
library(tidyverse)
library(lubridate)
chars_2_cplex = function(start, end) {
cbind(start, end) %>%
apply(2, compose(as.numeric, dmy)) %*% rbind(1, 1i)
}
my_data %>% transmute(inter_complex = chars_2_cplex(contract_start, contract_end))
What I want is, for each client id and each contract, to identify how many contracts associated to that same client id intersect with that contract. In other words: I want to create a new column called simultaneous which will depict for each row —i.e. for each contract— how many active contracts the corresponding client has during the very same period that the current contract is active. In case no intersection with any other contract is found for a given contract, then the value of simultaneous would have to be 1 —as while that contract is active it is also the only active contract that the respective client has—.
I figured it would help to obtain the combinations of inter_complex, then turn those combinations of complex numbers into combinations of intervals, and then use lubridate's intersect function to discern whether or not each combination of intervals intersect. For that purpose, I have written the following code:
## This function turns complex numbers into intervals.
cplex_2_inter = function(x) {
start = x %>% Re() %>% as.integer()
end = x %>% Im() %>% as.integer()
interval(as_date(start), as_date(end))
}
## This other function returns a list whose j-th element is a data frame that shows the interceptions
## between combinations of j + 1 intervals.
get_intersections = function(x) {
max_m = length(x)
output = vector(mode = "list", length = max_m - 1)
for (i in 2:max_m) {
output[[i - 1]] = combn(x, m = i) %>% t() %>% as.data.frame() %>%
mutate_all(cplex_2_inter) %>% rowid_to_column("id") %>%
pivot_longer(-id) %>% group_by(id) %>%
mutate(simultaneous = do.call(lubridate::intersect, as.list(value))) %>%
mutate(simultaneous = if_else(as.character(simultaneous) != "NA--NA", i, 1L))
}
return(output)
}
In order to get a better grasp on what the function get_intersections does, I propose that you run the following:
example = my_data %>% filter(client_id == 1) %>% pull(inter_complex) %>% get_intersections()
The data frame example[[1]] shows whether there are interceptions —or, for a better word, overlaps— between pairs of intervals. The data frame example[[2]] shows whether there are overlaps between groups of three intervals, and so on.
You may notice that according to example[[1]] the interval 2019-07-15 UTC--2020-07-15 UTC overlaps with some other interval —and hence, the associated value of simultaneous is 2— while, according to example[[2]], that very same interval is associated to a value of 3 for the variable simultaneous. Naturally, the idea is to assign to each interval its highest simultaneous value.
Since I do not care about global overlaps but rather about overlaps within each client id I figured I would require to work on a grouped data frame. The furthest I got on this project was writing this:
my_data %>% group_by(client_id) %>% group_map(~ get_intersections(.x$inter_complex))
Now onto my questions. 1) I have executed the line above, but the process is not very efficient. It has already run for a bit more than a whole day and it doesn't finish yet. Recently I came across the concept of interval trees but I'm not a computer scientist and I would require help in order to tackle this problem in a smarter way. 2) In case we stick to my not-so-smart approach to the problem, I would still require a function that accesses each element of the list that is returned by get_intersections, so as to identify and retrieve the highest simultaneous value associated to each interval. On that matter I would have to request help as well.
Edit
Regarding Wimpel's answer, I have examined their data table and I found this.
> DT %>% filter(client_id == 502 & contract_id == 3093) %>%
> select(contract_start, contract_end, contract_intersect)
# Output
contract_start contract_end contract_intersect
1: 2018-01-11 2019-01-11 7
That is, the displayed contract allegedly overlaps with seven other contracts that the same client has.
On the other hand, let's see if this holds true when using my combinations-based approach.
combs_10_502 = my_data %>% filter(client_id == 502) %>% pull(inter_complex) %>%
combn(10) %>% t() %>% as.data.frame() %>% mutate_all(cplex_2_inter) %>%
rowid_to_column("id") %>% pivot_longer(-id) %>% group_by(id) %>%
mutate(simultaneous = do.call(lubridate::intersect, as.list(value))) %>%
ungroup() %>%
mutate(simultaneous = if_else(as.character(simultaneous) != "NA--NA", 10L, 1L))
> combs_10_502 %>% filter(simultaneous == 10) %>% slice(11:20)
# A tibble: 10 x 4
id name value simultaneous
<int> <chr> <Interval> <int>
1 24311 V1 2018-01-11 UTC--2019-01-11 UTC 10
2 24311 V2 2018-03-01 UTC--2019-03-01 UTC 10
3 24311 V3 2018-07-11 UTC--2019-07-11 UTC 10
4 24311 V4 2018-04-20 UTC--2019-04-20 UTC 10
5 24311 V5 2018-05-21 UTC--2019-05-21 UTC 10
6 24311 V6 2018-08-10 UTC--2019-08-10 UTC 10
7 24311 V7 2018-08-09 UTC--2019-08-09 UTC 10
8 24311 V8 2018-09-27 UTC--2019-09-27 UTC 10
9 24311 V9 2020-01-03 UTC--2021-01-03 UTC 10
10 24311 V10 2019-12-19 UTC--2020-12-19 UTC 10
The same contract is displayed on the first row of the tibble above. As can be seen, that contract actually overlaps with nine other contracts of the given client —those nine are displayed on the remaining rows—.
I don't know how Wimpel's solution got this wrong, but I checked that it does get the number of intersections right for several other contracts. Now I know that a data table-based solution is what I am looking for, since the processes are made very fast, but there seems to be an issue with the proposed solution.
I believe you are looking for something like this?
library(data.table)
DT <- fread("https://raw.githubusercontent.com/pazos-feren/Data/main/contracts.csv")
#set dates as real dates
DT[, contract_start := as.Date(contract_start, format = "%d/%m/%Y")]
DT[, contract_end := as.Date(contract_end, format = "%d/%m/%Y")]
setkey(DT, V1)
DT[DT, c("contract_intersect", "contract_intersect_ids") := {
val = DT[ !V1 == i.V1 & client_id == i.client_id &
contract_start <= i.contract_end & contract_end >= i.contract_start, ]
list( nrow(val), paste0(val$contract_id, collapse = ";") )
}, by = .EACHI]
# V1 client_id contract_id contract_start contract_end inter_complex contract_intersect contract_intersect_ids
# 1: 1 1 1 2019-07-15 2020-07-15 18092+18458i 2 4162;4168
# 2: 2 3 3 2015-01-01 2015-01-01 16436+16436i 0
# 3: 3 5 5 2020-06-12 2020-06-12 18425+18425i 0
# 4: 4 13 13 2015-01-01 2015-01-01 16436+16436i 0
# 5: 5 18 18 2015-01-01 2015-01-01 16436+16436i 0
# 6: 6 19 19 2015-01-01 2015-01-01 16436+16436i 0
I am trying to move entries in my column around to set them up as start/stop times. The first row for each id is set up fine but I need to shift the rest down and over for this to work as I am trying.
I have tried using dplyr and mutating the entries into new columns but the problem is the time entry is in another column so I am trying to work around that.
#This is what my data looks like
mydata<-data.frame(id=c(rep(1,3),rep(2,2)),baseline=c(rep("2018-07-14",3),
rep("2018-06-16",2)),
date=c("2018-08-23","2018-09-20","2018-10-05","2018-07-04","2018-08-08"))
head(mydata)
expecteddata<-data.frame(id=c(rep(1,3),rep(2,2)),
start=c("2018-07-14","2018-08-23","2018-09-20","2018-06-16","2018-07-04"),
end=c("2018-08-23","2018-09-20","2018-10-05","2018-07-04","2018-08-08"))
head(expecteddata)
This is what I am hoping to get. It also might be nice to increment start times since different rows would belong to different risk sets but that is a different issue. Any help or pointers would be greatly appreciated on how I can proceed.
Ensure that the date variables are the correct class and try:
library(dplyr)
mydata %>%
group_by(id) %>%
mutate(baseline = lag(date, default = first(baseline))) %>%
rename(start = baseline, end = date)
# A tibble: 5 x 3
# Groups: id [2]
id start end
<dbl> <date> <date>
1 1 2018-07-14 2018-08-23
2 1 2018-08-23 2018-09-20
3 1 2018-09-20 2018-10-05
4 2 2018-06-16 2018-07-04
5 2 2018-07-04 2018-08-08