x <- data.frame(ID = c(1,2,3,4),
Line_name = c("AB", "CD", "AB", "CD"),
start_dt = c("12/1/2020", "2/1/2021", "2/1/2021", "3/1/2021"),
end_dt = c("4/1/2021", "4/1/2021", "3/1/2021", "4/1/2021"))
ID Line_name start_dt end_dt
1 AB 12/1/2020 4/1/2021
2 CD 2/1/2021 4/1/2021
3 AB 2/1/2021 3/1/2021
4 CD 3/1/2021 4/1/2021
I have a dataframe that looks likes this. It has items that is used within date range (start date to end date). I need to count frequency of use of each item for every month. The resulting output would look something like this.
Line_name Jan2021 Feb2021 Mar2021 Apr2021
1 AB 1 2 2 1
2 CD 0 1 2 2
In Jan, only AB was used. For that ID 1, the date ranges from Jan to April. So we would need to count that row for every month from Jan to April.
I am not sure how I can do it. I was thinking for instance January, I would check if 1/1/2021 date falls within start_dt and end_dt and if that condition is true than count.
(date %within% interval(start_dt, end_dt))
An option is to get a sequence of dates by month between the 'start_dt', and 'end_dt' columns with map2 into a list, then unnest the list column, get the count and reshape back from 'long' to 'wide' with pivot_wider
library(lubridate)
library(dplyr)
library(tidyr)
x %>%
transmute(Line_name, Year_month = map2(mdy(start_dt), mdy(end_dt),
~ format(seq(.x, .y, by = '1 month'), '%b%Y'))) %>%
unnest(c(Year_month)) %>%
count(Line_name,
Year_month = factor(Year_month, levels = unique(Year_month))) %>%
pivot_wider(names_from = Year_month, values_from = n, values_fill = 0)
-output
# A tibble: 2 x 5
Line_name Jan2021 Feb2021 Mar2021 Apr2021
<chr> <int> <int> <int> <int>
1 AB 1 2 2 1
2 CD 0 1 2 2
Related
I was working in the following problem. I've got monthly data from a survey, let's call it df:
df1 = tibble(ID = c('1','2'), reported_value = c(1200, 31000), anchor_month = c(3,5))
ID reported_value anchor_month
1 1200 3
2 31000 5
So, the first row was reported in March, but there's no way to know if it's reporting March or February values and also can be an approximation to the real value. I've also got a table with actual values for each ID, let's call it df2:
df2 = tibble( ID = c('1', '2') %>% rep(4) %>% sort,
real_value = c(1200,1230,11000,10,25000,3100,100,31030),
month = c(1,2,3,4,2,3,4,5))
ID real_value month
1 1200 1
1 1230 2
1 11000 3
1 10 4
2 25000 2
2 3100 3
2 100 4
2 31030 5
So there's two challenges: first, I only care about the anchor month OR the previous month to the anchor month of each ID and then I want to match to the closest value (sounds like fuzzy join). So, my first challenge was to filter my second table so it only has the anchor month or the previous one, which I did doing the following:
filter_aux = df1 %>%
bind_rows(df1 %>% mutate(anchor_month = if_else(anchor_month == 1, 12, anchor_month- 1)))
df2 = df2 %>%
inner_join(filter_aux , by=c('ID', 'month' = 'anchor_month')) %>% distinct(ID, month)
Reducing df2 to:
ID real_value month
1 1230 2
1 11000 3
2 100 4
2 31030 5
Now I tried to do a difference_inner_join by ID and reported_value = real_value, (df1 %>% difference_inner_join(df2, by= c('ID', 'reported_value' = 'real_value'))) but it's bringing a non-numeric argument to binary operator error I'm guessing because ID is a string in my actual data. What gives? I'm no expert in fuzzy joins, so I guess I'm missing something.
My final dataframe would look like this:
ID reported_value anchor_month closest_value month
1 1200 3 1230 2
2 31000 5 31030 5
Thanks!
It was easier without fuzzy_join:
df3 = df1 %>% left_join(df2 , by='ID') %>%
mutate(dif = abs(real_value - reported_value)) %>%
group_by(ID) %>% filter(dif == min(dif))
Output:
ID reported_value anchor_month real_value month dif
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1200 3 1230 2 30
2 2 31000 5 31030 5 30
I have a dataframe with two start and stop dates that looks like this:
ID G1_START G1_END G2_START G2_END LOCATION
1 1/1/2021 5/31/2021 2/1/2021 5/31/2021 A
2 12/1/2020 3/31/2021 10/1/2020 5/31/2021 B
What I would like to do is create one row per month per patient where the months overlap between the four dates. For example
ID MONTH ACTIVE LOCATION
1 2/1/2021 1 A
1 3/1/2021 1 A
1 4/1/2021 1 A
1 5/1/2021 1 A
2 12/1/2020 1 B
2 1/1/2021 1 B
2 2/1/2021 1 B
2 3/1/2021 1 B
Where active means the ID was on both G1 and G2 during these months.
Here is a method in tidyverse
Reshape data from wide to long format - pivot_longer
Convert the date columns 'START', 'END' to Date class (mdy)
Loop over the 'START', 'END' with map2, get the sequence by '1 month'
Floor the date by month - floor_date
grouped by ID, LOCATION, MONTH, filter the groups where 'Categ' distinct elements are 2
Create 'ACTIVE' column of 1 after returning the distinct rows
library(dplyr)
library(tidyr)
library(lubridate)
library(purrr)
pivot_longer(df1, cols = contains("_"),
names_to = c("Categ", ".value"), names_sep= "_") %>%
transmute(ID, LOCATION, Categ, MONTH = map2(mdy(START), mdy(END), ~
floor_date(seq(.x, .y, by = '1 month'), 'month'))) %>%
unnest(MONTH) %>%
group_by(ID, LOCATION, MONTH) %>%
filter(n_distinct(Categ) == 2) %>%
ungroup %>%
distinct(ID, LOCATION, MONTH) %>%
mutate(ACTIVE = 1) %>%
select(ID, MONTH, ACTIVE, LOCATION)
-output
# A tibble: 8 x 4
ID MONTH ACTIVE LOCATION
<int> <date> <dbl> <chr>
1 1 2021-02-01 1 A
2 1 2021-03-01 1 A
3 1 2021-04-01 1 A
4 1 2021-05-01 1 A
5 2 2020-12-01 1 B
6 2 2021-01-01 1 B
7 2 2021-02-01 1 B
8 2 2021-03-01 1 B
data
df1 <- structure(list(ID = 1:2, G1_START = c("1/1/2021", "12/1/2020"
), G1_END = c("5/31/2021", "3/31/2021"), G2_START = c("2/1/2021",
"10/1/2020"), G2_END = c("5/31/2021", "5/31/2021"), LOCATION = c("A",
"B")), class = "data.frame", row.names = c(NA, -2L))
I have a problem with a time series which I donĀ“t know to solve.
I have a tibble with 4 different variables. In my real dataset there are over 10.000 Documents.
document date author label
1 2018-04-05 Mr.X 1
2 2018-02-05 Mr.Y 0
3 2018-04-17 Mr.Z 1
So now my problem is that in the first step I want to count my articles which are occur in a specific month and a specific year for every month in my time series.I know that I can filter for a specific month in a year like this:
tibble%>%
filter(date > "2018-02-01" && date < "2018-02-28")
Result out of this would be a tibble with 1 Observation, but my problem is that I have 360 different time periods in my data. Can I write a function for this to solve this problem or do I need to make 360 own calculations?
The best solution for me would be a table with 360 different columns where in every column the amount of articles which are counted in this month are represented. Is this possible?
Thank you so much in advance.
If you want each result into a separate list, you can do something like this
suppressMessages(library(dplyr))
df %>% mutate(date = as.Date(date)) %>%
group_split(substr(date, 1, 7), .keep = F)
<list_of<
tbl_df<
document: integer
date : date
author : character
label : integer
>
>[2]>
[[1]]
# A tibble: 1 x 4
document date author label
<int> <date> <chr> <int>
1 2 2018-02-05 Mr.Y 0
[[2]]
# A tibble: 2 x 4
document date author label
<int> <date> <chr> <int>
1 1 2018-04-05 Mr.X 1
2 3 2018-04-17 Mr.Z 1
you can further use list2env() to save each item of this list as a separate item.
To count the number of rows for each month-year combination, in tidyverse you can do :
library(dplyr)
library(tidyr)
df %>%
mutate(date = as.Date(date),
year_mon = format(date, '%Y-%m')) %>%
select(year_mon) %>%
pivot_wider(names_from = year_mon, values_from = year_mon,
values_fn = length, values_fill = 0)
# `2018-04` `2018-02`
# <int> <int>
#1 2 1
and in base R :
df$date <- as.Date(df$date)
table(format(df$date, '%Y-%m'))
I have a dataset with ID, date, days of life, and medication variables. Each ID has multiple observations indicating different administrations of a certain drug. I want to find UNIQUE meds that were administered within 365 days of each other. A sample of the data frame is as follows:
ID date dayoflife meds
1 2003-11-24 16361 lasiks
1 2003-11-24 16361 vigab
1 2004-01-09 16407 lacos
1 2013-11-25 20015 pheno
1 2013-11-26 20016 vigab
1 2013-11-26 20016 lasiks
2 2008-06-05 24133 pheno
2 2008-04-07 24074 vigab
3 2014-11-25 8458 pheno
3 2014-12-22 8485 pheno
I expect the outcome to be:
ID N
1 3
2 2
3 1
indicating that individual 1 had a max of 3 different types of medications administered within 365 days of each other. I am not sure if it is best to use days of life or the date to get to this expected outcome.Any help is appreciated
An option would be to convert the 'date' to Date class, grouped by 'ID', get the absolute difference of 'date' and the lag of the column, check whether it is greater than 365, create a grouping index with cumsum, get the number of distinct elements of 'meds' in summarise
library(dplyr)
df1 %>%
mutate(date = as.Date(date)) %>%
group_by(ID) %>%
mutate(diffd = abs(as.numeric(difftime(date, lag(date, default = first(date)),
units = 'days')))) %>%
group_by(grp = cumsum(diffd > 365), add = TRUE) %>%
summarise(N = n_distinct(meds)) %>%
group_by(ID) %>%
summarise(N = max(N))
# A tibble: 3 x 2
# ID N
# <int> <int>
#1 1 2
#2 2 2
#3 3 1
You can try:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(date = as.Date(date),
lag_date = abs(date - lag(date)) <= 365,
lead_date = abs(date - lead(date)) <= 365) %>%
mutate_at(vars(lag_date, lead_date), ~ ifelse(., ., NA)) %>%
filter(coalesce(lag_date, lead_date)) %>%
summarise(N = n_distinct(meds))
Output:
# A tibble: 3 x 2
ID N
<int> <int>
1 1 2
2 2 2
3 3 1
I have two datafram as mentioned:
DF_1
ID Date1
1 12/01/2017
2 15/02/2017
3 18/03/2017
DF_2
ID Date1
1 05/01/2017
1 15/01/2017
1 18/01/2017
2 10/02/2017
2 13/02/2017
2 15/02/2017
3 22/03/2017
I want to calculate the difference between dates for a particular id in DF_1 to same id in DF_2 with most recent old date in DF_2 as compare with date of DF_1.
For Example: For ID=1 the Date of DF_1 is 12-01-2017 and Most recent old date for that in DF_2 would be 05-01-2017 because 15 & 18 both are > than DF_1 Date.
Required Output:
ID Date1 Count
1 12/01/2017 7
2 15/02/2017 0
3 18/03/2017 -4
The following reproduces your expected output:
library(tidyverse);
df1 <- read.table(text =
"ID Date1
1 12/01/2017
2 15/02/2017
3 18/03/2017", header = T) %>%
mutate(Date1 = as.Date(Date1, format = "%d/%m/%Y"));
df2 <- read.table(text =
"ID Date1
1 05/01/2017
1 15/01/2017
1 18/01/2017
2 10/02/2017
2 13/02/2017
2 15/02/2017
3 22/03/2017", header = T) %>%
mutate(Date1 = as.Date(Date1, format = "%d/%m/%Y"));
left_join(df1, df2, by = "ID") %>%
mutate(Count = Date1.x - Date1.y) %>%
group_by(ID) %>%
slice(ifelse(
all(Count < 0),
which.min(abs(Count)),
which.min(Count[Count >= 0]))) %>%
select(ID, Date1.x, Count)
## A tibble: 3 x 3
## Groups: ID [3]
# ID Date1.x Count
# <int> <date> <time>
#1 1 2017-01-12 7
#2 2 2017-02-15 0
#3 3 2017-03-18 -4
Explanation: Calculate the time difference between df1$Date1 and df2$Date2, group entries by ID, and keep only the row which has the smallest positive time difference, unless all time differences are negative in which case report the smallest absolute time difference.
I think your last row is wrong, as for ID=3 the df2 value is not before the df1 value. Assuming that is correct, you can do this...
df3 <- df2 %>% rename(Date2=Date1) %>%
left_join(df1) %>%
mutate(datediff=as.Date(Date1,format="%d/%m/%Y")-as.Date(Date2,format="%d/%m/%Y")) %>%
filter(datediff>=0) %>%
group_by(ID) %>%
summarise(Date1=first(Date1),Count=min(datediff))
df3
ID Date1 Count
1 1 12/01/2017 7
2 2 15/02/2017 0