My problem is easy to explain :
I have one table with start dates and end dates and n rows ordered by "start date" (see image bellow - Yellow rows are the ones I want to have on one unique row with first start date and last end date)
Table with rows where dates follow
I would like to regroup dates on one row when start date n+1 == end date n. Here is an exemple of what I need as a reslut (image below)
Result i need
I tried to use for loops that compare the two vectors of dates (vectors extracted from the columns) but it does not really work...
I tried something like this to identify start date and end date :
'''
a = sort(data$Date_debut)
b = sort(data$Date_fin)
for(i in 1:(length(a)-1)){
for(j in 2:length(a)){
datedeb = a[j-1]
if(b[i]+1 == a[j]){
while(b[i]+1 == a[j] ){
datefin = b[i+1]
i = i+1}
}
}
}
'''
datedeb = start date
datefin = end date
Thank you for your help, I am open to ideas / ways to deal with this.
Here is one approach using tidyverse. For each Var1 group, create subgroups containing an index based on when the start date does not equal the previous row end date (keeping those rows together with the same index). Then you can group_by both the Var1 and the index together, and use the first start date and last end date as your date ranges.
library(tidyverse)
df %>%
group_by(Var1) %>%
mutate(i = cumsum(Start_date != lag(End_date, default = as.Date(-Inf)) + 1)) %>%
group_by(i, .add = T) %>%
summarise(Start_date = first(Start_date), End_date = last(End_date)) %>%
select(-i)
Output
Var1 Start_date End_date
<chr> <date> <date>
1 A 2019-01-02 2019-04-09
2 A 2019-10-11 2019-10-11
3 B 2019-12-03 2019-12-20
4 C 2019-12-29 2019-12-31
Data
df <- structure(list(Var1 = c("A", "A", "A", "A", "B", "C"), Start_date = structure(c(17898,
17962, 17993, 18180, 18233, 18259), class = "Date"), End_date = structure(c(17961,
17992, 17995, 18180, 18250, 18261), class = "Date")), class = "data.frame", row.names = c(NA,
-6L))
Related
i have two dataframe
df1= data.frame( ts = c('2020-01-15', '2020-01-16' , '2020-01-17', '2021-01-14', '2021-01-15','2021-
01-16','2021-01-24','2021-01-25','2021-01-26'),
aa_h=c(1,2,3,6,4,5,7,9,8),
bh= c(12,13,14,11,11,11,122,12,56))
df2_mx=data.frame( ts = c('2020-01-17', '2021-01-16' , '2021-01-26'),
aa= NA)
Now here i want to compare the dates of df2_mx from df1, and if matches, I want the max value of aa_h of the last two and the current day and insert it in "aa" column of df2_mx
Example
1st row i.e. '2020-01-17' of df2_mx would match the 3rd row of df1 and it would look up 2 days above and get the value which_max(c(1,2,3))--> 3 and insert it "aa" column of df2_mx .
Expected Output:
df2_mx=data.frame( ts = c('2020-01-17', '2021-01-16' , '2021-01-26'),
aa= c(3,6,9))
Tryout Code
n=1
for (i in 1:nrow(df1)){
ifelse(which(as.Date(df1[i,1])==as.Date(df2_mx[,1])),
oh_df_mx[which(as.Date(df1[i,1])==as.Date(df2_mx
[,1])),n+1]<-which.max(df1[(i-2):i,3]),invisible())
}
An option with fuzzyjoin package.
library(dplyr)
df1 %>%
mutate(ts = as.Date(ts)) %>%
fuzzyjoin::fuzzy_right_join(df2_mx %>%
mutate(ts = as.Date(ts), ts_2_day = ts - 2),
by = c('ts', 'ts' = 'ts_2_day'),
match_fun = c(`<=`, `>=`)) %>%
group_by(ts = ts.y) %>%
summarise(aa_h = max(aa_h, na.rm = TRUE))
# ts aa_h
# <date> <dbl>
#1 2020-01-17 3
#2 2021-01-16 6
#3 2021-01-26 9
I want to find a first date greater than the given date in a column.
eg:
Pnp, Date1 Date2
A100,1/1/2020,1/1/2020
A100,1/1/2020,1/7/2020
A100,1/1/2020,1/1/2021
A100,1/1/2020,1/7/2021
Sample output:
Pnp,Date1,Date2,Date3,New Column
A100,1/1/2020,1/1/2020, 1/7/2020
A100,1/1/2020,1/7/2020,1/7/2020
A100,1/1/2020,1/1/2021,1/72020
A100,1/1/2020,1/7/2021,1/7/2020
I mean Based on date in date1 which date is greater than Date1 in Date2 (First greater value in Date2) to be put in New column.
sample code is :
library(dplyr)
library(sqldf)
monthly_sequence_03<- data.frame('Pnp' = 'A100','Frequency' = 3,'Duration' = c('Month'),'Date1' =seq(as.Date('2020-01-01'), as.Date('2025-6-30'), by = '3 months'))
monthly_sequence_06<- data.frame('Pnp' = 'A100','Frequency' = 6,'Duration' = c('Month'),'Date2' =seq(as.Date('2020-01-01'), as.Date('2025-6-30'), by = '6 months'))
new_df <- sqldf("select a.*,b.Date2 from monthly_sequence_03 as a
left join monthly_sequence_06 as b
on a.pnp = b.pnp")
new_df <-new_df[
order( new_df[,3], new_df[,4] ),
]
Any help is highly appreciated.
I would calculate leads of Date2 and join them back to your dataframe.
new_df %>%
left_join(new_df %>% transmute(Date2, Date3 = lead(Date2)) %>% distinct(), by = c("Date1" = "Date2"))
If you are trying to keep the date that is greater between the two dates in those columns, write a quick function for that and apply it over the columns in DF to create new column.
This could be written like this:
new_df$Date = as.Date(sapply(1:nrow(new_df), function(x){
Date1 = new_df$Date1[x]
Date2 = new_df$Date2[x]
if(Date1 > Date2){
return(Date1)}else{
return(Date2)}}), origin = "1970-01-01")
Thanks eastclintwood and Pceam.I combined the logic and added one part of mine. It gave me the required result.
ppp <- filter(new_df,Date2 > Date1)
ere <- ppp %>% group_by(Pnp,Frequency,Duration,Date1) %>% mutate(new_Date_11 = first(Date2))
Thanks again.
I am using sqldf library to manipulate data frame in R. Currently, I have a data frame like this:
ID Start_Date End_Date
1 08-29 09-01
I want to create a new data frame using sqldf to create a range of dates between the Start_Date and the End_Date, for example, for ID1, I want the final data frame look like:
ID Date_Range
1 08-29
1 08-30
1 08-31
1 09-01
I think I can just create a new data frame. But I am wondering if it is possible to implement in sqldf?
Here is one way to expand the date ranges using tidyverse functions.
library(dplyr)
df %>%
mutate(across(ends_with('Date'), as.Date, '%m-%d'),
#You don't need the above if columns are already of type date/POSIXct
Date_Range = purrr::map2(Start_Date, End_Date, seq, by = '1 day')) %>%
tidyr::unnest(Date_Range) %>%
mutate(Date_Range = format(Date_Range, '%m-%d')) %>%
select(-Start_Date, -End_Date)
# ID Date_Range
# <int> <chr>
#1 1 08-29
#2 1 08-30
#3 1 08-31
#4 1 09-01
data
df <- structure(list(ID = 1L, Start_Date = "08-29", End_Date = "09-01"),
class = "data.frame", row.names = c(NA, -1L))
I am trying to do a count of rows that fall on and between two dates (minimum and maximum) per group. The only caveat is each group has a different pair of dates. See example below.
This is my raw dataset.
raw <- data.frame ("Group" = c("A", "B", "A", "A", "B"), "Date" = c("2017-01-01", "2017-02-02", "2017-09-01", "2017-12-31", "2017-05-09"))
I would like it to return this...
clean <- data.frame ("Group" = c("A", "B"), "Min" = c("2017-01-01", "2017-02-02"), "Max" = c("2017-12-31", "2017-05-09"), "Count" = c(3, 2))
How would I be able to do this? The mix and max variable are not crucial, but definitely would like to know how to do the count variable. Thank you!
The date range is given or you want to calculate it from data as well. If later is true then this should do it.
require(tidyverse)
raw %>%
mutate(Date = as.Date(Date)) %>%
group_by(Group) %>%
summarise(min_date = min(Date), max_date = max(Date), count = n())
Output:
# A tibble: 2 x 4
Group min_date max_date count
<fct> <date> <date> <int>
1 A 2017-01-01 2017-12-31 3
2 B 2017-02-02 2017-05-09 2
Each day I have a new csv file with ids and some variables. The ids can be differents over the days. I would like to take the IDs of one day and follow how a variable evolves over the time.
My goal is to create area plot like this :
For example I take all the ids the 31 march, each day I make a join with thoses ids, and I make a count group by the var "Code". If there is missing ids (Ids here the 31 march but not day D) their code become "NA" to show how many IDs I "lose" over time. I hope i'm clear enough.
Here is how I calculate this king of plot : (my real datas are like li and not datas)
library(plyr)
library(dplyr)
datas <- data.frame(id1 = c("x", "y", "x", "y", "z", "x", "z"),
id2 = c("x2", "y2", "x2", "y2", "z2", "x2", "z2"),
code = c("code1", "code2", "code1", "code2", "code2", "code1", "code2"),
var = runif(7),
date = do.call(c, mapply(rep, seq(Sys.Date() - 2, Sys.Date(), by = 1), c(2, 3, 2))))
li <- split(datas, datas$date)
dateStart <- Sys.Date() - 2
dateEnd <- Sys.Date()
# A "filter" if I want to start with another date than the date min or end with another date than the max date
li <- li[as.Date(names(li)) >= dateStart & as.Date(names(li)) <= dateEnd]
dfCounts <- ldply(li, function(x)
left_join(li[[1]], x, by = c("id1", "id2")) %>%
group_by(code.y) %>%
count(code = code.y) %>%
mutate(freq = n / sum(n),
code = ifelse(is.na(code), "NA", code))),
.id = "date")
> dfCounts
date code n freq
1 2015-07-04 1 1 0.5
2 2015-07-04 2 1 0.5
3 2015-07-05 1 1 0.5
4 2015-07-05 2 1 0.5
5 2015-07-06 1 1 0.5
6 2015-07-06 NA 1 0.5
dfCounts %>%
ggplot(aes(date, freq)) +
geom_area(aes(fill = code), position = "stack")
# I have no idea why in this example, nothing is shown in the plot, but it works on my real datas
So it works, but if I want to observe a longer period, I have to join over many days (files) and it can be slow. Do you have any ideas to do the same things without joins, using the binded datas (the object datas and not li) with dplyr or data.table ?
In your opinion, which approach is better ?
Thanks !
(Sorry for the title I couldn't find better...)