This question already has answers here:
Expand ranges defined by "from" and "to" columns
(10 answers)
Closed 1 year ago.
Here is what my current dataframe looks like:
df <- data.frame(name = c("A", "A", "A", "B", "B")),
start_date = c("2020-01-23", "2019-10-15", "2019-07-28", "2020-03-15", "2019-04-23")),
end_date = c("2020-05-15", "2020-01-27", "2019-10-17", "2020-07-25", "2020-02-13")),
value = c(8.1, 3.3, 9.1, 9.4, 15.3)))
name start_date end_date value
A 2020-01-23 2020-05-15 8
A 2019-10-15 2020-01-27 3
A 2019-07-28 2019-10-17 9
B 2020-03-15 2020-07-25 9
B 2019-04-23 2020-02-13 15
The dates are in POSIXct, are not necessarily consecutive, and can overlap.
I would like my output dataframe to look something like this:
name date value
A 2020-01-23 8.1
A 2020-01-24 8.1
A ... 8.1
A 2020-05-14 8.1
A 2020-05-15 8.1
A 2019-10-15 3.3
A 2019-10-16 3.3
A ... 3.3
A 2020-01-26 3.3
A 2020-01-27 3.3
A 2019-07-28 9.1
A 2019-07-29 9.1
A ... 9.1
A 2019-10-16 9.1
A 2019-10-17 9.1
B 2020-03-15 9.4
B 2020-03-16 9.4
B ... 9.4
B 2020-07-24 9.4
B 2020-07-25 9.4
B 2019-04-23 15.3
B 2019-04-24 15.3
B ... 15.3
B 2020-02-12 15.3
B 2020-02-13 15.3
Here is what I have been trying:
library(data.table)
setDT(df) [, .(date = seq(as.Date(start_date), as.Date(end_date), by = "day")), by = end_date]
But I have been getting the following error:
Error in seq.Date(as.Date(start_date), as.Date(end_date), by = "day") :
'from' must be of length 1
How should I do this? I am open to using other packages rather than data.table if they work better.
Here, we may need to use by as sequence of rows
library(data.table)
setDT(df)[, .(date = seq(as.Date(start_date), as.Date(end_date),
by = 'day')), .(rn = seq_len(nrow(df)), name, value)][, rn := NULL][]
Or create a list column by looping over corresponding elements of 'start_date', 'end_date' to create a sequence of dates in Map and then unnest the list
library(tidyr)
library(magrittr)
setDT(df)[, .(name, date = Map(seq, MoreArgs = list(by = '1 day'),
as.Date(start_date), as.Date(end_date)), value)] %>%
unnest(date)
# A tibble: 731 x 3
# name date value
# <chr> <date> <dbl>
# 1 A 2020-01-23 8.1
# 2 A 2020-01-24 8.1
# 3 A 2020-01-25 8.1
# 4 A 2020-01-26 8.1
# 5 A 2020-01-27 8.1
# 6 A 2020-01-28 8.1
# 7 A 2020-01-29 8.1
# 8 A 2020-01-30 8.1
# 9 A 2020-01-31 8.1
#10 A 2020-02-01 8.1
# … with 721 more rows
Another approach using purrr
df <- data.frame(name = c("A", "A", "A", "B", "B"),
start_date = c("2020-01-23", "2019-10-15", "2019-07-28", "2020-03-15", "2019-04-23"),
end_date = c("2020-05-15", "2020-01-27", "2019-10-17", "2020-07-25", "2020-02-13"),
value = c(8.1, 3.3, 9.1, 9.4, 15.3))
library(dplyr)
library(purrr)
# function take in the name, start, end, value and generate a df fill as wanted
generate_fill <- function(name, start, end, value) {
tibble(name = name,
date = seq(as.Date(start), as.Date(end), by = "1 day"),
value = value)
}
# Map the function to original df and combine the result
bind_rows(
pmap(list(df[["name"]], df[["start_date"]], df[["end_date"]], df[["value"]]),
generate_fill))
Output
# A tibble: 731 x 3
name date value
<chr> <date> <dbl>
1 A 2020-01-23 8.1
2 A 2020-01-24 8.1
3 A 2020-01-25 8.1
4 A 2020-01-26 8.1
5 A 2020-01-27 8.1
6 A 2020-01-28 8.1
7 A 2020-01-29 8.1
8 A 2020-01-30 8.1
9 A 2020-01-31 8.1
10 A 2020-02-01 8.1
# … with 721 more rows
Related
I have two datasets that I would like to join based on date. One is a survey dataset, and the other is a list of prices at various dates. The dates don't match exactly, so I would like to join on the nearest date in the survey dataset (the price data is weekly).
Here's a brief snippet of what the survey dataset looks like (there are many other variables, but here's the two most relevant):
ID
actual.date
20120377
2012-09-26
2020455822
2020-11-23
20126758
2012-10-26
20124241
2012-10-25
2020426572
2020-11-28
And here's the price dataset (also much larger, but you get the idea):
date
price.var1
price.var2
2017-10-30
2.74733926399869
2.73994826674735
2015-03-16
2.77028200438506
2.74079930272231
2010-10-18
3.4265947805337
3.41591263539176
2012-10-29
4.10095806545397
4.14717556976502
2012-01-09
3.87888859352037
3.93074237884497
What I would like to do is join the price dataset to the survey dataset, joining on the nearest date.
I've tried a number of different things, none of which have worked to my satisfaction.
#reading in sample data
library(data.table)
library(dplyr)
survey <- fread(" ID actual.date
1: 20120377 2012-09-26
2: 2020455822 2020-11-23
3: 20126758 2012-10-26
4: 20124241 2012-10-25
5: 2020426572 2020-11-28
> ") %>% select(-V1)
price <- fread("date price.var1 price.var2
1: 2017-10-30 2.747339 2.739948
2: 2015-03-16 2.770282 2.740799
3: 2010-10-18 3.426595 3.415913
4: 2012-10-29 4.100958 4.147176
5: 2012-01-09 3.878889 3.930742") %>% select(-V1)
#using data.table
setDT(survey)[,DT_DATE := actual.date]
setDT(price)[,DT_DATE := date]
survey_price <- survey[price,on=.(DT_DATE),roll="nearest"]
#This works, and they join, but it drops a ton of observations, which won't work
#using dplyr
library(dplyr)
survey_price <- left_join(survey,price,by=c("actual.date"="date"))
#this joins them without dropping observations, but all of the price variables become NAs
You were almost there.
In the DT[i,on] syntax, i should be survey to join on all its rows
setDT(survey)
setDT(price)
survey_price <- price[survey,on=.(date=actual.date),roll="nearest"]
survey_price
date price.var1 price.var2 ID
<IDat> <num> <num> <int>
1: 2012-09-26 4.100958 4.147176 20120377
2: 2020-11-23 2.747339 2.739948 2020455822
3: 2012-10-26 4.100958 4.147176 20126758
4: 2012-10-25 4.100958 4.147176 20124241
5: 2020-11-28 2.747339 2.739948 2020426572
Convert the dates to numeric and find the closest date from the survey for price with Closest() from DescTools, and take that value.
Example datasets
survey <- tibble(
ID = sample(20000:40000, 9, replace = TRUE),
actual.date = seq(today() %m+% days(5), today() %m+% days(5) %m+% months(2),
"week")
)
price <- tibble(
date = seq(today(), today() %m+% months(2), by = "week"),
price_1 = sample(2:6, 9, replace = TRUE),
price_2 = sample(2:6, 9, replace = TRUE)
)
survey
# A tibble: 9 x 2
ID actual.date
<int> <date>
1 34592 2022-05-07
2 37846 2022-05-14
3 22715 2022-05-21
4 22510 2022-05-28
5 30143 2022-06-04
6 34348 2022-06-11
7 21538 2022-06-18
8 39802 2022-06-25
9 36493 2022-07-02
price
# A tibble: 9 x 3
date price_1 price_2
<date> <int> <int>
1 2022-05-02 6 6
2 2022-05-09 3 2
3 2022-05-16 6 4
4 2022-05-23 6 2
5 2022-05-30 2 6
6 2022-06-06 2 4
7 2022-06-13 2 2
8 2022-06-20 3 5
9 2022-06-27 5 6
library(tidyverse)
library(lubridate)
library(DescTools)
price <- price %>%
mutate(date = Closest(survey$actual.date %>%
as.numeric, date %>%
as.numeric) %>%
as_date())
# A tibble: 9 x 3
date price_1 price_2
<date> <int> <int>
1 2022-05-07 6 6
2 2022-05-14 3 2
3 2022-05-21 6 4
4 2022-05-28 6 2
5 2022-06-04 2 6
6 2022-06-11 2 4
7 2022-06-18 2 2
8 2022-06-25 3 5
9 2022-07-02 5 6
merge(survey, price, by.x = "actual.date", by.y = "date")
actual.date ID price_1 price_2
1 2022-05-07 34592 6 6
2 2022-05-14 37846 3 2
3 2022-05-21 22715 6 4
4 2022-05-28 22510 6 2
5 2022-06-04 30143 2 6
6 2022-06-11 34348 2 4
7 2022-06-18 21538 2 2
8 2022-06-25 39802 3 5
9 2022-07-02 36493 5 6
I have a dataset of weekly mortgage rate data.
The data looks very simple:
library(tibble)
library(lubridate)
df <- tibble(
Date = as_date(c("2/7/2008 ", "2/14/2008", "2/21/2008", "2/28/2008", "3/6/2008"), format = "%m/%d/%Y"),
Rate = c(5.67, 5.72, 6.04, 6.24, 6.03)
)
I am trying to group it and summarize by month.
This blogpost and this answer are not what I want, because they just add the month column.
They give me the output:
month Date summary_variable
2008-02-01 2008-02-07 5.67
2008-02-01 2008-02-14 5.72
2008-02-01 2008-02-21 6.04
2008-02-01 2008-02-28 6.24
My desired output (ideally the last day of the month):
Month Average rate
2/28/2008 6
3/31/2008 6.1
4/30/2008 5.9
In the output above I put random numbers, not real calculations.
We can get the month extracted as column and do a group by mean
library(dplyr)
library(lubridate)
library(zoo)
df1 %>%
group_by(Month = as.Date(as.yearmon(mdy(DATE)), 1)) %>%
summarise(Average_rate = mean(MORTGAGE30US))
-output
# A tibble: 151 x 2
# Month Average_rate
# <date> <dbl>
# 1 2008-02-29 5.92
# 2 2008-03-31 5.97
# 3 2008-04-30 5.92
# 4 2008-05-31 6.04
# 5 2008-06-30 6.32
# 6 2008-07-31 6.43
# 7 2008-08-31 6.48
# 8 2008-09-30 6.04
# 9 2008-10-31 6.2
#10 2008-11-30 6.09
# … with 141 more rows
I want to use the Prophet() function in R, but I cannot transform my column "YearWeek" to a as.Date() column.
I have a column "YearWeek" that stores values from 201401 up to 201937 i.e. starting in 2014 week 1 up to 2019 week 37.
I don't know how to declare this column as a date in the form yyyy-ww needed to use the Prophet() function.
Does anyone know how to do this?
Thank you in advance.
One solution could be to append a 01 to the end of your yyyy-ww formatted dates.
Data:
library(tidyverse)
df <- cross2(2014:2019, str_pad(1:52, width = 2, pad = 0)) %>%
map_df(set_names, c("year", "week")) %>%
transmute(date = paste(year, week, sep = "")) %>%
arrange(date)
head(df)
#> # A tibble: 6 x 1
#> date
#> <chr>
#> 1 201401
#> 2 201402
#> 3 201403
#> 4 201404
#> 5 201405
#> 6 201406
Now let's append the 01 and convert to date:
df %>%
mutate(date = paste(date, "01", sep = ""),
new_date = as.Date(date, "%Y%U%w"))
#> # A tibble: 312 x 2
#> date new_date
#> <chr> <date>
#> 1 20140101 2014-01-05
#> 2 20140201 2014-01-12
#> 3 20140301 2014-01-19
#> 4 20140401 2014-01-26
#> 5 20140501 2014-02-02
#> 6 20140601 2014-02-09
#> 7 20140701 2014-02-16
#> 8 20140801 2014-02-23
#> 9 20140901 2014-03-02
#> 10 20141001 2014-03-09
#> # ... with 302 more rows
Created on 2019-10-10 by the reprex package (v0.3.0)
More info about a numeric week of the year can be found here.
I have two datasets with the following data.
maindata = data.frame(eventid=c(1:10),
district=c(rep("lucknow",2),rep("allahabad",1), rep("kanpur", 2)),
date = c(rep("2018-01-01", 2), rep("2018-01-02", 1), rep("2018-01-03", 2)))
weather = data.frame(district=c(rep("lucknow", 4), rep("allahabad", 3), rep("kanpur", 3)),
date = c(rep("2017-01-01", 4), rep("2017-01-02", 3), rep("2017-01-03", 3)),
temperature=c(rep("19.3",2),rep("22.1",1), rep("24.1", 2)))
Few considerations:
"date" in each data frame is different, its ok to be like that. MM-DD are sufficient
Both datasets have different length - df1 is my main dataset where "temp" should be added
The merging must happen over "district" and "date"
maindata has district column in lowercase
What i Tried: (doing some silly conversions.. will fix them)
weather$District<-as.factor(tolower(weather$District))
weather$Date<-as.Date(as.character(weather$Date),format="%m/%d/%Y")
maindata$md<-strftime(data$createDate, "%m-%d")
weather$mdr<-strftime(weather$Date, "%m-%d")
maindata<-left_join(maindata, weather, by = c("md" = "mdr", "district" = "District"))
The final expected answer would be something like below in maindata
eventid district date temperature
1 lucknow 2018-01-01 19.3
2 lucknow 2018-01-01 19.3
3 allahabad 2018-01-03 24.1
4 kanpur 2018-01-03 NA
5 kanpur 2018-01-02 22.1
6 lucknow 2018-01-01 19.3
7 lucknow 2018-01-01 19.3
8 allahabad 2018-01-03 24.1
9 kanpur 2018-01-03 NA
10 kanpur 2018-01-02 22.1
Can anybody please help !!!
I don't understand your logic rules for merging; specifically I don't see how date comes in.
It is entirely possible to reproduce your expected output without considering date at all, by simply matching df1$district with df2$dist:
library(tidyverse);
left_join(df1, df2, by = c("district" = "dist")) %>%
distinct() %>%
select(-date.y)
# eventid date.x district temp
#1 1 2017-01-01 dist-1 19.3
#2 2 2017-01-01 dist-1 19.3
#3 3 2017-01-01 dist-1 19.3
#4 4 2017-01-01 dist-1 19.3
#5 5 2017-01-02 dist-2 22.1
#6 6 2017-01-02 dist-2 22.1
#7 7 2017-01-02 dist-2 22.1
#8 8 2017-01-03 dist-3 24.10
#9 9 2017-01-03 dist-3 24.10
#10 10 2017-01-03 dist-3 24.10
Could you provide sample data that is more representative of what you're trying to do, and where the role/importance of merging on date becomes clear?
A quick note - You should really post your trials to the solution before asking for the help in SO.
To the answer -
What you should be using is the merge function available by default in R.
After reproducing the data frames that you have provided - try the below chunk of code
#Since dates doesn't matter, df2 could be changed to a new df with only temp
df3 <- df2[,c("dist","temp")]
df3 <- unique(df3)
df4 <- merge(df1,df3,by.x = "district",by.y = "dist",all.x = T)
The deduplication has been done to avoid creation of numerous rows for each combination of dates in df1 and df2.
all.x = T ensures that you're getting a left-join (Where all rows of the df1 are present in your final output)
Perhaps something like this (with the updated data)
library(tidyverse)
df1 %>%
mutate(date = as.POSIXct(date),
date1 = format(date, "%d/%m")) %>%
left_join(df2 %>%
mutate(date = as.POSIXct(date),
date1 = format(date, "%d/%m")), by = c("date1" = "date1", "district" = "dist")) %>%
select(-date1, - date.y) %>%
rename(date = date.x) %>%
filter(!duplicated(eventid))
#output
eventid date district temp
1 1 2017-01-01 dist-1 19.3
2 2 2017-01-01 dist-1 19.3
3 3 2017-01-01 dist-1 19.3
4 4 2017-01-01 dist-1 19.3
5 5 2017-01-02 dist-2 <NA>
6 6 2017-01-02 dist-2 <NA>
7 7 2017-01-02 dist-2 <NA>
8 8 2017-01-03 dist-3 24.10
9 9 2017-01-03 dist-3 24.10
10 10 2017-01-03 dist-3 24.10
Convert date in both data frames to POSIXct, make a %d/%m column and join by it and district, and then clean up
Maybe you want this.
df2[, 2] <- as.numeric(as.character(df2[, 2]))
m1 <- merge(df1, df2, by.x = "district", by.y = "dist", all.x = TRUE)[-5]
names(m1)[3] <- "date"
m1 <- unique(m1[, c(2, 3, 1, 4)])
rownames(m1) <- NULL
> m1
eventid date district temp
1 1 2017-01-01 dist-1 19.3
2 2 2017-01-01 dist-1 19.3
3 3 2017-01-01 dist-1 19.3
4 4 2017-01-01 dist-1 19.3
5 5 2017-01-02 dist-2 22.1
6 6 2017-01-02 dist-2 22.1
7 7 2017-01-02 dist-2 22.1
8 8 2017-01-03 dist-3 24.1
9 9 2017-01-03 dist-3 24.1
10 10 2017-01-03 dist-3 24.1
I have a data.frame that looks like this
DATE MEAN SUM MAX MIN SAISON JAHR
1 1995-09-01 00:00:00 2.370833 56.9 7.4 0 S 1995
2 1995-09-01 01:00:00 2.225000 53.4 7.4 0 S 1995
3 1995-09-01 02:00:00 2.091667 50.2 7.4 0 S 1995
4 1995-09-01 03:00:00 1.929167 46.3 7.4 0 S 1995
5 1995-09-01 04:00:00 1.745833 41.9 7.4 0 S 1995
6 1995-09-01 05:00:00 1.558333 37.4 7.4 0 S 1995
....
With the dplyr package I am able to extract the highest SUM for every SAISON and JAHR:
group_by(.data = dataframe,JAHR,SAISON)
summarise(gJahrSAISON_24, hoechsterNiederschlag = max(SUM))
Do you have any idea how to extract the ten(!) highest sums for every JAHR and SAISON?
You can use slice with arrange
library(dplyr)
df1 %>%
group_by(JAHR, SAISON) %>%
arrange(desc(SUM)) %>%
slice(1:10)
Or filter with min_rank/dense_rank
df1 %>%
group_by(JAHR, SAISON) %>%
filter(dense_rank(SUM)<=10)
Similar options using data.table are
library(data.table)#v1.9.5+
setDT(df1)[order(-SUM), .SD[1:10], by = .(JAHR, SAISON)]
Or
setDT(df1)[, .SD[frank(SUM, ties.method='first') <=10], by = .(JAHR, SAISON)]
Or using sqldf
library(sqldf)
sqldf('select * from df1 i
where rowid in
(select rowid from df1
where JAHR = i.JAHR and SAISON=i.SAISON
order by SUM desc
limit 10)
order by i.JAHR, i.SAISON, i.SUM desc')
Or with base R
df1[with(df1, ave(SUM, SAISON, JAHR, FUN=function(x)
rank(-x, ties.method='first'))<=10),]