I have a dataframe 'my_data' which looks like this:
Calendar_Day Name
2018-03-31 ABC
2018-03-31 XYZ
2018-03-31 OPR
2019-01-31 ABC
2019-01-31 RTE
2019-10-31 YUD
2018-03-31 RYT
I wish to have another column that will serve as a primary key with a format
YEAR+MONTH+6digit sequence , eg: 201803000001
I am new to R and couldn't find a way to implement the concept.
and Dataframe should look like
Calendar_Day Name ID
2018-03-31 ABC 201803000001
2018-03-31 XYZ 201803000002
2018-03-31 OPR 201803000003
2019-01-31 ABC 201901000001
2019-01-31 RTE 201901000002
2019-10-31 YUD 201910000001
2018-03-31 RYT 201803000004
library(dplyr)
library(lubridate)
d %>%
mutate(Date = ymd(Date)) %>%
group_by(tmp1 = year(Date), tmp2 = month(Date)) %>%
mutate(ID = paste0(year(Date),
sprintf("%02d", month(Date)),
sprintf("%05d", row_number()))) %>%
ungroup() %>%
select(-tmp1, -tmp2)
#> # A tibble: 7 x 3
#> Date Name ID
#> <date> <chr> <chr>
#> 1 2018-03-31 ABC 20180300001
#> 2 2018-03-31 XYZ 20180300002
#> 3 2018-03-31 OPR 20180300003
#> 4 2019-01-31 ABC 20190100001
#> 5 2019-01-31 RTE 20190100002
#> 6 2019-10-31 YUD 20191000001
#> 7 2018-03-31 RYT 20180300004
you could use the tidyverse package like this:
library(tidyverse)
mydata %>%
mutate(Date2 = format(Date, "%Y%m")) %>%
group_by(Date2) %>%
mutate(ID = paste0(Date2, str_pad(1:n(), width = 6, side = "left", pad = "0"))) %>%
ungroup() %>%
select(-Date2)
The main idea is to use the format function: format(mydate, %Y) returns the year of a date object and format(mydate, %m) returns the month of a date object.
I paste these two together and add the six digit sequence.
I use string_pad to add leading zeros to the sequence.
Related
I have weather data for summarize across different date intervals.
Here is the weather data:
library(dplyr)
rainfall_data <- read.csv(text = "
date,rainfall_daily_mm
01/01/2019,0
01/02/2019,1
01/03/2019,3
01/04/2019,45
01/05/2019,0
01/06/2019,0
01/07/2019,0
01/08/2019,43
01/09/2019,5
01/10/2019,0
01/11/2019,55
01/12/2019,6
01/13/2019,0
01/14/2019,7
01/15/2019,0
01/16/2019,7
01/17/2019,8
01/18/2019,89
01/19/2019,65
01/20/2019,3
01/21/2019,0
01/22/2019,0
01/23/2019,2
01/24/2019,0
01/25/2019,0
01/26/2019,0
01/27/2019,0
01/28/2019,22
01/29/2019,3
01/30/2019,0
01/31/2019,0
") %>%
mutate(date = as.Date(date, format = "%d/%m/%Y"))
And here is the date intervals I need to get summaries of from the weather file:
intervals <- read.csv(text= "
treatment,initial,final
A,01/01/2019,01/05/2019
B,01/13/2019,01/20/2019
C,01/12/2019,01/26/2019
D,01/30/2019,01/31/2019
E,01/11/2019,01/23/2019
F,01/03/2019,01/19/2019
G,01/01/2019,01/24/2019
H,01/26/2019,01/28/2019
") %>%
mutate(initial = as.Date(initial, format = "%d/%m/%Y"),
final = as.Date(final, format = "%d/%m/%Y"))
The expected outcome is this one:
This is what I've tried based on a similar question:
summary_by_date_interval <- rainfall_data %>%
mutate(group = cumsum(grepl(intervals$initial|intervals$final, date))) %>%
group_by(group) %>%
summarise(rainfall = sum(rainfall_daily_mm))
And this is the error I got:
Error in `mutate()`:
! Problem while computing `group = cumsum(grepl(intervals$initial |
intervals$final, date))`.
Caused by error in `Ops.Date()`:
! | not defined for "Date" objects
Run `rlang::last_error()` to see where the error occurred.
Any help will be really appreciated.
First %d/%m/%Y need to be %m/%d/%Y (or you'll have wrong dates and many NA's).
Then you could e.g. use lubridates interval and %within%:
library(dplyr)
library(lubridate)
intervals |>
group_by(treatment) |>
mutate(test = sum(rainfall_data$rainfall_daily_mm[rainfall_data$date %within% interval(initial, final)])) |>
ungroup()
Output:
# A tibble: 8 × 4
treatment initial final test
<chr> <date> <date> <int>
1 A 2019-01-01 2019-01-05 49
2 B 2019-01-13 2019-01-20 179
3 C 2019-01-12 2019-01-26 187
4 D 2019-01-30 2019-01-31 0
5 E 2019-01-11 2019-01-23 242
6 F 2019-01-03 2019-01-19 333
7 G 2019-01-01 2019-01-24 339
8 H 2019-01-26 2019-01-28 22
I have question related to calculation based on timestamp:
I have a big dataframe dfwith Timestamp (whole year), Export_Country, Import_Country and respective Value in each hour of the year.
For example here is the sample dataframe df:
df <- data.frame(Timestamp=c("2020-01-01 00:00:00.000","2020-01-01 00:00:00.000","2020-01-01 00:00:00.000","2020-01-01 00:00:00.000","2020-01-01 00:00:00.000","2020-01-01 00:00:00.000"),
Export_Country=c('AT','DE','CH','DE','CZ','DE'),
Import_Country=c('DE','AT','DE','CH','DE','CZ'),
Value=c(170.06,289.37,1133.47,0,68.29,0.32),
stringsAsFactors=FALSE)
I want to write a function which can calculate the net value in each timestamp within two countries. The output should look like the dataframe df1:
df2<- data.frame(Timestamp=c("2F020-01-01 00:00:00.000","2020-01-01 00:00:00.000","2020-01-01 00:00:00.000"),
Export_Country=c('DE','CH','CZ'),
Import_Country=c('AT','DE','DE'),
Value=c(119.31,1133.47,67.97),
stringsAsFactors=FALSE)
I was trying to do something like:
df3<- df %<>%
group_by(Timestamp,Export_Country,Import_Country) %>%
summarise(Value=sum(Value))
Note this is output of str(mydataframe)
'data.frame': 65520 obs. of 4 variables:
$ DateTime : chr "2020-01-02 12:00:00.000" "2020-01-02 12:00:00.000" "2020-01-02 12:00:00.000" "2020-01-02 12:00:00.000" ...
$ Export_Country: Factor w/ 70 levels "AL","AT","BA",..: 15 13 15 10 13 2 53 13 46 10 ...
$ Import_Country: Factor w/ 70 levels "AL","AT","BA",..: 10 46 13 15 2 13 10 15 13 53 ...
$ FlowValue : num 417 251 898 0 1089 ...
Can anyone help me? Thank you.
Using tidyverse we can pivot the data to longer format and
# gets the next other country's index based on the current country index
funcp <- function(x) x + 1 - 2 * (x%%2 == 0)
df %>%
# pivoting to longer format in order to facilitate data manipulation
pivot_longer(cols=ends_with("Country"), values_to = "country") %>%
# remove _Country from (Import|Export)_Country and getting the real value of the transaction Imports = - Value
mutate(name=sub("_.+","", name), Value=Value*(1-2*(name=="Import"))) %>%
# adding a with column that contains the counterpart
tibble(with=.$country[funcp(1:nrow(.))]) %>%
# finally grouping by the Timestamp, the country and the counterpart to get the actual Net value
group_by(Timestamp, country, with) %>% summarise(Value=sum(Value)) -> df2
df2
#> # A tibble: 6 x 4
#> # Groups: Timestamp, country [4]
#> Timestamp country with Value
#> <chr> <chr> <chr> <dbl>
#> 1 2020-01-01 00:00:00.000 AT DE -119.
#> 2 2020-01-01 00:00:00.000 CH DE 1133.
#> 3 2020-01-01 00:00:00.000 CZ DE 68.0
#> 4 2020-01-01 00:00:00.000 DE AT 119.
#> 5 2020-01-01 00:00:00.000 DE CH -1133.
#> 6 2020-01-01 00:00:00.000 DE CZ -68.0
If you want to get only the positive nets then you can filter the results :
df2 %>% filter(Value >=0)
#> # A tibble: 3 x 4
#> # Groups: Timestamp, country [3]
#> Timestamp country with Value
#> <chr> <chr> <chr> <dbl>
#> 1 2020-01-01 00:00:00.000 CH DE 1133.
#> 2 2020-01-01 00:00:00.000 CZ DE 68.0
#> 3 2020-01-01 00:00:00.000 DE AT 119.
Note : czech to germany Value is rounded in the printing process but is equal to 67.97 in the tibble
I know that the following function is complicated and that problably there are simpler solutions but it seems to work.
fun <- function(X){
f <- function(x){
x[[3]]*(2*(x[[1]] < x[[2]]) - 1)
}
icontr <- grep("Country", names(X), value = TRUE)
X[["Value"]] <- f(X[c(icontr, "Value")])
X[icontr] <- t(apply(X[icontr], 1, sort))
fmla <- paste(c("Timestamp", icontr), collapse = "+")
fmla <- paste("Value", fmla, sep = "~")
fmla <- as.formula(fmla)
out <- aggregate(fmla, X, sum)
i <- out[["Value"]] < 0
tmp <- out[["Export_Country"]][i]
out[["Export_Country"]][i] <- out[["Import_Country"]][i]
out[["Import_Country"]][i] <- tmp
out[["Value"]][i] <- -out[["Value"]][i]
out
}
fun(df)
# Timestamp Export_Country Import_Country Value
#1 2020-01-01 00:00:00.000 DE AT 119.31
#2 2020-01-01 00:00:00.000 CH DE 1133.47
#3 2020-01-01 00:00:00.000 CZ DE 67.97
all.equal(fun(df), df2)
#[1] TRUE
Maybe you might want to combine Import_Country and Export_Country into a single character string? Then you can group_by this, and take difference between the two Values present. This assumes you only have two countries to combine for each Timestamp. It also subtracts import from export consistently.
library(tidyverse)
df %>%
mutate(CountryDyad = paste(pmin(Export_Country, Import_Country),
pmax(Export_Country, Import_Country),
sep = "-")) %>%
group_by(Timestamp, CountryDyad) %>%
summarise(Value = Value[which(startsWith(CountryDyad, Import_Country))] -
Value[which(startsWith(CountryDyad, Export_Country))])
Output
Timestamp CountryDyad Value
<chr> <chr> <dbl>
1 2020-01-01 00:00:00.000 AT-DE 119.
2 2020-01-01 00:00:00.000 CH-DE -1133.
3 2020-01-01 00:00:00.000 CZ-DE -68.0
I want to use the Prophet() function in R, but I cannot transform my column "YearWeek" to a as.Date() column.
I have a column "YearWeek" that stores values from 201401 up to 201937 i.e. starting in 2014 week 1 up to 2019 week 37.
I don't know how to declare this column as a date in the form yyyy-ww needed to use the Prophet() function.
Does anyone know how to do this?
Thank you in advance.
One solution could be to append a 01 to the end of your yyyy-ww formatted dates.
Data:
library(tidyverse)
df <- cross2(2014:2019, str_pad(1:52, width = 2, pad = 0)) %>%
map_df(set_names, c("year", "week")) %>%
transmute(date = paste(year, week, sep = "")) %>%
arrange(date)
head(df)
#> # A tibble: 6 x 1
#> date
#> <chr>
#> 1 201401
#> 2 201402
#> 3 201403
#> 4 201404
#> 5 201405
#> 6 201406
Now let's append the 01 and convert to date:
df %>%
mutate(date = paste(date, "01", sep = ""),
new_date = as.Date(date, "%Y%U%w"))
#> # A tibble: 312 x 2
#> date new_date
#> <chr> <date>
#> 1 20140101 2014-01-05
#> 2 20140201 2014-01-12
#> 3 20140301 2014-01-19
#> 4 20140401 2014-01-26
#> 5 20140501 2014-02-02
#> 6 20140601 2014-02-09
#> 7 20140701 2014-02-16
#> 8 20140801 2014-02-23
#> 9 20140901 2014-03-02
#> 10 20141001 2014-03-09
#> # ... with 302 more rows
Created on 2019-10-10 by the reprex package (v0.3.0)
More info about a numeric week of the year can be found here.
I've got a data set with reservation data that has the below format :
property <- c('casa1', 'casa2', 'casa3')
check_in <- as.Date(c('2018-01-01', '2018-01-30','2018-02-28'))
check_out <- as.Date(c('2018-01-02', '2018-02-03', '2018-03-02'))
total_paid <- c(100,110,120)
df <- data.frame(property,check_in,check_out, total_paid)
My goal is to have the monthly total_paid amount divided by days and assigned to each month correctly for budget reasons.
While there's no issue for casa1, casa2 and casa3 have days reserved in both months and the totals get skewed because of this issue.
Any help much appreciated!
Here you go:
library(dplyr)
library(tidyr)
df %>%
mutate(id = seq_along(property), # make few variable to help
day_paid = total_paid / as.numeric(check_out - check_in),
date = check_in) %>%
group_by(id) %>%
complete(date = seq.Date(check_in, (check_out - 1), by = "day")) %>% # get date for each day of stay (except last)
ungroup() %>% # make one row per day of stay
mutate(month = cut(date, breaks = "month")) %>% # determine month of date
fill(property, check_in, check_out, total_paid, day_paid) %>%
group_by(id, month) %>%
summarise(property = unique(property),
check_in = unique(check_in),
check_out = unique(check_out),
total_paid = unique(total_paid),
paid_month = sum(day_paid)) # summarise per month
result:
# A tibble: 5 x 7
# Groups: id [3]
id month property check_in check_out total_paid paid_month
<int> <fct> <fct> <date> <date> <dbl> <dbl>
1 1 2018-01-01 casa1 2018-01-01 2018-01-02 100 100
2 2 2018-01-01 casa2 2018-01-30 2018-02-03 110 55
3 2 2018-02-01 casa2 2018-01-30 2018-02-03 110 55
4 3 2018-02-01 casa3 2018-02-28 2018-03-02 120 60
5 3 2018-03-01 casa3 2018-02-28 2018-03-02 120 60
I hope it's somewhat readable but please ask if there is something I should explain. Convention is that people don't pay the last day of a stay, so I took that into account.
Suppose I have a daily rain data.frame like this:
df.meteoro = data.frame(Dates = seq(as.Date("2017/1/19"), as.Date("2018/1/18"), "days"),
rain = rnorm(length(seq(as.Date("2017/1/19"), as.Date("2018/1/18"), "days"))))
I'm trying to sum the accumulated rain between a 14 days interval with this code:
library(tidyverse)
library(lubridate)
df.rain <- df.meteoro %>%
mutate(TwoWeeks = round_date(df.meteoro$data, "14 days")) %>%
group_by(TwoWeeks) %>%
summarise(sum_rain = sum(rain))
The problem is that it isn't starting on 2017-01-19 but on 2017-01-15 and I was expecting my output dates to be:
"2017-02-02" "2017-02-16" "2017-03-02" "2017-03-16" "2017-03-30" "2017-04-13"
"2017-04-27" "2017-05-11" "2017-05-25" "2017-06-08" "2017-06-22" "2017-07-06" "2017-07-20"
"2017-08-03" "2017-08-17" "2017-08-31" "2017-09-14" "2017-09-28" "2017-10-12" "2017-10-26"
"2017-11-09" "2017-11-23" "2017-12-07" "2017-12-21" "2018-01-04" "2018-01-18"
TL;DR I have a year long daily rain data.frame and want to sum the accumulate rain for the dates above.
Please help.
Use of round_date in the way you have shown it will not give you 14-day periods as you might expect. I have taken a different approach in this solution and generated a sequence of dates between your first and last dates and grouped these into 14-day periods then joined the dates to your observations.
startdate = min(df.meteoro$Dates)
enddate = max(df.meteoro$Dates)
dateseq =
data.frame(Dates = seq.Date(startdate, enddate, by = 1)) %>%
mutate(group = as.numeric(Dates - startdate) %/% 14) %>%
group_by(group) %>%
mutate(starts = min(Dates))
df.rain <- df.meteoro %>%
right_join(dateseq) %>%
group_by(starts) %>%
summarise(sum_rain = sum(rain))
head(df.rain)
> head(df.rain)
# A tibble: 6 x 2
starts sum_rain
<date> <dbl>
1 2017-01-19 6.09
2 2017-02-02 5.55
3 2017-02-16 -3.40
4 2017-03-02 2.55
5 2017-03-16 -0.12
6 2017-03-30 8.95
Using a right-join to the date sequence is to ensure that if there are missing observation days that spanned a complete time period you'd still get that period listed in the result (though in your case you have a complete year of dates anyway).
round_date rounds to the nearest multiple of unit (here, 14 days) since some epoch (probably the Unix epoch of 1970-01-01 00:00:00), which doesn't line up with your purpose.
To get what you want, you can do the following:
df.rain = df.meteoro %>%
mutate(days_since_start = as.numeric(Dates - as.Date("2017/1/18")),
TwoWeeks = as.Date("2017/1/18") + 14*ceiling(days_since_start/14)) %>%
group_by(TwoWeeks) %>%
summarise(sum_rain = sum(rain))
This computes days_since_start as the days since 2017/1/18 and then manually rounds to the next multiple of two weeks.
Assuming you want to round to the closest date from the ones you have specified I guess the following will work
targetDates<-seq(ymd("2017-02-02"),ymd("2018-01-18"),by='14 days')
df.meteoro$Dates=targetDates[sapply(df.meteoro$Dates,function(x) which.min(abs(interval(targetDates,x))))]
sum_rain=ddply(df.meteoro,.(Dates),summarize,sum_rain=sum(rain,na.rm=T))
as you can see not all dates have the same number of observations. Date "2017-02-02" for instance has all the records between "2017-01-19" until "2017-02-09", which are 22 records. From "2017-02-10" on dates are rounded to "2017-02-16" etc.
This may be a cheat, but assuming each row/observation is a separate day, then why not just group by every 14 rows and sum.
# Assign interval groups, each 14 rows
df.meteoro$my_group <-rep(1:100, each=14, length.out=nrow(df.meteoro))
# Grab Interval Names
my_interval_names <- df.meteoro %>%
select(-rain) %>%
group_by(my_group) %>%
slice(1)
# Summarise
df.meteoro %>%
group_by(my_group) %>%
summarise(rain = sum(rain)) %>%
left_join(., my_interval_names)
#> Joining, by = "my_group"
#> # A tibble: 27 x 3
#> my_group rain Dates
#> <int> <dbl> <date>
#> 1 1 3.86 2017-01-19
#> 2 2 -0.581 2017-02-02
#> 3 3 -0.876 2017-02-16
#> 4 4 1.80 2017-03-02
#> 5 5 3.79 2017-03-16
#> 6 6 -3.50 2017-03-30
#> 7 7 5.31 2017-04-13
#> 8 8 2.57 2017-04-27
#> 9 9 -1.33 2017-05-11
#> 10 10 5.41 2017-05-25
#> # ... with 17 more rows
Created on 2018-03-01 by the reprex package (v0.2.0).