I have a dataset (data.weather) with one weather variable (TMAX) for two locations (combination of LAT and LON) and two years. TMAX is available for ten days per year and location in this mock example. I need to calculate the mean TMAX (mean_TMAX) for each of the four rows in data.locs. This last dataset indicate the range of date for which I need to calculate the mean. That is between DATE_0 and DATE_1.
Here is the code of what I did:
library(dplyr)
library(lubridate)
data.weather <-read.csv(text = "
LAT,LON,YEAR,DATE,TMAX
36,-89,2010,1/1/2010,25
36,-89,2010,1/2/2010,25
36,-89,2010,1/3/2010,25
36,-89,2010,1/4/2010,28
36,-89,2010,1/5/2010,28
36,-89,2010,1/6/2010,29
36,-89,2010,1/7/2010,25
36,-89,2010,1/8/2010,25
36,-89,2010,1/9/2010,25
36,-89,2010,1/10/2010,28
36,-89,2011,1/1/2011,26
36,-89,2011,1/2/2011,25
36,-89,2011,1/3/2011,28
36,-89,2011,1/4/2011,26
36,-89,2011,1/5/2011,27
36,-89,2011,1/6/2011,27
36,-89,2011,1/7/2011,28
36,-89,2011,1/8/2011,29
36,-89,2011,1/9/2011,27
36,-89,2011,1/10/2011,26
40,-96,2010,1/1/2010,29
40,-96,2010,1/2/2010,28
40,-96,2010,1/3/2010,25
40,-96,2010,1/4/2010,25
40,-96,2010,1/5/2010,28
40,-96,2010,1/6/2010,29
40,-96,2010,1/7/2010,26
40,-96,2010,1/8/2010,28
40,-96,2010,1/9/2010,26
40,-96,2010,1/10/2010,25
40,-96,2011,1/1/2011,29
40,-96,2011,1/2/2011,27
40,-96,2011,1/3/2011,29
40,-96,2011,1/4/2011,25
40,-96,2011,1/5/2011,28
40,-96,2011,1/6/2011,29
40,-96,2011,1/7/2011,29
40,-96,2011,1/8/2011,25
40,-96,2011,1/9/2011,25
40,-96,2011,1/10/2011,26
") %>%
mutate(DATE = as.Date(DATE, format = "%m/%d/%Y"))
data.locs <-read.csv(text = "
LAT,LON,YEAR,DATE_0,DATE_1,GEN,PR
36,-89,2010,1/2/2010,1/9/2010,MN103,35
36,-89,2011,1/1/2011,1/10/2011,IA100,33
40,-96,2010,1/4/2010,1/8/2010,MN103,36
40,-96,2011,1/2/2011,1/6/2011,IA100,34
") %>%
mutate(DATE_0 = as.Date(DATE_0, format = "%m/%d/%Y"),
DATE_1 = as.Date(DATE_1, format = "%m/%d/%Y"))
tmax.calculation <- data.locs %>%
group_by(LAT,LON,YEAR, GEN) %>%
mutate(mean_TMAX = mean(data.weather$TMAX[data.weather$DATE %within% interval(DATE_0, DATE_1)]))
This is the expected result:
LAT LON YEAR DATE_0 DATE_1 GEN PR meam_tmax
36 -89 2010 1/2/2010 1/9/2010 MN103 35 26.25
36 -89 2011 1/1/2011 1/10/2011 IA100 33 26.90
40 -96 2010 1/4/2010 1/8/2010 MN103 36 27.20
40 -96 2011 1/2/2011 1/6/2011 IA100 34 27.60
However, this is what I am getting:
LAT LON YEAR DATE_0 DATE_1 GEN PR meam_tmax
36 -89 2010 1/2/2010 1/9/2010 MN103 35 26.5625
36 -89 2011 1/1/2011 1/10/2011 IA100 33 27.0500
40 -96 2010 1/4/2010 1/8/2010 MN103 36 27.1000
40 -96 2011 1/2/2011 1/6/2011 IA100 34 27.1000
The problem I have is that, when reading the data interval in data.weather, the calculation is being made over the correct interval BUT across the two locations (combination of LAT and LON). I couldn't find a way to indicate to calculate the mean only for each LAT and LON combination separately.
This should do it:
library(dplyr)
library(lubridate)
data.weather <-read.csv(text = "
LAT,LON,YEAR,DATE,TMAX
36,-89,2010,1/1/2010,25
36,-89,2010,1/2/2010,25
36,-89,2010,1/3/2010,25
36,-89,2010,1/4/2010,28
36,-89,2010,1/5/2010,28
36,-89,2010,1/6/2010,29
36,-89,2010,1/7/2010,25
36,-89,2010,1/8/2010,25
36,-89,2010,1/9/2010,25
36,-89,2010,1/10/2010,28
36,-89,2011,1/1/2011,26
36,-89,2011,1/2/2011,25
36,-89,2011,1/3/2011,28
36,-89,2011,1/4/2011,26
36,-89,2011,1/5/2011,27
36,-89,2011,1/6/2011,27
36,-89,2011,1/7/2011,28
36,-89,2011,1/8/2011,29
36,-89,2011,1/9/2011,27
36,-89,2011,1/10/2011,26
40,-96,2010,1/1/2010,29
40,-96,2010,1/2/2010,28
40,-96,2010,1/3/2010,25
40,-96,2010,1/4/2010,25
40,-96,2010,1/5/2010,28
40,-96,2010,1/6/2010,29
40,-96,2010,1/7/2010,26
40,-96,2010,1/8/2010,28
40,-96,2010,1/9/2010,26
40,-96,2010,1/10/2010,25
40,-96,2011,1/1/2011,29
40,-96,2011,1/2/2011,27
40,-96,2011,1/3/2011,29
40,-96,2011,1/4/2011,25
40,-96,2011,1/5/2011,28
40,-96,2011,1/6/2011,29
40,-96,2011,1/7/2011,29
40,-96,2011,1/8/2011,25
40,-96,2011,1/9/2011,25
40,-96,2011,1/10/2011,26
") %>%
mutate(DATE = as.Date(DATE, format = "%m/%d/%Y"))
data.locs <-read.csv(text = "
LAT,LON,YEAR,DATE_0,DATE_1,GEN,PR
36,-89,2010,1/2/2010,1/9/2010,MN103,35
36,-89,2011,1/1/2011,1/10/2011,IA100,33
40,-96,2010,1/4/2010,1/8/2010,MN103,36
40,-96,2011,1/2/2011,1/6/2011,IA100,34
") %>%
mutate(DATE_0 = as.Date(DATE_0, format = "%m/%d/%Y"),
DATE_1 = as.Date(DATE_1, format = "%m/%d/%Y"))
tmax.calculation <- data.locs %>%
group_by(LAT,LON,YEAR,GEN) %>%
do(data.frame(LAT=.$LAT,
LON=.$LON,
YEAR=.$YEAR,
GEN=.$GEN,
DATE=seq(.$DATE_0, .$DATE_1, by="days"))) %>%
left_join(data.weather, by=c("LAT", "LON", "YEAR", "DATE")) %>%
summarise(mean_TMAX = mean(TMAX))
Result:
Related
I have some data in a format like the reproducible example below (code for data input below the question, at the end). Two things:
Not all dates have a value (i.e. many dates are missing).
Some dates have multiple values, eg 16 June 2020.
#> date value
#> 1 30-Jun-20 20
#> 2 29-Jun-20 -100
#> 3 26-Jun-20 -4
#> 4 16-Jun-20 -13
#> 5 16-Jun-20 40
#> 6 9-Jun-20 -6
For two week periods, ending on Tuesdays, I would like to take a sum of the value column.
So in the example data above, I want to sum ending on:
two weeks ending on Tuesday 16 June 2020 (i.e. from 3 June 2020 - 16 June 2020, inclusive)
two weeks ending on Tuesday 30 June 2020 (17 June 2020 - 30 June 2020 inclusive)
I'd ultimately like the code to continue summing all two week periods ending on every second Tuesday for when there's more data.
So my desired output is:
#2_weeks_end total
#30-Jun-20 -84
#16-Jun-20 21
Tidyverse and lubridate solutions would be my first preference.
Code for data input below:
df <- data.frame(
stringsAsFactors = FALSE,
date = c("30-Jun-20","29-Jun-20",
"26-Jun-20","16-Jun-20","16-Jun-20","9-Jun-20"),
value = c(20L, -100L, -4L, -13L, 40L, -6L)
)
df
Solution using findInterval().
df$date <- dmy(df$date)
df_intervals <- seq(as.Date("2020-06-03"), as.Date("2020-06-03")+14*3, 14)
df %>%
mutate(interval = findInterval(date, df_intervals)) %>%
mutate(`2_weeks_end` = df_intervals[interval+1]-1) %>%
group_by(`2_weeks_end`) %>%
summarise(total= sum(value))
Returns:
# A tibble: 2 x 2
2_weeks_end total
<date> <int>
1 2020-06-16 21
2 2020-06-30 -84
Here is an option if you like weekly or any other unit that is in lubridate by default:
library(dplyr)
library(lubridate)
df%>%
mutate(date = as.Date(date, format = "%d-%b-%y"))%>%
group_by(week_ceil = ceiling_date(date - 1L, unit = "week", week_start = 2L))%>%
summarize(sums = sum(value))
Here is a data.table approach that creates a reference table followed by a non-equi join:
library(data.table)
setDT(df)
df[, date := as.Date(date, format = "%d-%b-%y")]
ref_dt = df[, .(beg_date = seq.Date(from = floor_date(min(date), unit = "week", week_start = 3L),
to = max(date),
by = "2 weeks"))]
ref_dt[, end_date := beg_date +13L]
df[ref_dt,
on = .(date > beg_date,
date <= end_date),
sum(value),
by = .EACHI]
## date date V1
##1: 2020-06-03 2020-06-16 21
##2: 2020-06-17 2020-06-30 -84
There are 2 issues:
I have time data in factor format and I want to change it into date format for later manipulation.
The goal is to sum values of precipitation of the same time unit, eg. precipitation per hour.
I tried to convert the time using as.POSIXct() or as.date() in lubridate but always get NA values after defining the format. This is the code I used:
tt=as.POSIXct(FixUseNew$StartTimestamp, )
df$time <- as.Date(df$time, "%d-%m-%Y")
If I leave out the format and do the following :
tt=as.POSIXct(df$time)
tt
hour(tt)
The date data looks like this now: "0010-07-14 00:38:00 LMT"
I wanted to use aggregate function to sum the precipitation in the same hour interval or day but couldn't do it as I am stuck with the date format.
Just a brain dump. I was going to change the factor date in to character then to date format as following. Please advise if that is a good idea.
df$time <-paste(substr(df$time,6,7),
substr(df$time,9,10),
substr(df$time,1,4),sep="/")
Here is a subset of the data, hope this helps to illustrate the question better:
Id <- c(1,2,3,4)
Time <- c("10/7/2014 12:30:00 am", "10/7/2014 01:00:05 am","10/7/2014 01:30:10 am", "10/7/2014 02:00:15 am")
Precipitation <- c(0.06, 0.02,0,0.25)
cbind(Id, Time, Precipitation)
Thank you so much.
Here is the outcome:
It seems like the order is distorted:
6 1/1/15 0:35 602
7 1/1/15 0:36 582
8 1/1/15 0:37 958
9 1/1/15 0:38 872
10 1/10/14 0:31 500
11 1/10/14 0:32 571
12 1/10/14 0:33 487
13 1/10/14 0:34 220
14 1/10/14 0:35 550
15 1/10/14 0:36 582
16 1/10/14 0:37 524
17 1/10/14 0:38 487
⋮
106 10/10/14 15:16 494
107 10/10/14 7:53 37
108 10/10/14 7:56 24
109 10/10/14 8:01 3
110 10/11/14 0:30 686
111 10/11/14 0:31 592
112 10/11/14 0:32 368
113 10/11/14 0:33 702
114 10/11/14 0:34 540
115 10/11/14 0:35 564
Using dplyr and lubridate packages we can extract the hour from each Time and sum.
library(dplyr)
library(lubridate)
df %>%
mutate(hour = hour(dmy_hms(Time))) %>%
group_by(hour) %>%
summarise(Precipitation = sum(Precipitation, na.rm = TRUE))
For aggregation by date, we can do
df %>%
mutate(day = as.Date(dmy_hms(Time))) %>%
group_by(day) %>%
summarise(Precipitation = sum(Precipitation, na.rm = TRUE))
Using base R, we could do
df$Hour <- format(as.POSIXct(df$Time, format = "%d/%m/%Y %I:%M:%S %p"), "%H")
df$Day <- as.Date(as.POSIXct(df$Time, format = "%d/%m/%Y %I:%M:%S %p"))
#Aggregation by hour
aggregate(Precipitation~Hour, df, sum, na.rm = TRUE)
#Aggregation by date
aggregate(Precipitation~Day, df, sum, na.rm = TRUE)
EDIT
Based on updated data and information, we can do
df <- readxl::read_xlsx("/path/to/file/df (1).xlsx")
hour_df <- df %>%
group_by(hour = hour(Time)) %>%
summarise(Precipitation = sum(Precipitation, na.rm = TRUE))
day_df <- df %>%
mutate(day = as.Date(Time)) %>%
group_by(day) %>%
summarise(Precipitation = sum(Precipitation, na.rm = TRUE))
So hour_df has got hourly sum of values without taking into consideration the date and day_df has got sum of Precipitation for each day.
data
Id <- c(1,2,3,4)
Time <- c("10/7/2014 12:30:00 am", "10/7/2014 01:00:05 am",
"10/7/2014 01:30:10 am", "10/7/2014 02:00:15 am")
Precipitation <- c(0.06, 0.02,0,0.25)
df <- data.frame(Id, Time, Precipitation)
I have a data frame with a datetime column. I want to know the number of rows by hour of the day. However, I care only about the rows between 8 AM and 10 PM.
The lubridate package requires us to filter hours of the day using the 24-hour convention.
library(tidyverse)
library(lubridate)
### Fake Data with Date-time ----
x <- seq.POSIXt(as.POSIXct('1999-01-01'), as.POSIXct('1999-02-01'), length.out=1000)
df <- data.frame(myDateTime = x)
### Get all rows between 8 AM and 10 PM (inclusive)
df %>%
mutate(myHour = hour(myDateTime)) %>%
filter(myHour >= 8, myHour <= 22) %>% ## between 8 AM and 10 PM (both inclusive)
count(myHour) ## number of rows
Is there a way for me to use 10:00 PM rather than the integer 22?
You can use the ymd_hm and hour functions to do 12-hour to 24-hour conversions.
df %>%
mutate(myHour = hour(myDateTime)) %>%
filter(myHour >= hour(ymd_hm("2000-01-01 8:00 AM")), ## hour() ignores year, month, date
myHour <= hour(ymd_hm("2000-01-01 10:00 PM"))) %>% ## between 8 AM and 10 PM (both inclusive)
count(myHour)
A more elegant solution.
## custom function to convert 12 hour time to 24 hour time
hourOfDay_12to24 <- function(time12hrFmt){
out <- paste("2000-01-01", time12hrFmt)
out <- hour(ymd_hm(out))
out
}
df %>%
mutate(myHour = hour(myDateTime)) %>%
filter(myHour >= hourOfDay_12to24("8:00 AM"),
myHour <= hourOfDay_12to24("10:00 PM")) %>% ## between 8 AM and 10 PM (both inclusive)
count(myHour)
You can also use base R to do this
#Extract the hour
df$hour_day <- as.numeric(format(df$myDateTime, "%H"))
#Subset data between 08:00 AM and 10:00 PM
new_df <- df[df$hour_day >= as.integer(format(as.POSIXct("08:00 AM",
format = "%I:%M %p"), "%H")) & as.integer(format(as.POSIXct("10:00 PM",
format = "%I:%M %p"), "%H")) >= df$hour_day, ]
#Count the frequency
stack(table(new_df$hour_day))
# values ind
#1 42 8
#2 42 9
#3 41 10
#4 42 11
#5 42 12
#6 41 13
#7 42 14
#8 41 15
#9 42 16
#10 42 17
#11 41 18
#12 42 19
#13 42 20
#14 41 21
#15 42 22
This gives the same output as the tidyverse/lubridate approach
library(tidyverse)
library(lubridate)
df %>%
mutate(myHour = hour(myDateTime)) %>%
filter(myHour >= hour(ymd_hm("2000-01-01 8:00 AM")),
myHour <= hour(ymd_hm("2000-01-01 10:00 PM"))) %>%
count(myHour)
I have a table like this:
customer ID startdate enddate
11 22 2015-01-01 2015-03-01
11 55 2018-04-03 2018-06-16
22 33 2017-02-01 2017-04-01
And This is the output I want:
customer Id YearMonth
11 22 201501
11 22 201502
11 22 201503
11 55 201804
11 55 201805
11 55 201806
22 33 201702
22 33 201703
22 33 201704
22 33 201505
I've Started writing this function:
datseq<-function(t1,t2) {
seq(as.Data(t1), as.Date(t2), by="month")
}
My Questions are:
a. How can I correct the function to return me YYYYMM format?
b. How can I implemnt this function on the dataframe in order that each customer and id will get the appropriate list of months? The output should be a dataframe.
Thank you
We can do this with data.table, group by the sequence of rows, create a sequence from the 'startdate' to 'enddate', specifying the by as monthly and format the Date class to return the expected format ("%Y%m")
library(data.table)
setDT(df1)[, .(customer = customer[1], Id = ID[1],
YearMonth = format(seq(startdate, enddate, by = '1 month'), "%Y%m")),
by = 1:nrow(df1)]
This can also be done with tidyverse
library(tidyverse)
df1 %>%
mutate(YearMonth = map2(startdate, enddate,
~ seq(.x, .y, by = "1 month") %>%
format(., format = "%Y%m"))) %>%
select(-startdate, enddate) %>%
unnest
If we need a base R, option, then Map can be used
lst <- Map(function(x, y) seq(x, y, by = '1 month'), df1$startdate, df1$enddate)
Replicate the rows of the dataset by the lengths of the list, and create a column 'YearMonth' by concatenating the list elements and then getting the expected format
data.frame(df1[rep(1:nrow(df1), lengths(lst)), 1:2],
YearMonth = format(do.call(c, lst), "%Y%m"))
I have a "date" vector, that contains dates in mm/dd/yyyy format:
head(Entered_Date,5)
[1] 1/5/1998 1/5/1998 1/5/1998 1/5/1998 1/5/1998
I am trying to plot a frequency variable against the date, but I want to group the dates that it is by month or year. As it is now, there is a frequency per day, but I want to plot the frequency by month or year. So instead of having a frequency of 1 for 1/5/1998, 1 for 1/7/1998, and 3 for 1/8/1998, I would like to display it as 5 for 1/1998. It is a relatively large data set, with dates from 1998 to present, and I would like to find some automated way to accomplish this.
> dput(head(Entered_Date))
structure(c(260L, 260L, 260L, 260L, 260L, 260L), .Label = c("1/1/1998",
"1/1/1999", "1/1/2001", "1/1/2002", "1/10/2000", "1/10/2001",
"1/10/2002", "1/10/2003", "1/10/2005", "1/10/2006", "1/10/2007",
"1/10/2008", "1/10/2011", "1/10/2012", "1/10/2013", "1/11/1999",
"1/11/2000", "1/11/2001", "1/11/2002", "1/11/2005", "1/11/2006",
"1/11/2008", "1/11/2010", "1/11/2011", "1/11/2012", "1/11/2013",
"1/12/1998", "1/12/1999", "1/12/2001", "1/12/2004", "1/12/2005", ...
The floor_date() function from the lubridate package does this nicely.
data %>%
group_by(month = lubridate::floor_date(date, "month")) %>%
summarize(summary_variable = sum(value))
Thanks to Roman Cheplyaka
https://ro-che.info/articles/2017-02-22-group_by_month_r
See more on how to use the function: https://lubridate.tidyverse.org/reference/round_date.html
Here is an example using dplyr. You simply use the corresponding date format string for month %m or year %Y in the format statement.
set.seed(123)
df <- data.frame(date = seq.Date(from =as.Date("01/01/1998", "%d/%m/%Y"),
to=as.Date("01/01/2000", "%d/%m/%Y"), by="day"),
value = sample(seq(5), 731, replace = TRUE))
head(df)
date value
1 1998-01-01 2
2 1998-01-02 4
3 1998-01-03 3
4 1998-01-04 5
5 1998-01-05 5
6 1998-01-06 1
library(dplyr)
df %>%
mutate(month = format(date, "%m"), year = format(date, "%Y")) %>%
group_by(month, year) %>%
summarise(total = sum(value))
Source: local data frame [25 x 3]
Groups: month [?]
month year total
(chr) (chr) (int)
1 01 1998 105
2 01 1999 91
3 01 2000 3
4 02 1998 74
5 02 1999 77
6 03 1998 96
7 03 1999 86
8 04 1998 91
9 04 1999 95
10 05 1998 93
.. ... ... ...
Just to add to #cdeterman answer, you can use lubridate along with dplyr to make this even easier:
df <- data.frame(date = seq.Date(from =as.Date("01/01/1998", "%d/%m/%Y"),
to=as.Date("01/01/2000", "%d/%m/%Y"), by="day"),
value = sample(seq(5), 731, replace = TRUE))
library(dplyr)
library(lubridate)
df %>%
mutate(month = month(date), year = year(date)) %>%
group_by(month, year) %>%
summarise(total = sum(value))
Maybe you just add a column in your data like this:
Year <- format(as.Date(Entered_Date, "%d/%m/%Y"), "%Y")
Dont need dplyr. Look at ?as.POSIXlt
df$date<-as.POSIXlt(df$date)
mon<-df$date$mon
yr<-df$date$year
monyr<-as.factor(paste(mon,yr,sep="/"))
df$date<-monyr
Don't need to use ggplot2 but its nice for this kind of thing.
c <- ggplot(df, aes(factor(date)))
c + geom_bar()
If you want to see the actual numbers
aggregate(. ~ date,data = df,FUN=length )
df2<-aggregate(. ~ date,data = df,FUN=length )
df2
date value
1 0/98 31
2 0/99 31
3 1/98 28
4 1/99 28
5 10/98 30
6 10/99 30
7 11/97 1
8 11/98 31
9 11/99 31
10 2/98 31
11 2/99 31
12 3/98 30
13 3/99 30
14 4/98 31
15 4/99 31
16 5/98 30
17 5/99 30
18 6/98 31
19 6/99 31
20 7/98 31
21 7/99 31
22 8/98 30
23 8/99 30
24 9/98 31
25 9/99 31
There is a super easy way using the cut() function:
list = as.Date(c("1998-5-2", "1993-4-16", "1998-5-10"))
cut(list, breaks = "month")
and you will get this:
[1] 1998-05-01 1993-04-01 1998-05-01
62 Levels: 1993-04-01 1993-05-01 1993-06-01 1993-07-01 1993-08-01 ... 1998-05-01
Another solution is slider::slide_period:
library(slider)
library(dplyr)
monthly_summary <- function(data) summarise(data, date = format(max(date), "%Y-%m"), value = sum(value))
slide_period_dfr(df, df$date, "month", monthly_summary)
date value
1 1998-01 92
2 1998-02 82
3 1998-03 113
4 1998-04 94
5 1998-05 92
6 1998-06 74
7 1998-07 89
8 1998-08 92
9 1998-09 91
10 1998-10 100
...
There is also group_by(month_yr = cut(date, breaks = "1 month") in base R, without needing to use lubridate or other packages.