I need some help with R timeseries. I have daily values of temperature for a 30 year period = 365*30 days = 10950 days (if bisiest years are not considered) . I want to create a "daily climatology", that is, the average of
each (the 30 values) 1st of January, 2nd of January, etc.., to create a timesieres with 365 values. Could anyone help me with this topic?. Thanks in advance.
Something like this with dplyr + lubridate:
library(dplyr)
library(lubridate)
df %>%
group_by(month = month(date), day = day(date)) %>%
summarize(avg_value = mean(value)) %>%
pull(avg_value) %>%
ts() %>%
plot(ylab = "avg_value")
Result:
> df %>%
+ group_by(month = month(date), day = day(date)) %>%
+ summarize(avg_value = mean(value))
# A tibble: 366 x 3
# Groups: month [?]
month day avg_value
<dbl> <int> <dbl>
1 1 1 0.19750444
2 1 2 0.30492408
3 1 3 0.16760465
4 1 4 -0.09357058
5 1 5 0.10606383
6 1 6 -0.14456526
7 1 7 0.23384988
8 1 8 -0.11987095
9 1 9 -0.01166687
10 1 10 -0.08134161
# ... with 356 more rows
Data:
df = data.frame(date = seq.Date(as.Date("1970-1-1"), as.Date("2000-12-31"), "days"),
value = rnorm(length(seq.Date(as.Date("1970-1-1"), as.Date("2000-12-31"), "days"))))
I had the same probleme to solve and found an answer here:
Daily average calculation from multiple year daily weather data?
It took some time for me to understand and reorder all the comments beacause there was no straight code.
So here I give an complete example based on the link above.
As an example 3 years of random precipitation and temperature data:
test_data <- data.frame("date"= seq(from = as.Date("1990/1/1"), to = as.Date("1992/12/31"), by = "day"),"prec" =runif(1096, 0, 10),"temp" = runif(1096, 0, 10))
Next step is to ad a new column with a variable on which base the average will be calculated. One Day in this example:
test_data$day <- format(test_data$date, format='%m-%d')
In this column everyday of a year appears 3 times because of the 3 years. So we can calculate the mean for every day:
test_data_daily_mean <- aggregate(cbind(prec, temp) ~ (day), data=test_data, FUN=mean)
Hint: For this solution the date column really has to have dates inside. Otherwise you have to format them to R dates like this:
as.Date(data$date, format='%d-%m-%Y')
This answer is a little late, but maybe it helps someone else!
Related
I have a some test data with two columns. The column "hour" shows hourly values (p.m.). The column "day" indicates the corresponding day, i.e. on day 1 there are hourly values from 7 to 11 o'clock.
I now want to calculate how big the time span is for each day and store these values in a vector.
Something like:
timespan <- c(5,7,3)
How could I calculate this in a loop?
I thought about something like length(unique...)
Thanks in advance!
Here is the code:
day <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)
hour <- c(7,7,8,10,11,5,6,6,7,11,9,10,10,11,11)
df <- data.frame(day,hour)
library(dplyr)
df %>%
group_by(day) %>%
summarise(time_span = max(hour) - min(hour) + 1)
## A tibble: 3 x 2
# day time_span
# <dbl> <dbl>
# 1 1 5
# 2 2 7
# 3 3 3
So I have a data frame which is daily data for stock prices, however, I have also a variable that indicates the week of year (1,2,3,4,...,51,52) this is repeated for 22 companies. I would like to create a new variable that takes an average of the daily prices but only across each week.
The above equation has d = day and t = week. My challenge is taking this average of days across each week. Therefore, I should have 52 values per stock that I observe.
Using ave().
dat <- transform(dat, avg_week_price=ave(price, week, company))
head(dat, 9)
# week company wday price avg_week_price
# 1 1 1 a 16.16528 15.47573
# 2 2 1 a 18.69307 15.13812
# 3 3 1 a 11.01956 12.99854
# 4 1 2 a 15.92029 14.56268
# 5 2 2 a 12.26731 13.64916
# 6 3 2 a 17.40726 17.27226
# 7 1 3 a 11.83037 13.02894
# 8 2 3 a 13.09144 12.95284
# 9 3 3 a 12.08950 15.81040
Data:
setseed(42)
dat <- expand.grid(week=1:3, company=1:5, wday=letters[1:7])
dat$price <- runif(nrow(dat), 10, 20)
An option with dplyr
library(dplyr)
dat %>%
group_by(week, company) %>%
mutate(avg_week_price = mean(price))
I'm trying to calculate the number of days that a patient spent during a given state in R.
The image of an example data is included below. I only have columns 1 to 3 and I want to get the answer in column 5. I am thinking if I am able to create a date column in column 4 which is the first recorded date for each state, then I can subtract that from column 2 and get the days I am looking for.
I tried a group_by(MRN, STATE) but the problem is, it groups the second set of 1's as part of the first set of 1's, so does the 2's which is not what I want.
Use mdy_hm to change OBS_DTM to POSIXct type, group_by ID and rleid of STATE so that first set of 1's are handled separately than the second set. Use difftime to calculate difference between OBS_DTM with the minimum value in the group in days.
If your data is called data :
library(dplyr)
data %>%
mutate(OBS_DTM = lubridate::mdy_hm(OBS_DTM)) %>%
group_by(MRN, grp = data.table::rleid(STATE)) %>%
mutate(Answer = as.numeric(difftime(OBS_DTM, min(OBS_DTM),units = 'days'))) %>%
ungroup %>%
select(-grp) -> result
result
You could try the following:
library(dplyr)
df %>%
group_by(ID, State) %>%
mutate(priorObsDTM = lag(OBS_DTM)) %>%
filter(!is.na(priorObsDTM)) %>%
ungroup() %>%
mutate(Answer = as.numeric(OBS_DTM - priorObsDTM, units = 'days'))
The dataframe I used for this example:
df <- df <- data.frame(
ID = 1,
OBS_DTM = as.POSIXlt(
c('2020-07-27 8:44', '2020-7-27 8:56', '2020-8-8 20:12',
'2020-8-14 10:13', '2020-8-15 13:32')
),
State = c(1, 1, 2, 2, 2),
stringsAsFactors = FALSE
)
df
# A tibble: 3 x 5
# ID OBS_DTM State priorObsDTM Answer
# <dbl> <dttm> <dbl> <dttm> <dbl>
# 1 1 2020-07-27 08:56:00 1 2020-07-27 08:44:00 0.00833
# 2 1 2020-08-14 10:13:00 2 2020-08-08 20:12:00 5.58
# 3 1 2020-08-15 13:32:00 2 2020-08-14 10:13:00 1.14
I have a dataframe containing daily prices of a stock exchange with corresponding dates for several years. These dates are tradingdates and is thus excluded weekends and holidays. Ex:
df$date <- c(as.Date("2017-03-30", "2017-03-31", "2017-04-03", "2017-04-04")
I have used lubridate to extract a column containg which month each date is in, but what I struggle with is creating a column that for each month of every year, calculates which number of trading day in the month it is. I.e. from the example, a counter that will start at 1 for 2017-04-03 as this is the first observation of the month and not 3 as it is the third day of the month and end at the last observation of the month. So that the column would look like this:
df$DayofMonth <- c(22, 23, 1, 2)
and not
df$DayofMonth <- c(30, 31, 3, 4)
Is there anybody that can help me?
Maybe this helps:
library(data.table)
library(stringr)
df <- setDT(df)
df[,YearMonth:=str_sub(Date,1,7)]
df[, DayofMonth := seq(.N), by = YearMonth]
You have a column called YearMonth with values like these '2020-01'.
Then for each group (month) you give each date an index which in your case would correspond to the trading day.
As you can see this would lead to 1 for the date '2017-04-03' since it is the first trading day that month. This works if your df is sorted from first date to latest date.
There is a way using lubridate to extract the date components and dplyr.
library(dplyr)
library(lubridate)
df <- data.frame(date = as.Date(c("2017-03-30", "2017-03-31", "2017-04-03", "2017-04-04")))
df %>%
mutate(month = month(date),
year = year(date),
day = day(date)) %>%
group_by(year, month) %>%
mutate(DayofMonth = day - min(day) + 1)
# A tibble: 4 x 5
# Groups: year, month [2]
date month year day DayofMonth
<date> <dbl> <dbl> <int> <dbl>
1 2017-03-30 3 2017 30 1
2 2017-03-31 3 2017 31 2
3 2017-04-03 4 2017 3 1
4 2017-04-04 4 2017 4 2
You can try the following :
For each date find out the first day of that month.
Count how many working days are present between first_day_of_month and current date.
library(dplyr)
library(lubridate)
df %>%
mutate(first_day_of_month = floor_date(date, 'month'),
day_of_month = purrr::map2_dbl(first_day_of_month, date,
~sum(!weekdays(seq(.x, .y, by = 'day')) %in% c('Saturday', 'Sunday'))))
# date first_day_of_month day_of_month
#1 2017-03-30 2017-03-01 22
#2 2017-03-31 2017-03-01 23
#3 2017-04-03 2017-04-01 1
#4 2017-04-04 2017-04-01 2
You can drop the first_day_of_month column if not needed.
data
df <- data.frame(Date = as.Date(c("2017-03-30", "2017-03-31",
"2017-04-03", "2017-04-04")))
I have a date variable called DATE as follows:
DATE
2019-12-31
2020-01-01
2020-01-05
2020-01-09
2020-01-25
I am trying to return a result that counts the number of times the date occur in a week considering the Week variable starts from the minimum of DATE variable. So it would look something like this:
Week Count
1 3
2 1
3 0
4 1
Thanks in advance.
From base R
dates <- c('2019-12-31','2020-01-01','2020-01-05','2020-01-09','2020-01-25')
weeks <- strftime(dates, format = "%V")
table(weeks)
We subtract DATE values with minimum DATE value to get the difference in days between DATES. We divide the difference by 7 to get it in weeks and count it. We then use complete to fill the missing week information.
df %>%
dplyr::count(week = floor(as.integer(DATE - min(DATE))/7) + 1) %>%
tidyr::complete(week = min(week):max(week), fill = list(n = 0))
# week n
# <dbl> <dbl>
#1 1 3
#2 2 1
#3 3 0
#4 4 1
If your DATE column is not of date class, first run this :
df$DATE <- as.Date(df$DATE)