My data is in a dataframe which has a structure like this:
df2 <- data.frame(Year = c("2007"), Week = c(1:12), Measurement = c(rnorm(12, mean = 4, sd = 1)))
Unfortunately I do not have the complete date (e.g. days are missing) for each measurement, only the Year and the Weeks (these are ISO weeks).
Now I want to aggregate the Median of a Month's worth of measurements (e.g. the weekly measurements per month of the specific year) into a new column, Months. I did not find a convenient way to do this without having the exact day of the measurements available. Any inputs are much appreciated!
When it is necessary to allocate a week to a single month, the rule for first week of the year might be applied, although ISO 8601 does not consider this case. (Wikipedia)
For example, the 5th week of 2007 belongs to February, because the Thursday of the 5th week was the 1st of February.
I am using data.table and ISOweek packages. See the example how to compute the month of the week. Then you can do any aggregation by month.
require(data.table)
require(ISOweek)
df2 <- data.table(Year = c("2007"), Week = c(1:12),
Measurement = c(rnorm(12, mean = 4, sd = 1)))
# Generate Thursday as year, week of the year, day of week according to ISO 8601
df2[, thursday_ISO := paste(Year, sprintf("W%02d", Week), 4, sep = "-")]
# Convert Thursday to date format
df2[, thursday_date := ISOweek2date(thursday_ISO)]
# Compute month
df2[, month := format(thursday_date, "%m")]
df2
Suggestion by Uwe to compute a year-month string.
# Compute year-month
df2[, yr_mon := format(ISOweek2date(sprintf("%s-W%02d-4", Year, Week)), "%Y-%m")]
df2
And finally you can do an aggregation to the new table or by adding median as a column.
df2[, median(Measurement), by = yr_mon]
df2[, median := median(Measurement), by = yr_mon]
df2
If I understand correctly, you don't know the exact day, but only the week number and year. My answer takes the first day of the year as a starting date and then compute one week intervals based on that. You can probably refine the answer.
Based on
an answer by mnel, using the lubridate package.
library(lubridate)
# Prepare week, month, year information ready for the merge
# Make sure you have all the necessary dates
wmy <- data.frame(Day = seq(ymd('2007-01-01'),ymd('2007-04-01'),
by = 'weeks'))
wmy <- transform(wmy,
Week = isoweek(Day),
Month = month(Day),
Year = isoyear(Day))
# Merge this information with your data
merge(df2, wmy, by = c("Year", "Week"))
Year Week Measurement Day Month
1 2007 1 3.704887 2007-01-01 1
2 2007 10 1.974533 2007-03-05 3
3 2007 11 4.797286 2007-03-12 3
4 2007 12 4.291169 2007-03-19 3
5 2007 2 4.305010 2007-01-08 1
6 2007 3 3.374982 2007-01-15 1
7 2007 4 3.600008 2007-01-22 1
8 2007 5 4.315184 2007-01-29 1
9 2007 6 4.887142 2007-02-05 2
10 2007 7 4.155411 2007-02-12 2
11 2007 8 4.711943 2007-02-19 2
12 2007 9 2.465862 2007-02-26 2
using dplyr you can try:
require(dplyr)
df2 %>% mutate(Date = as.Date(paste("1", Week, Year, sep = "-"), format = "%w-%W-%Y"),
Year_Mon = format(Date,"%Y-%m")) %>% group_by(Year_Mon) %>%
summarise(result = median(Measurement))
As #djhrio pointed out, Thursday is used to determine the weeks in a month. So simply switch paste("1", to paste("4", in the code above.
This can be done relatively simply in dplyr.
library(dplyr)
df2 %>%
mutate(Month = rep(1:3, each = 4)) %>%
group_by(Month) %>%
summarise(MonthlyMedian = stats::median(Measurement))
Basically, add a new column to define your months. I'm presuming since you don't have days, you are going to allocate 4 weeks per month?
Then you just group by your Month variable and calculate the median. Very simple
Hope this helps
Related
Been a little stuck on this for a couple days.
Let's say I have a cohort of 2 people.
Person 1 was in cohort from 01/01/2000 to 01/03/2001.
Person 2 was in cohort from 01/01/1999 to 31/12/2001.
This means person 1 was in the cohort for all of 2000 and 25% of 2001.
Person 2 was in the cohort for all of 1999, all of 2000, and all of 2001.
Adding this together means that, in total, the cohort contributed 1 year of person-time in 1999,
2 years of person-time in 2000, and 1.25 years of person-time in 2001.
Does anyone know of any R functions that might help with dividing up/summing time elapsed between dates like this? I could write it all from scratch, but I'd like to use existing functions if they're out there, and Google has got me nowhere.
Thanks!
Using data.table and lubridate:
Data <- Data[, .(Start, Start2 = seq(Start, End, by="year"), End), by=.(Person)]
Data[, End2 := Start2+years(1)-days(1)]
Data[year(Start2) != year(Start), Start := Start2]
Data[year(End2) != year(End), End := End2]
Data[, c("Year", "Contribution") := list(year(Start), (month(End)-month(Start)+1)/12)]
Data <- Data[, .(Contribution = sum(Contribution)), by=.(Year)][order(Year)]
Which gives:
> Data
Year Contribution
1: 1999 1.00
2: 2000 2.00
3: 2001 1.25
This is a possible generalized tidyverse approach also using lubridate. This creates rows for each year and appropriate time intervals for each person-year. The intersection between the calendar year and person-year interval will be the contribution summed up in the end. Note that Jan 1 to Mar 1 here would be considered 2 months or 1/6 of a year contribution (not 25%).
df <- data.frame(
person = c("Person 1", "Person 2"),
start = c("01/01/2000", "01/01/1999"),
end = c("01/03/2001", "31/12/2001")
)
df$start <- dmy(df$start)
df$end <- dmy(df$end)
library(lubridate)
library(tidyverse)
df %>%
mutate(date_int = interval(start, end),
year = map2(year(start), year(end), seq)) %>%
unnest(year) %>%
mutate(
year_int = interval(
as.Date(paste0(year, '-01-01')), as.Date(paste0(year, '-12-31'))
),
year_sect = intersect(date_int, year_int)
) %>%
group_by(year) %>%
summarise(contribute = signif(sum(as.numeric(year_sect, "years")), 2))
Output
year contribute
<int> <dbl>
1 1999 1
2 2000 2
3 2001 1.2
I want to calculate log returns for a stock in R. The issue is that my financial year is from April 1 to March 31. I have tried using packages tidyquant and tidyverse. The code I have tried is as follows:
library(tidyquant)
RIL<- tq_get("RELIANCE.NS") # download the stock price data of Reliance Industries Limited listed on NSE of India. The data is from January 2011 to May 2021.
library(tidyverse)
RIL1<- RIL %>% mutate(CalYear = year(date),
Month = month(date),
FinYear = if_else(Month<4,CalYear,CalYear+1)) # This creates a new variable called FinYear, which correctly shows the financial year. If the month is >3 (ie March), the financial year is calendar year +1.
RIL_Returns<- RIL1 %>%
group_by(FinYear) %>%
tq_transmute(select = adjusted,
mutate_fun = periodReturn,
period = "yearly",
type = "log") #This part of the code has the problem.
From this code, I get two values for log returns per each year. This can't be true. I want a table with columns FinYear and Log_Returns, where Log_Returns is defined as ln(adjusted close price for the last trading day of given FinYear/adjusted close price for the first trading day of the given FinYear). How can I do this?
Perhaps this is not the most elegant but I think it works, I obtained the first and last day of each year manually and computed the log returns accordingly
# Get data
library("tibble")
library("tidyquant")
RIL<- tq_get("RELIANCE.NS")
RIL1<- RIL %>% mutate(CalYear = year(date),
Month = month(date),
FinYear = if_else(Month<4,CalYear,CalYear+1))
# Get minimum and max dates in each year
start_dates = c()
end_dates = c()
for(year in format(min(RIL1$date),"%Y"):format(max(RIL1$date),"%Y")){
start_dates =
c(start_dates,
min(RIL1$date[format(RIL1$date, "%Y") == format(as.Date(ISOdate(year, 1, 1)),"%Y")])
)
end_dates =
c(end_dates,
max(RIL1$date[format(RIL1$date, "%Y") == format(as.Date(ISOdate(year, 1, 1)),"%Y")])
)
}
# Get filtered data
RIL2 <- RIL1[(RIL1$date %in% start_dates | RIL1$date %in% end_dates),]
# Get log returns, even indexes represent end of each year rows
end_adjusted = RIL2$adjusted[1:length(RIL2$adjusted) %% 2 == 0]
beginning_adjusted = RIL2$adjusted[1:length(RIL2$adjusted) %% 2 != 0]
log_returns = log(end_adjusted/beginning_adjusted)
# Put log returns and years in a tibble.
result = tibble(log_returns ,format(RIL2$date[1:length(RIL2$date) %% 2 == 0], "%Y"))
# Result
result
Outputs
# A tibble: 11 x 2
log_returns `format(RIL2$date[1:length(RIL2$date)%%2 == 0],…
<dbl> <chr>
1 -0.412 2011
2 0.185 2012
3 0.0739 2013
4 0.0117 2014
5 0.145 2015
6 0.0743 2016
7 0.537 2017
8 0.215 2018
9 0.306 2019
10 0.287 2020
11 0.0973 2021
I am trying to create a monthly average of precipitation values of two different time sets, but I can't get the data to be split into two before making the aggregation.
I have a dataset of daily precipitation data from 01-01-2006 to 31-12-2099 and I want to aggregate per month over the time period of (01-01-2015 to 31-12-2054) and (01-01-2055 to 31-12-2099).
I have used the aggregate function to create an average per month like this. But now I have the average per month over the entire data set (2006-2100) and I want to have two lists (one for 01-01-2015 to 31-12-2054 and one for 01-01-2055 to 31-12-2099). I think I need to make a subset or split the data, but I cannot find how to combine this with the aggregate function. Thank you so much!
months = Alentejo_RCP4.5_Average$Month
Alentejo_RCP4.5_Average.myma = aggregate(x = Alentejo_RCP4.5_Average,
by = list(months), FUN = mean)
I also tried this but it just takes the dates and not the attached values to the date.
df <- data.frame(date=as.Date("2015-01-01")+1:365, x=1:365)
list <- split(df,df$date<as.Date("2055-01-01"))
zz <- " Year Month Day Date Average_P
2006 1 1 2006-01-01 6.5
2007 1 2 2007-01-02 2.8
2055 3 3 2055-03-03 3.5
2058 3 4 2058-03-04 5.1
2060 5 5 2060-05-05 3.2"
Data <- read.table(text=zz, header = TRUE)
Instead of splitting the datasets you can create a new column to distinguish between the two groups and take mean of each group and each month.
Data %>%
mutate(Date = as.Date(Date),
group = ifelse(Date < as.Date("2055-01-01"),
'below_2055', 'above_2055'),
month = format(Date, '%m-%Y')) %>%
group_by(group, Date) %>%
summarise(Average_P = mean(Average_P)) -> result
Or in base R :
Data$Date <- as.Date(Data$Date)
aggregate(Average_P~group + month,
transform(Data,
group = ifelse(Date < as.Date("2055-01-01"),
'below_2055', 'above_2055'),
month = format(Date, '%m-%Y')), mean) -> result
If you need final output as list you can then use split.
split(result,result$group)
I have data for hospitalisations that records date of admission and the number of days spent in the hospital:
ID date ndays
1 2005-06-01 15
2 2005-06-15 60
3 2005-12-25 20
4 2005-01-01 400
4 2006-06-04 15
I would like to create a dataset of days spend at the hospital per year, and therefore I need to deal with cases like ID 3, whose stay at the hospital goes over the end of the year, and ID 4, whose stay at the hospital is longer than one year. There is also the problem that some people do have a record on next year, and I would like to add the `surplus' days to those when this happens.
So far I have come up with this solution:
library(lubridate)
ndays_new <- ifelse((as.Date(paste(year(data$date),"12-31",sep="-")),
format="%Y-%m-%d") - data$date) < data$ndays,
(as.Date(paste(year(data$date),"12-31",sep="-")),
format="%Y-%m-%d") - data$date) ,
data$ndays)
However, I can't think of a way to get those `surplus' days that go over the end of the year and assign them to a new record starting on the next year. Can any one point me to a good solution? I use dplyr, so solutions with that package would be specially welcome, but I'm willing to try any other tool if needed.
My solution isn't compact. But, I tried to employ dplyr and did the following. I initially changed column names for my own understanding. I calculated another date (i.e., date.2) by adding ndays to date.1. If the years of date.1 and date.2 match, that means you do not have to consider the following year. If the years do not match, you need to consider the following year. ndays.2 is basically ndays for the following year. Then, I reshaped the data using do. After filtering unnecessary rows with NAs, I changed date to year and aggregated the data by ID and year.
rename(mydf, date.1 = date, ndays.1 = ndays) %>%
mutate(date.1 = as.POSIXct(date.1, format = "%Y-%m-%d"),
date.2 = date.1 + (60 * 60 * 24) * ndays.1,
ndays.2 = ifelse(as.character(format(date.1, "%Y")) == as.character(format(date.2, "%Y")), NA,
date.2 - as.POSIXct(paste0(as.character(format(date.2, "%Y")),"-01-01"), format = "%Y-%m-%d")),
ndays.1 = ifelse(ndays.2 %in% NA, ndays.1, ndays.1 - ndays.2)) %>%
do(data.frame(ID = .$ID, date = c(.$date.1, .$date.2), ndays = c(.$ndays.1, .$ndays.2))) %>%
filter(complete.cases(ndays)) %>%
mutate(date = as.numeric(format(date, "%Y"))) %>%
rename(year = date) %>%
group_by(ID, year) %>%
summarise(ndays = sum(ndays))
# ID year ndays
#1 1 2005 15
#2 2 2005 60
#3 3 2005 7
#4 3 2006 13
#5 4 2005 365
#6 4 2006 50
I need to convert date (m/d/y format) into 3 separate columns on which I hope to run an algorithm.(I'm trying to convert my dates into Julian Day Numbers). Saw this suggestion for another user for separating data out into multiple columns using Oracle. I'm using R and am throughly stuck about how to code this appropriately. Would A1,A2...represent my new column headings, and what would the format difference be with the "update set" section?
update <tablename> set A1 = substr(ORIG, 1, 4),
A2 = substr(ORIG, 5, 6),
A3 = substr(ORIG, 11, 6),
A4 = substr(ORIG, 17, 5);
I'm trying hard to improve my skills in R but cannot figure this one...any help is much appreciated. Thanks in advance... :)
I use the format() method for Date objects to pull apart dates in R. Using Dirk's datetext, here is how I would go about breaking up a date into its constituent parts:
datetxt <- c("2010-01-02", "2010-02-03", "2010-09-10")
datetxt <- as.Date(datetxt)
df <- data.frame(date = datetxt,
year = as.numeric(format(datetxt, format = "%Y")),
month = as.numeric(format(datetxt, format = "%m")),
day = as.numeric(format(datetxt, format = "%d")))
Which gives:
> df
date year month day
1 2010-01-02 2010 1 2
2 2010-02-03 2010 2 3
3 2010-09-10 2010 9 10
Note what several others have said; you can get the Julian dates without splitting out the various date components. I added this answer to show how you could do the breaking apart if you needed it for something else.
Given a text variable x, like this:
> x
[1] "10/3/2001"
then:
> as.Date(x,"%m/%d/%Y")
[1] "2001-10-03"
converts it to a date object. Then, if you need it:
> julian(as.Date(x,"%m/%d/%Y"))
[1] 11598
attr(,"origin")
[1] "1970-01-01"
gives you a Julian date (relative to 1970-01-01).
Don't try the substring thing...
See help(as.Date) for more.
Quick ones:
Julian date converters already exist in base R, see eg help(julian).
One approach may be to parse the date as a POSIXlt and to then read off the components. Other date / time classes and packages will work too but there is something to be said for base R.
Parsing dates as string is almost always a bad approach.
Here is an example:
datetxt <- c("2010-01-02", "2010-02-03", "2010-09-10")
dates <- as.Date(datetxt) ## you could examine these as well
plt <- as.POSIXlt(dates) ## now as POSIXlt types
plt[["year"]] + 1900 ## years are with offset 1900
#[1] 2010 2010 2010
plt[["mon"]] + 1 ## and months are on the 0 .. 11 intervasl
#[1] 1 2 9
plt[["mday"]]
#[1] 2 3 10
df <- data.frame(year=plt[["year"]] + 1900,
month=plt[["mon"]] + 1, day=plt[["mday"]])
df
# year month day
#1 2010 1 2
#2 2010 2 3
#3 2010 9 10
And of course
julian(dates)
#[1] 14611 14643 14862
#attr(,"origin")
#[1] "1970-01-01"
To convert date (m/d/y format) into 3 separate columns,consider the df,
df <- data.frame(date = c("01-02-18", "02-20-18", "03-23-18"))
df
date
1 01-02-18
2 02-20-18
3 03-23-18
Convert to date format
df$date <- as.Date(df$date, format="%m-%d-%y")
df
date
1 2018-01-02
2 2018-02-20
3 2018-03-23
To get three seperate columns with year, month and date,
library(lubridate)
df$year <- year(ymd(df$date))
df$month <- month(ymd(df$date))
df$day <- day(ymd(df$date))
df
date year month day
1 2018-01-02 2018 1 2
2 2018-02-20 2018 2 20
3 2018-03-23 2018 3 23
Hope this helps.
Hi Gavin: another way [using your idea] is:
The data-frame we will use is oilstocks which contains a variety of variables related to the changes over time of the oil and gas stocks.
The variables are:
colnames(stocks)
"bpV" "bpO" "bpC" "bpMN" "bpMX" "emdate" "emV" "emO" "emC"
"emMN" "emMN.1" "chdate" "chV" "cbO" "chC" "chMN" "chMX"
One of the first things to do is change the emdate field, which is an integer vector, into a date vector.
realdate<-as.Date(emdate,format="%m/%d/%Y")
Next we want to split emdate column into three separate columns representing month, day and year using the idea supplied by you.
> dfdate <- data.frame(date=realdate)
year=as.numeric (format(realdate,"%Y"))
month=as.numeric (format(realdate,"%m"))
day=as.numeric (format(realdate,"%d"))
ls() will include the individual vectors, day, month, year and dfdate.
Now merge the dfdate, day, month, year into the original data-frame [stocks].
ostocks<-cbind(dfdate,day,month,year,stocks)
colnames(ostocks)
"date" "day" "month" "year" "bpV" "bpO" "bpC" "bpMN" "bpMX" "emdate" "emV" "emO" "emC" "emMN" "emMX" "chdate" "chV"
"cbO" "chC" "chMN" "chMX"
Similar results and I also have date, day, month, year as separate vectors outside of the df.