This question already has answers here:
Summarizing by group of two variables
(2 answers)
Replacing Missing Value in R
(4 answers)
Closed 4 years ago.
This is my first Stack Overflow post. I researched extensively but have not found a similar post.
I am trying to impute the median for NA values based on two conditions.
Here is my code:
#Create sample of original data for reproducibility
Date<-c("2009-05-01","2009-05-02","2009-05-03","2009-06-01","2009-06-02",
"2009-06-03", "2010-05-01","2010-05-02","2010-05-03","2010-06-01",
"2010-06-02","2010-06-03","2011-05-01","2011-05-02","2011-05-03",
"2011-06-01","2011-06-02","2011-06-03")
Month<- c("May","May","May","June","June","June",
"May","May","May","June","June","June",
"May","May","May","June","June","June")
DayType<- c("Monday","Tuesday","Wednesday","Monday","Tuesday","Wednesday",
"Monday","Tuesday","Wednesday","Monday","Tuesday","Wednesday",
"Monday","Tuesday","Wednesday","Monday","Tuesday","Wednesday")
Qty<- c(NA,NA,NA,NA,NA,NA,
1,2,1,10,15,13,
3,2,5,20,14,16)
#Combine into dataframe
Example<-data.frame(Date,Month,DayType,Qty)
#Test output
Example
# Make a separate dataframe to calculate the median value based on day of the month
test1 <- ddply(Example,. (DayType,Month),summarize,median=median(Qty,na.rm=TRUE))
This works as expected. Test1 output looks like this:
DayType Month Median
Monday June 15.0
Monday May 2.0
Tuesday June 14.5
Tuesday May 2.0
Wednesday June 14.5
Wednesday May 3.0
My second step replaces "NA" values in the original dataset with the medians calculated in test1. This is where my issue comes in.
Example$Qty[is.na(Example$Qty)] <- test1$median[match(Example$DayType,test1$DayType,Example$Month,test1$Month)][is.na(Example$Qty)]
Example
Match[] only matches on the median value for each day, rather than the median value for each day by month. The output is the same seven repeating values for the entire set. I have not figured out how to match on both columns simultaneously.
Output:
Date DayType Month GSEvtQty
2009-05-01 Monday May 15.0 *should be 2.0, matching to June
2009-05-02 Tuesday May 14.5 *should be 2.0, matching to June
2009-05-03 Wednesday May 14.5 *should be 3.0, matching to June
2009-06-01 Monday June 15.0 *imputes correctly
2009-06-02 Tuesday June 14.5 *imputes correctly
2009-06-03 Wednesday June 14.5 *imputes correctly
2010-05-01 Monday May 1.0
2010-05-02 Tuesday May 2.0
2010-05-03 Wednesday May 1.0
2010-06-01 Monday June 10.0
2010-06-02 Tuesday June 15.0
2010-06-03 Wednesday June 13.0
I have also tried using %in%:
Example$Qty[is.na(Example$Qty)] <- test1$median[Example$DayType %in% test1$DayType & Example$Month %in% test1$Month][is.na(Example$Qty)]
But that does not match correctly and only outputs a limited number of values rather than over the entire series of NAs.
Using na.aggregate via the Zoo package as cleverly suggested by #Jaap:
setDT(Example)[, Value := na.aggregate("Qty", FUN = median), by = c("DayType","Month")]
For some reason does not transform the NAs:
Output:
Date Month DayType Qty
2009-05-01 May Monday NA
2009-05-02 May Tuesday NA
2009-05-03 May Wednesday NA
2009-06-01 June Monday NA
Any suggestions would be greatly appreciated! Thanks for sticking with this post for so long and look forward to paying the assistance forward in the future.
This is what merge was created for.
info$GSEvtQty[is.na(info$GSEvtQty)]<- merge(info[is.na(info$GSEvtQty,)], test1, by=c("DayType", "Month"))[,"GSEvtQty"]
Related
Hello I am trying to find the week number for a series of date over three years. However R is not giving the correct week number. I am generating a seq of dates from 2016-04-01 to 2019-03-30 and then I am trying to calculate week over three years such that I get the week number 54, 55 , 56 and so on.
However when I check the week 2016-04-03 R shows the week number as 14 where as when cross checked with excel it is the week number 15 and also it simply calculates 7 days and does not reference the actual calendar days. Also the week number starts from 1 for every start of year
The code looks like this
days <- seq(as.Date("2016-04-03"),as.Date("2019-03-30"),'days')
weekdays <- data.frame('days'=days, Month = month(days), week = week(days),nweek = rep(1,length(days)))
This is how the results looks like
days week
2016-04-01 14
2016-04-02 14
2016-04-03 14
2016-04-04 14
2016-04-05 14
2016-04-06 14
2016-04-07 14
2016-04-08 15
2016-04-09 15
2016-04-10 15
2016-04-11 15
2016-04-12 15
However when checked from excel this is what I get
days week
2016-04-01 14
2016-04-02 14
2016-04-03 15
2016-04-04 15
2016-04-05 15
2016-04-06 15
2016-04-07 15
2016-04-08 15
2016-04-09 15
2016-04-10 16
2016-04-11 16
2016-04-12 16
Can someone please help me identify wherever I am going wrong.
Thanks a lot in advance!!
Not anything that you're doing wrong per se, there is just a difference in how R (I presume you're using the lubridate package) and Excel calculate week numbers.
R will calculate week numbers based on the seven day block from 1 January that year; but
Excel calculates week numbers based on a week starting from Sunday.
Taking the first few days of January 2016 for an example. On, Friday, 1 January 2016, both R and Excel will say this is week 1.
On Sunday, 3 January 2016:
this is within the first seven days of the start of the year so R will return week number 1; but
it is a Sunday, so Excel ticks over to week number 2.
Try this:
ifelse(test = weekdays.Date(days[1]) == "Sunday", yes = epiweek(days[1]), no = epiweek(days[1]) + 1) + cumsum(weekdays.Date(days) == "Sunday")
This tests whether the first day is a Sunday or not and returns an appropriate week number starting point, then adds on one more week number each Sunday. Gives the same week number if there's overlap between years.
I have two daily time series ranging from 1st of Jan 2016 to 1st of Aug 2016, however one my my series only includes data from business days (i.e weekends and bank holidays omitted), the other has data for everyday. My question is, how do I merge the two series so that for both time series I have only the business day data left over (deleting those extra days from the second time series)
The question was tagged also with data.table so I guess that the two time series are stored as data.frames or data.tables.
By default, joins in data.table are right joins. So, if you know in advance which one the "shorter" time series is you can write:
library(data.table)
dt_long[dt_short, on = "date"]
# date weekday i.weekday
#1: 2017-03-30 4 4
#2: 2017-03-31 5 5
#3: 2017-04-03 1 1
#4: 2017-04-04 2 2
#5: 2017-04-05 3 3
#6: 2017-04-06 4 4
If you are not sure which the "shorter" time series is you can use an inner join:
dt_short[dt_long, on = "date", nomatch = 0]
nomatch = 0 specifies the inner join.
If your time series are not already data.tables as the sample data here but are stored as data.frames, you need to coerce them to data.table class beforehand by:
setDT(dt_long)
setDT(dt_short)
Data
As the OP hasn't provided any reproducible data, we need to prepare sample data on our own (similar to this answer but as data.table):
library(data.table)
dt_long <- data.table(date = as.Date("2017-03-30") + 0:7)
# add payload: integer weekday according ISO (week starts on Monday == 1L)
dt_long[, weekday := as.integer(format(date, "%u"))]
# remove weekends
dt_short <- dt_long[weekday < 6L]
We have two data.frames df_long that contains weekends and df_short that doesn't include weekends
Date <- as.Date(seq(as.Date("2003-03-03"), as.Date("2003-03-17"), by = 1), format="%Y-%m-%d")
weekday <- weekdays(as.Date(Date))
df_long <- data.frame(Date, weekday)
df_short<- df_long[ c(1:5, 8:12, 15), ]
You can join them using dplyr::inner_join to delete the weekends and holidays from df_long and keep just the business days.
library(dplyr)
df_join <- df_long %>% inner_join(., df_short, by ="Date")
> df_join
Date weekday.x weekday.y
1 2003-03-03 Monday Monday
2 2003-03-04 Tuesday Tuesday
3 2003-03-05 Wednesday Wednesday
4 2003-03-06 Thursday Thursday
5 2003-03-07 Friday Friday
6 2003-03-10 Monday Monday
7 2003-03-11 Tuesday Tuesday
8 2003-03-12 Wednesday Wednesday
9 2003-03-13 Thursday Thursday
10 2003-03-14 Friday Friday
11 2003-03-17 Monday Monday
I have a dataframe of the date and the air temperature of each hour like this:
date t.air
1 2013-05-01 11.1
2 2013-05-01 10.3
3 2013-05-01 9.5
4 2013-05-01 9.0
5 2013-05-01 8.6
6 2013-05-01 8.0
I should have 24 rows of 2013-05-01, t.air and then 24 rows of the next day and so on.
For computing the daily minimum and maximum temperature, I want to only consider days with 24 measurements.
How can I program something like
"If there are less than 24 values of "day", delete all rows with "day""?
Any help would be appreciated :)
Using data.table. If dat is the dataset
library(data.table)
setDT(dat)[,.SD[.N>=24], by=date]
You can use
subset(dat, ave(date, date, FUN = length) >= 24)
where dat is the name of your data frame.
I am finding this to be quite tricky. I have an R time series data frame, consisting of a value for each day for about 50 years of data. I would like to compute the mean of only the last 5 values for each month. This would be simple if each month ended in the same 31st day, in which case I could just subset. However, as we all know some months end in 31, some in 30, and then we have leap years. So, is there a simple way to do this in R without having to write a complex indexing function to take account of all the possibilities including leap years? Perhaps a function that works on zoo type objects? The data frame is as follows:
Date val
1 2014-01-06 1.49
2 2014-01-03 1.38
3 2014-01-02 1.34
4 2013-12-31 1.26
5 2013-12-30 2.11
6 2013-12-26 3.20
7 2013-12-25 3.00
8 2013-12-24 2.89
9 2013-12-23 2.90
10 2013-12-22 4.5
tapply Try this where dd is your data frame and we have assumed that the Date column is of class "Date". (If dd is already sorted in descending order of Date as it appears it might be in the question then we can shorten it a bit by replacing the anonymous function with function(x) mean(head(x, 5)) .)
> tapply(dd$val, format(dd$Date, "%Y-%m"), function(x) mean(tail(sort(x), 5)))
2013-12 2014-01
2.492000 1.403333
aggregate.zoo In terms of zoo we can do this which returns another zoo object and its index is of class "yearmon". (In the case of zoo it does not matter whether dd is sorted or not since zoo will sort it automatically.)
> library(zoo)
> z <- read.zoo(dd)
> aggregate(z, as.yearmon, function(x) mean(tail(x, 5)))
Dec 2013 Jan 2014
2.492000 1.403333
REVISIONS. Made some corrections.
Is there a good way to get a year + week number converted a date in R? I have tried the following:
> as.POSIXct("2008 41", format="%Y %U")
[1] "2008-02-21 EST"
> as.POSIXct("2008 42", format="%Y %U")
[1] "2008-02-21 EST"
According to ?strftime:
%Y Year with century. Note that whereas there was no zero in the
original Gregorian calendar, ISO 8601:2004 defines it to be valid
(interpreted as 1BC): see http://en.wikipedia.org/wiki/0_(year). Note
that the standard also says that years before 1582 in its calendar
should only be used with agreement of the parties involved.
%U Week of the year as decimal number (00–53) using Sunday as the
first day 1 of the week (and typically with the first Sunday of the
year as day 1 of week 1). The US convention.
This is kinda like another question you may have seen before. :)
The key issue is: what day should a week number specify? Is it the first day of the week? The last? That's ambiguous. I don't know if week one is the first day of the year or the 7th day of the year, or possibly the first Sunday or Monday of the year (which is a frequent interpretation). (And it's worse than that: these generally appear to be 0-indexed, rather than 1-indexed.) So, an enumerated day of the week needs to be specified.
For instance, try this:
as.POSIXlt("2008 42 1", format = "%Y %U %u")
The %u indicator specifies the day of the week.
Additional note: See ?strptime for the various options for format conversion. It's important to be careful about the enumeration of weeks, as these can be split across the end of the year, and day 1 is ambiguous: is it specified based on a Sunday or Monday, or from the first day of the year? This should all be specified and tested on the different systems where the R code will run. I'm not certain that Windows and POSIX systems sing the same tune on some of these conversions, hence I'd test and test again.
Day-of-week == zero in the POSIXlt DateTimesClasses system is Sunday. Not exactly Biblical and not in agreement with the R indexing that starts at "1" convention either, but that's what it is. Week zero is the first (partial) week in the year. Week one (but day of week zero) starts with the first Sunday. And all the other sequence types in POSIXlt have 0 as their starting point. It kind of interesting to see what coercing the list elements of POSIXlt objects do. The only way you can actually change a POSIXlt date is to alter the $year, the $mon or the $mday elements. The others seem to be epiphenomena.
today <- as.POSIXlt(Sys.Date())
today # Tuesday
#[1] "2012-02-21 UTC"
today$wday <- 0 # attempt to make it Sunday
today
# [1] "2012-02-21 UTC" The attempt fails
today$mday <- 19
today
#[1] "2012-02-19 UTC" Success
I did not come up with this myself (it's taken from a blog post by Forester), but nevertheless I thought I'd add this to the answer list because it's the first implementation of the ISO 8601 week number convention that I've seen in R.
No doubt, week numbers are a very ambiguous topic, but I prefer an ISO standard over the current implementation of week numbers via format(..., "%U") because it seems that this is what most people agreed on, at least in Germany (calendars etc.).
I've put the actual function def at the bottom to facilitate focusing on the output first. Also, I just stumbled across package ISOweek, maybe worth a try.
Approach Comparison
x.days <- c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")
x.names <- sapply(1:length(posix), function(x) {
x.day <- as.POSIXlt(posix[x], tz="Europe/Berlin")$wday
if (x.day == 0) {
x.day <- 7
}
out <- x.days[x.day]
})
data.frame(
posix,
name=x.names,
week.r=weeknum,
week.iso=ISOweek(as.character(posix), tzone="Europe/Berlin")$weeknum
)
# Result
posix name week.r week.iso
1 2012-01-01 Sun 1 4480458
2 2012-01-02 Mon 1 1
3 2012-01-03 Tue 1 1
4 2012-01-04 Wed 1 1
5 2012-01-05 Thu 1 1
6 2012-01-06 Fri 1 1
7 2012-01-07 Sat 1 1
8 2012-01-08 Sun 2 1
9 2012-01-09 Mon 2 2
10 2012-01-10 Tue 2 2
11 2012-01-11 Wed 2 2
12 2012-01-12 Thu 2 2
13 2012-01-13 Fri 2 2
14 2012-01-14 Sat 2 2
15 2012-01-15 Sun 3 2
16 2012-01-16 Mon 3 3
17 2012-01-17 Tue 3 3
18 2012-01-18 Wed 3 3
19 2012-01-19 Thu 3 3
20 2012-01-20 Fri 3 3
21 2012-01-21 Sat 3 3
22 2012-01-22 Sun 4 3
23 2012-01-23 Mon 4 4
24 2012-01-24 Tue 4 4
25 2012-01-25 Wed 4 4
26 2012-01-26 Thu 4 4
27 2012-01-27 Fri 4 4
28 2012-01-28 Sat 4 4
29 2012-01-29 Sun 5 4
30 2012-01-30 Mon 5 5
31 2012-01-31 Tue 5 5
Function Def
It's taken directly from the blog post, I've just changed a couple of minor things. The function is still kind of sketchy (e.g. the week number of the first date is far off), but I find it to be a nice start!
ISOweek <- function(
date,
format="%Y-%m-%d",
tzone="UTC",
return.val="weekofyear"
){
##converts dates into "dayofyear" or "weekofyear", the latter providing the ISO-8601 week
##date should be a vector of class Date or a vector of formatted character strings
##format refers to the date form used if a vector of
## character strings is supplied
##convert date to POSIXt format
if(class(date)[1]%in%c("Date","character")){
date=as.POSIXlt(date,format=format, tz=tzone)
}
# if(class(date)[1]!="POSIXt"){
if (!inherits(date, "POSIXt")) {
print("Date is of wrong format.")
break
}else if(class(date)[2]=="POSIXct"){
date=as.POSIXlt(date, tz=tzone)
}
print(date)
if(return.val=="dayofyear"){
##add 1 because POSIXt is base zero
return(date$yday+1)
}else if(return.val=="weekofyear"){
##Based on the ISO8601 weekdate system,
## Monday is the first day of the week
## W01 is the week with 4 Jan in it.
year=1900+date$year
jan4=strptime(paste(year,1,4,sep="-"),format="%Y-%m-%d")
wday=jan4$wday
wday[wday==0]=7 ##convert to base 1, where Monday == 1, Sunday==7
##calculate the date of the first week of the year
weekstart=jan4-(wday-1)*86400
weeknum=ceiling(as.numeric((difftime(date,weekstart,units="days")+0.1)/7))
#########################################################################
##calculate week for days of the year occuring in the next year's week 1.
#########################################################################
mday=date$mday
wday=date$wday
wday[wday==0]=7
year=ifelse(weeknum==53 & mday-wday>=28,year+1,year)
weeknum=ifelse(weeknum==53 & mday-wday>=28,1,weeknum)
################################################################
##calculate week for days of the year occuring prior to week 1.
################################################################
##first calculate the numbe of weeks in the previous year
year.shift=year-1
jan4.shift=strptime(paste(year.shift,1,4,sep="-"),format="%Y-%m-%d")
wday=jan4.shift$wday
wday[wday==0]=7 ##convert to base 1, where Monday == 1, Sunday==7
weekstart=jan4.shift-(wday-1)*86400
weeknum.shift=ceiling(as.numeric((difftime(date,weekstart)+0.1)/7))
##update year and week
year=ifelse(weeknum==0,year.shift,year)
weeknum=ifelse(weeknum==0,weeknum.shift,weeknum)
return(list("year"=year,"weeknum"=weeknum))
}else{
print("Unknown return.val")
break
}
}