Reassign the starting Julian date (July 1st = Julian date 1, southern hemisphere) - r

My data set includes various observations at different stages throughout the year.
year when samples were collected.
site location of measurement
Class physical stage during r of measurement
date date of measurement
Julian Julian date
The final measurements usually occur in the early part of the new year, which is the summer time in the southern hemisphere. (e.g. summer is winter, spring is fall).
year site Class date Julian
1 2009 10C Early 2008-09-15 259
2 2009 10C L2 2008-09-29 273
3 2009 10C L3 2008-12-15 350
4 2010 10C Early 2009-08-31 243
5 2010 10C L2 2009-09-14 257
6 2010 10C L3 2009-12-11 345
7 2012 10C Early 2011-08-23 235
8 2012 10C L2 2011-09-22 265
9 2012 10C L3 2011-12-03 337
10 2012 10C LSample 2012-03-26 86
11 2013 10C Early 2012-09-07 251
12 2013 10C L2 2012-09-30 274
13 2013 10C L3 2012-12-17 352
14 2014 10C Early 2013-09-02 245
15 2014 10C L2 2013-09-16 259
16 2014 10C L3 2013-12-16 350
17 2014 10C LMid 2014-01-07 7
18 2015 10C Early 2014-09-08 251
19 2015 10C L2 2014-09-30 273
20 2015 10C L3 2014-12-01 335
I am having a difficult time converting/reassigning the Julian start date to July 1st instead of January 1st. The dot plot below illustrates the final sampling that occurs at the beginning of the year (February-March).
The chron package has an option to reorder the origin but I cannot get it to work properly with my data.
library(chron)
library(dplyr)
data.date <- data %>%
mutate(July.Julian = chron(date,format = c(dates = "ymd"), options(chron.origin = c(month=7, day=1, year=2008))))
Error in chron(c("2008-09-15", "2008-09-29", "2008-12-15", "2009-08-31", :
misspecified chron format(s) length
or
July.Julian = chron(data$date, format = c(dates = "ymd"), options(chron.origin = c(month=7, day=1, year=2008)))
Error in chron(c("2008-09-15", "2008-09-29", "2008-12-15", "2009-08-31", :
misspecified chron format(s) length
I am trying to start the Julian date as 1 instead of 182.
Thoughts or suggestions are welcome.

Assuming that July.Julian is supposed to be Julian days past July 1st:
transform(date.data, July.Julian = as.chron(sprintf("%d-07-01", year)) + Julian)
or
date.data %>% mutate(July.Julian = as.chron(sprintf("%d-07-01", year)) + Julian)
Note that one does not actually need chron here. Just replace as.chron with as.Date and either of these work.

Related

Testing whether n% of data values exist in a variable grouped by posix date

I have a data frame that has hourly observational climate data over multiple years, I have included a dummy data frame below that will hopefully illustrate my QU.
dateTime <- seq(as.POSIXct("2012-01-01"),
as.POSIXct("2012-12-31"),
by=(60*60))
WS <- sample(0:20,8761,rep=TRUE)
WD <- sample(0:390,8761,rep=TRUE)
Temp <- sample(0:40,8761,rep=TRUE)
df <- data.frame(dateTime,WS,WD,Temp)
df$WS[WS>15] <- NA
I need to group by year (or in this example, by month) to find if df$WS has 75% or more of valid data for that month. My filtering criteria is NA as 0 is still a valid observation. I have real NAs as it is observational climate data.
I have tried dplyr piping using %>% function to filer by a new column "Month" as well as reviewing several questions on here
Calculate the percentages of a column in a data frame - "grouped" by column,
Making a data frame of count of NA by variable for multiple data frames in a list,
R group by date, and summarize the values
None of these have really answered my question.
My hope is to put something in a longer script that works in a looping function that will go through all my stations and all the years in each station to produce a wind rose if this criteria is met for that year / station. Please let me know if I need to clarify more.
Cheers
There are many way of doing this. This one appears quite instructive.
First create a new variable which will denote month (and account for year if you have more than one year). Split on this variable and count the number of NAs. Divide this by the number of values and multiply by 100 to get percentage points.
df$monthyear <- format(df$dateTime, format = "%m %Y")
out <- split(df, f = df$monthyear)
sapply(out, function(x) (sum(is.na(x$WS))/nrow(x)) * 100)
01 2012 02 2012 03 2012 04 2012 05 2012 06 2012 07 2012
23.92473 21.40805 24.09152 25.00000 20.56452 24.58333 27.15054
08 2012 09 2012 10 2012 11 2012 12 2012
22.31183 25.69444 23.22148 21.80556 24.96533
You could also use data.table.
library(data.table)
setDT(df)
df[, (sum(is.na(WS))/.N) * 100, by = monthyear]
monthyear V1
1: 01 2012 23.92473
2: 02 2012 21.40805
3: 03 2012 24.09152
4: 04 2012 25.00000
5: 05 2012 20.56452
6: 06 2012 24.58333
7: 07 2012 27.15054
8: 08 2012 22.31183
9: 09 2012 25.69444
10: 10 2012 23.22148
11: 11 2012 21.80556
12: 12 2012 24.96533
Here is a method using dplyr. It will work even if you have missing data.
library(lubridate) #for the days_in_month function
library(dplyr)
df2 <- df %>% mutate(Month=format(dateTime,"%Y-%m")) %>%
group_by(Month) %>%
summarise(No.Obs=sum(!is.na(WS)),
Max.Obs=24*days_in_month(as.Date(paste0(first(Month),"-01")))) %>%
mutate(Obs.Rate=No.Obs/Max.Obs)
df2
Month No.Obs Max.Obs Obs.Rate
<chr> <int> <dbl> <dbl>
1 2012-01 575 744 0.7728495
2 2012-02 545 696 0.7830460
3 2012-03 560 744 0.7526882
4 2012-04 537 720 0.7458333
5 2012-05 567 744 0.7620968
6 2012-06 557 720 0.7736111
7 2012-07 553 744 0.7432796
8 2012-08 568 744 0.7634409
9 2012-09 546 720 0.7583333
10 2012-10 544 744 0.7311828
11 2012-11 546 720 0.7583333
12 2012-12 554 744 0.7446237

ggplot: Best way to plot this using R with facets

Quarter Team Year Units Sales
2015Q3 A 2015 25000 61.1038751
2015Q3 B 2015 1370 4.5081774
2015Q3 C 2015 19103 34.9492249
2015Q3 D 2015 10757 0.5222169
2015Q3 E 2015 2658 6.0959838
2015Q2 A 2015 2500 10.38751
2015Q2 B 2015 370 3.508
2015Q2 C 2015 1103 2.94
2015Q2 D 2015 757 4.5222169
2015Q2 E 2015 658 5.09
...
2005Q3 A 2015 25000 31.1038751
2005Q3 B 2015 1370 4.5081774
2005Q3 C 2015 12345 4.9492249
2005Q3 D 2015 102 3.5222169
2005Q3 E 2015 2658 5.0959838
I have above data set which is quarterly sales by every team. I am trying to figure out what is the best way to plot this so there can be a good visual representation of each teams sales and units gets shown clear per year per quarter basis. I do have large number of teams for example 50.
I was trying to do something like this
library(ggplot2)
ggplot(df) +
geom_line(aes(x=Quarter,y=Sales,group=Year))+
facet_grid(.~Year,scales="free")
But this doesn't give a good idea of whats going on. I was going over http://docs.ggplot2.org/0.9.3.1/facet_wrap.html but unable to grasp how I should plot it.

R - split data to hydrological quarters

I wish to split my data sets into year quarters according to definition of hydrological year. According to Wikipedia, "Due to meteorological and geographical factors, the definition of the water years varies". In USA, hydrological year is a period between October 1st of one year and September 30th of the next.
I use definition of hydrological year for Poland (starts at November 1st and ends at October 31st).
Sample data set looks as folllows:
sampleData <- structure(list(date = structure(c(15946, 15947, 15875, 15910, 15869, 15888, 15823, 16059, 16068, 16067), class = "Date"),`example value` = c(-0.325806595888448, 0.116001346459147, 1.68884381116696, -0.480527505762716, -0.50307381813168,-1.12032214801472, -0.659699514672226, -0.547101497279717, 0.729148872679021,-0.769760735764215)), .Names = c("date", "example value"), row.names = c(NA, -10L), class = "data.frame")
For some reason, function "cut" in my code complains that "breaks" and "labels" differs in length (but they don't). If I omit "labels" options in cut (as below) function works perfectly.
What is wrong with labels?
ToHydroQuarters <-function(df)
{
result <- df
yearStart <- as.numeric(format(min(df$date),'%Y'))-1
#Hydrological year in Poland starts at November 1st
DateStart <- as.Date(paste(yearStart,"-11-01",sep=""))
breaks <- seq(from=DateStart, to=max(df$date)+90, by="quarter")
breakYear <- format(breaks,'%Y')
#Please, do not create labels in such way.
#Please note that for November and December we have next hydrological year - since it started at 1st November. So, we need to check month to decide which year we have (?) or use cut function again as mentioned here: http://stackoverflow.com/questions/22073881/hydrological-year-time-series
labels <- c(paste("Winter",breakYear[1]),
paste("Spring",breakYear[2]),
paste("Summer",breakYear[3]),
paste("Autumn",breakYear[4]),
paste("Autumn",breakYear[5]))
######Here is problem - once I add labels parameter, function complains about different lengths
result$hydroYear <- cut(df$date, breaks)
result
}
Firstly I think it is unwise to have labels as a "hardcoded" variable in a function since it is impossible to check without some kind of reproducible example, however I can see what you're trying to achieve.
You claim that your break and labels should be the correct length, however the function itself doesn't always work (this is without the labels, even if the labels did exist the cut function did not process the last portion of the dates).
For example:
library(lubridate)
x <- ymd(c("09-01-01", "09-01-02", "11-09-03"))
df <- data.frame(date=as.Date(seq(from=min(x), to=max(x), by="day")))
a <- ToHydroQuarters(df)
tail(a)
returns:
date hydroYear
971 2011-08-29 <NA>
972 2011-08-30 <NA>
973 2011-08-31 <NA>
974 2011-09-01 <NA>
975 2011-09-02 <NA>
976 2011-09-03 <NA>
Doing something like breaks <- seq(from=DateStart, to=max(df$date)+90, by="quarter"), does resolve that issue, as it forces a break to actually exist. This might solve your labelling issue that you've had in your function, but it does not make the function "generic".
Personally on the coding side I think it would be better to convert the month, and year parts separately, because it would be easier to understand. For example, you could use library(lubridate) to easily extract the month and specify the breaks and the labels as you normally would. I was thinking the function could look something like this:
thq <- function(date) {
mnth <- cut(month(date), breaks=c(1,4,7, 10, 12),
right=FALSE, include.lowest=TRUE,
labels=c("Spring", "Summer", "Autumn", "Winter"))
return(paste(mnth, ifelse(mnth == "Winter", year(date)+1, year(date))))
}
So then using some dummy data ...
library(lubridate)
x <- ymd(c("09-01-01", "09-01-02", "11-09-03"))
df <- data.frame(date=as.Date(seq(from=min(x), to=max(x), by="month")))
thq <- function(date) {
mnth <- cut(month(date), breaks=c(1,4,7, 10, 12),
right=FALSE, include.lowest=TRUE,
labels=c("Spring", "Summer", "Autumn", "Winter"))
return(paste(mnth, ifelse(mnth == "Winter", year(date)+1, year(date))))
}
df$newdate <- thq(df$date)
Which has the following output:
date newdate
1 2009-01-01 Spring 2009
2 2009-02-01 Spring 2009
3 2009-03-01 Spring 2009
4 2009-04-01 Summer 2009
5 2009-05-01 Summer 2009
6 2009-06-01 Summer 2009
7 2009-07-01 Autumn 2009
8 2009-08-01 Autumn 2009
9 2009-09-01 Autumn 2009
10 2009-10-01 Winter 2010
11 2009-11-01 Winter 2010
12 2009-12-01 Winter 2010
13 2010-01-01 Spring 2010
14 2010-02-01 Spring 2010
15 2010-03-01 Spring 2010
16 2010-04-01 Summer 2010
17 2010-05-01 Summer 2010
18 2010-06-01 Summer 2010
19 2010-07-01 Autumn 2010
20 2010-08-01 Autumn 2010
21 2010-09-01 Autumn 2010
22 2010-10-01 Winter 2011
23 2010-11-01 Winter 2011
24 2010-12-01 Winter 2011
25 2011-01-01 Spring 2011
26 2011-02-01 Spring 2011
27 2011-03-01 Spring 2011
28 2011-04-01 Summer 2011
29 2011-05-01 Summer 2011
30 2011-06-01 Summer 2011
31 2011-07-01 Autumn 2011
32 2011-08-01 Autumn 2011
33 2011-09-01 Autumn 2011
You can shift the months using the modulo operator if it is in a weird order...
thq <- function(date) {
mnth <- cut(((month(df$date)+1) %% 12), breaks=c(0, 3, 6, 9, 12),
right=FALSE, include.lowest=TRUE,
labels=c("Nov_Jan", "Feb_Apr", "May_Jul", "Aug_Oct")
)
# you will need to alter the return statement yourself, because
# I feel there is enough information for you to do it, rather than
# me changing it every time you change the question.
return(paste(mnth, ifelse(mnth == "Winter", year(date)+1, year(date))))
}
library(lubridate)
x <- ymd(c("09-01-01", "09-01-02", "11-09-03"))
df <- data.frame(date=as.Date(seq(from=min(x), to=max(x), by="day")))
df$new <- thq(df$date)
head(df)
output:
> head(df)
date new
1 2009-01-01 Nov_Jan 2009
2 2009-01-02 Nov_Jan 2009
3 2009-01-03 Nov_Jan 2009
4 2009-01-04 Nov_Jan 2009
5 2009-01-05 Nov_Jan 2009
6 2009-01-06 Nov_Jan 2009

Changing X-axis values in Time Series plot with R

I'm a newer R user and I need help with a time series plot. I created a time series plot, and cannot figure out how to change my x-axis values to correspond to my sample dates. My data is as follows:
Year Month Level
2009 8 350
2009 9 210
2009 10 173
2009 11 166
2009 12 153
2010 1 141
2010 2 129
2010 3 124
2010 4 103
2010 5 69
2010 6 51
2010 7 49
2010 8 51
2010 9 51
Let's say this data is saved as the name "data.csv"
data = read.table("data.csv", sep = ",", header = T)
data.ts = ts(data, frequency = 1)
plot(dat.mission.ts[, 3], ylab = "level", main = "main", axes = T)
I've also tried inputing the start = c(2009, 8) into the ts function but I still get wrong values
When I plot this my x axis does not correlate to August 2009 through Sept. 2010. It will either increase by year or just by decimal. I've looked up many examples online and also through the ? help on R, but cannot find a way to relabel my axis values. Any help would be appreciated.
Using base coding, you can accomplish this in a few steps. As described in this SO answer, you can identify your "Month" and "Year" data as a date if you use as.Date and paste functions together and incorporate a day (i.e., first day of the month; "1"). For the purposes of this answer, I will simply refer to the data you provided as df:
df$date<-with(df,as.Date(paste(Year,Month,'1',sep='-'),format='%Y-%m-%d'))
df
Year Month Level date
1 2009 8 350 2009-08-01
2 2009 9 210 2009-09-01
3 2009 10 173 2009-10-01
4 2009 11 166 2009-11-01
5 2009 12 153 2009-12-01
6 2010 1 141 2010-01-01
7 2010 2 129 2010-02-01
8 2010 3 124 2010-03-01
9 2010 4 103 2010-04-01
10 2010 5 69 2010-05-01
11 2010 6 51 2010-06-01
12 2010 7 49 2010-07-01
13 2010 8 51 2010-08-01
14 2010 9 51 2010-09-01
Then you can use your basic plot, axis, and mtext functions to control how you want to visualize the data and your axes. For instance:
xmin<-min(df$date,na.rm=T);xmax<-max(df$date,na.rm=T) #ESTABLISH X-VALUES (MIN & MAX)
ymin<-min(df$Level,na.rm=T);ymax<-max(df$Level,na.rm=T) #ESTABLISH Y-VALUES (MIN & MAX)
xseq<-seq.Date(xmin,xmax,by='1 month') #CREATE DATE SEQUENCE THAT INCREASES BY MONTH FROM DATE MINIMUM TO MAXIMUM
yseq<-round(seq(0,ymax,by=50),0) # CREATE SEQUENCE FROM 0-350 BY 50
par(mar=c(1,1,0,0),oma=c(6,5,3,2)) #CONTROLS YOUR IMAGE MARGINS
plot(Level~date,data=df,type='b',ylim=c(0,ymax),axes=F,xlab='',ylab='');box() #PLOT LEVEL AS A FUNCTION OF DATE, REMOVE AXES FOR FUTURE CUSTOMIZATION
axis.Date(side=1,at=xseq,format='%Y-%m',labels=T,las=3) #ADD X-AXIS LABELS WITH "YEAR-MONTH" FORMAT
axis(side=2,at=yseq,las=2) #ADD Y-AXIS LABELS
mtext('Date (Year-Month)',side=1,line=5) #X-AXIS LABEL
mtext('Level',side=2,line=4) #Y-AXIS LABEL
library(data.table)
library(ggplot2)
library(scales)
data<-data.table(datetime=seq(as.POSIXct("2009/08/01",format="%Y/%m/%d"),
as.POSIXct("2010/09/01",format="%Y/%m/%d"),by="1 month"),
Level=c(350,210,173,166,153,141,129,124,103,69,51,49,51,51))
ggplot(data)+
geom_point(aes(x=datetime,y=Level),col="brown1",size=1)+
scale_x_datetime(labels = date_format("%Y/%m"),breaks = "1 month")+
theme(axis.text.x = element_text(angle = 90, hjust = 1,vjust=0.3))
Example using xts package:
library(xts)
ts1 <- xts(data$Level, as.POSIXct(sprintf("%d-%d-01", data$Year, data$Month)))
# or ts1 <- xts(data$Level, as.yearmon(data$Year + (data$Month-1)/12))
plot(ts1)
If you are using ggplot2:
library(ggplot2)
autoplot(ts1)

Calculating and plotting forecasts using 2 dataframes and storing means in a dataframe

I want to forecast v1 to v6 with different exogenous variables, exo1 to exo3
> ts.cnt
v1 v2 v3 v4 v5 v6
Jan 2012 892 1091 175 615 9 139
Feb 2012 749 1057 148 701 20 123
Mar 2012 1021 1312 245 824 23 151
Apr 2012 878 1178 243 811 16 131
May 2012 894 1249 223 799 5 132
Jun 2012 925 1141 242 782 15 117
and its exogenous variables:
> head(ev)
exo1 exo2 exo3
75155 77628 76113
82914 76113 77653
82719 77653 81304
79342 81304 81745
76493 81745 79163
76911 79163 77188
The following is a for-loop code that runs models by going through every combination of v1-7 and exo1-3:
for (i in 1:6) {
for (j in 1:ncol(ev)) {
fit <- auto.arima(x = ts.cnt[,i],
xreg = ev[,j])
fcast <- forecast(fit, h = 12, xreg = ev[1:12,j])
plot.forecast(fcast,
main = substitute(paste(b,' with ', a),
list(a = colnames(ev)[j],
b = colnames(ts.cnt[i])
)
)
lines(fitted(fcast), col = 2)
}
}
I also want to save forecast values, fcast$mean, into an object within the loop so that the final result is going to look like this:
exo v1 v2 v3 v4 v5 v6
Jan 2015 exo1 1091 175 615 9 139 2924
Feb 2015 exo1 1057 148 701 20 123 2798
Mar 2015 exo1 1312 245 824 23 151 3576
....
Jan 2015 exo2 1178 243 811 16 131 3257
Feb 2015 exo2 1249 223 799 5 132 3304
Mar 2015 exo2 1141 242 782 15 117 3224
....
Jan 2015 exo3 1234 243 811 16 131 3257
Feb 2015 exo3 1249 223 799 5 132 3304
Mar 2015 exo3 1111 242 782 15 117 3224
Thanks in advance.
I used more or less the same looping logic, although:
Instead of looping over column indices, I looped over column names which simplifies a number of things
I used ev at the level-1 loop, and ts.cnt at the level-2 to facilitate writing to the output.df
I also changed h=12 to h=6 in the forecast call, and replaced 1:12 with 1:6, assuming that 12 and 1:12 were oversights given the dimension of the provided sample data.
Since I'm not familiar with this field, please make sure that the generated output makes sense.
output.df <- data.frame(per=rep(rownames(ts.cnt), 3), exo=rep(1:3, each=6),
v1=NA, v2=NA, v3=NA, v4=NA, v5=NA, v6=NA)
rows.offset <- 0 # defines row offset used when writing to df.output
for(ev.col in colnames(ev)) {
for (ts.cnt.col in colnames(ts.cnt)) {
fit <- auto.arima(x = ts.cnt[,ts.cnt.col], xreg = ev[,ev.col])
fcast <- forecast(fit, h=6, xreg=ev[1:6, ev.col])
plot.forecast(fcast, main = paste(ts.cnt.col, "with", ev.col))
lines(fitted(fcast), col = 2)
for(i in seq_along(fcast$mean)) {
output.df[rows.offset + i, ts.cnt.col] <- fcast$mean[i]
}
}
rows.offset <- rows.offset + 6
}
Results
per exo v1 v2 v3 v4 v5 v6
1 Jan 2012 1 848.9718 1114.304 202.0891 766.3091 14.14224 125.7534
2 Feb 2012 1 936.6196 1229.345 222.9528 835.6403 15.60229 138.7362
3 Mar 2012 1 934.4169 1226.453 222.4284 833.8978 15.56559 138.4099
4 Apr 2012 1 896.2693 1176.384 213.3478 803.7224 14.93013 132.7593
5 May 2012 1 864.0862 1134.142 205.6869 778.2649 14.39402 127.9922
6 Jun 2012 1 868.8081 1140.340 206.8109 782.0000 14.47268 128.6916
7 Jan 2012 2 878.2537 1152.167 209.6261 743.3959 14.32666 129.8641
8 Feb 2012 2 861.1136 1129.681 205.5350 728.8877 14.04706 127.3296
9 Mar 2012 2 878.5365 1152.538 209.6936 743.6353 14.33127 129.9059
10 Apr 2012 2 919.8426 1206.726 219.5528 778.5987 15.00509 136.0137
11 May 2012 2 924.8319 1213.272 220.7436 782.8219 15.08647 136.7514
12 Jun 2012 2 895.6201 1174.949 213.7712 758.0957 14.60995 132.4320
13 Jan 2012 3 862.2396 1131.347 205.7280 755.9879 14.20652 127.5946
14 Feb 2012 3 879.6854 1154.237 209.8905 793.2517 14.49396 130.1762
15 Mar 2012 3 921.0454 1208.506 219.7589 881.5961 15.17542 136.2967
16 Apr 2012 3 926.0412 1215.061 220.9509 892.2671 15.25773 137.0359
17 May 2012 3 896.7913 1176.682 213.9719 829.7897 14.77580 132.7075
18 Jun 2012 3 874.4177 1147.325 208.6336 782.0000 14.40717 129.3967
Sample Graph
Final note
If need be, you could save the fits and the forcasts in lists using the following code for instance:
fits <- list()
fcasts <- list()
for (ts.cnt.col in colnames(ts.cnt)) {
for(ev.col in colnames(ev)) {
fits[[paste(ts.cnt.col, ev.col, sep=".")]] <-
auto.arima(x = ts.cnt[,ts.cnt.col], xreg = ev[,ev.col])
fcasts[[paste(ts.cnt.col, ev.col, sep=".")]] <-
forecast(fits[[paste(ts.cnt.col, ev.col, sep=".")]], h=6, xreg=ev[1:6, ev.col])
}
}
That way you can use a functions such as lapply to repeat the same operation(s) on all items in the list.
Data
ts.cnt <- read.csv(text = "v1,v2,v3,v4,v5,v6
Jan 2012,892,1091,175,615,9,139
Feb 2012,749,1057,148,701,20,123
Mar 2012,1021,1312,245,824,23,151
Apr 2012,878,1178,243,811,16,131
May 2012,894,1249,223,799,5,132
Jun 2012,925,1141,242,782,15,117")
ev <- read.csv(text="exo1,exo2,exo3
75155,77628,76113
82914,76113,77653
82719,77653,81304
79342,81304,81745
76493,81745,79163
76911,79163,77188")

Resources