Split and randomly reassemble a time series, but maintain leap years in R - r

I need to create datasets of weather data to use for modeling over the next 50 years. I am planning to do this by using historical weather data (daily, 1980-2012), but mixing up the years in a random order and then relabeling them with 2014-2054. However, I cannot be completely random, because it is important to maintain leap years. I want to have as many datasets as possible so I can get an average response of the model to different weather patterns.
Here is an example of what the historical data looks like (except there is data for every day). How could I reassemble it so the years are in a different order, but make sure years with 366 days (1980, 1984, 1988) end up in future leap years (2016, 2020, 2024, 2028, 2052)? And then do that at least 50 more times?
year day radn maxt
1980 1 5.827989 -1.59375
1980 2 5.655813 -1.828125
1980 3 6.159346 -0.96875
1981 4 6.065136 -1.84375
1981 5 5.961181 -2.34375
1981 6 5.758733 -2.0625
1981 7 6.458055 -2.90625
1982 8 6.73056 -2.890625
1982 9 6.89472 -1.796875
1983 10 6.687879 -2.140625
1984 11 6.585833 -1.609375
1984 12 6.466392 -0.71875
1984 13 7.100092 -0.515625
1985 14 7.176402 -1.734375
1985 15 7.236122 -2.5
1985 16 7.455515 -2.375
1986 17 7.395174 -1.390625
1986 18 7.341537 -2.21875
1987 19 7.678102 -2.828125
1987 20 7.539239 -2.875
1987 21 7.231031 -2.390625
1988 22 7.397067 -0.21875
1988 23 7.947912 -0.5
1989 24 8.355059 -1.03125
1990 25 8.145792 -1.5
1990 26 8.591616 -2.078125

Here is a function that scrambles the years of a passed data frame df, returning a new data frame:
scramble.years = function(df) {
# Build convenience vectors of years
early.leap = seq(1980, 2012, 4)
late.leap = seq(2016, 2052, 4)
early.nonleap = seq(1980, 2012)[!seq(1980, 2012) %in% early.leap]
late.nonleap = seq(2014, 2054)[!seq(2014, 2054) %in% late.leap]
# Build map from late years to early years
map = data.frame(from=c(sample(early.leap, length(late.leap), replace=T),
sample(early.nonleap, length(late.nonleap), replace=T)),
to=c(late.leap, late.nonleap))
# Build a new data frame with the correct years/days for later period
return.list = lapply(2014:2054, function(x) {
get.df = subset(df, year == map$from[map$to == x])
get.df$year = x
return(get.df)
})
return(do.call(rbind, return.list))
}
You can call scramble.years any number of times to get new scrambled data frames.

Related

Create time series in R with weekly measurements for 30 years period

I have a set of weekly data for 30 years (1991 - 2020). The data was collected weekly between 5th may - 10 October every year. This gives me 23 weeks of data every year for 30 years.
I want to create a time series in R with this data. How do I do that please? It should be just 690 entriesin the output, but it is generating 1531 entries in the output See my codes and data below:
I saw a similar question HERE, but mine repeats for 30 years.
myts <- ts(df$Kc_Kamble, start = c(1991, 1), end = c(2020, 23), frequency = 52)
Output in R:
Time Series:
Start = c(1991, 1)
End = c(2020, 23)
Frequency = 52
Sample data:
Year Week Kc_Kamble
1991 1 0.357445197
1991 2 0.36902168
1991 3 0.383675947
1991 4 0.400703221
1991 5 0.418901921
1991 6 0.437049406
1991 7 0.453742803
1991 8 0.467291036
1991 9 0.475942834
1991 10 0.476898402
1991 11 0.464632341
1991 12 0.436298927
1991 13 0.396338825
1991 14 0.352731819
1991 15 0.313539638
1991 16 0.283932169
1991 17 0.2627343
1991 18 0.247373874
1991 19 0.235647483
1991 20 0.225655859
1991 21 0.216663659
1991 22 0.208550065
1991 23 0.203605036
1992 1 0.336754943
1992 2 0.334735193
1992 3 0.342654691
1992 4 0.363520428
1992 5 0.397733301
1992 6 0.4399758
1992 7 0.483592219
1992 8 0.521920773
1992 9 0.548597061
1992 10 0.560150059
1992 11 0.557210705
1992 12 0.542114151
1992 13 0.5173071
1992 14 0.485236257
1992 15 0.448348321
1992 16 0.409089999
1992 17 0.369907993
1992 18 0.333162073
1992 19 0.300014261
1992 20 0.270225988
1992 21 0.243406301
1992 22 0.219247646
1992 23 0.204966601
Let me suggest the following steps to set up and start analyzing your time series.
Initialize your time series by creating a 'dates' sequence and 'data' (set to NA). Use the library xts to create the time series.
library(xts)
dates <- seq(as.Date("1991-01-01"), as.Date("2020-01-01"), by = "weeks")
data <- rep(NA, length(dates))
myxts <- xts(x = data, order.by = dates)
str(myxts); head(myxts); tail(myxts)
Collect your data.
Data is collected weekly between 5th may - 10 October every year.
Let's read the data and work with Weekly Total Precipitation for year 2014.
ts_data <- read.table("https://www.dropbox.com/s/k2cxpja3cpsyoyc/ts_data.txt?dl=1", header =TRUE, sep="\t")
year.2014 <- ts_data[which(ts_data$Year == 2014),]
year.2014 # 23 rows of data for 2014.
start <- as.Date("2014-5-5"); end <- as.Date("2014-10-10")
collect <- which ( index(myxts) >= start & index(myxts) <= end )
myxts[collect] <- year.2014$PRPtot
# year.2014 and collect must have the same number of rows
Verify the collected data. You should see data inside each time window, and NA outside the time windows.
myxts2 <- window(myxts, start=start-50, end=end+50)
str(myxts2); myxts2
Visualize the collected data. You could view the complete time series (i.e. myxts). Note that autoplot drops all NAs.
library(ggplot2)
autoplot(myxts2, geom = "point")

How do I calculate days since value exceeded in R?

I'm working with daily discharge data over 30 years. Discharge is measured in cfs, and my dataset looks like this:
date ddmm year cfs
1/04/1986 1-Apr 1986 2560
2/04/1986 2-Apr 1986 3100
3/04/1986 3-Apr 1986 2780
4/04/1986 4-Apr 1986 2640
...
17/01/1987 17-Jan 1987 1130
18/01/1987 18-Jan 1987 1190
19/01/1987 19-Jan 1987 1100
20/01/1987 20-Jan 1987 864
21/01/1987 21-Jan 1987 895
22/01/1987 22-Jan 1987 962
23/01/1987 23-Jan 1987 998
24/01/1987 24-Jan 1987 1140
I'm trying to calculate the number of days preceding each date that the discharge exceeds 1000 cfs and put it in a new column ("DaysGreater1000") that will be used in a subsequent analysis.
In this example, DaysGreater1000 would be 0 for all of the dates in April 1986. DaysGreater1000 would be 1 on 20 Jan, 2 on 21 Jan, 3 on 22 Jan, etc.
Do I first need to create a column (event) of binary data for when the threshold is exceeded? I have been reading several old questions and it looks like I need to use ifelse but I can't figure out how to make a new column of data and then how to make the next step to calculate the number of preceding days.
Here are the questions that I have been examining:
Calculate days since last event in R
Calculate elapsed time since last event
... And this is the code that looks promising, but I can't quite put it all together!
df %>%
mutate(event = as.logical(event),
last_event = if_else(event, true = date, false = NA_integer_)) %>%
fill(last_event) %>%
mutate(event_age = date - last_event)
summary(df)
I'm sorry if I'm not being very eloquent! I'm feeling a bit rusty as I haven't used R in a while.

How to calculate the average year

I have a 20-year monthly XTS time series
Jan 1990 12.3
Feb 1990 45.6
Mar 1990 78.9
..
Jan 1991 34.5
..
Dec 2009 89.0
I would like to get the average (12-month) year, or
Jan xx
Feb yy
...
Dec kk
where xx is the average of every January, yy of every February, and so on.
I have tried apply.yearly and lapply but these return 1 value, which is the 20-year total average
Would you have any suggestions? I appreciate it.
The lubridate package could be useful for you. I would use the functions year() and month() in conjunction with aggregate():
library(xts)
library(lubridate)
#set up some sample data
dates = seq(as.Date('2000/01/01'), as.Date('2005/01/01'), by="month")
df = data.frame(rand1 = runif(length(dates)), rand2 = runif(length(dates)))
my_xts = xts(df, dates)
#get the mean by year
aggregate(my_xts$rand1, by=year(index(my_xts)), FUN=mean)
This outputs something like:
2000 0.5947939
2001 0.4968154
2002 0.4941752
2003 0.5291211
2004 0.6631564
To find the mean for each month you can do:
#get the mean by month
aggregate(my_xts$rand1, by=month(index(my_xts)), FUN=mean)
which will output something like
1 0.5560279
2 0.6352220
3 0.3308571
4 0.6709439
5 0.6698147
6 0.7483192
7 0.5147294
8 0.3724472
9 0.3266859
10 0.5331233
11 0.5490693
12 0.4642588

How to find out how many trading days in each month in R?

I have a dataframe like this. The time span is 10 years. Because it's Chinese market data, and China has Lunar Holidays. So each year have different holiday times in terms of the western calendar.
When it is a holiday, the stock market does not open, so it is a non-trading day. Weekends are non-trading days too.
I want to find out which month of which year has the least number of trading days, and most importantly, what number is that.
There are not repeated days.
date change open high low close volume
1 1995-01-03 -1.233 637.72 647.71 630.53 639.88 234518
2 1995-01-04 2.177 641.90 655.51 638.86 653.81 422220
3 1995-01-05 -1.058 656.20 657.45 645.81 646.89 430123
4 1995-01-06 -0.948 642.75 643.89 636.33 640.76 487482
5 1995-01-09 -2.308 637.52 637.55 625.04 625.97 509851
6 1995-01-10 -2.503 616.16 617.60 607.06 610.30 606925
If there are not repeated days, you can count days per month and year by:
library(data.table) "maxx"))), .Names = c("X2005", "X2006", "X2007", "X2008"))
library(lubridate)
dt <- as.data.table(dt)
dt_days <- dt[, .(count_day=.N), by=.(year(date), month(date))]
Then you only need to do this to get the min:
dt_days[count_day==min(count_day)]
The chron and bizdays packages deal with business days but neither actually contains a usable calendar of holidays limiting their usefulness.
We will use chron below assuming you have defined the .Holidays vector of dates that are holidays. (If you run the code below without doing that only weekdays will be regarded as business days as the default .Holidays vector supplied by chron has very few dates in it.) DF has 120 rows (one row for each year/month) and the last line subsets that to just the month in each year having least business days.
library(chron)
library(zoo)
st <- as.yearmon("2001-01")
en <- as.yearmon("2010-12")
ym <- seq(st, en, 1/12) # sequence of year/months of interest
# no of business days in each yearmonth
busdays <- sapply(ym, function(x) {
s <- seq(as.Date(x), as.Date(x, frac = 1), "day")
sum(!is.weekend(s) & !is.holiday(s))
})
# data frame with one row per year/month
yr <- as.integer(ym)
DF <- data.frame(year = yr, month = cycle(ym), yearmon = ym, busdays)
# data frame with one row per year
wx.min <- ave(busdays, yr, FUN = function(x) which.min(x) == seq_along(x))
DF[wx.min == 1, ]
giving:
year month yearmon busdays
2 2001 2 Feb 2001 20
14 2002 2 Feb 2002 20
26 2003 2 Feb 2003 20
38 2004 2 Feb 2004 20
50 2005 2 Feb 2005 20
62 2006 2 Feb 2006 20
74 2007 2 Feb 2007 20
95 2008 11 Nov 2008 20
98 2009 2 Feb 2009 20
110 2010 2 Feb 2010 20

converting continuous number into a binary value

I have a dataset that includes a column called BirthYear that includes lots of years in which people were born and I need to create a new column that prints "young" if their BirthYear is > 1993 and to print "old" if their BirthYear is < 1993. I've tried using the if function but I cant seem to achieve it, I would appreciate if u let me know how to do it, thanks!
I also like cut() for this, especially if you want the result to be a factor.
year <- sample(1989:1999, size=20, replace=T) # Arbitrary vector of years
breaks <- c(-Inf, 1993, Inf) # The 3 bounds of the 2 intervals
labels <- c("old", "young") # The 2 labels of the 2 intervals
binary <- cut(x=year, breaks=breaks, labels=labels, right=F)
# Inspect
data.frame(year, binary)
The result:
year binary
1 1993 young
2 1997 young
3 1989 old
4 1998 young
5 1999 young
6 1989 old
7 1994 young
8 1991 old
9 1991 old
10 1991 old
...
This is close to a duplicate, but involves custom labels.
If you have to inspect more than one variable eventually, look at dplyr::case_when().
Another option could be use dplyr::recode_factor as below:
set.seed(1)
year <- sample(1970:2005, size=10, replace=T)
> year
#[1] 2001 1975 1979 1994 1974 1973 1985 1994 1975 1981
recode_factor(as.factor(year > 1993), 'TRUE' = "Old", 'FALSE' = "Young")
#[1] Old Young Young Old Young Young Young Old Young Young
#Levels: Old Young

Resources