How to construct a random data set of different year in R? - r

The code below will generate uniformly distributed data at a daily time step for the year 2009. Suppose, i want to construct a similar data set which would include the year 2009,2012, 2015, and 2019, how would i do that?. I am basically trying to avoid repeating the code or using filter to grab data for the year of interest.
library(tidyverse)
library(lubridate)
set.seed(500)
DF1 <- data.frame(Date = seq(as.Date("2009-01-01"), to = as.Date("2009-12-31"), by = "day"),
Flow = runif(365,20,60))

Here is an option where we create a vector of year, loop over the vector, get the sequence of dates after converting to Date class and create the 'Flow' from uniform distribution
year <- c(2009, 2012, 2015, 2019)
lst1 <- lapply(year, function(yr) {
dates <- seq(as.Date(paste0(yr, '-01-01')),
as.Date(paste0(yr, '-12-31')), by = 'day')
data.frame(Date = dates,
Flow= runif(length(dates), 20, 60))
})
and create a single data.frame with do.call
dat1 <- do.call(rbind, lst1)

Here is a possible solution:
set.seed(123)
sample_size <- 1000
y <- sample(c(2009,2012,2015,2019),sample_size,replace=TRUE)
simulate_date <- function(year){
n_days <- ifelse(lubridate::leap_year(year),
366,365)
as.Date(sample(1:n_days, 1), origin=paste0(year,"-01-01"))
}
dates <- Reduce(`c`, purrr::map(y, simulate_date))
> head(dates)
[1] "2012-06-28" "2012-01-15" "2009-07-15" "2012-11-02" "2019-04-29"
[6] "2015-10-27"

Related

how to interate a list of columns names in purrr::map to create new columns in R

I want to perform an action against a list of columns in a dataframe using map() but I get an error which I can't understand, can anyone help?
I want it to recycle through the list of columns names in vec and to subtract against the values in column d, in dataframe df.
update: an answer was provided with across (which works) however i need to do this with map() not across()
library(tidyverse)
df <- tibble(a=seq.Date(from=ymd("2021-01-01"),to =ymd("2021-12-31"),by = "day"),
b=seq.Date(from=ymd("2020-01-01"),to =ymd("2020-12-31"),by = "day"),
c=seq.Date(from=ymd("2019-01-01"),to =ymd("2019-12-31"),by = "day"),
d=seq.Date(from=ymd("2018-01-01"),to =ymd("2018-12-31"),by = "day")
)
vec <- c("a","b","c")
map(vec,~transmute(df,d-.x))
You may try the across
I updated the data to have dates from march instead of Jan, since feb has different # days and we will not get the dataframe generated
data
df <- tibble(a=seq.Date(from=ymd("2021-03-01"),to =ymd("2021-12-31"),by = "day"),
b=seq.Date(from=ymd("2020-03-01"),to =ymd("2020-12-31"),by = "day"),
c=seq.Date(from=ymd("2019-03-01"),to =ymd("2019-12-31"),by = "day"),
d=seq.Date(from=ymd("2018-03-01"),to =ymd("2018-12-31"),by = "day")
)
vec <- c("a","b","c")
code
df %>% mutate(across(vec, ~ d-.x))
Created on 2023-01-28 with reprex v2.0.2

Loop in R using changing variable to write and name files

I am trying to create a loop in R that reads daily values of a netcdf file I have imported and converts them into annual sums, then creates a raster for each year. I have converted the netcdf into an array - this is named Biased_corrected.array in my code below. I am not sure how to include the variable 'year' in my file names as it changes with each iteration of the loop. I have tried using paste but this seems to be where it fails. Any suggestions?
# read in file specifying which days correspond to years
YearsDays <- read.csv("Data\\Years.csv") # a df with 49 obs. of 3 variables (year, start day, and end day
YearsDays[1,2:3] #returns 1 and 366 (the days for year 1972)
YearsDays[2,2:3] #returns 367 and 731 (the days for year 1973)
YearsDays[1,1] #returns 1972
YearsDays[2,1] #returns 1973
counter <- 1
startyear <- YearsDays[1,1]
year <- startyear
while(year < 2021){
#set variables to loop through
startday <- YearsDays[counter,2]
endday <- YearsDays[counter,3]
BC_rain.slice <- Biased_corrected.array[,,startday:endday]
paste(year, "_Annual_rain") <- apply(BC_rain.slice, c(1,2), sum)
#save data in a raster
paste(year, "_rain_r") <- raster(t(paste(year, "_Annual_Rain"), xmn=min(x), xmx=max(x), ymn=min(y), ymx=max(y), crs=WGS84)
# move on to next year
counter <- counter + 1
year <- 1971 + counter
}
EDIT: The working code for anyone interested:
YearsDays <- read.csv("Data\\Years.csv") # a df with 49 obs. of 3 variables (year, start day, and end day
for (idx in seq(nrow(YearsDays))){
#set variables to loop through
year <- YearsDays[idx,1]
startday <- YearsDays[idx,2]
endday <- YearsDays[idx,3]
BC_rain.slice <- Biased_corrected.array[,,startday:endday]
assign(paste(year, "_Annual_rain"),apply(BC_rain.slice, c(1,2), sum))
annual_rain <- apply(BC_rain.slice, c(1,2), sum)
#save data in a raster
assign(paste(year, "_rain_r"),raster(t(annual_rain), xmn=min(x), xmx=max(x), ymn=min(y), ymx=max(y), crs=WGS84))
}
You can't use paste to create a variable name as you've listed. You can enclose it within assign or eval, however it may be easier to instead store your results within a data frame. Below is an example of what I believe you're trying to achieve. I have also replaced your while loop and counter with a for loop iterating over years:
YearsDays <- read.csv("Data\\Years.csv") # a df with 49 obs. of 3 variables (year, start day, and end day
output <- data.frame(year = YearsDays[,1], rain_r = NA)
for (idx in seq(nrow(YearsDays))){
#set variables to loop through
year <- YearsDays[idx,1]
startday <- YearsDays[idx,2]
endday <- YearsDays[idx,3]
BC_rain.slice <- Biased_corrected.array[,,startday:endday]
annual_rain <- apply(BC_rain.slice, c(1,2), sum)
#save data in a raster
output$rain_r[output$year == year] <- raster(t(annual_rain, xmn=min(x), xmx=max(x), ymn=min(y), ymx=max(y), crs=WGS84))
}
How about to replace your part
paste(year, "_Annual_rain") <- apply(BC_rain.slice, c(1,2), sum)
#save data in a raster
paste(year, "_rain_r") <- raster(t(paste(year, "_Annual_Rain"), xmn=min(x), xmx=max(x), ymn=min(y), ymx=max(y), crs=WGS84)
to
txt <- paste0(year, "_Annual_rain <- apply(BC_rain.slice, c(1,2), sum)")
eval(parse(text = txt))
# save data in a raster
txt <- paste0(year, "_rain_r <- raster(t(", year, "_Annual_Rain), xmn=min(x), xmx=max(x), ymn=min(y), ymx=max(y), crs=WGS84)")
eval(parse(text = txt))

Transforming data frame to time series in r

I am currently trying to convert a data.frame to a time series. The data frame looks like this:
All I want to do is be able to plot the doc data as a function of time and run a statistical test on it.
Any help would be greatly appreciated!
This is what my code currently looks like:
x=aggregate( doc ~ mo + yr , B , mean )
x$Date <- as.yearmon(paste(x$yr, x$mo), "%Y %m")
df_ts <- xts(x, order.by = x$Date)
keeps <- "doc"
df_ts <- df_ts[ , keeps, drop = FALSE]
df_ts_1 <- as.ts(df_ts, start = head(index(df_ts), 1), end =
tail(index(df_ts), 1))
The issue I'm running into is that the months and years are not in sequential order so when I try to apply a as.tf function, the data does not fill in correctly.
Using DF defined reproducibly in the Note at the end read the data frame into a zoo object converting the Date column to yearmon class and plot using ggplot2. If there exist duplicate dates (there are none in the example data) then add the aggregate = mean argument to read.zoo.
library(ggplot2)
library(zoo)
z <- read.zoo(DF[c("Date", "doc")], FUN = as.yearmon, format = "%b %Y")
autoplot(z) + scale_x_yearmon()
This would also work:
tt <- as.ts(z)
plot(na.approx(tt), ylab = "tt")
Note
In the future please do not use images. I have retyped the first three rows this time.
DF <- data.frame(month = c("02", "10", "12"), year = c(1998, 2000, 2000),
doc = c(1.55, 2.2, 0.96), Date = c("Feb 1998", "Oct 2000", "Dec 2000"),
stringsAsFactors = FALSE)

Can I subset specific years and months directly from POSIXct datetimes?

I have time series data and I am trying to subset the following:
1) periods between specific years (beginning 12AM January 1 and ending 11pm December 31)
2) periods without specific months
These are two independent subsets I am trying to do.
Given the following dataframe:
test <- data.frame(seq(from = as.POSIXct("1983-03-09 01:00"), to = as.POSIXct("1985-01-08 00:00"), by = "hour"))
colnames(test) <- "DateTime"
test$Value<-sample(0:100,16104,rep=TRUE)
I can first create Year and Month columns and use these to subset:
# Add year column
test$Year <- as.numeric(format(test$DateTime, "%Y"))
# Add month column
test$Month <- as.numeric(format(test$DateTime, "%m"))
# Subset specific year (1984 in this case)
sub1 = subset(test, Year!="1983" & Year!="1985")
# Subset specific months (April and May in this case)
sub2 = subset(test, Month=="4" | Month=="5")
However, I am wondering if there is a better way to do this directly from the POSIXct datetimes (without having to first create the Year and Month columns. Any ideas?
sub1 <- subset(test, format(DateTime, "%Y") %in% c("1983" , "1985") )
sub2 <- subset(test, as.numeric(format(DateTime, "%m")) %in% 4:5)

Creating a sequence of columns in a data frame based on an index for loop or using plyr in r

I wish to create 24 hourly data frames in which each data.frame contains hourly demand for a product as 1 column, and the next 8 columns contain hourly temperatures. For example, for the data.frame for 8am, the data.frame will contain a column for demand at 8am, then eight columns for temperature ranging from the most current hour to the 7 past hours. The additional complication is that for hours before 8AM i.e. "4AM", I have to get yesterday's temperatures. I am hitting my head against the wall trying to figure out how to do this with apply or plyr, or a vectorized function.
demand8AM Temp8AM Temp7AM Temp6AM...Temp1AM
Demand4AM Temp4AM Temp3AM Temp2AM Temp1AM Temp12AM Temp11pm(Lag) Temp10pm(Lag)
In my code Hours are numbers; 1 is 12AM etc.
Here is some simple code I created to create the dataset I am dealing with.
#Creating some Fake Data
require(plyr)
# setting up some fake data
set.seed(31)
foo <- function(myHour, myDate){
rlnorm(1, meanlog=0,sdlog=1)*(myHour) + (150*myDate)
}
Hour <- 1:24
Day <-1:90
dates <-seq(as.Date("2012-01-01"), as.Date("2012-3-30"), by = "day")
myData <- expand.grid( Day, Hour)
names(myData) <- c("Date","Hour")
myData$Temperature <- apply(myData, 1, function(x) foo(x[2], x[1]))
myData$Date <-dates
myData$Demand <-(rnorm(1,mean = 0, sd=1)+.75*myData$Temperature )
## ok, done with the fake data generation.
It looks as though you could benefit from utilizing a time series. Here's my interpretation of what you want (I used the "mean" function in rollapply), not what you asked for. I recommend you read over the xts and zoo packages.
#create dummy time vector
time_index <- seq(from = as.POSIXct("2012-05-15 07:00"),
to = as.POSIXct("2012-05-17 18:00"), by = "hour")
#create dummy demand and temp.C
info <- data.frame(demand = sample(1:length(time_index), replace = T),
temp.C = sample (1:10))
#turn demand + temp.C into time series
eventdata <- xts(info, order.by = time_index)
x2 <- eventdata$temp.C
for (i in 1:8) {x2 <- cbind(x2, lag(eventdata$temp.C, i))}

Resources