calculating seasonal range in r for a number of years - r

I have a data frame of daily temperature measurements spanning 20 years. I would like to calculate the annual range in the data series for each year (i.e. end up with 20 values, representing the range for each year). Example data:
begin_date = as.POSIXlt("1990-01-01", tz = "GMT")
dat = data.frame(dt = begin_date + (0:(20*365)) * (86400))
dat = within(dat, {speed = runif(length(dt), 1, 10)})
I was thinking of writing a loop which goes through each year and then calculate the range, but was hoping there was another solution.
I think the best way forward would be to have the maximum and minimum values for each year and then calculate the range from that. Can anyone suggest a method to do this without writing a loop to go through each year individually?

Try
library(dplyr)
dat %>%
group_by(year=year(dt)) %>%
summarise(Range=diff(range(speed)))
Or
library(data.table)
setDT(dat)[, list(Range=diff(range(speed))), year(dt)]
Or
aggregate(speed~cbind(year=year(dt)), dat, function(x) diff(range(x)))

Related

Smoothing out missing values in R dataframe

I am using the dataset - https://data.ca.gov/dataset/covid-19-cases/resource/7e477adb-d7ab-4d4b-a198-dc4c6dc634c9 to look into covid cases and deaths in California.
As well as looking at cases/deaths by ethnicity I have grouped the data to give a total column of cases deaths per day. I also used the lag function to give a daily case / death amount.
However on 2 days in December (23rd and 30th) no increment to the cases or deaths columns were made so the daily cases and deaths read 0. The following day the data is 'caught up' with an extra large amount being added on, clearly the sum of the 2 days. (I suspect Christmas and New Year are the causes)
Is there a way of fixing this data? e.g. splitting the double days measurement into half and populating the cells with this, and then retrospectively altering the daily cases and daily deaths figures?
Hopefully the screenshots will clarify what i mean.
Here is the code I have used:
demog_eth <- (read.csv ("./Data/case_demographics_ethnicity.csv", header = T, sep = ","))
demog_eth$date <-as.Date(demog_eth$date)
#Create a DF with total daily information
total_stats <- data.frame(demog_eth$cases,demog_eth$deaths,demog_eth$date)
names(total_stats) <- c('cases', 'deaths', 'date')
total_stats <- total_stats %>% group_by(date) %>% summarise(cases = sum(cases), deaths = sum(deaths))
#Add daily cases and deaths by computing faily difference in totals
##Comment - use lag to look at previous rows
total_stats <- total_stats %>%
mutate(daily_cases = cases-lag(cases),
daily_deaths = deaths-lag(deaths))
The top paragraph of text in the image says cases and deaths. It should say Daily Cases and Daily Deaths. Apologies
df <- data.frame(col=seq(1:100), col2=seq(from=1, to=200, by=2))
df[c(33, 2),] <- 0
zeros <- as.integer(rownames(df[df$col == 0,])) # detect rows with 0
for (i in zeros){
df[i,"col"] <- 0.5 * df[i+1,"col"]
df[i+1,"col"] <- 0.5 * df[i+1,"col"]
}
Sorry, that I used own simple example data. But the mechanism should work if adapted.

Simulate a series of code n(lets say 1000) times while saving the result in a vector in R

I'm still relatively new to R so I'm struggling with repeating lines of code several times and saving the result for each repetition.
The aim is to randomly (equal probability) assign a number of events, in my case 100, over a 20 year period. Since days are irrelevant I use the number of months to define the period. Subsequently, I'm counting the events for every 24-month period within the 20 years. Lastly, extracting the maximum number of events occurring within a 24-month period.
Albeit messy and probably inefficient, the code works for the intended purpose. However, I want to repeat this process 1000 times to get a distribution of all the maximum number of events taking place over 24 months to compare to my real data.
here is my coding so far:
library(runner)
library(dplyr)
#First I set the period from the year 2000 to 2019 with one-month increments.
period <- seq(as.Date("2000/1/1"), by = "month", length.out = 240)
#I sample random observations assigned to different months over the entire period.
u <- sample(period, size=100, replace=T)
#Make a table in order to register the number of occurrences within each month.
u <- table(u)
#Create a data frame to ease information processing.
simulation <- data.frame(u)
#Change the date column to date format.
simulation$u <- as.Date(simulation$u)
#Compute number of events taking place within every 24-month period (730 = days in 24 months).
u <- u %>%
mutate(
Last_24_month_total = sum_run(
x = simulation$Freq,
k = 730,
idx = as.Date(simulation$u, format = "%d/%m/%Y"))
)
#extract the maximum number of uccurences within a 24 month period
max <- max(u$Last_24_month_total)
Could someone help me understand how to rewrite this process in order to facilitate a thousand repetitions while saving the max value for each repetition?
thanks
As #jogo suggested in the comments, you can use replicate.
I simplified your code.
library(runner)
library(dplyr)
seq_dates <- seq(as.Date("2000/1/1"), by = "month", length.out = 240)
replicate(100,
seq_dates %>%
sample(100, replace = TRUE) %>%
table() %>%
sum_run(730, idx = as.Date(names(.))) %>%
max)

Create date index and add to data frame in R

Currently transitioning from Python to R. In Python, you can create a date range with pandas and add it to a data frame like so;
data = pd.read_csv('Data')
dates = pd.date_range('2006-01-01 00:00', periods=2920, freq='3H')
df = pd.DataFrame({'data' : data}, index = dates)
How can I do this in R?
Further, if I want to compare 2 datasets with different lengths but same time span, you can resample the dataset with lower frequency so it can be the same length as the higher frequency by placing 'NaNs' in the holes like so:
df2 = pd.read_csv('data2') #3 hour resolution = 2920 points of data
data2 = df2.resample('30Min').asfreq() #30 Min resolution = 17520 points
I guess I'm basically looking for a Pandas package equivalent for R. How can I code these in R?
The following is a way of getting your time-series data from a given time interval (3 hours)to another (30 minutes):
Get the data:
starter_df <- data.frame(dates=seq(from=(as.POSIXct(strftime("2006-01-01 00:00"))),
length.out = 2920,
by="3 hours"),
data = rnorm(2920))
Get the full sequence in 30 minute intervals and replace the NA's with the values from the starter_df data.frame:
full_data <- data.frame(dates=seq(from=min(starter_df$dates),
to=max(starter_df$dates), by="30 min"),
data=rep(NA,NROW(seq(from=min(starter_df$dates),
to=max(starter_df$dates), by="30 min"))))
full_data[full_data$dates %in% starter_df$dates,] <- starter_df[starter_df$dates %in% full_data$dates,]
I hope it helps.

Convert a time series from minutes to Day period

I've got an R time series object that is measured in 1 hour intervals.
library(lubridate)
library(timeSeries)
set.seed(100)
c <- Sys.time()
d <- c + hours(1:200)
e <- rnorm(200)
f <- data.frame(d,e)
g <- as.timeSeries(f)
I would like to convert this to a daily time, , I am fine with using the average or value of the data column for this conversion.
The outcome would be a time series object with one entry per day whose value is the average of all the hourly values of that particular day.
How can this be done?
First, take advantage of lubridate package to calculate date:
library(lubridate)
f$date <- floor_date(ymd_hms(f$d), "day")
Then, calculate average for given day with
library(dplyr)
dplyr::group_by(f, date) %>%
dplyr::summarise(avg = mean(e))
And use this for time series.

R repeat rows by vector and date

I have a data frame with 275 different stations and 43 years seasonal data (October to next April, no need for May to Sept data)and 6 variables, here is a small example of the data frame with only one variable called value:
data <- data.frame(station=rep(1,6), year=rep(1969,6), month=c(10,10,10,10,11,11),day=c(1,8,16,24,1,9),value=c(1:6))
What I need is to fill the gap of day with daily date(eg:1:8) and the value of each row the average of the 8 days, it would be look like:
data1 <- data.frame(station=rep(1,40), year=rep(1969,40), month=c(rep(10,31),rep(11,9)),day=c(1:31,1:9),value=rep(c(1/7,2/8,3/8,4/8,5/8,6/8),c(7,8,8,8,8,1)))
I wrote some poor code and searched around the site, but unfortunately didn't work out, please help or better ideas would be appreciated.
station.date <- as.Date(with(data, paste(year, month, day, sep="-")))
for (i in 1:length(station.date)){
days <- as.numeric(station.date[i+1]-station.date[i]) #not working
data <- within(data, days <- c(days,1))
}
rows <- rep(1:nrow(data), times=data[ ,data$days])
rows <- ifelse(rows > 10, 0, rows) #get rid of month May to Sept
data <- data[rows, ]
data <- within(data, value1 <- value/days)
data <- within(data, dd <- ?) #don't know to change the repeated days to real days
I wrote some code that does the same things as your example, but probably You have to modyfi it in order to handle whole data set. I wasn't sure what to do with the last observation. Eventually I made a special case for it. If it should be divided by different number, You need just to replace 8 inside values <- c(values, tail(data$value, 1) / 8)
with that number. Moreover if you have all 275 stations in one data.frame, I think the best idea would be to split it, transform it separately and than cbind it.
data <- data.frame(station=rep(1,6), year=rep(1969,6), month=c(10,10,10,10,11,11),day=c(1,8,16,24,1,9),value=c(1:6))
station.date <- as.Date(with(data, paste(year, month, day, sep="-")))
d <- as.numeric(diff(station.date))
range <- sum(d) + 1
# create dates
dates <- seq(station.date[1], by = "day", length = range)
# create values
values <- unlist(sapply(1:length(d), function(i){
rep(data$value[i] / d[i] , d[i])
}))
# adding last observation
values <- c(values, tail(data$value, 1) / 8)
# create new data frame
data2 <- data.frame(station = rep(1, range),
year = as.numeric(format(dates, "%Y")),
month = as.numeric(format(dates, "%m")),
day = as.numeric(format(dates, "%d")),
value = values)
It could probably be optimised in some way, however I hope it helps too. Note how I extract year, month and day from dates.

Resources