R Trend of days over several years - r

I want to calculate different kind of trends with this testframe.
date_list = seq(ymd('2000-01-15'),ymd('2010-09-18'),by='day')
testframe = data.frame(Date = date_list)
testframe$Day = substr(testframe$Date, start = 6, stop = 10)
testframe$V1 = runif(3900, 2.0, 35.0)
testframe$V2 = runif(3900, 5.0, 40.0)
testframe$V3 = runif(3900, -10.0, 10.0)
testframe$V4 = seq(from = 5, to = 45, length.out = 3900)
I have a Date column which contains the exact date. The days column contains the extracted days and months and then I have 4 columns with different values.
I want to do two different things:
Calculate the trend of the values of each day over the years. So In the end I only have a column which each single day (no years) and then the slope of the values V1-V4 for each day. The slope should be calculated for each day from the year 2000 to 2010.
The same as above but this time I want to calculate the slope not only from each single day, I also want to take the mean of the values 15 days before and 15 days after each Day. So for the slope of the values of e.g. 2000-02-01, I want to have the mean of all values from 2000-01-17 to 2000-02-16. After I have the mean of these days, I want to do the same as above.
My only effort so far was to create the "Day" column to use it for the aggregate command...but it didnt bring me anywhere so far.
UPDATE: I found a nice package named TTR which contains a moving average function. That is what I need. I only didnt find out how to use it for several columns:
library(TTR)
mavg.15day = SMA(testframe$V1, n=15)
Unfortunately it does only use the 15 days before each date.

Related

Create moving-periods in a dataframe and calculate things (R studio)

I have a dataframe with Precipitation data for every day from January 1961 to December 2017 that looks like this:
DF=data.frame(Years,Month,Day,Precipitation Value)
I want to create periods of 30 days starting with 1th of January of 1961 so the first period will be 1st january to 30th January 1961 and want R to calculate the number of days without rain (Precipitation Value=0). Then, I want to do the same with the next day: 2th January so the period will be 2nd january-31st January, etc. After that, I need R to create a data frame with all the results for the year 1961. So it should be a data frame with of only one column with values (those values will be the number of days without rain in every period).
Then I need to do the same thing with all the years. Which means I will end up with 56 dataframes (1 for each year) and after that I could make a matrix with all of them (putting each data frame as a row).
The thing is I DO NOT KNOW how to start. I have no idea about how making the loop. I know it should be really easy, but I am having trouble with doing it. Specially i do not know how to tell R to stop every different year and start over and make a NEW data frame/vector with values.
Please provide a reproducible subset of your data so others can help you more effectively. While I cannot teach you how to create a loop from scratch here is some code that I think will help. This code simply calculates the moving 30 day average of precipitation using a simple for loop. You can use dplyr to filter these moving averages by year and create data frames doing that. Note I'm not counting the number of no precipitation days here but you can modify the loop easily to do that if needed
df<-data.frame(year = rep(1967:2002, each =12*30),
month = rep(rep(1:12, each = 30), 36),
day = rep(seq(1,30, by = 1), 432),
precipitation = sample(1:2000, 12*36))
df
#create a column that goes from 1 to however long your dataframe is
df$marker <- 1:nrow(df)
#'Now we create a simple loop to calculate the mean precipitation for
#'every 30 day window. You can modify this to count the number of days with
#'0 precipitation
#'the new column moving precip will tell you the mean precipitation for the
#' past 30 days relative to its postion. So if your on row 55, it will give
#' you the mean precipitation from row 25 to 55
df$movingprecip<-NA
for(i in 1:nrow(df)){
start = i #this says we start at i
end = i + 30 #we end 30 days later from i
if(end > nrow(df)){
#here I tell R to print this if there is not enough days
#in the dataset (30 days) to calculate the 30 day window mean
#this happens at the beginning of the dataset because we need to get to the
#30th row to start calculating means
print("not able to calculate, not 30 days into the data yet")
}else{
#Here I calculate the mean the of the past 30 days of precip
df$movingprecip[end] = mean(df[start:end,4])}
}

R: Fill in missing values depending on hour and day

I use R and have a data table with 3 columns:
unixtime | average by hour| 15 seconds value
The data contains several days of a year and all hours of those days.
In 1 hour I have 1 value for "average by hour" which is at the top row of this hour.
Further, there are 240 values for "15 seconds value".
I created a for loop which takes hours to solve the problem, but would solve it.
for (i in 2:nrow(merge_demand)){
if (is.na(merge_demand[i,2])) {
merge_demand[i,2] = merge_demand[i-1,2]
}
}
Is there a more efficient way to just fill those 239 missing values of "average by hour" with the one existing value depending on this hour on this day?
In total I have 1682761 rows.
I am kind of new to data tables so thanks for helping me out!
It's likely to be quicker to use an indexing approach. Here is an idea that you will need to incorporate into a loop
# Generate sample data
my_data <- data.frame(unixtime = seq(from = ymd_hms('2000-01-01 00:00:15'),
by = '15 sec',
length.out = 240),
average_by_hour = c(5, rep(NA, 239)),
value_15_sec = c(rep(5/240, 240)))
#fill the first 240 values of average_by_hour with the first value
my_data$average_by_hour[1:240] <- my_data$average_by_hour[1]

How to add individual timepoint standard error column to data frame with NaNs (SE of mean of week of timepoints for 24 hours)

I need to create a plot of the body temperatures of mice. I have data points collected every 15 minutes over the course of seven days. I also have the calculated mean temperature at each timepoint for the plot. The next step is calculating the standard error of each of these mean temperatures, taking into account all seven days' worth of temperature readings. This is an image of the extended data I am working from:
https://imgur.com/ukk0iOt
I also have a separate, condensed data frame that is the mean_temp from above averaged over seven days for every timepoint, so only one temperature reading each for 24 hours worth of timepoints. It is 96 rows and only contains columns for time and mean_temp24.
With the following code, I am only able to calculate a single standard error for all the timepoints (I know it's wrong but am having a heck of a time finding a solution). I am also unable to calculate standard error from the condensed 24-hour dataset since the full seven days' worth of temperatures are not present.
Adding column with mean temperatures (7 days) of three mice to data frame 'df'
df=cbind(df,"mean_temp"=rowMeans(df[,3:5],na.rm=TRUE))
Trying to calculate standard deviation for each timepoint, to start with
times = unique(df$time)
Function to achieve individual standard errors per row
for (current_time in times){
df$se=sd(df$mean_temp24, na.rm=T)/sqrt(3-1)
}
Ideally, I will end up with a data frame that is 96 lines (each a 15-minute interval timepoint) for 24 hours of temperature data, where the values are the means of the seven temperatures for each timepoint ("mean_temp" from the image of my data frame). I will also have an additional column for standard error, which takes into account the 7 temperature values used to calculate the mean temperature in the final, 24-hour dataset.
The actual output is a single, identical SE for every timepoint in the full dataset that is not condensed to 24 hours.
Use ddply from the plyr package. The function f is called for every unique combination of dt and time:
f = function(x) {
n3 = length(which(!is.na(x[,3])))
n4 = length(which(!is.na(x[,4])))
n5 = length(which(!is.na(x[,5])))
data.frame(
mean3 = mean(x[,3], na.rm=TRUE),
mean4 = mean(x[,4], na.rm=TRUE),
mean5 = mean(x[,5], na.rm=TRUE),
se3 = sd(x[,3], na.rm=TRUE)/sqrt(n3),
se4 = sd(x[,4], na.rm=TRUE)/sqrt(n4),
se5 = sd(x[,5], na.rm=TRUE)/sqrt(n5)
)
}
ddply(df, .(dt,time), f)

How to match dates in 2 data frames in R, then sum specific range of values up to that date?

I have two data frames: rainfall data collected daily and nitrate concentrations of water samples collected irregularly, approximately once a month. I would like to create a vector of values for each nitrate concentration that is the sum of the previous 5 days' rainfall. Basically, I need to match the nitrate date with the rain date, sum the previous 5 days' rainfall, then print the sum with the nitrate data.
I think I need to either make a function, a for loop, or use tapply to do this, but I don't know how. I'm not an expert at any of those, though I've used them in simple cases. I've searched for similar posts, but none get at this exactly. This one deals with summing by factor groups. This one deals with summing each possible pair of rows. This one deals with summing by aggregate.
Here are 2 example data frames:
# rainfall df
mm<- c(0,0,0,0,5, 0,0,2,0,0, 10,0,0,0,0)
date<- c(1:15)
rain <- data.frame(cbind(mm, date))
# b/c sums of rainfall depend on correct chronological order, make sure the data are in order by date.
rain[ do.call(order, list(rain$date)),]
# nitrate df
nconc <- c(15, 12, 14, 20, 8.5) # nitrate concentration
ndate<- c(6,8,11,13,14)
nitrate <- data.frame(cbind(nconc, ndate))
I would like to have a way of finding the matching rainfall date for each nitrate measurement, such as:
match(nitrate$date[i] %in% rain$date)
(Note: Will match work with as.Date dates?) And then sum the preceding 5 days' rainfall (not including the measurement date), such as:
sum(rain$mm[j-6:j-1]
And prints the sum in a new column in nitrate
print(nitrate$mm_sum[i])
To make sure it's clear what result I'm looking for, here's how to do the calculation 'by hand'. The first nitrate concentration was collected on day 6, so the sum of rainfall on days 1-5 is 5mm.
Many thanks in advance.
You were more or less there!
nitrate$prev_five_rainfall = NA
for (i in 1:length(nitrate$ndate)) {
day = nitrate$ndate[i]
nitrate$prev_five_rainfall[i] = sum(rain$mm[(day-6):(day-1)])
}
Step by step explanation:
Initialize empty result column:
nitrate$prev_five_rainfall = NA
For each line in the nitrate df: (i = 1,2,3,4,5)
for (i in 1:length(nitrate$ndate)) {
Grab the day we want final result for:
day = nitrate$ndate[i]
Take the rainfull sum and it put in in the results column
nitrate$prev_five_rainfall[i] = sum(rain$mm[(day-6):(day-1)])
Close the for loop :)
}
Disclaimer: This answer is basic in that:
It will break if nitrate's ndate < 6
It will be incorrect if some dates are missing in the rain dataframe
It will be slow on larger data
As you get more experience with R, you might use data manipulation packages like dplyr or data.table for these types of manipulations.
#nelsonauner's answer does all the heavy lifting. But one thing to note, in my actual data my dates are not numerical like they are in the example above, they are dates listed as MM/DD/YYYY with the appropriate as.Date(nitrate$date, "%m/%d/%Y").
I found that the for loop above gave me all zeros for nitrate$prev_five_rainfall and I suspected it was a problem with the dates.
So I changed my dates in both data sets to numerical using the difference in number of days between a common start date and the recorded date, so that the for loop would look for a matching number of days in each data frame rather than a date. First, make a column of the start date using rep_len() and format it:
nitrate$startdate <- rep_len("01/01/1980", nrow(nitrate))
nitrate$startdate <- as.Date(all$startdate, "%m/%d/%Y")
Then, calculate the difference using difftime():
nitrate$diffdays <- as.numeric(difftime(nitrate$date, nitrate$startdate, units="days"))
Do the same for the rain data frame. Finally, the for loop looks like this:
nitrate$prev_five_rainfall = NA
for (i in 1:length(nitrate$diffdays)) {
day = nitrate$diffdays[i]
nitrate$prev_five_rainfall[i] = sum(rain$mm[(day-5):(day-1)]) # 5 days
}

What does the ts function do in R

I have downloaded the historical prices between Jan-1-2010 and Dec-31-2014 for Twitter, Inc. (TWTR) -NYSE from YAHOO! FINANCE in a twitter.csv file.
I then loaded it into RStudio using:
x = read.csv("Z:/path/to/file/twitter.csv", header=T,stringsAsFactors=F)
Here is how table x looks like:
View(x)
Then I used ts function to get the time series of Adj.Close:
x.ts = ts(x$Adj.Close, frequency = 12, start=c(2010,1), end=c(2014,12)
x.ts
How the previous results have been obtained? They are really different from table x data. Do they need any adjustements?
Your problem is the scale in which the data are read. With frequency = 12, start=c(2010,1), end=c(2014,12) you are telling the function that you have one number per month. If you have one number per day, as it's your case, you should try with:
x.ts = ts(x$Adj.Close, frequency = 365, start=c(2010,1), end=c(2014,365)
Firstly, frequency should be set to 365 if you deal with daily data, 12 if monthly etc.
Secondly
Secondly, I think you need to arrange the data ascending chronologically before using the ts() function.
The function blindly follows exactly what you are telling it, e.g. the data from the chart starts with the first value 35.87 in 2014-12-31 but the start date in the code is 2010, January, meaning it will attribute that value to being associated with Jan-2010.
x <- x %>%
dplyr::arrange(date)
ts.x <- ts(x$Adj.Close, frequency = 365, start=min(x$date), end=max(x$date))

Resources