I have a data table with columns date, stock, daily return, start date, and end date.
I'd like to calculate the mean of daily return between start date and end date specific to each stock, where end date=date, and start date=date-1 year. The image is a small part of my data table, which contains 5 different time brackets (2009-2010, 2010-2011...2014-2015).
pic1
Let's first create the dataset:
d1 <- data.frame(Date = seq.Date(as.Date("2009-04-07"), as.Date("2015-04-06"), by = "day"), stock = 60004)
d2 <- data.frame(Date = seq.Date(as.Date("2009-04-07"), as.Date("2015-04-06"), by = "day"), stock = 60005)
d3 <- data.frame(Date = seq.Date(as.Date("2009-04-07"), as.Date("2015-04-06"), by = "day"), stock = 60006)
d4 <- data.frame(Date = seq.Date(as.Date("2009-04-07"), as.Date("2015-04-06"), by = "day"), stock = 60007)
dat <- rbind(d1, d2, d3, d4)
dat$D <- rnorm(dim(dat)[1])
dat$stock <- as.factor(dat$stock)
datzoo$rollmean <- ave(dats$D, datzoo$stock, FUN = function(x) rollmean(x, k = 365, fill = 0, align = "right"))
For ave to work optimally, you should convert stock into a factor, and set the ave function where k is the window size (365 for rolling mean by day); fill is what to fill NA values with; and align is to let the function know which side (left is the same as top and right is the same as bottom of dataset) to calculate your rolling mean from.
Related
Goal :
Sort column as month order 1~12 in a pivot table to get figures difference compare with different year same month .
Desire data shape :
# 1 - sort as every Jan ~ Dec inside every year
setcolorder(d_c,
c("2016-1","2017-1","2018-1".....))
# 2 - finally add column to calculate the differences
d_c[,"dif":=format(`lastyear_samemonth_column`-`neweryear_samemonthcolumn`,big.mark = ",")]
Data :
set.seed(566684)
n = 100
d <-as.data.table(tibble(month = sample(1:12, n, replace = TRUE),
year = sample(2016:2018,n, replace = TRUE),
`year-month` = paste(year, month, sep = '-'),
value = rnorm(n),
c1 = sample(LETTERS,n,replace = TRUE)))
d_c <- dcast(d,c1 ~ `year-month`, value.var = "value" ,fun.aggregate = sum)
Problem :
They grouped by "year-month" column as ascending order but donnot how to sort as monthly order and assign dynamic column name to get the comparison result
To order the data you can use -
library(data.table)
cols <- c(1, order(as.numeric(sub('.*-', '', names(d_c)[-1]))) + 1)
d_c[, ..cols]
I have plots which were visited at irregular intervals to record biomass of several species. I would like to record the change in each species's biomass, and the duration of the interval, at the beginning of the interval
Sample data are
plot <- c(rep(1,4), rep(2,3))
species <- c(rep(c('a','b'), 2), rep('a',3))
year <- c(1,1,3,3,2,5,13)
biom <- c(5,2,8,4,3,9,18)
DT <- data.table(plot=plot, sp=sp,year=year,biom=biom)
The desired output would look like
elapsed = c(2,2,NA,NA,3,8,NA)
dbiom = c(3,2,NA,NA,6,9,NA)
(e.g., change in biomass of species a in plot 1, first survey in year 1 to second survey in year 2, was +3, and the elapsed time was 2 years)
I have been using the 'shift' operator in data.table but I cannot get it to work
setkey(DT, plot, sp , year)
cols = c("year","biom")
anscols = paste("lead", cols, sep="_")
b4 <- b3[ , (anscols) := shift(.SD, 1, NA, type = "lead"),
.SDcols=cols, by = c(plot, sp)]
I keep getting 'Error in eval(bysub, x, parent.frame()) : object '1' not found'
Would something like this do the trick?
library(data.table)
plot <- c(rep(1,4), rep(2,3))
sp <- c(rep(c('a','b'), 2), rep('a',3))
year <- c(1,1,3,3,2,5,13)
biom <- c(5,2,8,4,3,9,18)
DT <- data.table(plot=plot, sp=sp,year=year,biom=biom)
elapsed = c(2,2,NA,NA,3,8,NA)
dbiom = c(3,2,NA,NA,6,9,NA)
DT[
order(plot, year),
.(year, biom,
elapsed = year - shift(year),
dbiom = biom - shift(biom)
),
by = c("sp", "plot")
]
I'd like to be able to join a dataframe...
concerts that has:
a column of keys (venue)
a column of date-times (start_time)
The start_times represent the time at which the concert began at the venue.
...with a number of other dataframes that are effectively time series. For example, the dataframe...
temperatures has:
a column of keys (also venue)
a column of hourly date-times (datetime) that span the entire time frame (many days before the concert times and many days after).
a temperature column
What I want to have in the joined result is: the temperature at the venue at the start hour of the concert, but also at the end of the first, second, third, and fourth hours of the concert. Essentially a 4-hour 'window' of temperatures.
The only approach I can think of is to create lagged columns in temperatures (one for each of the 1st-4th hours of the concert), and then to join with concerts on the venue and start hours. But this is very slow when applied to my actual dataset, which has many more columns than just temperature.
Here is the example data I've cooked up.
library(lubridate)
library(tidyverse)
concerts <- tibble(venue = c("A", "A", "B", "B"),
start_time = ymd_hm(c("2019-08-09 08:05",
"2019-08-10 16:07",
"2019-08-09 09:30",
"2019-08-10 17:15"))
)
temperatures <- tibble(venue = c(rep("A", 50),
rep("B", 50)),
datetime = rep(seq(ymd_hm("2019-08-09 00:00"), by = "hour", length.out = 50), 2),
temperature = c(rnorm(50, 60, 5),
rnorm(50, 95, 5))
)
Here is my successful but expensive attempt, with the desired results in temperature_over_course_of_concerts.
temperatures_lagged <- temperatures %>%
mutate(temperature_1hr_in = lag(temperature, 1),
temperature_2hr_in = lag(temperature, 2),
temperature_3hr_in = lag(temperature, 3),
temperature_4hr_in = lag(temperature, 4)) %>%
rename(temperature_start = temperature)
temperature_over_course_of_concerts <- concerts %>%
mutate(start_hour = floor_date(start_time, unit = "hour")) %>%
left_join(temperatures_lagged, by = c("venue" = "venue", "start_hour" = "datetime"))
I've tried methods from others for merging time series datasets. However, The time series column is missing. Please see captured screen.
Here is the example of my datasets.
df1 = data.frame(Time = round(seq(1, 200, length.out= 50)), Var1 = runif(50,1, 10))
df2 = data.frame(Time = round(seq(1, 200, length.out= 80)), Var2 = runif(80,1, 10))
df3 = data.frame(Time = round(seq(1, 200, length.out= 100)), Var3 = runif(100,1, 10))
Here is what I've tried.
a = read.zoo(df1,drop = FALSE)
b = read.zoo(df2,drop = FALSE)
c = read.zoo(df3,drop = FALSE)
abc = merge(a, b, c)
How can I add one first column listing the Time? Any comments about this task that I can learn from you?
Thanks.
This converts all three data frames to zoo and merges them into a combined zoo object.
z <- do.call("merge", lapply(list(df1, df2, df3), read.zoo, drop = FALSE))
Note that in zoo objects the time is stored in the index attribute. It is not a column. The statement shown above already includes the time as derived from the first columns of each of the data frames.
I have a data.frame of a time series of data, I would like to thin the data by only keeping the entries that are measured on every even day number. For example:
set.seed(1)
RandData <- rnorm(100,sd=20)
Locations <- rep(c('England','Wales'),each=50)
today <- Sys.Date()
dseq <- (seq(today, by = "1 days", length = 100))
Date <- as.POSIXct(dseq, format = "%Y-%m-%d")
Final <- data.frame(Loc = Locations,
Doy = as.numeric(format(Date,format = "%j")),
Temp = RandData)
So, how would I reduce this data frame to only contain every entry that is measured on even numbered days such as Lloc, day, and temp on day 172, day 174 and so on...
What about:
Final[Final$Doy%%2==0,]