select even numbered days from data frame - r

I have a data.frame of a time series of data, I would like to thin the data by only keeping the entries that are measured on every even day number. For example:
set.seed(1)
RandData <- rnorm(100,sd=20)
Locations <- rep(c('England','Wales'),each=50)
today <- Sys.Date()
dseq <- (seq(today, by = "1 days", length = 100))
Date <- as.POSIXct(dseq, format = "%Y-%m-%d")
Final <- data.frame(Loc = Locations,
Doy = as.numeric(format(Date,format = "%j")),
Temp = RandData)
So, how would I reduce this data frame to only contain every entry that is measured on even numbered days such as Lloc, day, and temp on day 172, day 174 and so on...

What about:
Final[Final$Doy%%2==0,]

Related

How to join a data.frame of keys and datetimes with a window of values from a time series?

I'd like to be able to join a dataframe...
concerts that has:
a column of keys (venue)
a column of date-times (start_time)
The start_times represent the time at which the concert began at the venue.
...with a number of other dataframes that are effectively time series. For example, the dataframe...
temperatures has:
a column of keys (also venue)
a column of hourly date-times (datetime) that span the entire time frame (many days before the concert times and many days after).
a temperature column
What I want to have in the joined result is: the temperature at the venue at the start hour of the concert, but also at the end of the first, second, third, and fourth hours of the concert. Essentially a 4-hour 'window' of temperatures.
The only approach I can think of is to create lagged columns in temperatures (one for each of the 1st-4th hours of the concert), and then to join with concerts on the venue and start hours. But this is very slow when applied to my actual dataset, which has many more columns than just temperature.
Here is the example data I've cooked up.
library(lubridate)
library(tidyverse)
concerts <- tibble(venue = c("A", "A", "B", "B"),
start_time = ymd_hm(c("2019-08-09 08:05",
"2019-08-10 16:07",
"2019-08-09 09:30",
"2019-08-10 17:15"))
)
temperatures <- tibble(venue = c(rep("A", 50),
rep("B", 50)),
datetime = rep(seq(ymd_hm("2019-08-09 00:00"), by = "hour", length.out = 50), 2),
temperature = c(rnorm(50, 60, 5),
rnorm(50, 95, 5))
)
Here is my successful but expensive attempt, with the desired results in temperature_over_course_of_concerts.
temperatures_lagged <- temperatures %>%
mutate(temperature_1hr_in = lag(temperature, 1),
temperature_2hr_in = lag(temperature, 2),
temperature_3hr_in = lag(temperature, 3),
temperature_4hr_in = lag(temperature, 4)) %>%
rename(temperature_start = temperature)
temperature_over_course_of_concerts <- concerts %>%
mutate(start_hour = floor_date(start_time, unit = "hour")) %>%
left_join(temperatures_lagged, by = c("venue" = "venue", "start_hour" = "datetime"))

Calculating mean between two dates

I have a data table with columns date, stock, daily return, start date, and end date.
I'd like to calculate the mean of daily return between start date and end date specific to each stock, where end date=date, and start date=date-1 year. The image is a small part of my data table, which contains 5 different time brackets (2009-2010, 2010-2011...2014-2015).
pic1
Let's first create the dataset:
d1 <- data.frame(Date = seq.Date(as.Date("2009-04-07"), as.Date("2015-04-06"), by = "day"), stock = 60004)
d2 <- data.frame(Date = seq.Date(as.Date("2009-04-07"), as.Date("2015-04-06"), by = "day"), stock = 60005)
d3 <- data.frame(Date = seq.Date(as.Date("2009-04-07"), as.Date("2015-04-06"), by = "day"), stock = 60006)
d4 <- data.frame(Date = seq.Date(as.Date("2009-04-07"), as.Date("2015-04-06"), by = "day"), stock = 60007)
dat <- rbind(d1, d2, d3, d4)
dat$D <- rnorm(dim(dat)[1])
dat$stock <- as.factor(dat$stock)
datzoo$rollmean <- ave(dats$D, datzoo$stock, FUN = function(x) rollmean(x, k = 365, fill = 0, align = "right"))
For ave to work optimally, you should convert stock into a factor, and set the ave function where k is the window size (365 for rolling mean by day); fill is what to fill NA values with; and align is to let the function know which side (left is the same as top and right is the same as bottom of dataset) to calculate your rolling mean from.

Not able to append the return of a function when I read csv files using a for loop.

The issue i am facing is I am getting individual lists for each .csv i read and it is not appending the result to a single dataframe or list. I am very new to R. Please help me out.
I am getting output as
amazon.csv 10.07
facebook.csv 54.67
Whereas i am expecting all the values in a data frame with column company and CAGR values.
enter code here
preprocess <- function(x){
##flipping data to suit time series analysis
my.data <- x[nrow(x):1,]
#sort(x,)
## setting up date as date format
my.data$date <- as.Date(my.data$date)
##creating a new data frame to sort the data.
sorted.data <- my.data[order(my.data$date),]
#removing the last row as it contains stocks price at moment when i downloaded data
#sorted.data <- sorted.data[-nrow(sorted.data),]
#calculating lenght of the data frame
data.length <- length(sorted.data$date)
## extracting the first date
time.min <- sorted.data$date[1]
##extracting the last date
time.max <- sorted.data$date[data.length]
# creating a new data frame with all the dates in sequence
all.dates <- seq(time.min, time.max, by="day")
all.dates.frame <- data.frame(list(date=all.dates))
#Merging all dates data frame and sorted data frame; all the empty cells are assigned NA vales
merged.data <- merge(all.dates.frame, sorted.data, all=T)
##Replacing all NA values with the values of the rows of previous day
final.data <- transform(merged.data, close = na.locf(close), open = na.locf(open), volume = na.locf(volume), high = na.locf(high), low =na.locf(low))
# write.csv(final.data, file = "C:/Users/rites/Downloads/stock prices", row.names = FALSE)
#
#return(final.data) #--> ##{remove comment for Code Check}
################################################################
######calculation of CAGR(Compound Annual Growth Rate ) #######
#### {((latest price/Oldest price)^1/#ofyears) - 1}*100 ########
################################################################
##Extracting closing price of the oldest date
old_closing_price <- final.data$close[1]
##extracting the closing price of the latest date
new_closing_price <- final.data$close[length(final.data$close)]
##extracting the starting year
start_date <- final.data$date[1]
start_year <- as.numeric(format(start_date, "%Y"))
##extracting the latest date
end_date <- final.data$date[length(final.data$date)]
end_year <- as.numeric(format(end_date, "%Y"))
CAGR_1 <- new_closing_price/old_closing_price
root <- 1/(end_year-start_year)
CAGR <- (((CAGR_1)^(root))-1)*100
return (CAGR)
}
temp = list.files(pattern="*.csv")
for (i in 1:length(temp))
assign(temp[i], preprocess (read.csv(temp[i])))
you need to create an empty data frame and append to this in the loop. You're using assign at the moment which creates variables, not in a data frame. try something like:
df<-data.frame()
for(i in 1:length(temp)){
preproc <- preprocess(read.csv(temp[i])))
df<-rbind(df,data.frame(company = paste0(temp[i]),
value = preproc))
}

R Subset rows by range of Time between several days

I have a huge dataset which I have to subset by a range of time and write the subsets into new dataframes. My problem is to subset the dataset between 12PM and 12PM the next day.
Small dummy subsetting by day.
dfrm <- data.frame(a=rnorm(240),dtm=as.POSIXct("2007-03-27 05:00", tz="GMT")+3600*(1:240))
dfrm
## Create list of dates in dfrm
date.start<-format(min(dfrm$dtm),"%Y-%m-%d")
date.end<-format(max(dfrm$dtm),"%Y-%m-%d")
datum<-seq(as.Date(date.start),as.Date(date.end),by="days")
## Get Date and Time from dfrm
dfrm$day<-as.POSIXlt(as.character(dfrm$dtm),format="%Y-%m-%d")
dfrm$clock<-as.POSIXlt(as.character(dfrm$dtm))
dfrm$clock<-format(dfrm$clock,format="%H:%M:%S")
## write dfrm daywise
j<-1
while (j<=length(datum))
{
name <- paste("day", datum[j], sep = "")
assign(name,dfrm[which(dfrm$day==format(datum[j],"%Y-%m-%d")),])
j<-j+1
}
Thank you for your help.
You can do
dfrm <- data.frame(a=rnorm(240),dtm=as.POSIXct("2007-03-27 05:00")+3600*(1:240));
lst <- split(dfrm, cut(dfrm$dtm, breaks = seq(as.POSIXct(paste0(as.Date(min(dfrm$dtm))-1, " 12:00:00")), as.POSIXct(paste0(as.Date(max(dfrm$dtm))+1, " 12:00:00")), by = "1 day")))
Now, take e.g. a subset of a few days:
lst2 <- lst[as.character(seq(as.POSIXct("2007-04-04 12:00:00"), as.POSIXct("2007-04-06 12:00:00"), "1 day"))]
And create a separate data frame for each day:
list2env(lst2, envir = .GlobalEnv)
head(`2007-04-04 12:00:00`)

Creating a sequence of columns in a data frame based on an index for loop or using plyr in r

I wish to create 24 hourly data frames in which each data.frame contains hourly demand for a product as 1 column, and the next 8 columns contain hourly temperatures. For example, for the data.frame for 8am, the data.frame will contain a column for demand at 8am, then eight columns for temperature ranging from the most current hour to the 7 past hours. The additional complication is that for hours before 8AM i.e. "4AM", I have to get yesterday's temperatures. I am hitting my head against the wall trying to figure out how to do this with apply or plyr, or a vectorized function.
demand8AM Temp8AM Temp7AM Temp6AM...Temp1AM
Demand4AM Temp4AM Temp3AM Temp2AM Temp1AM Temp12AM Temp11pm(Lag) Temp10pm(Lag)
In my code Hours are numbers; 1 is 12AM etc.
Here is some simple code I created to create the dataset I am dealing with.
#Creating some Fake Data
require(plyr)
# setting up some fake data
set.seed(31)
foo <- function(myHour, myDate){
rlnorm(1, meanlog=0,sdlog=1)*(myHour) + (150*myDate)
}
Hour <- 1:24
Day <-1:90
dates <-seq(as.Date("2012-01-01"), as.Date("2012-3-30"), by = "day")
myData <- expand.grid( Day, Hour)
names(myData) <- c("Date","Hour")
myData$Temperature <- apply(myData, 1, function(x) foo(x[2], x[1]))
myData$Date <-dates
myData$Demand <-(rnorm(1,mean = 0, sd=1)+.75*myData$Temperature )
## ok, done with the fake data generation.
It looks as though you could benefit from utilizing a time series. Here's my interpretation of what you want (I used the "mean" function in rollapply), not what you asked for. I recommend you read over the xts and zoo packages.
#create dummy time vector
time_index <- seq(from = as.POSIXct("2012-05-15 07:00"),
to = as.POSIXct("2012-05-17 18:00"), by = "hour")
#create dummy demand and temp.C
info <- data.frame(demand = sample(1:length(time_index), replace = T),
temp.C = sample (1:10))
#turn demand + temp.C into time series
eventdata <- xts(info, order.by = time_index)
x2 <- eventdata$temp.C
for (i in 1:8) {x2 <- cbind(x2, lag(eventdata$temp.C, i))}

Resources