I have a huge dataset which I have to subset by a range of time and write the subsets into new dataframes. My problem is to subset the dataset between 12PM and 12PM the next day.
Small dummy subsetting by day.
dfrm <- data.frame(a=rnorm(240),dtm=as.POSIXct("2007-03-27 05:00", tz="GMT")+3600*(1:240))
dfrm
## Create list of dates in dfrm
date.start<-format(min(dfrm$dtm),"%Y-%m-%d")
date.end<-format(max(dfrm$dtm),"%Y-%m-%d")
datum<-seq(as.Date(date.start),as.Date(date.end),by="days")
## Get Date and Time from dfrm
dfrm$day<-as.POSIXlt(as.character(dfrm$dtm),format="%Y-%m-%d")
dfrm$clock<-as.POSIXlt(as.character(dfrm$dtm))
dfrm$clock<-format(dfrm$clock,format="%H:%M:%S")
## write dfrm daywise
j<-1
while (j<=length(datum))
{
name <- paste("day", datum[j], sep = "")
assign(name,dfrm[which(dfrm$day==format(datum[j],"%Y-%m-%d")),])
j<-j+1
}
Thank you for your help.
You can do
dfrm <- data.frame(a=rnorm(240),dtm=as.POSIXct("2007-03-27 05:00")+3600*(1:240));
lst <- split(dfrm, cut(dfrm$dtm, breaks = seq(as.POSIXct(paste0(as.Date(min(dfrm$dtm))-1, " 12:00:00")), as.POSIXct(paste0(as.Date(max(dfrm$dtm))+1, " 12:00:00")), by = "1 day")))
Now, take e.g. a subset of a few days:
lst2 <- lst[as.character(seq(as.POSIXct("2007-04-04 12:00:00"), as.POSIXct("2007-04-06 12:00:00"), "1 day"))]
And create a separate data frame for each day:
list2env(lst2, envir = .GlobalEnv)
head(`2007-04-04 12:00:00`)
Related
I am trying to create a loop in R that reads daily values of a netcdf file I have imported and converts them into annual sums, then creates a raster for each year. I have converted the netcdf into an array - this is named Biased_corrected.array in my code below. I am not sure how to include the variable 'year' in my file names as it changes with each iteration of the loop. I have tried using paste but this seems to be where it fails. Any suggestions?
# read in file specifying which days correspond to years
YearsDays <- read.csv("Data\\Years.csv") # a df with 49 obs. of 3 variables (year, start day, and end day
YearsDays[1,2:3] #returns 1 and 366 (the days for year 1972)
YearsDays[2,2:3] #returns 367 and 731 (the days for year 1973)
YearsDays[1,1] #returns 1972
YearsDays[2,1] #returns 1973
counter <- 1
startyear <- YearsDays[1,1]
year <- startyear
while(year < 2021){
#set variables to loop through
startday <- YearsDays[counter,2]
endday <- YearsDays[counter,3]
BC_rain.slice <- Biased_corrected.array[,,startday:endday]
paste(year, "_Annual_rain") <- apply(BC_rain.slice, c(1,2), sum)
#save data in a raster
paste(year, "_rain_r") <- raster(t(paste(year, "_Annual_Rain"), xmn=min(x), xmx=max(x), ymn=min(y), ymx=max(y), crs=WGS84)
# move on to next year
counter <- counter + 1
year <- 1971 + counter
}
EDIT: The working code for anyone interested:
YearsDays <- read.csv("Data\\Years.csv") # a df with 49 obs. of 3 variables (year, start day, and end day
for (idx in seq(nrow(YearsDays))){
#set variables to loop through
year <- YearsDays[idx,1]
startday <- YearsDays[idx,2]
endday <- YearsDays[idx,3]
BC_rain.slice <- Biased_corrected.array[,,startday:endday]
assign(paste(year, "_Annual_rain"),apply(BC_rain.slice, c(1,2), sum))
annual_rain <- apply(BC_rain.slice, c(1,2), sum)
#save data in a raster
assign(paste(year, "_rain_r"),raster(t(annual_rain), xmn=min(x), xmx=max(x), ymn=min(y), ymx=max(y), crs=WGS84))
}
You can't use paste to create a variable name as you've listed. You can enclose it within assign or eval, however it may be easier to instead store your results within a data frame. Below is an example of what I believe you're trying to achieve. I have also replaced your while loop and counter with a for loop iterating over years:
YearsDays <- read.csv("Data\\Years.csv") # a df with 49 obs. of 3 variables (year, start day, and end day
output <- data.frame(year = YearsDays[,1], rain_r = NA)
for (idx in seq(nrow(YearsDays))){
#set variables to loop through
year <- YearsDays[idx,1]
startday <- YearsDays[idx,2]
endday <- YearsDays[idx,3]
BC_rain.slice <- Biased_corrected.array[,,startday:endday]
annual_rain <- apply(BC_rain.slice, c(1,2), sum)
#save data in a raster
output$rain_r[output$year == year] <- raster(t(annual_rain, xmn=min(x), xmx=max(x), ymn=min(y), ymx=max(y), crs=WGS84))
}
How about to replace your part
paste(year, "_Annual_rain") <- apply(BC_rain.slice, c(1,2), sum)
#save data in a raster
paste(year, "_rain_r") <- raster(t(paste(year, "_Annual_Rain"), xmn=min(x), xmx=max(x), ymn=min(y), ymx=max(y), crs=WGS84)
to
txt <- paste0(year, "_Annual_rain <- apply(BC_rain.slice, c(1,2), sum)")
eval(parse(text = txt))
# save data in a raster
txt <- paste0(year, "_rain_r <- raster(t(", year, "_Annual_Rain), xmn=min(x), xmx=max(x), ymn=min(y), ymx=max(y), crs=WGS84)")
eval(parse(text = txt))
I've got the following time frame:
A <- c('2016-01-01', '2019-01-05')
B <- c('2017-05-05','2019-06-05')
X_Period <- interval("2015-01-01", "2019-12-31")
Y_Periods <- interval(A, B)
I'd like to find the non overlapping periods between X_Period and Y_Periods so that the result would be:
[1]'2015-01-01'--'2015-12-31'
[2]'2017-05-06'--'2019-01-04'
[3]'2019-06-06'--'2019-31-12'
I'm trying to use setdiff but it does not work
setdiff(X_Period, Y_Periods)
Here is an option:
library(lubridate)
seq_X <- as.Date(seq(int_start(X_Period), int_end(X_Period), by = "1 day"))
seq_Y <- as.Date(do.call("c", sapply(Y_Periods, function(x)
seq(int_start(x), int_end(x), by = "1 day"))))
unique_dates_X <- seq_X[!seq_X %in% seq_Y]
lst <- aggregate(
unique_dates_X,
by = list(cumsum(c(0, diff.Date(unique_dates_X) != 1))),
FUN = function(x) c(min(x), max(x)),
simplify = F)$x
lapply(lst, function(x) interval(x[1], x[2]))
#[[1]]
#[1] 2015-01-01 UTC--2015-12-31 UTC
#
#[[2]]
#[1] 2017-05-06 UTC--2019-01-04 UTC
#
#[[3]]
#[1] 2019-06-06 UTC--2019-12-31 UTC
The strategy is to convert the intervals to by-day sequences (one for X_Period and one for Y_Period); then we find all days that are only part of X_Period (and not part of Y_Periods). We then aggregate to determine the first and last date in all sub-sequences of consecutive dates. The resulting lst is a list with those start/end dates. To convert to interval, we simply loop through the list and convert the start/end dates to an interval.
The issue i am facing is I am getting individual lists for each .csv i read and it is not appending the result to a single dataframe or list. I am very new to R. Please help me out.
I am getting output as
amazon.csv 10.07
facebook.csv 54.67
Whereas i am expecting all the values in a data frame with column company and CAGR values.
enter code here
preprocess <- function(x){
##flipping data to suit time series analysis
my.data <- x[nrow(x):1,]
#sort(x,)
## setting up date as date format
my.data$date <- as.Date(my.data$date)
##creating a new data frame to sort the data.
sorted.data <- my.data[order(my.data$date),]
#removing the last row as it contains stocks price at moment when i downloaded data
#sorted.data <- sorted.data[-nrow(sorted.data),]
#calculating lenght of the data frame
data.length <- length(sorted.data$date)
## extracting the first date
time.min <- sorted.data$date[1]
##extracting the last date
time.max <- sorted.data$date[data.length]
# creating a new data frame with all the dates in sequence
all.dates <- seq(time.min, time.max, by="day")
all.dates.frame <- data.frame(list(date=all.dates))
#Merging all dates data frame and sorted data frame; all the empty cells are assigned NA vales
merged.data <- merge(all.dates.frame, sorted.data, all=T)
##Replacing all NA values with the values of the rows of previous day
final.data <- transform(merged.data, close = na.locf(close), open = na.locf(open), volume = na.locf(volume), high = na.locf(high), low =na.locf(low))
# write.csv(final.data, file = "C:/Users/rites/Downloads/stock prices", row.names = FALSE)
#
#return(final.data) #--> ##{remove comment for Code Check}
################################################################
######calculation of CAGR(Compound Annual Growth Rate ) #######
#### {((latest price/Oldest price)^1/#ofyears) - 1}*100 ########
################################################################
##Extracting closing price of the oldest date
old_closing_price <- final.data$close[1]
##extracting the closing price of the latest date
new_closing_price <- final.data$close[length(final.data$close)]
##extracting the starting year
start_date <- final.data$date[1]
start_year <- as.numeric(format(start_date, "%Y"))
##extracting the latest date
end_date <- final.data$date[length(final.data$date)]
end_year <- as.numeric(format(end_date, "%Y"))
CAGR_1 <- new_closing_price/old_closing_price
root <- 1/(end_year-start_year)
CAGR <- (((CAGR_1)^(root))-1)*100
return (CAGR)
}
temp = list.files(pattern="*.csv")
for (i in 1:length(temp))
assign(temp[i], preprocess (read.csv(temp[i])))
you need to create an empty data frame and append to this in the loop. You're using assign at the moment which creates variables, not in a data frame. try something like:
df<-data.frame()
for(i in 1:length(temp)){
preproc <- preprocess(read.csv(temp[i])))
df<-rbind(df,data.frame(company = paste0(temp[i]),
value = preproc))
}
I wish to create 24 hourly data frames in which each data.frame contains hourly demand for a product as 1 column, and the next 8 columns contain hourly temperatures. For example, for the data.frame for 8am, the data.frame will contain a column for demand at 8am, then eight columns for temperature ranging from the most current hour to the 7 past hours. The additional complication is that for hours before 8AM i.e. "4AM", I have to get yesterday's temperatures. I am hitting my head against the wall trying to figure out how to do this with apply or plyr, or a vectorized function.
demand8AM Temp8AM Temp7AM Temp6AM...Temp1AM
Demand4AM Temp4AM Temp3AM Temp2AM Temp1AM Temp12AM Temp11pm(Lag) Temp10pm(Lag)
In my code Hours are numbers; 1 is 12AM etc.
Here is some simple code I created to create the dataset I am dealing with.
#Creating some Fake Data
require(plyr)
# setting up some fake data
set.seed(31)
foo <- function(myHour, myDate){
rlnorm(1, meanlog=0,sdlog=1)*(myHour) + (150*myDate)
}
Hour <- 1:24
Day <-1:90
dates <-seq(as.Date("2012-01-01"), as.Date("2012-3-30"), by = "day")
myData <- expand.grid( Day, Hour)
names(myData) <- c("Date","Hour")
myData$Temperature <- apply(myData, 1, function(x) foo(x[2], x[1]))
myData$Date <-dates
myData$Demand <-(rnorm(1,mean = 0, sd=1)+.75*myData$Temperature )
## ok, done with the fake data generation.
It looks as though you could benefit from utilizing a time series. Here's my interpretation of what you want (I used the "mean" function in rollapply), not what you asked for. I recommend you read over the xts and zoo packages.
#create dummy time vector
time_index <- seq(from = as.POSIXct("2012-05-15 07:00"),
to = as.POSIXct("2012-05-17 18:00"), by = "hour")
#create dummy demand and temp.C
info <- data.frame(demand = sample(1:length(time_index), replace = T),
temp.C = sample (1:10))
#turn demand + temp.C into time series
eventdata <- xts(info, order.by = time_index)
x2 <- eventdata$temp.C
for (i in 1:8) {x2 <- cbind(x2, lag(eventdata$temp.C, i))}
I am an R newbie and am finding the conversion from matlab rather tricky, so apologies in advance for what could be a very simple question.
I am analyzing some time series data and the problem outlined below demonstrates the problem I am having in R:
Dat1 <- data.frame(dateTime = as.POSIXct(c("2012-05-03 00:00","2012-05-03 02:00",
"2012-05-03 02:30","2012-05-03 05:00",
"2012-05-03 07:00"), tz = 'UTC'),x1 = rnorm(5))
Dat2 <- data.frame(dateTime = as.POSIXct(c("2012-05-03 01:00","2012-05-03 01:30",
"2012-05-03 02:30","2012-05-03 06:00",
"2012-05-03 07:00"), tz = 'UTC'),x1 = rnorm(5))
Dat3 <- data.frame(dateTime = as.POSIXct(c("2012-05-03 00:15","2012-05-03 02:20",
"2012-05-03 02:40","2012-05-03 06:25",
"2012-05-03 07:00"), tz = 'UTC'),x1 = rnorm(5))
Dat4 <- data.frame(dateTime = as.POSIXct(c("2010-05-03 00:15","2010-05-03 02:20",
"2010-05-03 02:40","2010-05-03 06:25",
"2010-05-03 07:00"), tz = 'UTC'),x1 = rnorm(5))
So, here I have 5 data frames where all of the data are measured at similar times. I am now trying to ensure that all of the data frames have an identical time step i.e. all measured at the same time. I can do this for two data frames:
idx1 <- (Dat1[,1] %in% Dat2[,1])
which will tell me the index of the consistent times in these two data frames. I can then re-define the data frame by
newDat1 <- Dat1[idx1,]
to get the data desired.
My question now is, how do I apply this to all of the data frames i.e. more than 2. I have tried:
idx1 <- (Dat1[,1] %in% (Dat2[,1] %in% (Dat3[,1] %in% Dat4[,1])))
but I can see that this is completely wrong. Any suggestions? Please keep in mind that I have many data frames (more than five), where each contain different variables.
EDIT:
I may have found one way this can be done:
idx1 <- (Dat1[,1] %in% intersect(intersect(intersect(Dat1[,1],Dat2[,1]),Dat3[,1]),Dat4[,1]))
which will give the index, and can be used to define a new variable:
Dat1 <- Dat1[idx1,]
Dat2 <- Dat2[idx1,]
Dat3 <- Dat3[idx1,]
Dat4 <- Dat4[idx1,]
Although this work for this example, I was hoping to find a way of making this work for n number of data frames without having to write repeat this n number of times
To identify timestamps that are common to all data frames, create a function to return the intersection of multiple vectors
intersectMulti <- function(x=list()){
for(i in 2:length(x)){
if(i==2) foo <- x[[i-1]]
foo <- intersect(foo,x[[i]]) #find intersection between ith and previous
}
return(x[[1]][match(foo, x[[1]])]) #get original to retain format
}
Note that there are no common timestamps among the four dataframes in the question
> intersectMulti(x=list(Dat1[,1],Dat2[,1],Dat3[,1],Dat4[,1]))
character(0)
But there is one common timestamp in the first three dataframes
> intersectMulti(x=list(Dat1[,1],Dat2[,1],Dat3[,1]))
[1] "2012-05-03 07:00:00 UTC"
Use the result from the function to subset rows of each dataframe with common timestamp:
m <- intersectMulti(x=list(Dat1[,1],Dat2[,1],Dat3[,1]))
Dat1 <- Dat1[Dat1$dateTime %in% m,]
Dat2 <- Dat2[Dat2$dateTime %in% m,]
Dat3 <- Dat3[Dat3$dateTime %in% m,]
Dat4 <- Dat4[Dat4$dateTime %in% m,]
> Dat1
dateTime x1
5 2012-05-03 07:00:00 -0.1607363
> Dat2
dateTime x1
5 2012-05-03 07:00:00 -0.2662494
> Dat3
dateTime x1
5 2012-05-03 07:00:00 -0.1917905
If this works for you:
idx1 <- (Dat1[,1] %in% intersect(intersect(intersect(Dat1[,1],Dat2[,1]),Dat3[,1]),Dat4[,1]))
then try this, it works on lists/vectors and more elegant:
idx1 <- Dat1[,1] %in% Reduce(intersect, list(Dat1[,1], Dat2[,1], Dat3[,1], Dat4[,1]))