Basic data summary - determine maximum value by date - r

This is my first time to use R. I'm trying to do some basic data summarizing (find max) for plotting. I can accomplish this in Excel but it takes a while and since I do the same thing over and over, developing an R script makes a lot of sense. I searched previous posts and found a similar problem, but can't figure out the correct R syntax. Again, I am an absolute beginner so any help is greatly appreciated.
Problem description: I have a data frame with two columns: DATE/TIME (10 minute time stamp), and PRESSURE. I need to determine the maximum value for PRESSURE for each day.
DateAndTime Pressure
1 8/1/2011 0:06 0.06119370
2 8/1/2011 0:16 0.06003765
3 8/1/2011 0:26 0.06118049
I have tried modifying the code below from a previous post (tried deleting the "which.max" portion) but without success.
for (imonth in 1:12) {
month <- which(data[,2]==imonth)
monthly_max[imonth] <- max(data[month,3])
maxi[imonth] <- which.max(data[month,3])
}
tabela <- cbind(monthly_max, maxi)
write.table(tabela, col.names=TRUE, row.names=TRUE, append=FALSE, sep="\t")

#creating some data for demonstration purpose
time1 <- seq(from=as.POSIXct("2011-01-08 00:06:00"),to=as.POSIXct("2011-01-18 00:06:00"),by="10 min")
DateAndTime <- format(time1,"%d/%m/%Y %H:%M")
Pressure <- rnorm(length(DateAndTime),0.06,0.01)
DF <- data.frame(DateAndTime,Pressure)
#look at first lines
head(DF)
#convert character in datetime format
DF$DateAndTime2 <- strptime(DF$DateAndTime,"%d/%m/%Y %H:%M",tz="GMT")
DF$Days <- trunc(DF$DateAndTime2,"days")
#create the summary
require(plyr)
summaryDF <- ddply(DF,.(Days),summarise,max(Pressure))
names(summaryDF)<-c("Day","Maximum")
#write to CSV file, which can be read into Excel
write.table(summaryDF,file="output.csv",col.names=TRUE,row.names=FALSE,dec=".",sep=",")

I'd recommend using a time-series class, like xts or zoo.
# create some data that looks like the OP's
NOW <- .POSIXct(1342460400)
d <- data.frame(DateAndTime=format(NOW+seq(0,3600*72,600), "%Y-%m-%d %H:%M"))
d$Pressure <- runif(NROW(d))/10
library(xts) # load the xts package
# create an xts object from the OP's data.frame
x <- xts(d["Pressure"], as.POSIXct(d$DateAndTime))
# apply the max function to each day
dx <- apply.daily(x, max)
# Pressure
# 2012-07-16 23:50:00 0.09872622
# 2012-07-17 23:50:00 0.09947256
# 2012-07-18 23:50:00 0.09932375
# 2012-07-19 12:40:00 0.09971159

This?
dat <- data.frame(date = rep(seq(1,50,2),2), value = rnorm(50))
head(dat)
require(plyr)
ddply(dat, .(date), summarise, max(value))

Related

add_months function in Spark R

I have a variable of the form "2020-09-01". I need to increase and decrease this by 3 months and 5 months and store it in other variables. I need a syntax in Spark R.Thanks. Any other method will also work.Thanks, Again
In R following code works fine
y <- as.Date(load_date,"%Y-%m-%d") %m+% months(i)
The code below didn't work. Error says
unable to find an inherited method for function ‘add_months’ for signature ‘"Date", "numeric"
loaddate = 202009
year <- substr(loaddate,1,4)
month <- substr(loaddate,5,6)
load_date <- paste(year,month,"01",sep = "-")
y <- as.Date(load_date,"%Y%m%d")
y1 <- add_months(y,-3)
Expected Result - 2020-06-01
The lubridate package makes dealing with dates much easier. Here I have shuffled as.Date up a step, then simply subtract 3 months.
library(lubridate)
loaddate = 202009
year <- substr(loaddate,1,4)
month <- substr(loaddate,5,6)
load_date <- as.Date(paste(year,month,"01",sep = "-"))
new_date <- load_date - months(3)
new_date Output:
Date[1:1], format: "2020-06-01"

Separating time and date into different columns (format=2019-05-26T13:50:56.335288Z)

In my data, Time and date is stored in a column 'cord' (class=factor). I want to separate the date and the time into two separate columns.
The data looks like this:
1 2019-05-26T13:50:56.335288Z
2 2019-05-26T17:55:45.348073Z
3 2019-05-26T18:12:00.882572Z
4 2019-05-26T18:26:49.577310Z
I have successfully extracted the date using:cord$Date <- as.POSIXct(cord$Time)
I have however not been able to find a way to extract the time in format "H:M:S".
The output of dput(head(cord$Time)) returns a long list of timestamps: "2020-04-02T13:34:07.746777Z", "2020-04-02T13:41:11.095014Z",
"2020-04-02T14:08:05.508818Z", "2020-04-02T14:17:10.337101Z", and so on...
Extract H:M:S
library(lubridate)
format(as_datetime(cord$Time), "%H:%M:%S")
#> [1] "13:50:56" "17:55:45" "18:12:00" "18:26:49"
If you need milliseconds too:
format(as_datetime(cord$Time), "%H:%M:%OS6")
#> [1] "13:50:56.335288" "17:55:45.348073" "18:12:00.882572" "18:26:49.577310"
where cord is:
cord <- read.table(text = " Time
1 2019-05-26T13:50:56.335288Z
2 2019-05-26T17:55:45.348073Z
3 2019-05-26T18:12:00.882572Z
4 2019-05-26T18:26:49.577310Z ", header = TRUE)
I typically use lubridate and data.table to do my date and manipulation work. This works for me copying in some of your raw dates as strings
library(lubridate)
library(data.table)
x <- c("2019-05-26T13:50:56.335288Z", "2019-05-26T17:55:45.348073Z")
# lubridate to parse to date time
y <- parse_date_time(x, "ymd HMS")
# data.table to split in to dates and time
split_y <- tstrsplit(y, " ")
dt <- as.data.table(split_y)
setnames(dt, "Date", "Time")
dt[]
# if you use data.frames instead
df <- as.data.frame(dt)
df

convert irregular 6hourly data to daily accumulated using R

I have the following data:
Date,Rain
1979_8_9_0,8.775
1979_8_9_6,8.775
1979_8_9_12,8.775
1979_8_9_18,8.775
1979_8_10_0,0
1979_8_10_6,0
1979_8_10_12,0
1979_8_10_18,0
1979_8_11_0,8.025
1979_8_12_12,0
1979_8_12_18,0
1979_8_13_0,8.025
[1] The data is six hourly but some dates have incomplete 6 hourly data. For example, August 11 1979 has only one value at 00H. I would like to get the daily accumulated from this kind of data using R. Any suggestion on how to do this easily in R?
I'll appreciate any help.
You can transform your data to dates very easily with:
dat$Date <- as.Date(strptime(dat$Date, '%Y_%m_%d_%H'))
After that you should aggregate with:
aggregate(Rain ~ Date, dat, sum)
The result:
Date Rain
1 1979-08-09 35.100
2 1979-08-10 0.000
3 1979-08-11 8.025
4 1979-08-12 0.000
5 1979-08-13 8.025
Based on the comment of Henrik, you can also transform to dates with:
dat$Date <- as.Date(dat$Date, '%Y_%m_%d')
# split the "date" variable into new, separate variable
splitDate <- stringr::str_split_fixed(string = df$Date, pattern = "_", n = 4)
df$Day <- splitDate[,3]
# split data by Day, loop over each split and add rain variable
unlist(lapply(split(df$Rain, df$Day), sum))

Convert continuous time-series data into daily-hourly representation using R

I have time-series data in xts representation as
library(xts)
xtime <-timeBasedSeq('2015-01-01/2015-01-30 23')
df <- xts(rnorm(length(xtime),30,4),xtime)
Now I want to calculate co-orelation between different days, and hence I want to represent df in matrix form as:
To achieve this I used
p_mat= split(df,f="days",drop=FALSE,k=1)
Using this I get a list of days, but I am not able to arrange this list in matrix form. Also I used
p_mat<- df[.indexday(df) %in% c(1:30) & .indexhour(df) %in% c(1:24)]
With this I do not get any output.
Also I tried to use rollapply(), but was not able to arrange it properly.
May I get help to form the matrix using xts/zoo objects.
Maybe you could use something like this:
#convert to a data.frame with an hour column and a day column
df2 <- data.frame(value = df,
hour = format(index(df), '%H:%M:%S'),
day = format(index(df), '%Y:%m:%d'),
stringsAsFactors=FALSE)
#then use xtabs which ouputs a matrix in the format you need
tab <- xtabs(value ~ day + hour, df2)
Output:
hour
day 00:00:00 01:00:00 02:00:00 03:00:00 04:00:00 05:00:00 06:00:00 07:00:00 08:00:00 09:00:00 10:00:00 11:00:00 12:00:00
2015:01:01 28.15342 35.72913 27.39721 29.17048 28.42877 28.72003 28.88355 31.97675 29.29068 27.97617 35.37216 29.14168 29.28177
2015:01:02 23.85420 28.79610 27.88688 27.39162 29.77241 22.34256 34.70633 23.34011 28.14588 25.53632 26.99672 38.34867 30.06958
2015:01:03 37.47716 31.70040 29.04541 34.23393 33.54569 27.52303 38.82441 28.97989 24.30202 29.42240 30.83015 39.23191 30.42321
2015:01:04 24.13100 32.08409 29.36498 35.85835 26.93567 28.27915 26.29556 29.29158 31.60805 27.07301 33.32149 25.16767 25.80806
2015:01:05 32.16531 29.94640 32.04043 29.34250 31.68278 28.39901 24.51917 33.95135 36.07898 28.76504 24.98684 32.56897 29.82116
2015:01:06 18.44432 27.43807 32.28203 29.76111 29.60729 32.24328 25.25417 34.38711 29.97862 32.82924 34.13643 30.89392 26.48517
2015:01:07 34.58491 20.38762 32.29096 31.49890 28.29893 33.80405 28.44305 28.86268 33.42964 36.87851 31.08022 28.31126 25.24355
2015:01:08 33.67921 31.59252 28.36989 35.29703 27.19507 27.67754 25.99571 27.32729 33.78074 31.73481 34.02064 28.43953 31.50548
2015:01:09 28.46547 36.61658 36.04885 30.33186 32.26888 25.90181 31.29203 34.17445 30.39631 28.18345 27.37687 29.85631 34.27665
2015:01:10 30.68196 26.54386 32.71692 28.69160 23.72367 28.53020 35.45774 28.66287 32.93100 33.78634 30.01759 28.59071 27.88122
2015:01:11 32.70907 31.51985 29.22881 36.31157 32.38494 25.30569 29.37743 22.32436 29.21896 19.63069 35.25601 27.45783 28.28008
2015:01:12 29.96676 30.51542 29.41650 29.34436 37.05421 33.05035 34.44572 26.30717 30.65737 34.61930 29.77391 21.48256 31.37938
2015:01:13 33.46089 34.29776 37.58262 27.58801 28.43653 28.33511 28.49737 28.53348 28.81729 35.76728 27.20985 28.44733 32.61015
2015:01:14 22.96213 32.27889 36.44939 23.45088 26.88173 27.43529 27.27547 21.86686 32.00385 23.87281 29.90001 32.37194 29.20722
2015:01:15 28.30359 30.94721 20.62911 33.84679 27.58230 26.98849 23.77755 24.18443 30.22533 32.03748 21.60847 25.98255 32.14309
2015:01:16 23.52449 29.56138 31.76356 35.40398 24.72556 31.45754 30.93400 34.77582 29.88836 28.57080 25.41274 27.93032 28.55150
2015:01:17 25.56436 31.23027 25.57242 31.39061 26.50694 30.30921 28.81253 25.26703 30.04517 33.96640 36.37587 24.50915 29.00156
...and so on
Here's one way to do it using a helper function that will account for days that do not have 24 observations.
library(xts)
xtime <- timeBasedSeq('2015-01-01/2015-01-30 23')
set.seed(21)
df <- xts(rnorm(length(xtime),30,4), xtime)
tHourly <- function(x) {
# initialize result matrix for all 24 hours
dnames <- list(format(index(x[1]), "%Y-%m-%d"),
paste0("H", 0:23))
res <- matrix(NA, 1, 24, dimnames = dnames)
# transpose day's rows and set colnames
tx <- t(x)
colnames(tx) <- paste0("H", .indexhour(x))
# update result object and return
res[,colnames(tx)] <- tx
res
}
# split on days, apply tHourly to each day, rbind results
p_mat <- split(df, f="days", drop=FALSE, k=1)
p_list <- lapply(p_mat, tHourly)
p_hmat <- do.call(rbind, p_list)

Calculate the average based on other columns

I want to calculate
"average of the closing prices for the 5,10,30 consecutive trading days immediately preceding and including the Announcement Day, but excluding trading halt days (days on which trading volume is 0 or NA)
For example, now we set 2014/5/7 is the Announcement day.
then average of price for 5 consecutive days :
average of (price of 2014/5/7,2014/5/5, 2014/5/2, 2014/4/30,2014/4/29),
price of 2014/5/6 and 2014/5/1 was excluded due to 0 trading volume on those days.
EDIT on 11/9/2014
One thing to Note: the announcement day for each stock is different, and it's not last valid date in the data, so usage of tail when calculating average was not appropriate.
Date Price Volume
2014/5/9 1.42 668000
2014/5/8 1.4 2972000
2014/5/7 1.5 1180000
2014/5/6 1.59 0
2014/5/5 1.59 752000
2014/5/2 1.6 138000
2014/5/1 1.6 NA
2014/4/30 1.6 656000
2014/4/29 1.61 364000
2014/4/28 1.61 1786000
2014/4/25 1.64 1734000
2014/4/24 1.68 1130000
2014/4/23 1.68 506000
2014/4/22 1.67 354000
2014/4/21 1.7 0
2014/4/18 1.7 0
2014/4/17 1.7 1954000
2014/4/16 1.65 1788000
2014/4/15 1.71 1294000
2014/4/14 1.68 1462000
Reproducible Code:
require(quantmod)
require(data.table)
tickers <- c("0007.hk","1036.hk")
date_begin <- as.Date("2010-01-01")
date_end <- as.Date("2014-09-09")
# retrive data of all stocks
prices <- getSymbols(tickers, from = date_begin, to = date_end, auto.assign = TRUE)
dataset <- merge(Cl(get(prices[1])),Vo(get(prices[1])))
for (i in 2:length(prices)){
dataset <- merge(dataset, Cl(get(prices[i])),Vo(get(prices[i])))
}
# Write First
write.zoo(dataset, file = "prices.csv", sep = ",", qmethod = "double")
# Read zoo
test <- fread("prices.csv")
setnames(test, "Index", "Date")
Then I got a data.table. The first Column is Date, then the price and volume for each stock.
Actually, the original data contains information for about 40 stocks. Column names have the same patter: "X" + ticker.close , "X" + ticker.volumn
Last trading days for different stock were different.
The desired output :
days 0007.HK 1036.HK
5 1.1 1.1
10 1.1 1.1
30 1.1 1.1
The major issues:
.SD and lapply and .SDCol can be used for looping different stocks. .N can be used when calculating last consecutive N days.
Due to the different announcement day, it becomes a little complicated.
Any suggestions on single stock using quantmod or multiple stocks using data.table are extremely welcomed!
Thanks GSee and pbible for the nice solutions, it was very useful. I'll update my code later incorporating different announcement day for each stocks, and consult you later.
Indeed, it's more a xts question than a data.table one. Anything about data.table will be very helpful. Thanks a lot!
Because the different stocks have different announcement days, I tried to make a solution first following #pbible's logic, any suggestions will be extremely welcomed.
library(quantmod)
tickers <- c("0007.hk","1036.hk")
date_begin <- as.Date("2010-01-01")
# Instead of making one specific date_end, different date_end is used for convenience of the following work.
date_end <- c(as.Date("2014-07-08"),as.Date("2014-05-15"))
for ( i in 1: length(date_end)) {
stocks <- getSymbols(tickers[i], from = date_begin, to = date_end[i], auto.assign = TRUE)
dataset <- cbind(Cl(get(stocks)),Vo(get(stocks)))
usable <- subset(dataset,dataset[,2] > 0 & !is.na(dataset[,2]))
sma.5 <- SMA(usable[,1],5)
sma.10 <- SMA(usable[,1],10)
sma.30 <- SMA(usable[,1],30)
col <- as.matrix(rbind(tail(sma.5,1), tail(sma.10,1), tail(sma.30,1)))
colnames(col) <- colnames(usable[,1])
rownames(col) <- c("5","10","30")
if (i == 1) {
matrix <- as.matrix(col)
}
else {matrix <- cbind(matrix,col)}
}
I got what I want, but the code is ugly..Any suggestions to make it elegant are extremely welcomed!
Well, here's a way to do it. I don't know why you want to get rid of the loop, and this does not get rid of it (in fact it has a loop nested inside another). One thing that you were doing is growing objects in memory with each iteration of your loop (i.e. the matrix <- cbind(matrix,col) part is inefficient). This Answer avoids that.
library(quantmod)
tickers <- c("0007.hk","1036.hk")
date_begin <- as.Date("2010-01-01")
myEnv <- new.env()
date_end <- c(as.Date("2014-07-08"),as.Date("2014-05-15"))
lookback <- c(5, 10, 30) # different number of days to look back for calculating mean.
symbols <- getSymbols(tickers, from=date_begin,
to=tail(sort(date_end), 1), env=myEnv) # to=last date
end.dates <- setNames(date_end, symbols)
out <- do.call(cbind, lapply(end.dates, function(x) {
dat <- na.omit(get(names(x), pos=myEnv))[paste0("/", x)]
prc <- Cl(dat)[Vo(dat) > 0]
setNames(vapply(lookback, function(n) mean(tail(prc, n)), numeric(1)),
lookback)
}))
colnames(out) <- names(end.dates)
out
# 0007.HK 1036.HK
#5 1.080 8.344
#10 1.125 8.459
#30 1.186 8.805
Some commentary...
I created a new environment, myEnv, to hold your data so that it does not clutter your workspace.
I used the output of getSymbols (as you did in your attempt) because the input tickers are not uppercase.
I named the vector of end dates so that we can loop over that vector and know both the end date and the name of the stock.
the bulk of the code is an lapply loop (wrapped in do.call(cbind, ...)). I'm looping over the named end.dates vector.
The first line gets the data from myEnv, removes NAs, and subsets it to only include data up to the relevant end date.
The next line extracts the close column and subsets it to only include rows where volume is greater than zero.
The vapply loops over a vector of different lookbacks and calculates the mean. That is wrapped in setNames so that each result is named based on which lookback was used to calculate it.
The lapply call returns a list of named vectors. do.call(cbind, LIST) is the same as calling cbind(LIST[[1]], LIST[[2]], LIST[[3]]) except LIST can be a list of any length.
at this point we have a matrix with row names, but no column names. So, I named the columns based on which stock they represent.
Hope this helps.
How about something like this using the subset and moving average (SMA). Here is the solution I put together.
library(quantmod)
tickers <- c("0007.hk","1036.hk","cvx")
date_begin <- as.Date("2010-01-01")
date_end <- as.Date("2014-09-09")
stocks <- getSymbols(tickers, from = date_begin, to = date_end, auto.assign = TRUE)
stock3Summary <- function(stock){
dataset <- cbind(Cl(get(stock)),Vo(get(stock)))
usable <- subset(dataset,dataset[,2] > 0 & !is.na(dataset[,2]))
sma.5 <- SMA(usable[,1],5)
sma.10 <- SMA(usable[,1],10)
sma.30 <- SMA(usable[,1],30)
col <- as.matrix(rbind(tail(sma.5,1), tail(sma.10,1), tail(sma.30,1)))
colnames(col) <- colnames(usable[,1])
rownames(col) <- c("5","10","30")
col
}
matrix <- as.matrix(stock3Summary(stocks[1]))
for( i in 2:length(stocks)){
matrix <- cbind(matrix,stock3Summary(stocks[i]))
}
The output:
> matrix
X0007.HK.Close X1036.HK.Close CVX.Close
5 1.082000 8.476000 126.6900
10 1.100000 8.412000 127.6080
30 1.094333 8.426333 127.6767
This should work with multiple stocks. It will use only the most recent valid date.

Resources