i have a question how to select certain values from a table. I have a table with times and values and i want to get the row below and after a certain time.
Example-Data.Frame.
Time Value
02:51 0.08033405
05:30 0.43456738
09:45 0.36052075
14:02 0.45013807
18:55 0.05745870
....# and so on
Time is coded as character, but can be formatted.
Now i have for example the time "6:15" and want to get the values of the time before and after this time from the table (0.43456738 and 0.36052075).
The database is in fact quite huge and i have a lot of time values.
Anyone has a nice suggestion how to accomplish this?
thanks
Curlew
value_before <- example_df[which(example_df$time=="09:45")-1, ]$value
value_after <- example_df[which(example_df$time=="09:45")+1, ]$value
# This could become a function
return_values <- function(df,cutoff) {
value_before <- df[which(df$time==cutoff)-1, ]$value
value_after <- df[which(df$time==cutoff)+1, ]$value
return(list(value_before, value_after))
}
return_values(exmaple_df, "09:15")
# A solution for a large dataset.
library(data.table)
df <- data.frame(time = 1:1000000, value = rnorm(1000000))
# create a couple of offsets
df$nvalue <- c(df$value[2:dim(df)[1]],NA)
df$pvalue <- c(NA,df$value[2:dim(df)[1]])
new_df <- data.table(df)
setkey(new_df,"time")
new_df[time==10]
time value pvalue nvalue
[1,] 10 -0.8488881 -0.1281219 -0.5741059
> new_df[time==1234]
time value pvalue nvalue
[1,] 1234 -0.3045015 0.708884 -0.5049194
Related
I have data with Order Id, Start Date & End Date. I have to split both the Start and End dates into intervals of 30 days, and derive two new variables “split start date” and “split end date”.
Example: The below example illustrates how split dates are created when the Start Date is “01/05/2017” and the End Date is “06/07/2017”
Suppose, an order have start and end dates as below
see the image for example
What is the code for this problem in R ?
Here is a solution which should generalize to multiple order id's. I have created a sample data with two order id's. The basic idea is to calculate the number of intervals between start_date and end_date. Then we repeat the row for each order id by the number of intervals, and also create a sequence to determine which interval we are in. This is the purpose of creating functions f and g and the use of Map.
The remaining is just vector manipulations where we define split_start_date and split_end_date. The last statement is to ensure that split_end_date does not exceed end_date.
df <- data.frame(
order_id = c(1, 2),
start_date = c(as.Date("2017-05-01"), as.Date("2017-08-01")),
end_date = c(as.Date("2017-07-06"), as.Date("2017-09-15"))
)
df$diff_days <- as.integer(df$end_date - df$start_date)
df$num_int <- ceiling(df$diff_days / 30)
f <- function(rowindex) {
rep(rowindex, each = df[rowindex, "num_int"])
}
g <- function(rowindex) {
1:df[rowindex, "num_int"]
}
rowindex_rep <- unlist(Map(f, 1:nrow(df)))
df2 <- df[rowindex_rep, ]
df2$seq <- unlist(Map(g, 1:nrow(df)))
df3 <- df2
df3$split_start_date <- df3$start_date + (df3$seq - 1) * 30
df3$split_end_date <- df3$split_start_date + 29
df3[which(df3$seq == df3$num_int), ]$split_end_date <-
df3[which(df3$seq == df3$num_int), ]$end_date
I want to calculate fiscal year returns and standard deviations from daily returns for a large number of firms. I am relatively new to R, having previously used SAS to calculate returns etc. However, I'd like to switch to R in the short/medium-term.
I have two files: 1) Containing a firm identifier, dates, daily returns(df.1) and 2) my sample (df.2) over which I'd like to aggregate the returns
firm date ret
1 01/01/1992 0.024
1 02/01/1992 0.010
. . .
. . .
1 31/12/2014 0.002
2 01/01/1992 0.004
2 02/01/1992 0.012
The file is very large about 1M rows.
The second file looks like that:
firm fiscal_year_start fiscal_year_end
1 01/01/1992 31/12/1992
1 01/01/1993 31/12/1993
1 01/01/1994 31/12/1994
I want to calculate fiscal year returns and annualised standard deviation. Both .csv files are loaded into R as data frames. I am unsure on how to best treat the date variables and how to structure the for loop to loop through the daily return file.
Any help would be much appreciated.
EDIT1
I am able to subset the big data frame using this function:
myfunc <- function(x,y,z){df.1(df.1$date1 >= x & df.1$date1 < y & df.1$firm == firm1,]}
firm1 <- df.2$firm[1]
start_date <- df.2$StartDate[1]
end_date <- df.2$EndDate[1]
Test <- myfunc(start_date,end_date, firm1)
For this subset I can then get the fiscal-year return and std:
# return
fiscal_year_ret <- with(Test, sum(Test$ret))
# annualized variance
var <- with(Test, var(Test$ret))
annualized_var <- var*length(Test)
annualized_st.dev <- sqrt(annualized_var)
My big problem is embedding this into a loop that allows me to loop through the different firm identifiers and dates in df.2
EDIT2
So I have something like this
df.output <- data.frame(returns=as.numeric(),
std.deviation=as.numeric(),
stringsAsFactors=FALSE)
I would like to populate the above data frame with the results.
for (i in sample) {
myfunc <- function(x,y,z){df.1[df.1$date1 >= x & df.1$date1 < y & df.1$firm == firm1,]}
firm1 <- df.2$firm[i]
start_date <- df.2$StartDate[i]
end_date <- df.2$EndDate[i]
subset <- myfunc(start_date,end_date, firm1)
# return
fiscal_year_ret <- with(subset, sum(subset$ret))
df.output$returns <-fiscal_year_ret
# variance
var <- with(subset, var(subset$ret))
annualized_var <- var*length(subset)
annualized_st.dev <- sqrt(annualized_var)
}
Something like that.
Here is one way:
library(lubridate)
data %>%
mutate(year =
date %>%
mdy %>%
floor_date(unit = "year") )
group_by(year) %>%
summarize(
mean_return = mean(ret),
sd_return = sd(ret))
I want to calculate
"average of the closing prices for the 5,10,30 consecutive trading days immediately preceding and including the Announcement Day, but excluding trading halt days (days on which trading volume is 0 or NA)
For example, now we set 2014/5/7 is the Announcement day.
then average of price for 5 consecutive days :
average of (price of 2014/5/7,2014/5/5, 2014/5/2, 2014/4/30,2014/4/29),
price of 2014/5/6 and 2014/5/1 was excluded due to 0 trading volume on those days.
EDIT on 11/9/2014
One thing to Note: the announcement day for each stock is different, and it's not last valid date in the data, so usage of tail when calculating average was not appropriate.
Date Price Volume
2014/5/9 1.42 668000
2014/5/8 1.4 2972000
2014/5/7 1.5 1180000
2014/5/6 1.59 0
2014/5/5 1.59 752000
2014/5/2 1.6 138000
2014/5/1 1.6 NA
2014/4/30 1.6 656000
2014/4/29 1.61 364000
2014/4/28 1.61 1786000
2014/4/25 1.64 1734000
2014/4/24 1.68 1130000
2014/4/23 1.68 506000
2014/4/22 1.67 354000
2014/4/21 1.7 0
2014/4/18 1.7 0
2014/4/17 1.7 1954000
2014/4/16 1.65 1788000
2014/4/15 1.71 1294000
2014/4/14 1.68 1462000
Reproducible Code:
require(quantmod)
require(data.table)
tickers <- c("0007.hk","1036.hk")
date_begin <- as.Date("2010-01-01")
date_end <- as.Date("2014-09-09")
# retrive data of all stocks
prices <- getSymbols(tickers, from = date_begin, to = date_end, auto.assign = TRUE)
dataset <- merge(Cl(get(prices[1])),Vo(get(prices[1])))
for (i in 2:length(prices)){
dataset <- merge(dataset, Cl(get(prices[i])),Vo(get(prices[i])))
}
# Write First
write.zoo(dataset, file = "prices.csv", sep = ",", qmethod = "double")
# Read zoo
test <- fread("prices.csv")
setnames(test, "Index", "Date")
Then I got a data.table. The first Column is Date, then the price and volume for each stock.
Actually, the original data contains information for about 40 stocks. Column names have the same patter: "X" + ticker.close , "X" + ticker.volumn
Last trading days for different stock were different.
The desired output :
days 0007.HK 1036.HK
5 1.1 1.1
10 1.1 1.1
30 1.1 1.1
The major issues:
.SD and lapply and .SDCol can be used for looping different stocks. .N can be used when calculating last consecutive N days.
Due to the different announcement day, it becomes a little complicated.
Any suggestions on single stock using quantmod or multiple stocks using data.table are extremely welcomed!
Thanks GSee and pbible for the nice solutions, it was very useful. I'll update my code later incorporating different announcement day for each stocks, and consult you later.
Indeed, it's more a xts question than a data.table one. Anything about data.table will be very helpful. Thanks a lot!
Because the different stocks have different announcement days, I tried to make a solution first following #pbible's logic, any suggestions will be extremely welcomed.
library(quantmod)
tickers <- c("0007.hk","1036.hk")
date_begin <- as.Date("2010-01-01")
# Instead of making one specific date_end, different date_end is used for convenience of the following work.
date_end <- c(as.Date("2014-07-08"),as.Date("2014-05-15"))
for ( i in 1: length(date_end)) {
stocks <- getSymbols(tickers[i], from = date_begin, to = date_end[i], auto.assign = TRUE)
dataset <- cbind(Cl(get(stocks)),Vo(get(stocks)))
usable <- subset(dataset,dataset[,2] > 0 & !is.na(dataset[,2]))
sma.5 <- SMA(usable[,1],5)
sma.10 <- SMA(usable[,1],10)
sma.30 <- SMA(usable[,1],30)
col <- as.matrix(rbind(tail(sma.5,1), tail(sma.10,1), tail(sma.30,1)))
colnames(col) <- colnames(usable[,1])
rownames(col) <- c("5","10","30")
if (i == 1) {
matrix <- as.matrix(col)
}
else {matrix <- cbind(matrix,col)}
}
I got what I want, but the code is ugly..Any suggestions to make it elegant are extremely welcomed!
Well, here's a way to do it. I don't know why you want to get rid of the loop, and this does not get rid of it (in fact it has a loop nested inside another). One thing that you were doing is growing objects in memory with each iteration of your loop (i.e. the matrix <- cbind(matrix,col) part is inefficient). This Answer avoids that.
library(quantmod)
tickers <- c("0007.hk","1036.hk")
date_begin <- as.Date("2010-01-01")
myEnv <- new.env()
date_end <- c(as.Date("2014-07-08"),as.Date("2014-05-15"))
lookback <- c(5, 10, 30) # different number of days to look back for calculating mean.
symbols <- getSymbols(tickers, from=date_begin,
to=tail(sort(date_end), 1), env=myEnv) # to=last date
end.dates <- setNames(date_end, symbols)
out <- do.call(cbind, lapply(end.dates, function(x) {
dat <- na.omit(get(names(x), pos=myEnv))[paste0("/", x)]
prc <- Cl(dat)[Vo(dat) > 0]
setNames(vapply(lookback, function(n) mean(tail(prc, n)), numeric(1)),
lookback)
}))
colnames(out) <- names(end.dates)
out
# 0007.HK 1036.HK
#5 1.080 8.344
#10 1.125 8.459
#30 1.186 8.805
Some commentary...
I created a new environment, myEnv, to hold your data so that it does not clutter your workspace.
I used the output of getSymbols (as you did in your attempt) because the input tickers are not uppercase.
I named the vector of end dates so that we can loop over that vector and know both the end date and the name of the stock.
the bulk of the code is an lapply loop (wrapped in do.call(cbind, ...)). I'm looping over the named end.dates vector.
The first line gets the data from myEnv, removes NAs, and subsets it to only include data up to the relevant end date.
The next line extracts the close column and subsets it to only include rows where volume is greater than zero.
The vapply loops over a vector of different lookbacks and calculates the mean. That is wrapped in setNames so that each result is named based on which lookback was used to calculate it.
The lapply call returns a list of named vectors. do.call(cbind, LIST) is the same as calling cbind(LIST[[1]], LIST[[2]], LIST[[3]]) except LIST can be a list of any length.
at this point we have a matrix with row names, but no column names. So, I named the columns based on which stock they represent.
Hope this helps.
How about something like this using the subset and moving average (SMA). Here is the solution I put together.
library(quantmod)
tickers <- c("0007.hk","1036.hk","cvx")
date_begin <- as.Date("2010-01-01")
date_end <- as.Date("2014-09-09")
stocks <- getSymbols(tickers, from = date_begin, to = date_end, auto.assign = TRUE)
stock3Summary <- function(stock){
dataset <- cbind(Cl(get(stock)),Vo(get(stock)))
usable <- subset(dataset,dataset[,2] > 0 & !is.na(dataset[,2]))
sma.5 <- SMA(usable[,1],5)
sma.10 <- SMA(usable[,1],10)
sma.30 <- SMA(usable[,1],30)
col <- as.matrix(rbind(tail(sma.5,1), tail(sma.10,1), tail(sma.30,1)))
colnames(col) <- colnames(usable[,1])
rownames(col) <- c("5","10","30")
col
}
matrix <- as.matrix(stock3Summary(stocks[1]))
for( i in 2:length(stocks)){
matrix <- cbind(matrix,stock3Summary(stocks[i]))
}
The output:
> matrix
X0007.HK.Close X1036.HK.Close CVX.Close
5 1.082000 8.476000 126.6900
10 1.100000 8.412000 127.6080
30 1.094333 8.426333 127.6767
This should work with multiple stocks. It will use only the most recent valid date.
I have two dataframes, one which contains a timestamp and air_temperature
air_temp time_stamp
85.1 1396335600
85.4 1396335860
And another, which contains startTime, endTime, location coordinates, and a canonical name.
startTime endTime location.lat location.lon name
1396334278 1396374621 37.77638 -122.4176 Work
1396375256 1396376369 37.78391 -122.4054 Work
For each row in the first data frame, I want to identify which time range in the second data frame it lies in, i.e if the timestamp 1396335600, is between the startTime 1396334278, and endTime 1396374621, add the location and name value to the row in the first data.frame.
The start and end time in the second data frame don't overlap, and are linearly increasing. However they are not perfectly continuous, so if the timestamp falls between two time bands, I need to mark the location as NA. If it does fit between the start and end times, I want to add the location.lat, location.lon, and name columns to the first data frame.
Appreciate your help.
Try this. Not tested.
newdata <- data2[data1$timestamp>=data2$startTime & data1$timestamp<=data2$endTime ,3:5]
data1 <- cbind(data1[data1$timestamp>=data2$startTime & data1$timestamp<=data2$endTime,],newdata)
This won't return any values if timestamp isn't between startTime and endTime, so in theory your returned dataset could be shorter than the original. Just in case I treated data1 with the same TRUE FALSE vector as data2 so they will be the same length.
Interesting problem... Turned out to be more complicated than I originally thought!!
Step1: Set up the data!
DF1 <- read.table(text="air_temp time_stamp
85.1 1396335600
85.4 1396335860",header=TRUE)
DF2 <- read.table(text="startTime endTime location.lat location.lon name
1396334278 1396374621 37.77638 -122.4176 Work
1396375256 1396376369 37.78391 -122.4054 Work",header=TRUE)
Step2: For each time_stamp in DF1 compute appropriate index in DF2:
index <- sapply(DF1$time_stamp,
function(i) {
dec <- which(i >= DF2$startTime & i <= DF2$endTime)
ifelse(length(dec) == 0, NA, dec)
}
)
index
Step3: Merge the two data frames:
DF1 <- cbind(DF1,DF2[index,3:5])
row.names(DF1) <- 1:nrow(DF1)
DF1
Hope this helps!!
rowidx <- sapply(dfrm1$time_stamp, function(x) which( dfrm2$startTime <= x & dfrm2$endTime >= x)
cbind(dfrm1$time_stamp. dfrm2[ rwoidx, c("location.lat","location.lon","name")]
Mine's not test either and looks substantially similar to CCurtis, so give him the check if it works.
I'm stuck on a problem calculating travel dates. I have a data frame of departure dates and return dates.
Departure Return
1 7/6/13 8/3/13
2 7/6/13 8/3/13
3 6/28/13 8/7/13
I want to create and pass a function that will take these dates and form a list of all the days away. I can do this individually by turning each column into dates.
## Turn the departure and return dates into a readable format
Dept <- as.Date(travelDates$Dept, format = "%m/%d/%y")
Retn <- as.Date(travelDates$Retn, format = "%m/%d/%y")
travel_dates <- na.omit(data.frame(dept_dates,retn_dates))
seq(from = travel_dates[1,1], to = travel_dates[1,2], by = 1)
This gives me [1] "2013-07-06" "2013-07-07"... and so on. I want to scale to cover the whole data frame, but my attempts have failed.
Here's one that I thought might work.
days_abroad <- data.frame()
get_days <- function(x,y){
all_days <- seq(from = x, to = y, by =1)
c(days_abroad, all_days)
return(days_abroad)
}
get_days(travel_dates$dept_dates, travel_dates$retn_dates)
I get this error:
Error in seq.Date(from = x, to = y, by = 1) : 'from' must be of length 1
There's probably a lot wrong with this, but what I would really like help on is how to run multiple dates through seq().
Sorry, if this is simple (I'm still learning to think in r) and sorry too for any breaches in etiquette. Thank you.
EDIT: updated as per OP comment.
How about this:
travel_dates[] <- lapply(travel_dates, as.Date, format="%m/%d/%y")
dts <- with(travel_dates, mapply(seq, Departure, Return, by="1 day"))
This produces a list with as many items as you had rows in your initial table. You can then summarize (this will be data.frame with the number of times a date showed up):
data.frame(count=sort(table(Reduce(append, dts)), decreasing=T))
# count
# 2013-07-06 3
# 2013-07-07 3
# 2013-07-08 3
# 2013-07-09 3
# ...
OLD CODE:
The following gets the #days of each trip, rather than a list with the dates.
transform(travel_dates, days_away=Return - Departure + 1)
Which produces:
# Departure Return days_away
# 1 2013-07-06 2013-08-03 29 days
# 2 2013-07-06 2013-08-03 29 days
# 3 2013-06-28 2013-08-07 41 days
If you want to put days_away in a separate list, that is trivial, though it seems more useful to have it as an additional column to your data frame.