How to logical compare two data frames in R - r

library(data.table)
library(QuantTools)
date_from <- '2018-11-01'
date_to <- '2018-11-30'
ticker <- 'SPFB.RTS'
# get days
dataDaily <- get_finam_data(ticker, date_from, date_to, 'day')
# get hours
dataHourly <- get_finam_data(ticker, date_from, date_to, 'hour')
# percent change of the day
dataDaily$pc <- ((dataDaily$close - dataDaily$open)/dataDaily$open)*100
# mark days with > 2 percent change
dataDaily$isBigCh <- dataDaily$pc[dataDaily$pc > 2]
So, I have a code above which downloads a daily/hourly OHLC data of the futures.
Questions:
1) How can I move the marks from dataDaily$isBigCh to dataHourly? It seems not easy because these data frames have different time formats and different lengths of rows.
dataHourly$time # has a format like this 2018-11-09 23:00:00
dataDaily$date # has a format like this 2018-11-09
2) How can I select the first bar of the day in dataHourly$time?

Slightly modified code for readability
# percent change of the day
dataDaily[, price_change := ( close / open - 1 ) * 100 ]
# mark days with > 2 percent change
dataDaily[, isBigCh := price_change > 2 ]
Question 1
# add date column to hourly data
# note that 00:00 time corresponds to 23:00-00:00 candle
dataHourly[, date := as.Date( time - as.difftime( '01:00:00' ) ) ]
# copy dataDaily isBigCh to dataHourly isBigChDaily
dataHourly[ dataDaily, isBigChDaily := isBigCh, on = 'date' ]
Question 2
# select first bar of the day
dataHourly[, .SD[1], by = date ]
Optionally
# remove date column from hourly data
dataHourly[, date := NULL ]
Note
library(data.table) not necessary as QuantTools loads it automatically
please read data.table manual it will save you lots of time trying to figure out simple manipulations similar to what you asked

Related

`DT[, function(.SD), by = ID]` behaves differently from `function(DT[ID %in% ID_GROUP])`

I am working with Geolife Trajectories 1.3 dataset (https://www.microsoft.com/en-us/download/confirmation.aspx?id=52367).
It contains bunch of folders, where each folder is separate user.
Each user have few separate .plt files with GPS coord and DATE-TIME info.
Some users have file with labels - time intervals, and transportation type
taken by user (airplane, car, etc)
I created two datasets, first contain all users ID's, DATE-TIMES's and other
info, irrelevant for now:
first dataset with users ID's and DATE's:
ID DATE
20 2007-04-29 08:34:32
... ...
100 2007-04-29 12:35:04
second contains all user ID's, StartTIME's, EndTime's and Transportation type:
ID Start.Time End.Time Transportation
1: 21 2007/04/29 12:34:24 2007/04/29 12:53:45 taxi
2: 21 2007/04/29 22:27:11 2007/04/30 04:28:00 car
...
From 'StartTIME, EndTime' columns of second dataset I created dataset with lubridate intervals:
2007-04-29 12:34:24 UTC--2007-04-29 12:53:45 UTC
...
2007-04-29 22:27:11 UTC--2007-04-30 04:28:00 UTC
Than I wrote 2 functions:
# function for single row label processing
# will search row's DATE in a subset of intervals for current ID
# if TRUE - will search for a label in a subset of labels for current ID
get_label <- function(id, date, labels_subset, interval_subset) {
# convert date to POSIX time
single_time <- as.POSIXct(date)
# search for current time in intervals subset and get label
result <- labels_subset[single_time %within% interval_subset]$Transportation
# check for result, if there is none -> return NA
if (identical(as.vector(result), character(0))) {
# "is type 'character' but expecting type 'logical'. Column types must be
# consistent for each group." will raise if `return(NA)` without `as.char`
return(as.character(NA))
} else {
return(as.character(result))
}
}
and
# function for ID subset label processing
# will create a subset of intervals for current ID
# will create a subset of labels for current ID
get_group <- function(tab) {
# grep ID
id <- tab$ID[1]
# create interval subset for ID
interval_subset <- intervals[labels_d$ID == id]
# create label subset for ID
labels_subset <- labels_d[labels_d$ID == id]
# pass all data for get_label function -- process `tab` by row
tab[, get_label(as.integer(ID), as.character(DATE), labels_subset, interval_subset), 1:nrow(tab)]
}
I want to get a vector with lables if DATE are in some lubridate interval and
NA if it is not in any lubridate interval for current ID.
And tmp <- get_group(dt[ID %in% c(21, 110)]) works:
> unique(tmp$V1)
[1] NA "car" "walk"
But tmp <- dt[, get_group(.SD), by = ID] does not work properly, it outputs only NA's (and dt have only two ID's -- 21 and 110):
> unique(tmp$V1)
[1] NA
Even if I create DT with only one ID, function(DT) works and DT[,function(.SD), by = ID] does not:
tmp<- DT[ID==21]
unique(tmp[, get_group(.SD), by = ID]$V1)
>[1] NA
unique(get_group(tmp)$V1)
>[1] NA "car" "walk"
Why, what I am doing wrong?
UPD:
I should have printed .SD earlier.
By default, R does not pass by= argument into .SD, so my function could not achieve an ID. Sadly, there is no standard warning about that.
.SDcols did the trick:
tmp[, get_group(.SD), by = ID, .SDcols=c('ID', 'DATE')]
You can do a data.table non-equi join as follows:
ds2[ds1, on=.(ID, Start.Time <= DATE, End.Time >= DATE)]

R how to avoid a loop. Counting weekends between two dates in a row for each row in a dataframe

I have two columns of dates. Two example dates are:
Date1= "2015-07-17"
Date2="2015-07-25"
I am trying to count the number of Saturdays and Sundays between the two dates each of which are in their own column (5 & 7 in this example code). I need to repeat this process for each row of my dataframe. The end results will be one column that represents the number of Saturdays and Sundays within the date range defined by two date columns.
I can get the code to work for one row:
sum(weekdays(seq(Date1[1,5],Date2[1,7],"days")) %in% c("Saturday",'Sunday')*1))
The answer to this will be 3. But, if I take out the "1" in the row position of date1 and date2 I get this error:
Error in seq.Date(Date1[, 5], Date2[, 7], "days") :
'from' must be of length 1
How do I go line by line and have one vector that lists the number of Saturdays and Sundays between the two dates in column 5 and 7 without using a loop? Another issue is that I have 2 million rows and am looking for something with a little more speed than a loop.
Thank you!!
map2* functions from the purrr package will be a good way to go. They take two vector inputs (eg two date columns) and apply a function in parallel. They're pretty fast too (eg previous post)!
Here's an example. Note that the _int requests an integer vector back.
library(purrr)
# Example data
d <- data.frame(
Date1 = as.Date(c("2015-07-17", "2015-07-28", "2015-08-15")),
Date2 = as.Date(c("2015-07-25", "2015-08-14", "2015-08-20"))
)
# Wrapper function to compute number of weekend days between dates
n_weekend_days <- function(date_1, date_2) {
sum(weekdays(seq(date_1, date_2, "days")) %in% c("Saturday",'Sunday'))
}
# Iterate row wise
map2_int(d$Date1, d$Date2, n_weekend_days)
#> [1] 3 4 2
If you want to add the results back to your original data frame, mutate() from the dplyr package can help:
library(dplyr)
d <- mutate(d, end_days = map2_int(Date1, Date2, n_weekend_days))
d
#> Date1 Date2 end_days
#> 1 2015-07-17 2015-07-25 3
#> 2 2015-07-28 2015-08-14 4
#> 3 2015-08-15 2015-08-20 2
Here is a solution that uses dplyr to clean things up. It's not too difficult to use with to assign the columns in the dataframe directly.
Essentially, use a reference date, calculate the number of full weeks (by floor or ceiling). Then take the difference between the two. The code does not include cases in which the start date or end data fall on Saturday or Sunday.
# weekdays(as.Date(0,"1970-01-01")) -> "Friday"
require(dplyr)
startDate = as.Date(0,"1970-01-01") # this is a friday
df <- data.frame(start = "2015-07-17", end = "2015-07-25")
df$start <- as.Date(df$start,"", format = "%Y-%m-%d", origin="1970-01-01")
df$end <- as.Date(df$end, format = "%Y-%m-%d","1970-01-01")
# you can use with to define the columns directly instead of %>%
df <- df %>%
mutate(originDate = startDate) %>%
mutate(startDayDiff = as.numeric(start-originDate), endDayDiff = as.numeric(end-originDate)) %>%
mutate(startWeekDiff = floor(startDayDiff/7),endWeekDiff = floor(endDayDiff/7)) %>%
mutate(NumSatsStart = startWeekDiff + ifelse(startDayDiff %% 7>=1,1,0),
NumSunsStart = startWeekDiff + ifelse(startDayDiff %% 7>=2,1,0),
NumSatsEnd = endWeekDiff + ifelse(endDayDiff %% 7 >= 1,1,0),
NumSunsEnd = endWeekDiff + ifelse(endDayDiff %% 7 >= 2,1,0)
) %>%
mutate(NumSats = NumSatsEnd - NumSatsStart, NumSuns = NumSunsEnd - NumSunsStart)
Dates are number of days since 1970-01-01, a Thursday.
So the following is the number of Saturdays or Sundays since that date
f <- function(d) {d <- as.numeric(d); r <- d %% 7; 2*(d %/% 7) + (r>=2) + (r>=3)}
For the number of Saturdays or Sundays between two dates, just subtract, after decrementing the start date to have an inclusive count.
g <- function(d1, d2) f(d2) - f(d1-1)
These are all vectorized functions so you can just call directly on the columns.
# Example data, as in Simon Jackson's answer
d <- data.frame(
Date1 = as.Date(c("2015-07-17", "2015-07-28", "2015-08-15")),
Date2 = as.Date(c("2015-07-25", "2015-08-14", "2015-08-20"))
)
As follows
within(d, end_days<-g(Date1,Date2))
# Date1 Date2 end_days
# 1 2015-07-17 2015-07-25 3
# 2 2015-07-28 2015-08-14 4
# 3 2015-08-15 2015-08-20 2

Number of overlapping intervals over time

Let's say I have a set of, partly overlapping, intervals
require(lubridate)
date1 <- as.POSIXct("2000-03-08 01:59:59")
date2 <- as.POSIXct("2001-02-29 12:00:00")
date3 <- as.POSIXct("1999-03-08 01:59:59")
date4 <- as.POSIXct("2002-02-29 12:00:00")
date5 <- as.POSIXct("2000-03-08 01:59:59")
date6 <- as.POSIXct("2004-02-29 12:00:00")
int1 <- new_interval(date1, date2)
int2 <- new_interval(date3, date4)
int3 <- new_interval(date5, date6)
Does anyone have an idea how one could construct a time series plot that provides, for every point in time, the number of overlapping intervals at that point?
So, for instance, to take the above example: For a given date in January 2000, the function I'm looking for would return the value "1" (the date is only within int2) while for a date in January 2001, it would return "3" (since that date is within int1, int2 and int3). Etc.
Any ideas?
Here's one way using foverlaps() function using data.table package:
Please install the development version 1.9.5 by following the installation instructions as a bug that affects overlap joins on numeric types has been fixed there.
require(data.table) ## 1.9.5+
intervals = data.table(start = c(date1, date3, date5),
end = c(date2, date4, date6))
# assuming your query is:
query = as.POSIXct(c("2000-01-01 00:00:00", "2001-01-01 00:00:00"))
We'll construct the query data.table with both start and end intervals as well:
querydt = data.table(start=query, end=query) # identical start,end
Then we can use foverlaps() as follows:
setkeyv(intervals, c("start", "end"))
ans = foverlaps(querydt, intervals, which=TRUE, nomatch=0L, type="within")
# xid yid
# 1: 1 1
# 2: 2 1
# 3: 2 2
# 4: 2 3
We first set key - which sorts the data.table intervals by the columns provided, in increasing order, and marks those columns as the key columns on which we want to perform the overlap join.
Then we use foverlaps() to find which intervals in querydt overlaps (falls type=within) with intervals. In this case, querydt consists of just points as start and end points are identical. This returns all matching indices (nomatch=0L removes all rows with no matches and which=TRUE returns indices instead of merged result) for those rows in querydt that falls within intervals.
Now all we have to do is to aggregate by xid and count the number of observations to get the count:
ans[, .N, by=xid]
# xid N
# 1: 1 1
# 2: 2 3
Check ?foverlaps for more info.

Data frame of departure and return dates, how do I get a list of all dates away?

I'm stuck on a problem calculating travel dates. I have a data frame of departure dates and return dates.
Departure Return
1 7/6/13 8/3/13
2 7/6/13 8/3/13
3 6/28/13 8/7/13
I want to create and pass a function that will take these dates and form a list of all the days away. I can do this individually by turning each column into dates.
## Turn the departure and return dates into a readable format
Dept <- as.Date(travelDates$Dept, format = "%m/%d/%y")
Retn <- as.Date(travelDates$Retn, format = "%m/%d/%y")
travel_dates <- na.omit(data.frame(dept_dates,retn_dates))
seq(from = travel_dates[1,1], to = travel_dates[1,2], by = 1)
This gives me [1] "2013-07-06" "2013-07-07"... and so on. I want to scale to cover the whole data frame, but my attempts have failed.
Here's one that I thought might work.
days_abroad <- data.frame()
get_days <- function(x,y){
all_days <- seq(from = x, to = y, by =1)
c(days_abroad, all_days)
return(days_abroad)
}
get_days(travel_dates$dept_dates, travel_dates$retn_dates)
I get this error:
Error in seq.Date(from = x, to = y, by = 1) : 'from' must be of length 1
There's probably a lot wrong with this, but what I would really like help on is how to run multiple dates through seq().
Sorry, if this is simple (I'm still learning to think in r) and sorry too for any breaches in etiquette. Thank you.
EDIT: updated as per OP comment.
How about this:
travel_dates[] <- lapply(travel_dates, as.Date, format="%m/%d/%y")
dts <- with(travel_dates, mapply(seq, Departure, Return, by="1 day"))
This produces a list with as many items as you had rows in your initial table. You can then summarize (this will be data.frame with the number of times a date showed up):
data.frame(count=sort(table(Reduce(append, dts)), decreasing=T))
# count
# 2013-07-06 3
# 2013-07-07 3
# 2013-07-08 3
# 2013-07-09 3
# ...
OLD CODE:
The following gets the #days of each trip, rather than a list with the dates.
transform(travel_dates, days_away=Return - Departure + 1)
Which produces:
# Departure Return days_away
# 1 2013-07-06 2013-08-03 29 days
# 2 2013-07-06 2013-08-03 29 days
# 3 2013-06-28 2013-08-07 41 days
If you want to put days_away in a separate list, that is trivial, though it seems more useful to have it as an additional column to your data frame.

Problems adding a month to X using POSIXlt in R - need to reset value using as.Date(X)

This works for me in R:
# Setting up the first inner while-loop controller, the start of the next water year
NextH2OYear <- as.POSIXlt(firstDate)
NextH2OYear$year <- NextH2OYear$year + 1
NextH2OYear<-as.Date(NextH2OYear)
But this doesn't:
# Setting up the first inner while-loop controller, the start of the next water month
NextH2OMonth <- as.POSIXlt(firstDate)
NextH2OMonth$mon <- NextH2OMonth$mon + 1
NextH2OMonth <- as.Date(NextH2OMonth)
I get this error:
Error in as.Date.POSIXlt(NextH2OMonth) :
zero length component in non-empty POSIXlt structure
Any ideas why? I need to systematically add one year (for one loop) and one month (for another loop) and am comparing the resulting changed variables to values with a class of Date, which is why they are being converted back using as.Date().
Thanks,
Tom
Edit:
Below is the entire section of code. I am using RStudio (version 0.97.306). The code below represents a function that is passed an array of two columns (Date (CLass=Date) and Discharge Data (Class=Numeric) that are used to calculate the monthly averages. So, firstDate and lastDate are class Date and determined from the passed array. This code is adapted from successful code that calculates the yearly averages - there maybe one or two things I still need to change over, but I am prevented from error checking later parts due to the early errors I get in my use of POSIXlt. Here is the code:
MonthlyAvgDischarge<-function(values){
#determining the number of values - i.e. the number of rows
dataCount <- nrow(values)
# Determining first and last dates
firstDate <- (values[1,1])
lastDate <- (values[dataCount,1])
# Setting up vectors for results
WaterMonths <- numeric(0)
class(WaterMonths) <- "Date"
numDays <- numeric(0)
MonthlyAvg <- numeric(0)
# while loop variables
loopDate1 <- firstDate
loopDate2 <- firstDate
# Setting up the first inner while-loop controller, the start of the next water month
NextH2OMonth <- as.POSIXlt(firstDate)
NextH2OMonth$mon <- NextH2OMonth$mon + 1
NextH2OMonth <- as.Date(NextH2OMonth)
# Variables used in the loops
dayCounter <- 0
dischargeTotal <- 0
dischargeCounter <- 1
resultsCounter <- 1
loopCounter <- 0
skipcount <- 0
# Outer while-loop, controls the progression from one year to another
while(loopDate1 <= lastDate)
{
# Inner while-loop controls adding up the discharge for each water year
# and keeps track of day count
while(loopDate2 < NextH2OMonth)
{
if(is.na(values[resultsCounter,2]))
{
# Skip this date
loopDate2 <- loopDate2 + 1
# Skip this value
resultsCounter <- resultsCounter + 1
#Skipped counter
skipcount<-skipcount+1
} else{
# Adding up discharge
dischargeTotal <- dischargeTotal + values[resultsCounter,2]
}
# Adding a day
loopDate2 <- loopDate2 + 1
#Keeping track of days
dayCounter <- dayCounter + 1
# Keeping track of Dicharge position
resultsCounter <- resultsCounter + 1
}
# Adding the results/water years/number of days into the vectors
WaterMonths <- c(WaterMonths, as.Date(loopDate2, format="%mm/%Y"))
numDays <- c(numDays, dayCounter)
MonthlyAvg <- c(MonthlyAvg, round((dischargeTotal/dayCounter), digits=0))
# Resetting the left hand side variables of the while-loops
loopDate1 <- NextH2OMonth
loopDate2 <- NextH2OMonth
# Resetting the right hand side variable of the inner while-loop
# moving it one year forward in time to the next water year
NextH2OMonth <- as.POSIXlt(NextH2OMonth)
NextH2OMonth$year <- NextH2OMonth$Month + 1
NextH2OMonth<-as.Date(NextH2OMonth)
# Resettting vraiables that need to be reset
dayCounter <- 0
dischargeTotal <- 0
loopCounter <- loopCounter + 1
}
WaterMonths <- format(WaterMonthss, format="%mm/%Y")
# Uncomment the line below and return AvgAnnualDailyAvg if you want the water years also
# AvgAnnDailyAvg <- data.frame(WaterYears, numDays, YearlyDailyAvg)
return((MonthlyAvg))
}
Same error occurs in regular R. When doing it line by line, its not a problem, when running it as a script, it it.
Plain R
seq(Sys.Date(), length = 2, by = "month")[2]
seq(Sys.Date(), length = 2, by = "year")[2]
Note that this works with POSIXlt too, e.g.
seq(as.POSIXlt(Sys.Date()), length = 2, by = "month")[2]
mondate.
library(mondate)
now <- mondate(Sys.Date())
now + 1 # date in one month
now + 12 # date in 12 months
Mondate is bit smarter about things like mondate("2013-01-31")+ 1 which gives last day of February whereas seq(as.Date("2013-01-31"), length = 2, by = "month")[2] gives March 3rd.
yearmon If you don't really need the day part then yearmon may be preferable:
library(zoo)
now.ym <- yearmon(Sys.Date())
now.ym + 1/12 # add one month
now.ym + 1 # add one year
ADDED comment on POSIXlt and section on yearmon.
Here is you can add 1 month to a date in R, using package lubridate:
library(lubridate)
x <- as.POSIXlt("2010-01-31 01:00:00")
month(x) <- month(x) + 1
>x
[1] "2010-03-03 01:00:00 PST"
(note that it processed the addition correctly, as 31st of Feb doesn't exist).
Can you perhaps provide a reproducible example? What's in firstDate, and what version of R are you using? I do this kind of manipulation of POSIXlt dates quite often and it seems to work:
Sys.Date()
# [1] "2013-02-13"
date = as.POSIXlt(Sys.Date())
date$mon = date$mon + 1
as.Date(date)
# [1] "2013-03-13"

Resources