R - Relational Operators and Vectorization - r

I have a vector of times when people scanned a badge. I have another set of times that are 'measurement points'.
scans = structure(c(1388570120, 1388572119, 1388575229, 1388577402, 1388580457, 1388583364, 1388586817, 1388589929, 1388593054, 1388599025), class = c("POSIXct", "POSIXt"), tzone = "UTC")
points = as.POSIXct(9*3600,"UTC",origin="2014-01-01")+seq(0,10*3600,3600)
What I want to do is count how many scans are greater (or equal to) than points
sum(scans >= points[1])
#> [1] 10
This works one at a time and can easily be converted to a for loop or an lapply
lapply(points,function(x){sum(scans >= x)})
However, I cannot simply use scans >= points and get a list back where all of scans is compared to points element by element.
Is there a way in R to compare one entire vector to each element of another vector without using a looping construct (so the result is identical to the lapply example above except possibly in structure)? What I actually have a list of vectors of scans which I'm already going to be lapplying through and I'm hoping there's a way to avoid nested looping in R.

You can do
colSums(outer(scans,points,'>='))
I can't guarantee that the intermediate matrix would fit into memory though.

You can do the following with the development version of data.table:
library(data.table)
dt1 = data.table(scans)
dt2 = data.table(points)
dt1[dt2, on = .(scans >= points), .N, by = .EACHI]
# scans N
# 1: 2014-01-01 09:00:00 10
# 2: 2014-01-01 10:00:00 9
# 3: 2014-01-01 11:00:00 8
# 4: 2014-01-01 12:00:00 6
# 5: 2014-01-01 13:00:00 5
# 6: 2014-01-01 14:00:00 4
# 7: 2014-01-01 15:00:00 3
# 8: 2014-01-01 16:00:00 2
# 9: 2014-01-01 17:00:00 1
#10: 2014-01-01 18:00:00 0
#11: 2014-01-01 19:00:00 0
This should be much more memory-efficient than building the full outer product.

Related

Convert timestamps in Google Finance stock data to proper datetime

I am trying to convert the timestamps in the stock data from Google Finance API to a more usable datetime format.
I have used data.table::fread to read the data here:
fread(<url>)
datetime open high low close volume
1: a1497619800 154.230 154.2300 154.2300 154.2300 500
2: 1 153.720 154.3200 153.7000 154.2500 1085946
3: 2 153.510 153.8000 153.2000 153.7700 34882
4: 3 153.239 153.4800 153.1400 153.4800 24343
5: 4 153.250 153.3000 152.9676 153.2700 20212
As you can see, the "datetime" format is rather strange. The format is described in this link:
The full timestamps are denoted by the leading 'a'. Like this: a1092945600. The number after the 'a' is a Unix timestamp. [...]
The numbers without a leading 'a' are "intervals". So, for example, the second row in the data set below has an interval of 1. You can multiply this number by our interval size [...] and add it to the last Unix Timestamp.
In my case, the "interval size" is 300 seconds (5 minutes). This format is restarted at the start of each new day and so trying to format it is quite difficult!
I can pull out the index positions of where the day starts are by using grep and searching for "a";
newDay <- grep(df$V1, pattern = "a")
Then my idea was to split the dataframe into chunks depending on index positions then extend the unix times on each day separately followed by combing them back to a data.table, before storing.
data.table::split looks like it will do the job, but I am unsure of how to supply it the day breaks to split by index positions, or if there is a more logical way to achieve the same result without having to break it down to each day.
Thanks.
You may use grepl to search for "a" in "datetime", which results in a boolean vector. cumsum the boolean to create a grouping variable - for each "a" (TRUE), the counter will increase by one.
Within each group, convert the first element to POSIXct, using an appropriate format and origin (and timezone, tz?). Add multiples of the 'interval size' (300 sec), using zero for the first element and the "datetime" multiples for the others.
d[ , time := {
t1 <- as.POSIXct(datetime[1], format = "a%s", origin = "1970-01-01")
.(t1 + c(0, as.numeric(datetime[-1]) * 300))
}
, by = .(cumsum(grepl("^a", datetime)))]
d
# datetime time
# 1: a1497619800 2017-06-16 15:30:00
# 2: 1 2017-06-16 15:35:00
# 3: 2 2017-06-16 15:40:00
# 4: 3 2017-06-16 15:45:00
# 5: 4 2017-06-16 15:50:00
# 6: a1500000000 2017-07-14 04:40:00
# 7: 3 2017-07-14 04:55:00
# 8: 5 2017-07-14 05:05:00
# 9: 7 2017-07-14 05:15:00
Some toy data:
d <- fread(input = "datetime
a1497619800
1
2
3
4
a1500000000
3
5
7")
With:
DT[grep('^a', date), datetime := as.integer(gsub('\\D+','',date))
][, datetime := zoo::na.locf(datetime)
][nchar(date) < 4, datetime := datetime + (300 * as.integer(date))
][, datetime := as.POSIXct(datetime, origin = '1970-01-01', tz = 'America/New_York')][]
you get:
date close high low open volume datetime
1: a1500298200 153.57 153.7100 153.57 153.5900 1473 2017-07-17 09:30:00
2: 1 153.51 153.8700 153.33 153.7500 205057 2017-07-17 09:35:00
3: 2 153.49 153.7800 153.34 153.5800 70023 2017-07-17 09:40:00
4: 3 153.68 153.7300 153.42 153.5400 53050 2017-07-17 09:45:00
5: 4 153.06 153.7500 153.06 153.7200 120899 2017-07-17 09:50:00
---
2348: 937 143.94 144.0052 143.91 143.9917 36651 2017-08-25 15:40:00
2349: 938 143.90 143.9958 143.90 143.9400 40769 2017-08-25 15:45:00
2350: 939 143.94 143.9500 143.87 143.8900 56616 2017-08-25 15:50:00
2351: 940 143.97 143.9700 143.89 143.9400 56381 2017-08-25 15:55:00
2352: 941 143.74 143.9700 143.74 143.9655 179811 2017-08-25 16:00:00
Used data:
DT <- fread('https://www.google.com/finance/getprices?i=300&p=30d&f=d,t,o,h,l,c,v&df=cpct&q=IBM', skip = 7, header = FALSE)
setnames(DT, 1:6, c('date','close','high','low','open','volume'))

Alter values in one data frame based on comparison values in another in R

I am trying to subtract one hour to date/times within a POSIXct column that are earlier than or equal to a time stated in a different comparison dataframe for that particular ID.
For example:
#create sample data
Time<-as.POSIXct(c("2015-10-02 08:00:00","2015-11-02 11:00:00","2015-10-11 10:00:00","2015-11-11 09:00:00","2015-10-24 08:00:00","2015-10-27 08:00:00"), format = "%Y-%m-%d %H:%M:%S")
ID<-c(01,01,02,02,03,03)
data<-data.frame(Time,ID)
Which produces this:
Time ID
1 2015-10-02 08:00:00 1
2 2015-11-02 11:00:00 1
3 2015-10-11 10:00:00 2
4 2015-11-11 09:00:00 2
5 2015-10-24 08:00:00 3
6 2015-10-27 08:00:00 3
I then have another dataframe with a key date and time for each ID to compare against. The Time in data should be compared against Comparison in ComparisonData for the particular ID it is associated with. If the Time value in data is earlier than or equal to the comparison value one hour should be subtracted from the value in data:
#create sample comparison data
Comparison<-as.POSIXct(c("2015-10-29 08:00:00","2015-11-02 08:00:00","2015-10-26 08:30:00"), format = "%Y-%m-%d %H:%M:%S")
ID<-c(01,02,03)
ComparisonData<-data.frame(Comparison,ID)
This should look like this:
Comparison ID
1 2015-10-29 08:00:00 1
2 2015-11-02 08:00:00 2
3 2015-10-26 08:30:00 3
In summary, the code should check all times of a certain ID to see if any are earlier than or equal to the value specified in ComparisonData and if they are, subtract one hour. This should give this data frame as an output:
Time ID
1 2015-10-02 07:00:00 1
2 2015-11-02 11:00:00 1
3 2015-10-11 09:00:00 2
4 2015-11-11 09:00:00 2
5 2015-10-24 07:00:00 3
6 2015-10-27 08:00:00 3
I have looked at similar solutions such as this but I cannot understand how to also check the times using the right timing with that particular ID.
I think ddply seems quite a promising option but I'm not sure how to use it for this particular problem.
Here's a quick and efficient solution using data.table. First we join the two data sets by ID and then just modify the Times which are lower or equal to Comparison
library(data.table) # v1.9.6+
setDT(data)[ComparisonData, end := i.Comparison, on = "ID"]
data[Time <= end, Time := Time - 3600L][, end := NULL]
data
# Time ID
# 1: 2015-10-02 07:00:00 1
# 2: 2015-11-02 11:00:00 1
# 3: 2015-10-11 09:00:00 2
# 4: 2015-11-11 09:00:00 2
# 5: 2015-10-24 07:00:00 3
# 6: 2015-10-27 08:00:00 3
Alternatively, we could do this in one step while joining using ifelse (not sure how efficient this though)
setDT(data)[ComparisonData,
Time := ifelse(Time <= i.Comparison,
Time - 3600L, Time),
on = "ID"]
data
# Time ID
# 1: 2015-10-02 07:00:00 1
# 2: 2015-11-02 11:00:00 1
# 3: 2015-10-11 09:00:00 2
# 4: 2015-11-11 09:00:00 2
# 5: 2015-10-24 07:00:00 3
# 6: 2015-10-27 08:00:00 3
I am sure there is going to be a better solution than this, however, I think this works.
for(i in 1:nrow(data)) {
if(data$Time[i] < ComparisonData[data$ID[i], 1]){
data$Time[i] <- data$Time[i] - 3600
}
}
# Time ID
#1 2015-10-02 07:00:00 1
#2 2015-11-02 11:00:00 1
#3 2015-10-11 09:00:00 2
#4 2015-11-11 09:00:00 2
#5 2015-10-24 07:00:00 3
#6 2015-10-27 08:00:00 3
This is going to iterate through every row in data.
ComparisonData[data$ID[i], 1] gets the time column in ComparisonData for the corresponding ID. If this is greater than the Time column in data then reduce the time by 1 hour.

R: sequence of days between dates

I have the following dataframes:
AllDays
2012-01-01
2012-01-02
2012-01-03
...
2015-08-18
Leases
StartDate EndDate
2012-01-01 2013-01-01
2012-05-07 2013-05-06
2013-09-05 2013-12-01
What I want to do is, for each date in the allDays dataframe, calculate the number of leases that are in effect. e.g. if there are 4 leases with start date <= 2015-01-01 and end date >= 2015-01-01, then I would like to place a 4 in that dataframe.
I have the following code
for (i in 1:nrow(leases))
{
occupied = seq(leases$StartDate[i],leases$EndDate[i],by="days")
occupied = occupied[occupied < dateOfInt]
matching = match(occupied,allDays$Date)
allDays$Occupancy[matching] = allDays$Occupancy[matching] + 1
}
which works, but as I have about 5000 leases, it takes about 1.1 seconds. Does anyone have a more efficient method that would require less computation time?
Date of interest is just the current date and is used simply to ensure that it doesn't count lease dates in the future.
Using seq is almost surely inefficient--imagine you had a lease in your data that's 10000 years long. seq will take forever and return 10000*365-1 days that don't matter to us. We then have to use %in% which also makes the same number of unnecessary comparisons.
I'm not sure the following is the best approach (I'm convinced there's a fully vectorized solution) but it gets closer to the heart of the problem.
Data
set.seed(102349)
days<-data.frame(AllDays=seq(as.Date("2012-01-01"),
as.Date("2015-08-18"),"day"))
leases<-data.frame(StartDate=sample(days$AllDays,5000L,T))
leases$EndDate<-leases$StartDate+round(rnorm(5000,mean=365,sd=100))
Approach
Use data.table and sapply:
library(data.table)
setDT(leases); setDT(days)
days[,lease_count:=
sapply(AllDays,function(x)
leases[StartDate<=x&EndDate>=x,.N])][]
AllDays lease_count
1: 2012-01-01 5
2: 2012-01-02 8
3: 2012-01-03 11
4: 2012-01-04 16
5: 2012-01-05 18
---
1322: 2015-08-14 1358
1323: 2015-08-15 1358
1324: 2015-08-16 1360
1325: 2015-08-17 1363
1326: 2015-08-18 1359
This is exactly the problem where foverlaps shines: subsetting a data.frame based upon another data.frame (foverlaps seems to be tailored for that purpose).
Based on #MichaelChirico's data.
setkey(days[, AllDays1:=AllDays,], AllDays, AllDays1)
setkey(leases, StartDate, EndDate)
foverlaps(leases, days)[, .(lease_count=.N), AllDays]
# user system elapsed
# 0.114 0.018 0.136
# #MichaelChirico's approach
# user system elapsed
# 0.909 0.000 0.907
Here is a brief explanation on how it works by #Arun, which got me started with the data.table.
Without your data, I can't test whether or not this is faster, but it gets the job done with less code:
for (i in 1:nrow(AllDays)) AllDays$tally[i] = sum(AllDays$AllDays[i] >= Leases$Start.Date & AllDays$AllDays[i] <= Leases$End.Date)
I used the following to test it; note that the relevant columns in both data frames are formatted as dates:
AllDays = data.frame(AllDays = seq(from=as.Date("2012-01-01"), to=as.Date("2015-08-18"), by=1))
Leases = data.frame(Start.Date = as.Date(c("2013-01-01", "2012-08-20", "2014-06-01")), End.Date = as.Date(c("2013-12-31", "2014-12-31", "2015-05-31")))
An alternative approach, but I'm not sure it's faster.
library(lubridate)
library(dplyr)
AllDays = data.frame(dates = c("2012-02-01","2012-03-02","2012-04-03"))
Lease = data.frame(start = c("2012-01-03","2012-03-01","2012-04-02"),
end = c("2012-02-05","2012-04-15","2012-07-11"))
# transform to dates
AllDays$dates = ymd(AllDays$dates)
Lease$start = ymd(Lease$start)
Lease$end = ymd(Lease$end)
# create the range id
Lease$id = 1:nrow(Lease)
AllDays
# dates
# 1 2012-02-01
# 2 2012-03-02
# 3 2012-04-03
Lease
# start end id
# 1 2012-01-03 2012-02-05 1
# 2 2012-03-01 2012-04-15 2
# 3 2012-04-02 2012-07-11 3
data.frame(expand.grid(AllDays$dates,Lease$id)) %>% # create combinations of dates and ranges
select(dates=Var1, id=Var2) %>%
inner_join(Lease, by="id") %>% # join information
rowwise %>%
do(data.frame(dates=.$dates,
flag = ifelse(.$dates %in% seq(.$start,.$end,by="1 day"),1,0))) %>% # create ranges and check if the date is in there
ungroup %>%
group_by(dates) %>%
summarise(N=sum(flag))
# dates N
# 1 2012-02-01 1
# 2 2012-03-02 1
# 3 2012-04-03 2
Try the lubridate package. Create an interval for each lease. Then count the lease intervals which each date falls in.
# make some data
AllDays <- data.frame("Days" = seq.Date(as.Date("2012-01-01"), as.Date("2012-02-01"), by = 1))
Leases <- data.frame("StartDate" = as.Date(c("2012-01-01", "2012-01-08")),
"EndDate" = as.Date(c("2012-01-10", "2012-01-21")))
library(lubridate)
x <- new_interval(Leases$StartDate, Leases$EndDate, tzone = "UTC")
AllDays$NumberInEffect <- sapply(AllDays$Days, function(a){sum(a %within% x)})
The Output
head(AllDays)
Days NumberInEffect
1 2012-01-01 1
2 2012-01-02 1
3 2012-01-03 1
4 2012-01-04 1
5 2012-01-05 1
6 2012-01-06 1

Fastest way for filling-in missing dates for data.table

I am loading a data.table from CSV file that has date, orders, amount etc. fields.
The input file occasionally does not have data for all dates. For example, as shown below:
> NADayWiseOrders
date orders amount guests
1: 2013-01-01 50 2272.55 149
2: 2013-01-02 3 64.04 4
3: 2013-01-04 1 18.81 0
4: 2013-01-05 2 77.62 0
5: 2013-01-07 2 35.82 2
In the above 03-Jan and 06-Jan do not have any entries.
Would like to fill the missing entries with default values (say, zero for orders, amount etc.), or carry the last vaue forward (e.g, 03-Jan will reuse 02-Jan values and 06-Jan will reuse the 05-Jan values etc..)
What is the best/optimal way to fill-in such gaps of missing dates data with such default values?
The answer here suggests using allow.cartesian = TRUE, and expand.grid for missing weekdays - it may work for weekdays (since they are just 7 weekdays) - but not sure if that would be the right way to go about dates as well, especially if we are dealing with multi-year data.
The idiomatic data.table way (using rolling joins) is this:
setkey(NADayWiseOrders, date)
all_dates <- seq(from = as.Date("2013-01-01"),
to = as.Date("2013-01-07"),
by = "days")
NADayWiseOrders[J(all_dates), roll=Inf]
date orders amount guests
1: 2013-01-01 50 2272.55 149
2: 2013-01-02 3 64.04 4
3: 2013-01-03 3 64.04 4
4: 2013-01-04 1 18.81 0
5: 2013-01-05 2 77.62 0
6: 2013-01-06 2 77.62 0
7: 2013-01-07 2 35.82 2
Here is how you fill in the gaps within subgroup
# a toy dataset with gaps in the time series
dt <- as.data.table(read.csv(textConnection('"group","date","x"
"a","2017-01-01",1
"a","2017-02-01",2
"a","2017-05-01",3
"b","2017-02-01",4
"b","2017-04-01",5')))
dt[,date := as.Date(date)]
# the desired dates by group
indx <- dt[,.(date=seq(min(date),max(date),"months")),group]
# key the tables and join them using a rolling join
setkey(dt,group,date)
setkey(indx,group,date)
dt[indx,roll=TRUE]
#> group date x
#> 1: a 2017-01-01 1
#> 2: a 2017-02-01 2
#> 3: a 2017-03-01 2
#> 4: a 2017-04-01 2
#> 5: a 2017-05-01 3
#> 6: b 2017-02-01 4
#> 7: b 2017-03-01 4
#> 8: b 2017-04-01 5
Not sure if it's the fastest, but it'll work if there are no NAs in the data:
# just in case these aren't Dates.
NADayWiseOrders$date <- as.Date(NADayWiseOrders$date)
# all desired dates.
alldates <- data.table(date=seq.Date(min(NADayWiseOrders$date), max(NADayWiseOrders$date), by="day"))
# merge
dt <- merge(NADayWiseOrders, alldates, by="date", all=TRUE)
# now carry forward last observation (alternatively, set NA's to 0)
require(xts)
na.locf(dt)

How to get NA returned from R aggregate over NA data?

I have a dataframe with a POSIXct datetime column and a column with a value.
The value may contain periods of NA, sometimes even lags between some hours (no data at all, eg.)
t v
2014-01-01 20:00:00 1000
2014-01-01 20:15:00 2300
2014-01-01 20:30:00 1330
2014-01-01 20:45:00 NA
2014-01-01 21:00:00 NA
2014-01-01 22:15:00 NA
2014-01-01 22:30:00 1330
2014-01-01 22:45:00 3333
One can easily see that there is a period with simply no data written (21:00 to 22:15)
When I now apply
aggregate(data, list(t=cut($t, "1hour"), FUN=sum)
it interprets anything missing as zero. When plotting it with ggplot2 and geom_line, the curve in that region will break down from 1000s to 10s.
I want that aggregate returns NA values for every hour that is not represented by the data (missing or NA itself), such that the values are not bent down to 0 and such that the line plot shows a gap in that period (disconnected data points).
Thanks to #JulienNavarre and #user20650 who both contributed parts of the solution, I put here my final solution which is additionally capable of handling data at non-regular times and demands at least x values per hour for aggregation.
data$t <- as.POSIXct(strptime(data$t,"%Y-%m-%d %H:%M:%S"))
x <- 4 # data available x times per hour
h <- 1 # aggregate to every h hours
# aggregation puts NA if data has not x valid values per hour
dataagg <- aggregate(data$v, list(t=cut(data$t, paste(h,"hours"))),
function(z) ifelse(length(z)<x*h||any(is.na(z)),NA,sum(z,na.rm=T)))
dataagg$t <- as.POSIXct(strptime(dataagg$t, '%Y-%m-%d %H:%M:%S'))
# Now fill up missing datetimes with NA
a <- seq(min(dataagg$t), max(dataagg$t), by=paste(h,"hours"))
t <- a[seq(1, length(a), by=1)]
tdf <- as.data.frame(t)
tdf$t <- as.POSIXct(strptime(tdf$t, '%Y-%m-%d %H:%M:%S'))
dataaggfinal <- merge(dataagg, tdf, by="t", all.y=T)
What you want is not clear tho, but maybe you are looking for a right join, which you can do with merge and all.Y = TRUE.
And after you can do your sum grouped by, with aggregate.
> data$t <- as.POSIXct(data$t)
>
> time.seq <- seq(min(as.POSIXct(data$t)), max(as.POSIXct(data$t)), by = "min")[seq(1, 166, by = 15)]
>
> merge(data, as.data.frame(time.seq), by.x = "t", by.y = "time.seq", all.y = T)
t v
1 2014-01-01 20:00:00 1000
2 2014-01-01 20:15:00 2300
3 2014-01-01 20:30:00 1330
4 2014-01-01 20:45:00 NA
5 2014-01-01 21:00:00 NA
6 2014-01-01 21:15:00 NA
7 2014-01-01 21:30:00 NA
8 2014-01-01 21:45:00 NA
9 2014-01-01 22:00:00 NA
10 2014-01-01 22:15:00 NA
11 2014-01-01 22:30:00 1330
12 2014-01-01 22:45:00 3333
And the x argument in aggregate should be, in this case, the variable you want to "sum", then its "data$v" not "data".

Resources