Join on two columns/keys, with margin

Join on two columns/keys, with margin - r

For accident-analysis, I have to check if logged accidents from one system, exist in the logs from another system. Problem is that both systems are filled manually, so (small?) differences in location and time may occur.
For now, I've got this problem solved with a function, which I call with:
sys1log.df["match_1_900"] <- apply(sys1log.df, 1, bestMatch, marginLocation = 1, marginTime = 900)
marginLocation is the margin I want to use for the location of an incident. In this case the margin is 1, so all incidents in syslog2.df, which are logged between 0 and 2 are possible candidates for a match.
The same goes for marginTime, in this example set to 900 seconds. All incidents from syslog2.df which are logged between a quarter of an hour before (or after) the incident from syslog1.df, are possible matches.
The only thing I want to match 'hard' is the roadnumber.
The function bestMatch is:
bestMatch <- function (x, marginLocation, marginTime) {
location <- as.numeric( x[10] )
roadnumber <- as.numeric( x[9] )
time <- as.POSIXct( strptime(x[4], "%Y-%m-%d %H:%M:%S") )
require("dplyr")
df <- sys2log.df %>%
#filter rows that match criteria (within margins)
filter(road == roadnumber,
loc < location + marginLocation,
loc > location - marginLocation,
starttime < time + marginTime,
starttime > time - marginTime) %>%
#create column with absolute difference between time system1 and time system2
mutate(timeDifference = abs( as.numeric(time) - as.numeric(starttime) )) %>%
#sort on timeDifference
arrange(timeDifference)
#if a match is found, return the value in column 15 from the row with the smallest timeDifference)
if (length(df)) {
return(df[1,15])
} else {
return(NA)
}
}
This works fine, but the problem is that the logs contain >100.000 rows, so the apply-function takes about 15-30 minutes to run. I'm using multiple combination of location/time-margins, so I would really like to speed up things.
I think this can be done (much) faster, using data.table's rolling joins. My "problem" is that I would like to join on three keys, of which two should contain a rolling window/margin. Data.table only lets you apply a rolling join on one (the last) key.
I'm sure there is a way to achieve my goal with data.table (or another package), but I'm lost. Who can point me in the right direction?

It's typically a situation where you shouldn't use apply, you're converting your data.frame to a matrix then at each iteration reconverting every value.
use purrr::pmap instead to iterate on the chosen columns.
Don't sort your data when you're only looking for a minimum value only, use which.min, (and keep only the first result in case of multiple solutions).
Your test on length(df)) is counting the columns of the data.frame so it will never fail, I think you meant to test for nrows. I just skipped it as you can just test afterwards what object you received.
As you don't provide a reproducible example I can't guarantee that it works as I'm a lousy blind coder :). But it should point you to the solution.
# I'm supposing that the indices 10 9 and 4 are for loc, road, and starttime, and that in the original format the columns are well formatted
get_new_col <- function(marginLocation = 1, marginTime = 900){
sys1log.df["match_1_900"] <- sys1log.df %>% select(loc,road,starttime) %>%
pmap(function(location,road_number,time){
filter(sys1log.df %>%
filter(road == roadnumber,
loc < location + marginLocation,
loc > location - marginLocation,
starttime < time + marginTime,
starttime > time - marginTime) %>%
%>% {.[which.min(abs(time-starttime))[1],"timeDifference"]}
}
}
sys1log.df["match_1_900"] <- get_new_col()

Related

Direct update (replace) of sparse data frame is slow and inefficient

I'm attempting to read in a few hundred-thousand JSON files and eventually get them into a dplyr object. But the JSON files are not simple key-value parse and they require a lot of pre-processing. The preprocessing is coded and does fairly good for efficiency. But the challenge I am having is loading each record into a single object (data.table or dplyr object) efficiently.
This is very sparse data, I'll have over 2000 variables that will mostly be missing. Each record will have maybe a hundred variables set. The variables will be a mix of character, logical and numeric, I do know the mode of each variable.
I thought the best way to avoid R copying the object for every update (or adding one row at a time) would be to create an empty data frame and then update the specific fields after they are pulled from the JSON file. But doing this in a data frame is extremely slow, moving to data table or dplyr object is much better but still hoping to reduce it to minutes instead of hours. See my example below:
timeMe <- function() {
set.seed(1)
names = paste0("A", seq(1:1200))
# try with a data frame
# outdf <- data.frame(matrix(NA, nrow=100, ncol=1200, dimnames=list(NULL, names)))
# try with data table
outdf <- data.table(matrix(NA, nrow=100, ncol=1200, dimnames=list(NULL, names)))
for(i in seq(100)) {
# generate 100 columns (real data is in json)
sparse.cols <- sample(1200, 100)
# Each record is coming in as a list
# Each column is either a character, logical, or numeric
sparse.val <- lapply(sparse.cols, function(i) {
if(i < 401) { # logical
sample(c(TRUE, FALSE), 1)
} else if (i < 801) { # numeric
sample(seq(10), 1)
} else { # character
sample(LETTERS, 1)
}
}) # now we have a list with values to populate
names(sparse.val) <- paste0("A", sparse.cols)
# and here is the challenge and what takes a long time.
# want to assign the ith row and the named column with each value
for(x in names(sparse.val)) {
val=sparse.val[[x]]
# this is where the bottleneck is.
# for data frame
# outdf[i, x] <- val
# for data table
outdf[i, x:=val]
}
}
outdf
}
I thought the mode of each column might have been set and reset with each update, but I have also tried this by pre-setting each column type and this didn't help.
For me, running this example with a data.frame (commented out above) takes around 22 seconds, converting to a data.table is 5 seconds. I was hoping someone knew what was going on under the covers and could provide a faster way to populate the data table here.

I follow your code except the part where you construct sparse.val. There are minor errors in the way you assign columns. Don't forget to check that the answer is right in trying to optimise :).
First, the creation of data.table:
Since you say that you already know the type of the columns, it's important to generate the correct type up front. Else, when you do: DT[, LHS := RHS] and RHS type is not equal to LHS, RHS will be coerced to the type of LHS. In your case, all your numeric and character values will be converted to logical, as all columns are logical type. This is not what you want.
Creating a matrix won't help therefore (all columns will be of the same type) + it's also slow. Instead, I'd do it like this:
rows = 100L
cols = 1200L
outdf <- setDT(lapply(seq_along(cols), function(i) {
if (i < 401L) rep(NA, rows)
else if (i >= 402L & i < 801L) rep(NA_real_, rows)
else rep(NA_character_, rows)
}))
Now we've the right type set. Next, I think it should be i >= 402L & i < 801L. Otherwise, you're assigning the first 401 columns as logical and then the first 801 columns as numeric, which, given that you know the type of the columns upfront, doesn't make much sense, right?
Second, doing names(.) <-:
The line:
names(sparse.val) <- paste0("A", sparse.cols)
will create a copy and is not really necessary. Therefore we'll delete this line.
Third, the time consuming for-loop:
for(x in names(sparse.val)) {
val=sparse.val[[x]]
outdf[i, x:=val]
}
is not actually doing what you think it's doing. It's not assigning the values from val to the name assigned to x. Instead it's (over)writing (each time) to a column named x. Check your output.
This is not a part of optimisation. This is just to let you know what you're actually wanting to do here.
for(x in names(sparse.val)) {
val=sparse.val[[x]]
outdf[i, (x) := val]
}
Note the ( around x. Now, it'll be evaluated and the value contained in x will be the column to which val's value will be assigned to. It's a bit subtle, I understand. But, this is necessary because it allows for the possibility to create column x as DT[, x := val] where you actually want val to be assigned to x.
Coming back to the optimisation, the good news is, your time consuming for-loop is simply:
set(outdf, i=i, j=paste0("A", sparse.cols), value = sparse.val)
This is where data.table's sub-assign by reference feature comes in handy!
Putting it all together:
Your final function looks like this:
timeMe2 <- function() {
set.seed(1L)
rows = 100L
cols = 1200L
outdf <- as.data.table(lapply(seq_len(cols), function(i) {
if (i < 401L) rep(NA, rows)
else if (i >= 402L & i < 801L) rep(NA_real_, rows)
else sample(rep(NA_character_, rows))
}))
setnames(outdf, paste0("A", seq(1:1200)))
for(i in seq(100)) {
sparse.cols <- sample(1200L, 100L)
sparse.val <- lapply(sparse.cols, function(i) {
if(i < 401L) sample(c(TRUE, FALSE), 1)
else if (i >= 402 & i < 801L) sample(seq(10), 1)
else sample(LETTERS, 1)
})
set(outdf, i=i, j=paste0("A", sparse.cols), value = sparse.val)
}
outdf
}
By doing this, your solution takes 9.84 seconds on my system whereas the function above takes 0.34 seconds, which is ~29x improvement. I think this is the result you're looking for. Please verify it.
HTH

speeding up "for-loop" for deleting rows matching criteria

I am backtesting some investment strategy using R, I have a piece of script below:
set.seed(1)
output.df <- data.frame(action=sample(c("initial_buy","sell","buy"),
10000,replace=TRUE),stringsAsFactors=FALSE)
output.df[,"uid"] <- 1:nrow(output.df)
cutrow.fx <- function(output.df) {
loop.del <- 2
while (loop.del <= nrow(output.df)) {
if ((output.df[loop.del,"action"]=="initial_buy" &
output.df[loop.del-1,"action"]=="initial_buy")|
(output.df[loop.del,"action"]=="sell" &
output.df[loop.del-1,"action"]=="sell")|
(output.df[loop.del,"action"]=="buy" &
output.df[loop.del-1,"action"]=="sell")|
(output.df[loop.del,"action"]=="initial_buy" &
output.df[loop.del-1,"action"]=="buy")){
output.df <- output.df[-loop.del,]
} else {
loop.del <- loop.del + 1
}
}
output.df<<-output.df
}
print(system.time(cutrow.fx(output.df=output.df)))
The strategy will determine: 1) when to start buying a stock; 2) when to add additional contribution to the stock; and 3) when to sell all the stock. I have a dataframe with price of a stock for the past 10 years. I wrote 3 scripts to indicate which date should I buy/sell the stock, combine the 3 results and order them.
I need to remove some of the "impossible action", e.g. I cannot sell the same stock twice without buying new units beforehand, so I used the script above to delete those impossible action. But the for loop is kind of slow.
Any suggestion for speeding it up?
Update 01
I have updated the cutrow.fx into the following but fail:
cutrow.fx <- function(output.df) {
output.df[,"action_pre"] <- "NIL"
output.df[2:nrow(output.df),"action_pre"] <- output.df[1:(nrow(output.df)-1),"action"]
while (any(output.df[,"action_pre"]=="initial_buy" & output.df[,"action"]=="initial_buy")|
any(output.df[,"action_pre"]=="sell" & output.df[,"action"]=="sell")|
any(output.df[,"action_pre"]=="sell" & output.df[,"action"]=="buy")|
any(output.df[,"action_pre"]=="buy" & output.df[,"action"]=="initial_buy")) {
output.df <- output.df[!(output.df[,"action_pre"]=="initial_buy" & output.df[,"action"]=="initial_buy"),]
output.df <- output.df[!(output.df[,"action_pre"]=="sell" & output.df[,"action"]=="sell"),]
output.df <- output.df[!(output.df[,"action_pre"]=="sell" & output.df[,"action"]=="buy"),]
output.df <- output.df[!(output.df[,"action_pre"]=="buy" & output.df[,"action"]=="initial_buy"),]
output.df[,"action_pre"] <- "NIL"
output.df[2:nrow(output.df),"action_pre"] <- output.df[1:(nrow(output.df)-1),"action"]
}
output.df[,"action_pre"] <- NULL
output.df<<-output.df
}
I used the vector comparison as somehow inspired (I used somehow as I'm not sure if I get exact what he means in the answer) by John, use a while-loop to repeat. But the output is not the same.
Is the for-loop here inevitable?

It looks like all you're doing is checking the last action. This doesn't require a loop at all. All you have to do is shift the vector and do straight vector comparisons. Here's an artificial example.
x <- sample(1:11)
buysell <- sample(c('buy', 'sell'), 11, replace = TRUE)
So, I have 11 samples, x, and whether I've bought or sold them. I want to make a boolean that shows whether I bought or sold the last sample.
bought <- c(NA, buysell[1:10])
which( bought == 'buy' )
Examine the x and buysell variables and you'll see the results here are the index of the x items where a buy was made on the prior item.
Also, you might want to check out he function %in%.

I tried to do something clever with vectorization, but failed because previous iterations of the loop can change the data relationships for later iterations through. So I couldn't lag the data by a set amount and compare lagged to real results.
What I can do is minimize the copying operation involved. R is assign-by-copy, so when you write a statement like output.df <- output.df[-loop.del,], you are copying the entire data structure for each row that is deleted. Instead of changing (and copying) the data frame, I made changes to a logical vector. Some other attempts at speed-up include using logical and (&&) instead of bitwise and (&), using %in% to make fewer comparisons, and minimizing accesses on output.df.
To compare the two functions I slightly modified OP solution such that the original data frame was not overwritten. It looks like this can improve speeds by a factor of 10, but it still takes a noticeable about of time (>0.5 sec). I'd love to see any faster solutions.
OP's solution (slightly modified in return value and without global assign)
cutrow.fx <- function(output.df) {
loop.del <- 2
while (loop.del <= nrow(output.df)) {
if ((output.df[loop.del,"action"]=="initial_buy" &
output.df[loop.del-1,"action"]=="initial_buy")|
(output.df[loop.del,"action"]=="sell" &
output.df[loop.del-1,"action"]=="sell")|
(output.df[loop.del,"action"]=="buy" &
output.df[loop.del-1,"action"]=="sell")|
(output.df[loop.del,"action"]=="initial_buy" &
output.df[loop.del-1,"action"]=="buy")){
output.df <- output.df[-loop.del,]
} else {
loop.del <- loop.del + 1
}
}
return(output.df)
}
ans1 <- cutrow.fx(output.df)
my solution
cutrow.fx2 <- function(output.df) {
##edge case if output.df has too few rows
if (nrow(output.df) < 2) return(output.df)
##logical vector of indices of rows to keep
idx <- c(TRUE,logical(nrow(output.df)-1))
##keeps track of the previous row
prev.row <- 1
prev.act <- output.df[prev.row,"action"]
for (current.row in seq_len(nrow(output.df))[-1]) {
##access output.df only once per iteration
current.act <- output.df[current.row,"action"]
##checks to see if current row is bad
##if so, continue to next row and leave previous row as is
if ( (prev.act %in% c("initial_buy","buy")) &&
(current.act == "initial_buy") ) {
next
} else if ( (prev.act == "sell") &&
(current.act %in% c("buy","sell")) ) {
next
}
##if current row is good, mark it in idx and update previous row
idx[current.row] <- TRUE
prev.row <- current.row
prev.act <- current.act
}
return(output.df[idx,])
}
ans2 <- cutrow.fx2(output.df)
checks that answers are the same
identical(ans1,ans2)
## [1] TRUE
#benchmarking
require(microbenchmark)
mb <- microbenchmark(
ans1=cutrow.fx(output.df)
,ans2=cutrow.fx2(output.df),times=50)
print(mb)
# Unit: milliseconds
# expr min lq median uq max
# 1 ans1 9630.1671 9743.1102 9967.6442 10264.7000 12396.5822
# 2 ans2 481.8821 491.6699 500.6126 544.4222 645.9658
plot(mb)
require(ggplot2)
ggplot2::qplot(y=time, data=mb, colour=expr) + ggplot2::scale_y_log10()

Here is some code that is a bit simpler and much faster. It does not loop over all elements, but only loops between matches. It matches forward rather than backward.
First, modify your cutrow.fx function. Remove the <<-output.df on the last line, and simply return the result. Then you can run two functions and compare the results.
cutrow.fx1 <- function(d) {
len <- length(d[,1])
o <- logical(len)
f <- function(a) {
switch(a,
initial_buy=c('buy', 'sell'),
buy=c('buy', 'sell'),
sell='initial_buy'
)
}
cur <- 1
o[cur] <- TRUE
while (cur < len) {
nxt <- match(f(d[cur,1]), d[(cur+1):len,1])
if (all(is.na(nxt))) {
break
} else {
cur <- cur + min(nxt, na.rm=TRUE);
o[cur] <- TRUE
}
}
d[o,]
}
Show that the results are correct:
identical(cutrow.fx1(output.df), cutrow.fx(output.df))
## [1] TRUE
And it is quite a bit faster. This is due to the partial vectorization of the problem, using match to find the next row to keep, rather than iterating to discard rows.
print(system.time(cutrow.fx(output.df)))
## user system elapsed
## 5.688 0.000 5.720
print(system.time(cutrow.fx1(output.df)))
## user system elapsed
## 1.050 0.000 1.056

format time span to show hours, minutes, seconds

I've been trying to find a simple way of formatting the output from difftime into HH:MM:SS.ms. So far I haven't come across anything which I was surprised by.
I did write the function below which almost does it. The limitation is the presentation of the numbers as significant single digits. eg 2hr, 3mins, 4.5secs becomes "2:3:4.5" instead of "02:03:04.5"
Does anyone have a better suggestion?
format.timediff <- function(start_time) {
diff = as.numeric(difftime(Sys.time(), start_time, units="mins"))
hr <- diff%/%60
min <- floor(diff - hr * 60)
sec <- round(diff%%1 * 60,digits=2)
return(paste(hr,min,sec,sep=':'))
}

In addition to #GSee's comment, you could use a function like this:
f <- function(start_time) {
start_time <- as.POSIXct(start_time)
dt <- difftime(Sys.time(), start_time, units="secs")
# Since you only want the H:M:S, we can ignore the date...
# but you have to be careful about time-zone issues
format(.POSIXct(dt,tz="GMT"), "%H:%M:%S")
}
f(Sys.Date())

Merge_Charge_Point$Duration<- difftime(Merge_Charge_Point$EndConnectionDateTime, Merge_Charge_Point$StartConnectionDateTime, units="secs")
This is the code. But this code transforms the data in to seconds but the outcome should be a time string.

How to vectorize and accelerate strtime() logtime conversion on a data-frame

(EDIT: one of the issues here is scale, namely what works for one row will blow up/crash R on a 200,000 * 50 dataframe. For example, strptime must be applied column-wise, not row-wise, to avoid hanging.
I'm looking for working code solutions that you actually ran on 200,000 * 50 including your measured runtime, not just casual "this is easy" remarks. It's easy to get runtimes > 12 hrs if you pick the wrong fn. Next, I also asked you to make my zero-time adjustment code faster, the job's not finished till that's done. Noone attempted that so far.)
I want to vectorize and accelerate the following multistep log-time conversion, with millisecond accuracy, involving converting strtime() to a single numeric, followed by subtraction and then log() on a large data-frame (200,000 rows * 300 cols; other (non-time) columns omitted).
Code below.
As well as making it vectorized and fast, an extra problem is I'm not sure how best to represent the (higher-dimensional) intermediate values at each step e.g. as list from strtime, matrix, vector). I already tried apply,sapply,lapply,vapply,ddply::maply(),... but the incompatibility of intermediate format(s) keeps messing me up...
Each row has 50 columns time1..time50 (chr, format="HH:MM:SS.sss") representing time as string in millisecond resolution. I need millisecond accuracy.
Within each row, columns time1..time50 are in non-decreasing order, and I want to convert them into log of time before time50. The conversion fn parse_hhmmsecms() is at bottom, and needs serious vectorization and speeding up, you can see alternative versions commented out. What I figured so far: strtime() is faster than (multiple) substr() calls, I then convert somehow to list of three numeric (hh,mm,sec.ms), then convert to vector assuming the next step should be to vector-multiply with %*% c(3600,60,1) to convert to numeric seconds.
Here is pseudocode of what I do for each row, and each time-string; full code is at bottom:
for each row in dataframe { # vectorize this, loop_apply(), or whatever...
#for each time-column index i ('time1'..'time50') { # vectorize this...
hhmmsecms_50 <- parse_hhmmsecms(xx$time50[i])
# Main computation
xx[i,Clogtime] <- -10*log10(1000*(hhmmsecms_50 - parse_hhmmsecms(xx[i,Ctime]) ))
# Minor task: fix up all the 'zero-time' events to be evenly spaced between -3..0
#}
}
So there are five subproblems involved:
How to vectorize handling the list returned by strtime()? since it returns a list of 3 items, when passed a 2D dataframe or 1D row of time-strings, we will get a 3D or 2D intermediate object. (do we internally we use list-of-list? matrix of lists? array of lists?)
How to vectorize the entire function parse_hhmmsecms()?
Then do the subtraction and log
Vectorize the zero-time fixup code as well (this is now the slowest part by far)
How to accelerate steps 1...4.?
Code snippet below using ten example columns time41..50 (use random_hhmmsecms() if you want a bigger sample)
I did my best to follow these recommendations, this is as reproducible as I can get it in six hours' work:
# Each of 200,000 rows has 50 time strings (chr) like this...
xx <- structure(list(time41 = c("08:00:41.465", "08:00:50.573", "08:00:50.684"
), time42 = c("08:00:41.465", "08:00:50.573", "08:00:50.759"),
time43 = c("08:00:41.465", "08:00:50.573", "08:00:50.759"
), time44 = c("08:00:41.465", "08:00:50.664", "08:00:50.759"
), time45 = c("08:00:41.465", "08:00:50.684", "08:00:50.759"
), time46 = c("08:00:42.496", "08:00:50.684", "08:00:50.759"
), time47 = c("08:00:42.564", "08:00:50.759", "08:00:51.373"
), time48 = c("08:00:48.370", "08:00:50.759", "08:00:51.373"
), time49 = c("08:00:50.573", "08:00:50.759", "08:00:54.452"
), time50 = c("08:00:50.573", "08:00:50.759", "08:00:54.452"
)), .Names = c("time41", "time42", "time43", "time44", "time45",
"time46", "time47", "time48", "time49", "time50"), row.names = 3:5, class = "data.frame")
# Handle millisecond timing and time conversion
options('digits.secs'=3)
# Parse "HH:MM:SS.sss" timestring into (numeric) number of seconds (Very slow)
parse_hhmmsecms <- function(t) {
as.numeric(substr(t,1,2))*3600 + as.numeric(substr(t,4,5))*60 + as.numeric(substr(t,7,12)) # WORKS, V SLOW
#c(3600,60,1) %*% sapply((strsplit(t[1,]$time1, ':')), as.numeric) # SLOW, NOT VECTOR
#as.vector(as.numeric(unlist(strsplit(t,':',fixed=TRUE)))) %*% c(3600,60,1) # WANT TO VECTORIZE THIS
}
random_hhmmsecms <- function(n=1, min=8*3600, max=16*3600) {
# Generate n random hhmmsecms objects between min and max (8am:4pm)
xx <- runif(n,min,max)
ss <- xx %% 60
mm <- (xx %/% 60) %% 60
hh <- xx %/% 3600
sprintf("%02d:%02d:%05.3f", hh,mm,ss)
}
xx$logtime45 <- xx$logtime44 <- xx$logtime43 <- xx$logtime42 <- xx$logtime41 <- NA
xx$logtime50 <- xx$logtime49 <- xx$logtime48 <- xx$logtime47 <- xx$logtime46 <- NA
# (we pass index vectors as the dataframe column ordering may change)
Ctime <- which(colnames(xx)=='time41') : which(colnames(xx)=='time50')
Clogtime <- which(colnames(xx)=='logtime41') : which(colnames(xx)=='logtime50')
for (i in 40:nrow(xx)) {
#if (i%%100==0) { print(paste('... row',i)) }
hhmmsecms_50 <- parse_hhmmsecms(xx$time50[i])
xx[i,Clogtime] <- -10*log10(1000*(hhmmsecms_50 - parse_hhmmsecms(xx[i,Ctime]) ))
# Now fix up all the 'zero-time' events to be evenly spaced between -3..0
Czerotime.p <- which(xx[i,Clogtime]==Inf | xx[i,Clogtime]>-1e-9)
xx[i,Czerotime.p] <- seq(-3,0,length.out=length(Czerotime.p))
}

You may be overcomplicating things.
Start with base classes which do milliseconds very well (and on appropriate operating systems even microseconds) but note that
you need to set options("digits.secs"=7) (that's the max that can be displayed) to see them displayed
you need an additional parsing character for strptime et al
all of which is in the docs, and countless examples here on SO.
Quick examples:
R> someTime <- ISOdatetime(2011, 12, 27, 2, 3, 4.567)
R> someTime
[1] "2011-12-27 02:03:04.567 CST"
R> now <- Sys.time()
R> now
[1] "2011-12-27 16:48:20.247298 CST" # microsecond display on Linux
R>
R> txt <- "2001-02-03 04:05:06.789123"
R> strptime(txt, "%Y-%m-%d %H:%M:%OS") # note the %0S for sub-seconds
[1] "2001-02-03 04:05:06.789123"
R>
And key functions such as strptime or as.POSIXct are all vectorised and you can throw entire columns at them.

Why do I have to set as.Date origin again after using ifelse? Is there a better way?

The following function does work, but the last as.Date part was more or less an result of trial and error that do not understand fully.
### This function creates a real date column out of year / period that is saved in
### in separate columns, plus it handles a 13th period in case of overlapping period
### terminology. Turns quarters into months.
realDate <- function (table,year="year_col",period="period_col"){
if (is.character(table) == TRUE)
{
dframe <- get(table)
}
else{
dframe <- table
}
x <- expression({resDate <- with(dframe,
as.Date(paste(get(year),"-",
ifelse(get(period) > 9, get(period),
paste("0", get(period), sep = "")),
"-01", sep = "")))
})
y <- expression({resDate <- with(dframe,as.Date(paste(get(year) + 1,"-","01","-01",sep="")))})
#### I do not get this? Why do I have to do this?
a <- ifelse(get(period) == 13,eval(y),eval(x))
a <-as.Date(a, origin="1970-01-01")
return(a)
}
Instead I tried to do it like this (because it was more intuitively to me):
{ ....
ifelse(get(period) == 13,eval(y),eval(x))
return(resDate)
}
This returned the corrected values whenever the condition was FALSE (no) but returned NA if the condition was TRUE (yes). Why is that? And if I use the function above, why do I have to define the origin again? Why I even have call as.Date again?
EDIT:
a <- rep(2002:2010,2)
b <- rep(1:13,2)
d<-cbind(a,b[1:length(a)])
names(d) <- c("year_col","period_col")
P.S.:
I found this thread on vectorized ifelse.

Your construct is "interesting" at least. To start with, neither x nor y gives output. I wonder why you use an assignment in your eval(). this gives you a resDate vector that is exactly what the last call has been. And that is not dependent on the condition, it's the last one written (eval(x) in your case). They get executed before the ifelse clause is executed.
Plus, the output you get is the numeric representation of your data, not the data object. That is in resDate. I guess that ifelse cannot determine the class of the output vector as you use the eval() inside. I'm surprised you get output at all, in fact you're effectively using something that could be called a "bug" in R (Microsoft would call it a feature :-) ).
Your mistake is in your ifelse : get(period) doesn't exist. it should be get(period, dframe). Then it works. The only reason why it works on your computer, is because you have a period in your workspace presumably. Classis problem when debugging.
In any case, I'd make it:
realDate <- function (table,year="year_col",period="period_col"){
if (is.character(table)){ # is.character(table) returns a boolean already.
dframe <- get(table)
} else {
dframe <- table
}
year <- get(year,dframe)
period <- get(period,dframe)
year[period==13] <- year[period==13]+1
period[period==13] <- 1
as.Date(paste(year,"-",period,"-01",sep=""))
}
This is quite a bit faster than your own, has less pitfalls and conversions, and is more the R way of doing it. You could change year[...] and period [...] by ifelse constructs, but using indices is generally faster.
EDIT :
This is easier for the data generation:
dframe <- data.frame(
year_col= rep(2006:2007,each=13),
period_col = rep(1:13,2)
)
realDate(dframe)
[1] "2006-01-01" "2006-02-01" "2006-03-01" "2006-04-01" "2006-05-01"
"2006-06-01" "2006-07-01" "2006-08-01" "2006-09-01"
[10] "2006-10-01" "2006-11-01" "2006-12-01" "2007-01-01" "2007-01-01"
"2007-02-01" "2007-03-01" "2007-04-01" "2007-05-01"
[19] "2007-06-01" "2007-07-01" "2007-08-01" "2007-09-01"
"2007-10-01" "2007-11-01" "2007-12-01" "2008-01-01"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Join on two columns/keys, with margin - r

Related

Direct update (replace) of sparse data frame is slow and inefficient

speeding up "for-loop" for deleting rows matching criteria

format time span to show hours, minutes, seconds

How to vectorize and accelerate strtime() logtime conversion on a data-frame

Why do I have to set as.Date origin again after using ifelse? Is there a better way?

Categories

Resources