Avoid for-loop: Define blocks of actions within a time range - r

I need to define blocks of actions - so I want to group together all actions for a single id that take place less than 30 days since the last action. If it's more than 30 days since the last action, then I'd increment the label by one (so label 2, 3, 4...). Every new id would start at 1 again.
Here's the data:
dat = data.frame(cbind(
id = c(rep(1,2), rep(16,3), rep(17,24)),
##day_id is the action date in %Y%m%d format - I keep it as numeric but could potentially turn to a date.
day_id = c(20130702, 20130121, 20131028, 20131028, 20130531, 20140513, 20140509,
20140430, 20140417, 20140411, 20140410, 20140404,
20140320, 20140313, 20140305, 20140224, 20140213, 20140131, 20140114,
20130827, 20130820, 20130806, 20130730, 20130723,
20130719, 20130716, 20130620, 20130620, 20130614 ),
###diff is the # of days between actions/day_ids
diff =c(NA,162,NA,0,150,NA,4,9,13,6,1,6,15,7,8,9,11,13,17,140,7,14,
7,7,4,3,26,0,6),
###Just a flag to say whether it's a new id
new_id = c(1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
))
I've done it with a for loop and managed to avoid loops within loops (see below) but can't seem to get rid of that outer loop. Of course, it gets extremely slow with thousands of ids. In the example below, 'call_block' is what I'm trying to reproduce but without the for loop. Can anyone help me get this out of a loop??
max_days = 30
r = NULL
for(i in unique(dat$id)){
d = dat$diff[dat$id==i]
w = c(1,which(d>=max_days) , length(d)+1)
w2 = diff(w)
r = c(r,rep(1:(length(w)-1), w2))
}
dat$call_block = r
Thank you!

Posting #alexis_laz's answer here to close out the question
library(data.table)
f = function(x){
ret = c(1, cumsum((x >= 30)[-1]) + 1)
return(ret = ret)
}
df = data.table(dat)
df2 = df[,list(call_block= f(diff)), by = id]

Related

How can I get specific data out of a list (with lists) under a certain condition

I have lists in lists and would like to take the activities that occur just within the first 600 seconds (journey time < 600). The "journey time" starts with 0 and adds the time of corresponding activity "code" on top.
homepage1[["customer_data"]][["activity_list"]][[i]][["journey_time"]]
homepage1[["customer_data"]][["activity_list"]][[i]][["code"]]
So for example [["journey_time"]] could look like this 0, 46.7, 79.4, ...., 1800.
[["code"]] looks like StartPage, ClickItem1, ScrollItem1, ..., ClosePage.
"i" are the customers here.
I tried it for each customer alone, but I, of course, would prefer an iterative process with loops.
Thank you in advance! Appreciate it much!
Marius
Your data structure looks like a JSON -> R conversion, so in case you have the original JSON you may not need to convert it to JSON first.
Having said this, you can use fromJSON with flatten = TRUE to get the relevant data in a nice data.frame format, which makes the processing much easier:
Data
homepage1 <- list(
customer_data = list(
activity_list = list(
list(
list(journey_time = 0, code = "StartPage"),
list(journey_time = 46.7, code = "ClickItem1"),
list(journey_time = 79.4, code = "ScrollItem1"),
list(journey_time = 1800, code = "ClosePage")
)
)
)
)
library(jsonlite)
(mdat <- fromJSON(toJSON(homepage1), flatten = TRUE))
# $customer_data
# $customer_data$activity_list
# $customer_data$activity_list[[1]]
# journey_time code
# 1 0 StartPage
# 2 46.7 ClickItem1
# 3 79.4 ScrollItem1
# 4 1800 ClosePage
So all you need to do is to use cumsum on column journey_time (assuming that each timing measures the time spent on the element since the last visit and not from the beginning, if the later is true you do not need cumsum) to get cumulative timings and use that as a filter:
idx <- cumsum(mdat$customer_data$activity_list[[1]]$journey_time) < 600
unlist(mdat$customer_data$activity_list[[1]]$code[idx])
# [1] "StartPage" "ClickItem1" "ScrollItem1"
Now it is easy to loop over all customers like this:
lapply(mdat$customer_data$activity_list, function(al) {
idx <- cumsum(al$journey_time) < 600
unlist(al$code[idx])
})

Loop in R through variable names with values as endings and create new variables from the result

I have 24 variables called empl_1 -empl_24 (e.g. empl_2; empl_3..)
I would like to write a loop in R that takes this values 1-24 and puts them in the respective places so the corresponding variables are either called or created with i = 1-24. The sample below shows what I would like to have within the loop (e.g. ye1- ye24; ipw_atet_1 - ipw_atet_14 and so on.
ye1_ipw <- empl$empl_1[insample==1]
ipw_atet_1 <- treatweight(y=ye1_ipw, d=treat_ipw, x=x1_ipw, ATET =TRUE, trim=0.05, boot = 2)
ipw_atet_1
ipw_atet_1$se
ye2_ipw <- empl$empl_2[insample==1]
ipw_atet_2 <- treatweight(y=ye2_ipw, d=treat_ipw, x=x1_ipw, ATET =TRUE, trim=0.05, boot = 2)
ipw_atet_2
ipw_atet_2$se
ye3_ipw <- empl$empl_3[insample==1]
ipw_atet_3 <- treatweight(y=ye3_ipw, d=treat_ipw, x=x1_ipw, ATET =TRUE, trim=0.05, boot = 2)
ipw_atet_3
ipw_atet_3$se
coming from a Stata environment I tried
for (i in seq_anlong(empl_list)){
ye[i]_ipw <- empl$empl_[i][insample==1]
ipw_atet_[i]<-treatweight(y=ye[i]_ipw, d=treat_ipw, x=x1_ipw, ATET=TRUE, trim=0.05, boot =2
}
However this does not work at all. Do you have any idea how to approach this problem by writing a nice loop? Thank you so much for your help =)
You can try with lapply :
result <- lapply(empl[paste0('empl_', 1:24)], function(x)
treatweight(y = x[insample==1], d = treat_ipw,
x = x1_ipw, ATET = TRUE, trim = 0.05, boot = 2))
result would be a list output storing the data of all the 24 variables in same object which is easier to manage and process instead of having different vectors.

make loaded data.table modifiable by a function

I am loading some data.tables and want to create some new columns in them.
There is a closely related question on this topic, but it is predicated on manually entering the name of each data.table. Here's my example:
library(data.table)
library(magrittr)
perf_attr = data.table(
ID = 1:2,
perf_date = as.IDate("2015-12-18") + 0:1,
metro_pop = 1e4*(1:2)
)
##### this part causes trouble ######
save(perf_attr, file = "tmp.rdata")
rm(perf_attr)
load("tmp.rdata")
add_vars = function(DT = data.table(), vars = list()){
if (length(vars)) DT[, names(vars) := lapply(vars, . %>% `[[`(2) %>% eval)][]
DT
}
vars = list(
perf_attr = list(
const = ~1,
lpop = ~log(metro_pop),
dum_weekend = ~weekdays(perf_date) %in% c("Friday", "Saturday")
)
)
for (DTnm in names(vars)) add_vars(get(DTnm), vars[[DTnm]])
##### new columns should appear here, but don't ######
perf_attr
# ID perf_date metro_pop medinc
# 1: 1 2015-12-18 10000 30000
# 2: 2 2015-12-19 20000 40000
Comments
The get doesn't seem to be central to the problem, since add_vars(perf_attr, vars$perf_attr) also fails.
If you skip the save/load part, it seems to work fine, with perf_attr modified by reference). It also works if I don't use a function, like:
perf_attr[, names(vars$perf_attr) := lapply(vars$perf_attr, . %>% `[[`(2) %>% eval)]
I'm used to having internal selfref pointers messed up for loaded data.tables, but am not sure how to repair them to make this work. I tried various lapply(mget(tables()$NAME), f) and for (DTnm in tables()$NAME){stuff} hacks after the load line, but to no good effect.
Well, looking again at the linked answer, I came up with this loop to insert after load:
for (DTnm in tables()$NAME){
assign(DTnm, alloc.col(get(DTnm)))
}
Of course, this tweaks all data.tables in memory.

Using ifelse to create a running tally in R

I am trying to do some quantitative modeling in R. I'm not getting an error message, but the results are not what I actually need.
I am a newbie, but here is my complete code sample.
`library(quantmod)
#Building the data frame and xts to show dividends, splits and technical indicators
getSymbols(c("AMZN"))
Playground <- data.frame(AMZN)
Playground$date <- as.Date(row.names(Playground))
Playground$wday <- as.POSIXlt(Playground$date)$wday #day of the week
Playground$yday <- as.POSIXlt(Playground$date)$mday #day of the month
Playground$mon <- as.POSIXlt(Playground$date)$mon #month of the year
Playground$RSI <- RSI(Playground$AMZN.Adjusted, n = 5, maType="EMA") #can add Moving Average Type with maType =
Playground$MACD <- MACD(AMZN, nFast = 12, nSlow = 26, nSig = 9)
Playground$Div <- getDividends('AMZN', from = "2007-01-01", to = Sys.Date(), src = "google", auto.assign = FALSE)
Playground$Split <- getSplits('AMZN', from = "2007-01-01", to = Sys.Date(), src = "google", auto.assign = FALSE)
Playground$BuySignal <- ifelse(Playground$RSI < 30 & Playground$MACD < 0, "Buy", "Hold")
All is well up until this point when I start using some logical conditions to come up with decision points.
Playground$boughts <- ifelse(Playground$BuySignal == "Buy", lag(Playground$boughts) + 1000, lag(Playground$boughts))
It will execute but the result will be nothing but NA. I suppose this is because you are trying to add NA to a number, but I'm not 100% sure. How do you tell the computer I want you to keep a running tally of how much you have bought?
Thanks so much for the help.
So we want ot buy 1000 shares every time a buy signal is generated?
Your problem stems from MACD idicator. It actually generates two columns, macd and signal. You have to decide which one you want to keep.
Playground$MACD <- MACD(AMZN, nFast = 12, nSlow = 26, nSig = 9)$signal
This should solve the problem at hand.
Also, please check the reference for ifelse. The class of return value can be tricky at times, and so the approach suggested by Floo0 is preferable.
Also, I'd advocate using 1 and 0 instead of buy and sell to show weather you are holding . It makes the math much easier.
And I'd strongly suggest reading some beginner tutorial on backtesting with PerformanceAnalytics. They make the going much much easier.
BTW, you missed this line in the code:
Playground$boughts<- 0
Hope it helps.
EDIT: And I forgot to mention the obvious. discard the first few rows where MACD will be NA
Something like:
Playground<- Playground[-c(1:26),]
Whenever you want to do an ifelse like
if ... Do something, else stay the same: Do not use ifelse
Try this instead
ind <- which(Playground$BuySignal == "Buy")
Playground$boughts[ind] <- lag(Playground$boughts) + 1000

Lag in R dataframe

I have the following sample dataset (below and/or as CSVs here: http://goo.gl/wK57T) which I want to transform as follows. For each person in a household I want to create two new variables OrigTAZ and DestTAZ. It should take the value in TripendTAZ and put that in DestTAZ. For OrigTAZ it should put value of TripendTAZ from the previous row. For the first trip of every person in a household (Tripid = 1) the OrigTAZ = hometaz. For each person in a household, from the second trip OrigTAZ = TripendTAZ_(n-1) and DestTAZ = TripEndTAZ. The sample input and output data are shown below. I tried the suggestions shown here: Basic lag in R vector/dataframe but have not had luck. I am used to doing something like this in SAS.
Any help is appreciated.
TIA,
Krishnan
SAS Code Sample
if Houseid = lag(Houseid) then do;
if Personid = lag(Personid) then do;
DestTAZ = TripendTAZ;
if Tripid = 1 then OrigTAZ = hometaz
else
OrigTAZ = lag(TripendTAZ);
end;
end;
INPUT DATA
Houseid,Personid,Tripid,hometaz,TripendTAZ
1,1,1,45,4
1,1,2,45,7
1,1,3,45,87
1,1,4,45,34
1,1,5,45,45
2,1,1,8,96
2,1,2,8,4
2,1,3,8,2
2,1,4,8,1
2,1,5,8,8
2,2,1,8,58
2,2,2,8,67
2,2,3,8,9
2,2,4,8,10
2,2,5,8,8
3,1,1,7,89
3,1,2,7,35
3,1,3,7,32
3,1,4,7,56
3,1,5,7,7
OUTPUT DATA
Houseid,Personid,Tripid,hometaz,TripendTAZ,OrigTAZ,DestTAZ
1,1,1,45,4,45,4
1,1,2,45,7,4,7
1,1,3,45,87,7,87
1,1,4,45,34,87,34
1,1,5,45,45,34,45
2,1,1,8,96,8,96
2,1,2,8,4,96,4
2,1,3,8,2,4,2
2,1,4,8,1,2,1
2,1,5,8,8,1,8
2,2,1,8,58,8,58
2,2,2,8,67,58,67
2,2,3,8,9,67,9
2,2,4,8,10,9,10
2,2,5,8,8,10,8
3,1,1,7,89,7,89
3,1,2,7,35,89,35
3,1,3,7,32,35,32
3,1,4,7,56,32,56
3,1,5,7,7,56,7
Just proceed through the steps you outlined step-by-step and it isn't so bad.
First I'll read in your data by copying it:
df <- read.csv(file('clipboard'))
Then I'll sort to make sure the data frame is ordered by houseid, then personid, then tripid:
# first sort so that it's ordered by Houseid, then Personid, then Tripid:
df <- with(df, df[order(Houseid,Personid,Tripid),])
Then follow the steps you specified:
# take value in TripendTAZ and put it in DestTAZ
df$DestTAZ <- df$TripendTAZ
# Set OrigTAZ = value from previous row
df$OrigTAZ <- c(NA,df$TripendTAZ[-nrow(df)])
# For the first trip of every person in a household (Tripid = 1),
# OrigTAZ = hometaz.
df$OrigTAZ[ df$Tripid==1 ] <- df$hometaz[ df$Tripid==1 ]
You'll notice that df is then what you're after.

Resources