Create recency variable using previous observation in data.table - r

I am willing to create a new variable called recency - how recent is the transaction of the customer - which is useful for RFM analysis. The definition is as follows: We observe transaction log of each customer weekly and assign dummy variable called "trans" if the customers made a transaction. Recency variable will equal to the number of the week if she made a transaction at that week, otherwise recency will be equal to the previous recency value. To make it more clear, I have also created a demo data.table for you.
demo<-data.table( cust=rep(c(1:3), 3))
demo[,week:=seq(1,3,1),by=cust]
demo[, trans:=c(1,1,1,0,1,0,1,1,0)]
demo[, rec:=c(1,1,1, 1,2,1,3,3,1)]
I need to calculate "rec" variable which I entered manually in demo data.table. Please also consider that, I can handle it with looping which takes a lot of time. Therefore, I would be grateful if you help me with data.table way. Thanks in advance.

This works for the example:
demo[, v := cummax(week*trans), by=cust]
cust week trans rec v
1: 1 1 1 1 1
2: 2 1 1 1 1
3: 3 1 1 1 1
4: 1 2 0 1 1
5: 2 2 1 2 2
6: 3 2 0 1 1
7: 1 3 1 3 3
8: 2 3 1 3 3
9: 3 3 0 1 1
We observe transaction log of each customer weekly and assign dummy variable called "trans" if the customers made a transaction. Recency variable will equal to the number of the week if she made a transaction at that week, otherwise recency will be equal to the previous recency value.
This means taking the cumulative max week, ignoring weeks where there is no transaction. Since weeks are positive numbers, we can treat the no-transaction weeks as zero.

Related

Subdividing and counting how many values in particular columns under certain conditions in r

I am new to R and data analysis. I have a database similar to this one below, just a lot bigger and I was trying to find a general way to count for each country how many actions there are and how many subquestion with value 1, value 2 and so on there are. For each action there are multiple questions, subquestions and subsubquestions but I would love to find a way to count
1:how many actions there are per country, excluding subquestions
2: a way to find out how many subquestions 1 or 2 with value 1 there are for each country, actionn and questionn.
id country questionn subquestion value actionn
06 NIE 1 1 1 1
05 NIG 1 1 1 1
07 TAN 1 1 1 1
08 BEN 1 1 1 1
03 TOG 1 1 2 1
45 MOZ 1 1 2 1
40 ZIM 1 1 1 1
56 COD 1 1 1 1
87 BFA 1 1 1 1
09 IVC 1 1 2 1
08 SOA 1 1 2 1
02 MAL 1 1 2 1
78 MAI 1 1 2 1
35 GUB 1 1 2 1
87 RWA 1 1 2 1
41 ETH 1 1 1 1
06 NIE 1 2 2 1
05 NIG 1 2 1 1
87 BFA 1 2 1 2
I have tried to create subsets of the data frame and count everything for each country once at a time but it is going to take forever and I was wondering if there was a general way to do it.
For the first question I have done this
df1<-df %>% group_by (country) %>% summarise (countries=county)
unique(df1)
count(df1)
For the second question I was thinking of individually select and count each rows which has quesionn=1, subquestion=1, value=1 and actionn=1, then to select and count how many per country with qustionn=1, subquestionn=2, value=1, actionn=1 etc. Value refers to whether the answer to the question is 1=yes or 2=no.
I would be grateful for any help, thank you soo much :)
For the first question you can try to do something like this:
df %>%
filter(subquestion != 2) %>%
group_by(country) %>%
summarise(num_actions = n())
This will return the number of actions per country, removing rows that do not have 2 for the subquestion column. Note that the n() in the summarize function will count the number observations in the groups (in this case countries).
I am not sure I fully understand the second question, but my suggestion would be to make a new label for the particular observation you want to know (how many subquestions 1 or 2 with value 1 there are for each country, actionn and questionn):
df %>%
mutate(country_question_code = paste(country, action, questionn, sep = "_")) %>%
group_by(country_question_code) %>%
summarize(num_subquestion = n())
For question 1 possible solution (assuming country name are not unique and actionn can be 0, 1,2 or more..
For just total count:
df%>%group_by(country)%>%
summarise(
"Count_actions" = sum(actionn)
) #ignores all other columns.
If you want to count how many times a country appears use n() in place of sum(actionn, na.rm=TRUE).# this may not be desired but sometime simple solution is the best
(Just count the frequency of country)
Or df%>%group_by(country, actionn)%>%summarise("count_actions"= n()) will give country wise count for each type ( say 1,2 or more actions).
Data table version dt[, .(.N), by=.(country, actionn )]
For question 2: use grouping for "for each on your question" after putting filter on data as required. Here, filter subquestions 1 or 2 with (and) value 1 for each "country, question and actionn":
df%>%filter(subquestions <=2 & value==1)%>%group_by( country, question, actionn)%>%summarise("counts_desired"= n(), "sums_desired"= sum(actionn, na.rm=TRUE))
Hope this works. I am also learning and applying it on similar data.
Have not tested it and made certain assumptions about your data (numerical and clean). (Also for.mobile while traveling! Cheers!!)

If (condition), add 1 to previous value, else, subtract 1

I'm tracking Meals and satiety in a dataframe. I would like to have R add 1 to the previous value in the satiety column when a meal is eaten, and subtract 1 when no meal is eaten (meal=NA).
I'm trying to accomplish this with a for loop nested in an ifelse statement but it is not working.
My current attempt:
ifelse(Meals=="NA",for (i in 1:length(Day$Fullness)){
print(Day$Fullness[[i]]-1+i)}, for (i in 1:length(Day$Fullness)){
print(Day$Fullness[[i]]+1+i)}
Error: Error in ans[test & ok] <- rep(yes, length.out = length(ans))
[test & ok] :
replacement has length zero
In addition: Warning message:
In rep(yes, length.out = length(ans)) :
'x' is NULL so the result will be NULL
I'm not sure how to create a table on here but I will do my best to make sense.
Time: 9:30 AM 10:00 AM 10:30 AM ETC
Meals: NA NA Breakfast NA NA Snack NA NA NA ETC
Satiety: Range from 0-10.
My current satiety data is just a vector I created, but I would like it to start at 0 and increase by 1 after every meal, while decreasing by 1 after every 30 minute timeframe where there is no meal(where meal= NA).
I'm sure there is a much better way to do this.
Thank you.
Here's some sample data and a potential solution.
set.seed(123)
meals <- sample(c(1, 1, 1, NA), 20, replace = TRUE)
df <- data.frame(meals = meals)
head(df)
# meals
# 1 1
# 2 NA
# 3 1
# 4 NA
# 5 NA
# 6 1
df$meals[is.na(df$meals)] <- -1
df$satiety <- cumsum(df$meals)
head(df)
# meals satiety
# 1 1 1
# 2 -1 0
# 3 1 1
# 4 -1 0
# 5 -1 -1
# 6 1 0
tail(df)
# meals satiety
# 15 1 5
# 16 -1 4
# 17 1 5
# 18 1 6
# 19 1 7
# 20 -1 6
I would suggest not coding the absence of a meal (or a skipped meal) as NA which means "I don't know". If you're using NA to mean the meal was skipped, than you do actually know and you should give it something that represents a skipped meal. Here, since your model interprets a skipped meal as having a negative impact on satiety (not a neutral impact), -1 actually makes quite a lot of sense. If that's how you use it in your model, then code it that way.
A couple of things here.
Unless the data includes the string "NA", you should use the command is.na(x) to check if a value or values are NA. It's hard to tell however without sample data.
Generally speaking, in R you will want to use vectorised solutions. In many cases, if you're using a for loop, it's incorrect.
You've stated that "Meals" is in a dataframe. As such, you will need to refer to Meals as a subset of that data frame. For example, if the data frame is data, then the expression should be data$Meals.
Summarising all of this, I'd probably do something similar to the following:
Day$Meals.na <- is.na(Day$Meals)
print(Day$Fullness + (-1)^Day$Meals.na)
This uses a nice trick: TRUE and FALSE are both stored as 1 and 0 respectively under the hood.
Hopefully this helps. If not, we'd really need sample data and expected outputs to be able to be of more use.

Updating a table with the rolling average of previous rows in R?

So I have a table where every row represents a given user in a specific event. Each row contains two types of information: the outcomes of such event, as well as data regarding a user specifically. Multiple users can take part in the a same event.
For clarity, here is an simplified example of such table:
EventID Date Revenue Time(s) UserID X Y Z
1 1/1/2017 $10 120 1 3 2 2
1 1/1/2017 $15 150 2 2 1 2
2 2/1/2017 $50 60 1 1 5 1
2 2/1/2017 $45 100 4 3 5 2
3 3/1/2017 $25 75 1 2 3 1
3 3/1/2017 $20 210 2 5 5 1
3 3/1/2017 $25 120 3 1 0 4
3 3/1/2017 $15 100 4 3 1 1
4 4/1/2017 $75 25 4 0 2 1
My goal is to build a model that can, given a specific user's performance history (in the example attributes X, Y and Z), predict a given revenue and time for an event.
What I am after now is a way to format my data in order to train and test such model. More specifically, I want to transform the table in a way that each row would keep the event specific information, while presenting the moving average of each users attributes up until the previous event. An example of the thought process could be: a user up until an event presents averages of 2, 3.5, and 1.5 in attributes X, Y and Z respectively, and the revenue and time outcomes of such event were $25 and 75, now I will use this as a input for my training.
Once again for clarity, here is an example of the output I would expect applying such logic on the original table:
EventID Date Revenue Time(s) UserID X Y Z
1 1/1/2017 $10 120 1 0 0 0
1 1/1/2017 $15 150 2 0 0 0
2 2/1/2017 $50 60 1 3 2 2
2 2/1/2017 $45 100 4 0 0 0
3 3/1/2017 $25 75 1 2 3.5 1.5
3 3/1/2017 $20 210 2 2 1 2
3 3/1/2017 $25 120 3 0 0 0
3 3/1/2017 $15 100 4 3 5 2
4 4/1/2017 $75 25 4 3 3 1.5
Notice that in each users first appearance all attributes are 0, since we still know nothing about them. Also, in a user's second appearance, all we know is the result of his first appearance. In lines 5 and 9, users 1 and 4 third appearances start to show the rolling mean of their previous performances.
If I were dealing with only a single user, I would tackle this problem by simply calculating the moving average of his attributes, and then shifting only the data in the attribute columns down one row. My questions are:
Is there a way to perform such shift filtered by UserID, when dealing with a table with multiple users?
Or is there a better way in R to calculate the rolling mean directly from the original table by always placing a result in each user's next appearance?
It can assumed that all rows are already sorted by date. Any other tips or references related to this problem are also welcome.
Also, It wasn't obvious how to summarize my question with a one liner title, so I'm open to suggestions from any R experts that might think of an improved way of describing it.
We can achieve your desired output using the dplyr package.
library(dplyr)
tablinka %>%
arrange(UserID, EventID) %>%
group_by(UserID) %>%
mutate_at(c("X", "Y", "Z"), cummean) %>%
mutate_at(c("X", "Y", "Z"), lag) %>%
mutate_at(c("X", "Y", "Z"), funs(ifelse(is.na(.), 0, .))) %>%
arrange(EventID, UserID) %>%
ungroup()
We arrange the data, group it, and then apply the desired transformations (the dplyr functions cummean, lag, and replacing NA with 0 using an ifelse).
Once this is done, we rearrange the data to its original state, and ungroup it.

how to add variable in longitudinal data using SPSS or R?

I have a file with repeated measures data and another file with single observations for the same persons (e.g. in one file subjects have repeated assessments and the other file just says if subjects are male or female) when I merge the files I get something like this:
ID time gender
1 1 0
1 2
1 3
2 1 1
2 2
3 1 0
3 2
3 3
3 4
but I want that the variable that was measured once (e.g.male/female) to be repeated across time (in each row) for each subject. So I would like to have :
1 1 0
1 2 0
1 3 0
2 1 1
2 2 1
and not do it manually, since I have thousands of cases...
How to do this in SPSS (preferably), or in R ?
You should have used match files with one "file" (multiple record per ID) and one "table" (no duplicate ID's).
But you can probably still fix it by running
sort cases by ID.
if mis(gender) and ID = lag(ID) gender= lag(gender).
Wherever there's no value for gender, it will be filled in with the gender of the previous case if it has the same ID as the current one.

New calculation loop

I want to have a loop that will perform a calculation for me, and export the variable (along with identifying information) into a new data frame.
My data look like this:
Each unique sampling point (UNIQUE) has 4 data points associated with it (they differ by WAVE).
WAVE REFLECT REFEREN PLOT LOCAT COMCOMP DATE UNIQUE
1 679.9 119 0 1 1 1 11.16.12 1
2 799.9 119 0 1 1 1 11.16.12 1
3 899.8 117 0 1 1 1 11.16.12 1
4 970.3 113 0 1 1 1 11.16.12 1
5 679.9 914 31504 1 2 1 11.16.12 2
6 799.9 1693 25194 1 2 1 11.16.12 2
And I want to create a new data frame that will look like this:
For each unique sampling point, I want to calculate "WBI" from 2 specific "WAVE" measurements.
WBI PLOT .... UNIQUE
(WAVE==899.8/WAVE==970) 1 1
(WAVE==899.8/WAVE==970) 1 2
(WAVE==899.8/WAVE==970) 1 3
Depends on the size of your input data.frame there could be better solution in terms of efficiency but the following should work ok for small or medium data sets, and is kind of simple:
out.unique = unique(input$UNIQUE);
out.plot = sapply(out.unique,simplify=T,function(uq) {
# assuming that plot is simply the first PLOT of those belonging to that
# unique number. If not yo should change this.
subset(input,subset= UNIQUE == uq)$PLOT[1];
});
out.wbi = sapply(out.unique,simplify=T,function(uq) {
# not sure how you compose WBI but I assume that are the two last
# record with that unique number so it matches the first output of your example
uq.subset = subset(input,subset= UNIQUE == uq);
uq.nrow = nrow(uq.subset);
paste("(WAVE=",uq.subset$WAVE[uq.nrow-1],"/WAVE=",uq.subset$WAVE[uq.nrow],")",sep="")
});
output = data.frame(WBI=out.wbi,PLOT=out.plot,UNIQUE=out.unique);
If the input data is big however you may want to exploit de fact that records seem to be sorted by "UNIQUE"; repetitive data.frame sub-setting would be costly. Also both sapply calls can be combined into one but make it a bit more cumbersome so I had leave it like this.

Resources