Updating a table with the rolling average of previous rows in R? - r

So I have a table where every row represents a given user in a specific event. Each row contains two types of information: the outcomes of such event, as well as data regarding a user specifically. Multiple users can take part in the a same event.
For clarity, here is an simplified example of such table:
EventID Date Revenue Time(s) UserID X Y Z
1 1/1/2017 $10 120 1 3 2 2
1 1/1/2017 $15 150 2 2 1 2
2 2/1/2017 $50 60 1 1 5 1
2 2/1/2017 $45 100 4 3 5 2
3 3/1/2017 $25 75 1 2 3 1
3 3/1/2017 $20 210 2 5 5 1
3 3/1/2017 $25 120 3 1 0 4
3 3/1/2017 $15 100 4 3 1 1
4 4/1/2017 $75 25 4 0 2 1
My goal is to build a model that can, given a specific user's performance history (in the example attributes X, Y and Z), predict a given revenue and time for an event.
What I am after now is a way to format my data in order to train and test such model. More specifically, I want to transform the table in a way that each row would keep the event specific information, while presenting the moving average of each users attributes up until the previous event. An example of the thought process could be: a user up until an event presents averages of 2, 3.5, and 1.5 in attributes X, Y and Z respectively, and the revenue and time outcomes of such event were $25 and 75, now I will use this as a input for my training.
Once again for clarity, here is an example of the output I would expect applying such logic on the original table:
EventID Date Revenue Time(s) UserID X Y Z
1 1/1/2017 $10 120 1 0 0 0
1 1/1/2017 $15 150 2 0 0 0
2 2/1/2017 $50 60 1 3 2 2
2 2/1/2017 $45 100 4 0 0 0
3 3/1/2017 $25 75 1 2 3.5 1.5
3 3/1/2017 $20 210 2 2 1 2
3 3/1/2017 $25 120 3 0 0 0
3 3/1/2017 $15 100 4 3 5 2
4 4/1/2017 $75 25 4 3 3 1.5
Notice that in each users first appearance all attributes are 0, since we still know nothing about them. Also, in a user's second appearance, all we know is the result of his first appearance. In lines 5 and 9, users 1 and 4 third appearances start to show the rolling mean of their previous performances.
If I were dealing with only a single user, I would tackle this problem by simply calculating the moving average of his attributes, and then shifting only the data in the attribute columns down one row. My questions are:
Is there a way to perform such shift filtered by UserID, when dealing with a table with multiple users?
Or is there a better way in R to calculate the rolling mean directly from the original table by always placing a result in each user's next appearance?
It can assumed that all rows are already sorted by date. Any other tips or references related to this problem are also welcome.
Also, It wasn't obvious how to summarize my question with a one liner title, so I'm open to suggestions from any R experts that might think of an improved way of describing it.

We can achieve your desired output using the dplyr package.
library(dplyr)
tablinka %>%
arrange(UserID, EventID) %>%
group_by(UserID) %>%
mutate_at(c("X", "Y", "Z"), cummean) %>%
mutate_at(c("X", "Y", "Z"), lag) %>%
mutate_at(c("X", "Y", "Z"), funs(ifelse(is.na(.), 0, .))) %>%
arrange(EventID, UserID) %>%
ungroup()
We arrange the data, group it, and then apply the desired transformations (the dplyr functions cummean, lag, and replacing NA with 0 using an ifelse).
Once this is done, we rearrange the data to its original state, and ungroup it.

Related

Delete rows when certain factor is present more than 200 times

I have a dataset with over 400,000 cows. These cows are (unevenly) spreak over 2355 herds. Some herds are only present once in the data, while one herd is even present 2033 times in the data, meaning that 2033 cows belong to this herd. I want to delete herds from my data that occur less than 200 times.
With use of plyr and subset, I can obtain a list of which herds occur less than 200 times, I however can not find out how to apply this selection to the full dataset.
For example, my current data looks a little like:
cow herd
1 1
2 1
3 1
4 2
5 3
6 4
7 4
8 4
With function count() I can obtain the following:
x freq
1 3
2 1
3 1
4 3
Say I want to delete the data belonging to herds that occur less than 3 times, I want my data to look like this eventually:
cow herd
1 1
2 1
3 1
6 4
7 4
8 4
I do know how to tell R to delete data herd by herd, however since, in my real datatset, over 1000 herds occur less then 200 times, it would mean that I would have to type every herd number in my script one by one. I am sure there is an easier and quicker way of asking R to delete data above or below a certain occurence.
I hope my explanation is clear and someone can help me, thanks in advance!
Use n + group_by:
library(dplyr)
your_data %>%
group_by(herd) %>%
filter(n() >= 3)

Get rate of change from messy data

I have a database that looks like this:
Id session q1 q2 q3 ...
1 1 4 5 5
1 2 4 5 6
1 3 5 5 6
2 1 4 4 5
2 2 5 4 5
2 3 5 5 6
Basically, different subjects with 3 different measurements of the same questions. What I want to do is measure the rate of change, and check if every observation improved over time or if there where observations who got worse results in session 3 than in session 1 or 2.
The only thing I have manged to do is get it a bit more tidy with pivot_wider like this:
pivot_wider(id_cols = Id, names_from = session, values_from = c(q1:q4))
The problem is that I have more than 70 questions, and I havenĀ“t figured out a way to automate this instead of doing hundreds of line with mutate in the form of:
mutate(q1change = q1_3 - q1_1)
I was wondering if anyone could come up with a better and simpler solution so I can check this rate of change for each variable.
Ideally I would also like to plot it after I have gotten the rate of change value, so I can show graphically if there where observations that gotten worse.
Thanks

Reverse cumsum with breaks with non-sequential numbers

Looking to fill a matrix with a reverse cumsum. There are multiple breaks that must be maintained.
I have provided a sample matrix for what I want to accomplish. The first column is the data, the second column is what I want. You will see that column 2 is updated to reflect the number of items that are left. When there are 0's the previous number must be carried through.
update <- matrix(c(rep(0,4),rep(1,2),2,rep(0,2),1,3,
rep(10,4), 9,8,6, rep(6,2), 5, 2),ncol=2)
I have tried multiple ways to create a sequence, loop using numerous packages (i.e. zoo). What is difficult is that the numbers in column 1 can be between 0,1,..,X but less than column 2.
Any help or tips would be appreciated
EDIT: Column 2 starts with a given value which can represent any starting value (i.e. inventory at the beginning of a month). Column 1 would then represent "purchases" made which; thus, column 2 should reflect the total number of remaining items available.
The following will report the purchase and inventory balance as described:
starting_inventory <- 100
df <- data.frame(purchases=c(rep(0,4),rep(1,2),2,rep(0,2),1,3))
df$cum_purchases <- cumsum(df$purchases)
df$remaining_inventory <- starting_inventory - df$cum_purchases
Result:
purchases cum_purchases remaining_inventory
1 0 0 100
2 0 0 100
3 0 0 100
4 0 0 100
5 1 1 99
6 1 2 98
7 2 4 96
8 0 4 96
9 0 4 96
10 1 5 95
11 3 8 92

Subdividing and counting how many values in particular columns under certain conditions in r

I am new to R and data analysis. I have a database similar to this one below, just a lot bigger and I was trying to find a general way to count for each country how many actions there are and how many subquestion with value 1, value 2 and so on there are. For each action there are multiple questions, subquestions and subsubquestions but I would love to find a way to count
1:how many actions there are per country, excluding subquestions
2: a way to find out how many subquestions 1 or 2 with value 1 there are for each country, actionn and questionn.
id country questionn subquestion value actionn
06 NIE 1 1 1 1
05 NIG 1 1 1 1
07 TAN 1 1 1 1
08 BEN 1 1 1 1
03 TOG 1 1 2 1
45 MOZ 1 1 2 1
40 ZIM 1 1 1 1
56 COD 1 1 1 1
87 BFA 1 1 1 1
09 IVC 1 1 2 1
08 SOA 1 1 2 1
02 MAL 1 1 2 1
78 MAI 1 1 2 1
35 GUB 1 1 2 1
87 RWA 1 1 2 1
41 ETH 1 1 1 1
06 NIE 1 2 2 1
05 NIG 1 2 1 1
87 BFA 1 2 1 2
I have tried to create subsets of the data frame and count everything for each country once at a time but it is going to take forever and I was wondering if there was a general way to do it.
For the first question I have done this
df1<-df %>% group_by (country) %>% summarise (countries=county)
unique(df1)
count(df1)
For the second question I was thinking of individually select and count each rows which has quesionn=1, subquestion=1, value=1 and actionn=1, then to select and count how many per country with qustionn=1, subquestionn=2, value=1, actionn=1 etc. Value refers to whether the answer to the question is 1=yes or 2=no.
I would be grateful for any help, thank you soo much :)
For the first question you can try to do something like this:
df %>%
filter(subquestion != 2) %>%
group_by(country) %>%
summarise(num_actions = n())
This will return the number of actions per country, removing rows that do not have 2 for the subquestion column. Note that the n() in the summarize function will count the number observations in the groups (in this case countries).
I am not sure I fully understand the second question, but my suggestion would be to make a new label for the particular observation you want to know (how many subquestions 1 or 2 with value 1 there are for each country, actionn and questionn):
df %>%
mutate(country_question_code = paste(country, action, questionn, sep = "_")) %>%
group_by(country_question_code) %>%
summarize(num_subquestion = n())
For question 1 possible solution (assuming country name are not unique and actionn can be 0, 1,2 or more..
For just total count:
df%>%group_by(country)%>%
summarise(
"Count_actions" = sum(actionn)
) #ignores all other columns.
If you want to count how many times a country appears use n() in place of sum(actionn, na.rm=TRUE).# this may not be desired but sometime simple solution is the best
(Just count the frequency of country)
Or df%>%group_by(country, actionn)%>%summarise("count_actions"= n()) will give country wise count for each type ( say 1,2 or more actions).
Data table version dt[, .(.N), by=.(country, actionn )]
For question 2: use grouping for "for each on your question" after putting filter on data as required. Here, filter subquestions 1 or 2 with (and) value 1 for each "country, question and actionn":
df%>%filter(subquestions <=2 & value==1)%>%group_by( country, question, actionn)%>%summarise("counts_desired"= n(), "sums_desired"= sum(actionn, na.rm=TRUE))
Hope this works. I am also learning and applying it on similar data.
Have not tested it and made certain assumptions about your data (numerical and clean). (Also for.mobile while traveling! Cheers!!)

Grouping and building intervals of data in R and useful visualization

I have some data extracted via HIVE. In the end we are talking of csv with around 500 000 rows. I want to plot them after grouping them in intervals.
Beside the grouping it's not clear how to visualize the data. Since we are talking about low spends and sometimes a high frequency I'm not sure how to handle this problem.
Here is just an overview via head(data)
userid64 spend freq
575033023245123 0.00924205 489
12588968125440467 0.00037 2
13830962861053825 0.00168 1
18983461971805285 0.001500366 333
25159368164208149 0.00215 1
32284253673482883 0.001721303 222
33221593608613197 0.00298 709
39590145306822865 0.001785281 11
45831636009567401 0.00397 654
71526649454205197 0.000949978 1
78782620614743930 0.00552 5
I want to group the data in intervals. So I want an extra columns indicating the groups. The first group should contain all data with an frequency (called freq) between 1 and 100. The second group should contain all rows where there entries have a frequency between 101 and 200... and so on.
The result should look like
userid64 spend freq group
575033023245123 0.00924205 489 5
12588968125440467 0.00037 2 1
13830962861053825 0.00168 1 1
18983461971805285 0.001500366 333 3
25159368164208149 0.00215 1 1
32284253673482883 0.001721303 222 2
33221593608613197 0.00298 709 8
39590145306822865 0.001785281 11 1
45831636009567401 0.00397 654 7
71526649454205197 0.000949978 1 1
78782620614743930 0.00552 5 1
Is there a nice and gentle art to get this? I need this grouping for upcoming plots. I want to do visualization for all intervals to get an overview regarding the spend. If you have any ideas for the visualization please let me know. I thought I should work with boxplots.
If you want to group freq for every 100 units, you can try ceiling function in base R
ceiling(df$freq / 100)
#[1] 5 1 1 4 1 3 8 1 7 1 1
where df is your dataframe.

Resources