Creating summary table of user event data - r

Edit 2: I realized I can use dcast() to do what I want to do. However I do not want to count all of the events in the Event Data, only those that happened before a date specified in another data set. I can't seem to figure out how to use the subset argument in dcast(). So far I've tried:
dcast(dt.events, Email ~ EventType, fun.aggregate = length, subset = as.Date(Date) <=
as.Date(dt.users$CreatedDate[dt.users$Email = dt.events$Email]))
However this doesn't work. I could add the CreatedDate column from the dt.users to the dt.events. And then subset using:
dcast(dt.events, Email ~ EventType, fun.aggregate = length, subset = as.Date(Date) <=
as.Date(CreatedDate)
I was wondering if it was possible to do this without having to add the extra column?
Edit: Just calculated that it'll probably take about 37 hours to complete in the way I'm currently doing it, so if anyone has any tips to make this faster. Please let me know :)
I'm new to R, I've figured out a way to do what I want to do. But it's extremely inefficient, and takes hours to complete.
I have the following:
Event data:
UserID Email EventType Date
User1 User1#*.com Type2 2016-01-02
User1 User1#*.com Type6 2016-01-02
User1 User1#*.com Type1 2016-01-02
User1 User1#*.com Type3 2016-01-02
User2 User2#*.com Type1 2016-01-02
User2 User2#*.com Type1 2016-01-02
User2 User2#*.com Type2 2016-01-02
User3 User3#*.com Type1 2016-01-02
User3 User3#*.com Type3 2016-01-02
User1 User1#*.com Type2 2016-01-04
User1 User1#*.com Type2 2016-01-04
User2 User2#*.com Type5 2016-01-04
User3 User3#*.com Type1 2016-01-04
User3 User3#*.com Type4 2016-01-04
Every time a user does something, an event is recorded with an event type, with a time stamp.
User list from different database:
UserID Email CreatedDate
DxUs1 User1#*.com 2016-01-01
DxUs2 User2#*.com 2016-01-03
DxUs3 User3#*.com 2016-01-03
I want to get the following:
A summarized list which counts the amount of each event type in the Event Data for each user in the User List. However, events should only be counted if the "CreatedDate" in the user list is before or equal to the "Date" in the Event Data.
So for the above data I would eventually want to get:
Email Type1 Type2 Type3 Type4 Type5 Type6
User1#*.com 1 3 1 0 0 1
User2#*.com 0 0 1 0 1 0
User3#*.com 1 0 0 1 0 0
How I've managed to do it so far
I've been able to do this by first creating a 'dt.master' data.table that includes all the columns for all events and the list of Emails. Which looks like this:
Email Type1 Type2 Type3 Type4 Type5 Type6
User1#*.com 0 0 0 0 0 0
User2#*.com 0 0 0 0 0 0
User3#*.com 0 0 0 0 0 0
And then filling out this table using the while loop below:
# The data sets
dt.events # event data
dt.users # user list
dt.master # blank master table
# Loop that fills master table
counter_limit = group_size(dt.master)
index = 1
while (index <= counter_limit) {
# Get events of user at current index
dt.events.temp = filter(dt.events, dt.events$Email %in% dt.users$Email[index],
as.Date(dt.events$Date) <= as.Date(dt.users$CreatedDate[index]))
# Count all the different events
dt.event.counter = as.data.table(t(as.data.table(table(dt.events.temp$EventType))))
# Clean the counter by 1: Rename columns to event names, 2: Remove event names row
names(dt.event.counter) = as.character(unlist(dt.event.counter[1,]))
dt.event.counter = dt.event.counter[-1]
# Fill the current index in on the blank master table
set(dt.master, index, names(dt.event.counter), dt.event.counter)
index = index + 1
}
The Problem
This does work... However I am dealing with 9+ million events, 250k+ users, 150+ Event Types. Therefore the above while loop takes HOURS before it has been processed. I tested it with a small batch of 500 users, which had the following processing time:
user system elapsed
179.33 62.92 242.60
I'm still waiting for the full batch to be processed haha. I've read somewhere that loops should be avoided, as they take a lot of time. However I am completely new to R and programming in general, and I've been learning through trial/error and Googling whatever I've needed. Clearly that leads to some messy code. I was wondering if anyone could point me in the direction of something that might be faster/more efficient?
Thanks!
Edit: Just calculated that it'll probably take about 37 hours to complete in the way I'm currently doing it, so if anyone has any tips to make this faster. Please let me know :)
TL,DR: My event aggregation/summarization code takes several hours to process my data (it's still not done). Is there any faster way to do it?

Assuming your data is already in a data.table, you could use the fun.aggregate parameter in dcast:
dcast(dat, Email ~ EventType, fun.aggregate = length)
gives:
Email Type1 Type2 Type3 Type4 Type5 Type6
1: User1#*.com 1 2 1 0 0 1
2: User2#*.com 4 1 0 0 1 0
3: User3#*.com 0 1 1 1 0 0
In response to the comments & updated question: you can get the desired result by using non-equi joins inside the dcast-function:
dcast(dt.events[dt.users, on = .(Email, Date >= CreatedDate)],
Email ~ EventType, fun.aggregate = length)
which gives:
Email Type1 Type2 Type3 Type4 Type5 Type6
1: User1#*.com 1 2 1 0 0 1
2: User2#*.com 1 0 0 0 1 0
3: User3#*.com 0 1 0 1 0 0

untested
library(dpylr)
library(tidyr)
your.dataset %>%
count(Email, EventType) %>%
spread(EventType, n)

Related

Problem finding number of elements in a dataframe in R

I have downloaded the data frame casos_hosp_uci_def_sexo_edad_provres_60_mas.csv, which describes the amount of people infected from Covid-19 in Spain classified for their province, age, gender... from this webpage. I read and represent the dataframe as:
db<-read.csv(file = 'casos_hosp_uci_def_sexo_edad_provres.csv')
The first five rows are shown
provincia_iso sexo grupo_edad fecha num_casos num_hosp num_uci num_def
1 A H 0-9 2020-01-01 0 0 0 0
2 A H 10-19 2020-01-01 0 0 0 0
3 A H 20-29 2020-01-01 0 0 0 0
4 A H 30-39 2020-01-01 0 0 0 0
5 A H 40-49 2020-01-01 0 0 0 0
The first four colums of the data frame show the name of the province, gender of the people, age group and date, the latest four columns show the number of people who got ill, were hospitalized, in ICU or dead.
I want to use R to find the day with the highest rate of contagions. To do that, I have to sum the elements of the fifth row num_casos for each different value of the column fecha.
I have already been able to calculate the number of sick males as hombresEnfermos=sum(db[which(db$sexo=="H"), 5]). However, I think there has to be a better way to check the days with higher contagion than go manually counting. However, I cannot find out how.
Can someone please help me?
Using dplyr to get the total by date:
library(dplyr)
db %>% group_by(fecha) %>% summarise(total = sum(num_casos))
Two alternatives in base R:
data.frame(fecha = sort(unique(db$fecha)),
total = sapply(split(db, f = db$fecha), function(x) {sum(x[['num_casos']])}))
Or more simply,
aggregate(db$num_casos, list(db$fecha), FUN=sum)
An alternative in data.table:
library(data.table)
db <- as.data.table(db)
db[, list(total=sum(num_casos)), by = fecha]

How to take data elements in a dataset and turn them into rows and columns?

Before I start off I should I have never used R until yesterday, I only know some Python and am very much a beginner. I'm using R because I cannot figure out how to do anything in excel, and have already had more progress with R.
So I have a seemingly unique problem I'm trying to solve. I have a data set that looks similar to this:
ID Contaminant
1 123 Lead
2 123 Copper
3 456 Lead
4 678 Iron
5 456 Lead
6 111 Iron
7 222 Arsenic
I want to take this data and create a new xlsx or csv file from it for data analysis. I want to see how many times the ID has had a contaminant and what that contaminant is. I (think) I have figured out how to figure out the unique values of how many ID's are associated with the Contaminant type and how many ID's have a unique Contaminant associated with them. If that makes sense.
I want the new data sheet to look something like this:
ID Lead Copper Iron Arsenic
123 1 2 0 0
456 2 0 0 0
678 0 0 1 0
111 0 0 1 0
222 0 0 0 0
So far I have figured out how to take my original data sheet, which contains a lot of variables, and turn it into the first data set I listed above, that only contains IDs and Contaminants.
My code is a bit rough, as I'm mimicking others works, but what I have so far is:
violations <- tbl_df(2015_Violations)
new_violations <- select(violations, -TypeofViolation, -"CODETypeof Vio")
with(unique(violations[c("ID","Contaminant")]), table(Contaminant))
with(unique(violations[c("ID","Contaminant")]), table(ID))
write.csv(new_violations, file = "C:/r_stuff/new_violations.csv",
row.names = F)
This spits out the unique numbers for Contaminants and ID's into some tables.
I am then using a different .R file to test this new .csv file. It simply contains this:
mydata <- read.csv("C:/r_stuff/new_violations.csv")
View(mydata)
So my question is, how can I take my data in the first table and turn it into a new file with the structure of the second? I assume this isn't a very easy task, but doing it by hand will be impossible as there is thousands of entries for the original data file.
I propose to do it like that:
library(tidyverse)
data %>%
group_by(ID, Contaminant) %>%
mutate(Count = n()) %>%
distinct() %>%
pivot_wider(names_from = Contaminant, values_from = Count) %>%
ungroup() %>%
mutate_all(replace_na, 0)
where:
data <-
tibble(
ID = c(123, 123, 456, 678, 456, 111, 222),
Contaminant = c("Lead", "Copper", "Lead", "Iron", "Lead", "Iron", "Arsenic")
)
It gives:
# A tibble: 5 x 5
ID Lead Copper Iron Arsenic
<dbl> <dbl> <dbl> <dbl> <dbl>
1 123 1 1 0 0
2 456 2 0 0 0
3 678 0 0 1 0
4 111 0 0 1 0
5 222 0 0 0 1

Create recency variable using previous observation in data.table

I am willing to create a new variable called recency - how recent is the transaction of the customer - which is useful for RFM analysis. The definition is as follows: We observe transaction log of each customer weekly and assign dummy variable called "trans" if the customers made a transaction. Recency variable will equal to the number of the week if she made a transaction at that week, otherwise recency will be equal to the previous recency value. To make it more clear, I have also created a demo data.table for you.
demo<-data.table( cust=rep(c(1:3), 3))
demo[,week:=seq(1,3,1),by=cust]
demo[, trans:=c(1,1,1,0,1,0,1,1,0)]
demo[, rec:=c(1,1,1, 1,2,1,3,3,1)]
I need to calculate "rec" variable which I entered manually in demo data.table. Please also consider that, I can handle it with looping which takes a lot of time. Therefore, I would be grateful if you help me with data.table way. Thanks in advance.
This works for the example:
demo[, v := cummax(week*trans), by=cust]
cust week trans rec v
1: 1 1 1 1 1
2: 2 1 1 1 1
3: 3 1 1 1 1
4: 1 2 0 1 1
5: 2 2 1 2 2
6: 3 2 0 1 1
7: 1 3 1 3 3
8: 2 3 1 3 3
9: 3 3 0 1 1
We observe transaction log of each customer weekly and assign dummy variable called "trans" if the customers made a transaction. Recency variable will equal to the number of the week if she made a transaction at that week, otherwise recency will be equal to the previous recency value.
This means taking the cumulative max week, ignoring weeks where there is no transaction. Since weeks are positive numbers, we can treat the no-transaction weeks as zero.

Updating a table with the rolling average of previous rows in R?

So I have a table where every row represents a given user in a specific event. Each row contains two types of information: the outcomes of such event, as well as data regarding a user specifically. Multiple users can take part in the a same event.
For clarity, here is an simplified example of such table:
EventID Date Revenue Time(s) UserID X Y Z
1 1/1/2017 $10 120 1 3 2 2
1 1/1/2017 $15 150 2 2 1 2
2 2/1/2017 $50 60 1 1 5 1
2 2/1/2017 $45 100 4 3 5 2
3 3/1/2017 $25 75 1 2 3 1
3 3/1/2017 $20 210 2 5 5 1
3 3/1/2017 $25 120 3 1 0 4
3 3/1/2017 $15 100 4 3 1 1
4 4/1/2017 $75 25 4 0 2 1
My goal is to build a model that can, given a specific user's performance history (in the example attributes X, Y and Z), predict a given revenue and time for an event.
What I am after now is a way to format my data in order to train and test such model. More specifically, I want to transform the table in a way that each row would keep the event specific information, while presenting the moving average of each users attributes up until the previous event. An example of the thought process could be: a user up until an event presents averages of 2, 3.5, and 1.5 in attributes X, Y and Z respectively, and the revenue and time outcomes of such event were $25 and 75, now I will use this as a input for my training.
Once again for clarity, here is an example of the output I would expect applying such logic on the original table:
EventID Date Revenue Time(s) UserID X Y Z
1 1/1/2017 $10 120 1 0 0 0
1 1/1/2017 $15 150 2 0 0 0
2 2/1/2017 $50 60 1 3 2 2
2 2/1/2017 $45 100 4 0 0 0
3 3/1/2017 $25 75 1 2 3.5 1.5
3 3/1/2017 $20 210 2 2 1 2
3 3/1/2017 $25 120 3 0 0 0
3 3/1/2017 $15 100 4 3 5 2
4 4/1/2017 $75 25 4 3 3 1.5
Notice that in each users first appearance all attributes are 0, since we still know nothing about them. Also, in a user's second appearance, all we know is the result of his first appearance. In lines 5 and 9, users 1 and 4 third appearances start to show the rolling mean of their previous performances.
If I were dealing with only a single user, I would tackle this problem by simply calculating the moving average of his attributes, and then shifting only the data in the attribute columns down one row. My questions are:
Is there a way to perform such shift filtered by UserID, when dealing with a table with multiple users?
Or is there a better way in R to calculate the rolling mean directly from the original table by always placing a result in each user's next appearance?
It can assumed that all rows are already sorted by date. Any other tips or references related to this problem are also welcome.
Also, It wasn't obvious how to summarize my question with a one liner title, so I'm open to suggestions from any R experts that might think of an improved way of describing it.
We can achieve your desired output using the dplyr package.
library(dplyr)
tablinka %>%
arrange(UserID, EventID) %>%
group_by(UserID) %>%
mutate_at(c("X", "Y", "Z"), cummean) %>%
mutate_at(c("X", "Y", "Z"), lag) %>%
mutate_at(c("X", "Y", "Z"), funs(ifelse(is.na(.), 0, .))) %>%
arrange(EventID, UserID) %>%
ungroup()
We arrange the data, group it, and then apply the desired transformations (the dplyr functions cummean, lag, and replacing NA with 0 using an ifelse).
Once this is done, we rearrange the data to its original state, and ungroup it.

Contingency table based on third variable (numeric)

Some time ago I asked a question about creating market basket data. Now I would like to create a similar data.frame, but based on a third variable. Unfortunately I run into problems trying. Previous question: Effecient way to create market basket matrix in R
#shadow and #SimonO101 gave me good answers, but I was not able to alter their anwser correctly. I have the following data:
Customer <- as.factor(c(1000001,1000001,1000001,1000001,1000001,1000001,1000002,1000002,1000002,1000003,1000003,1000003))
Product <- as.factor(c(100001,100001,100001,100004,100004,100002,100003,100003,100003,100002,100003,100008))
input <- data.frame(Customer,Product)
I can create a contingency table now the following way:
input_df <- as.data.frame.matrix(table(input))
However I have a third (numeric) variable which I want as output in the table.
Number <- c(3,1,-4,1,1,1,1,1,1,1,1,1)
input <- data.frame(Customer,Product,Number)
Now the code (of course, now there are 3 variables) does not work anymore. The result I am looking for has unique Customer as row names and unique Product as column names. And has Number as value (or 0 if not present), this number could be calculated by:
input_agg <- aggregate( Number ~ Customer + Product, data = input, sum)
Hope my question is clear, please comment if something is not clear.
You can use xtabs for that :
R> xtabs(Number~Customer+Product, data=input)
Product
Customer 100001 100002 100003 100004 100008
1000001 0 1 0 2 0
1000002 0 0 3 0 0
1000003 0 1 1 0 1
This class of problem is designed for reshape2::dcast...
require( reshape2 )
# Too many rows so change to a data.table.
dcast( input , Customer ~ Product , fun = sum , value.var = "Number" )
# Customer 100001 100002 100003 100004 100008
#1 1000001 0 1 0 2 0
#2 1000002 0 0 3 0 0
#3 1000003 0 1 1 0 1
Recently, the method for using dcast with data.table object was implemented by #Arun responding to FR #2627. Great stuff. You will have to use the development version 1.8.11. Also at the moment, it should be used as dcast.data.table. This is because dcast is not a S3 generic yet in reshape2 package. That is, you can do:
require(reshape2)
require(data.table)
input <- data.table(input)
dcast.data.table(input , Customer ~ Product , fun = sum , value.var = "Number")
# Customer 100001 100002 100003 100004 100008
# 1: 1000001 0 1 0 2 0
# 2: 1000002 0 0 3 0 0
# 3: 1000003 0 1 1 0 1
This should work quite well on bigger data and should be much faster than reshape2:::dcast as well.
Alternatively, you can try the reshape:::cast version which may or may not crash... Try it!!
require(reshape)
input <- data.table( input )
cast( input , Customer ~ Product , fun = sum , value = .(Number) )

Resources