Summarize a data.table with unreliable data - r

I have a data.table of events recording, say, user ID, country of residence, and event.
E.g.,
dt <- data.table(user=c(rep(3, 5), rep(4, 5)),
country=c(rep(1,4),rep(2,6)),
event=1:10, key="user")
As you can see, the data is somewhat corrupted: event 5 reports user 3 as being in country 2 (or maybe he traveled - it does not matter to me here).
So when I try to summarize the data:
dt[, country[.N] , by=user]
user V1
1: 3 2
2: 4 2
I get the wrong country for user 3.
Ideally, I would like to get the most common country for a user and the
percentage of time he spent there:
user country support
1: 3 1 0.8
2: 4 2 1.0
How do I do that?
The actual data has ~10^7 rows, so the solution has to scale (this is why I am using data.table and not data.frame after all).

Another way:
Edited. table(.) was the culprit. Changed it to complete data.table syntax.
dt.out<- dt[, .N, by=list(user,country)][, list(country[which.max(N)],
max(N)/sum(N)), by=user]
setnames(dt.out, c("V1", "V2"), c("country", "support"))
# user country support
# 1: 3 1 0.8
# 2: 4 2 1.0

Using plyr's count function:
dt[, count(country), by = user][order(-freq),
list(country = x[1],
support = freq[1]/sum(freq)),
by = user]
# user country support
#1: 4 2 1.0
#2: 3 1 0.8
Idea is to count the countries per user, order by max frequency and then get the data you like.
A smarter answer thanks to #mnel, that doesn't use extra functions:
dt[, list(freq = .N),
by = list(user, country)][order(-freq),
list(country = country[1],
support = freq[1]/sum(freq)),
by = user]

Related

How to get each customers entry and exit to the shop?

I have a dataset of each customer being captured by the sensors at different times.So the customer will enter the shop and will be captured by sensor_id 1 and the customer also can enter through sensor_id 2. But the customer can only exit through sensor_id 3.The data set looks like below :
library(data.table)
library(lubridate)
DT1 <- data.table(
customer_id=c(1,1,1,2,2,2,1,1),
sensor_id=c(1,2,3,1,2,3,2,3),
in_time=c("2017-01-01 00:00:05","2017-01-01 00:06:35","2017-01-01 00:23:44","2017-01-02 22:00:20","2017-01-02 22:01:09","2017-01-02 22:28:02","2017-01-03 22:23:01","2017-01-03 22:50:52")
)
DT1[,in_time:=ymd_hms(in_time)]
So from this, I wanted to get the data frame
result <- data.table(
customer_id=c(1,2,1),
entry_sensor_id=c(1,1,2),
entry_time=c("2017-01-01 00:00:05","2017-01-02 22:00:20","2017-01-03 22:23:01"),
entry_sensor_id=c(3,3,3),
exit_Time=c("2017-01-01 00:23:44","2017-01-02 22:28:02","2017-01-03 22:50:52")
)
So I tried the below :
DT1[, spotted_group := rleid( cumsum(difftime(in_time,
shift(in_time, fill = first(in_time)), units = "mins") > 120)), customer_id]
DT1Stretch=DT1[ DT1[order(in_time), .I[c(1L,.N)], by=list(customer_id,spotted_group)]$V1 ]
DT1Stretch[,c(.SD[1,] , .SD[2,]),by=c("customer_id","spotted_group")]
But this approach does not work if the customer returns back to the shop within 2 hours since I label the spotted_group based on the 120 minutes difference which is not ideal.
Not sure which is the right way to deal with my problem. Any help is appreciated.
To label the groups of the customer staying in the shop and getting
Here is another option using unique by exit sensor & time after a rolling join:
unique(
DT1[sensor_id==3L][DT1[sensor_id!=3L], on=.(customer_id, in_time), roll=-Inf,
.(customer_id, entry_sensor_id=i.sensor_id, entry_time=i.in_time,
exit_sensor_id=3L, exit_time=x.in_time)],
by=c("customer_id", "exit_sensor_id", "exit_time"))
Does this answer your question :
library(data.table)
library(lubridate)
DT <- data.table(
customer_id=c(1,1,1,2,2,2,1,1),
sensor_id=c(1,2,3,1,2,3,2,3),
in_time=c("2017-01-01 00:00:05","2017-01-01 00:06:35","2017-01-01 00:23:44","2017-01-02 22:00:20","2017-01-02 22:01:09","2017-01-02 22:28:02","2017-01-03 22:23:01","2017-01-03 22:50:52")
)
DT[,in_time:=lubridate::ymd_hms(in_time)]
# For both sensors 1&2 customer is in
DT[, customer_in:= ifelse(sensor_id %in% c(1,2),T,F)]
# Aggregate sensors 1 & 2, find first entry time
inout <- DT[order(customer_id,in_time)][, .(in_time = min(in_time),customer_in) , by = .(customer_id,rleid(customer_in),customer_in)]
# Separate entry & exit
entry <- inout[customer_in == T]
exit <- inout[customer_in == F]
# Join results
entry[exit,.(customer_id,in_time=x.in_time,out_time=in_time),roll=Inf, on=.(customer_id,in_time)]
customer_id in_time out_time
1: 1 2017-01-01 00:00:05 2017-01-01 00:23:44
2: 2 2017-01-02 22:00:20 2017-01-02 22:28:02
3: 1 2017-01-03 22:23:01 2017-01-03 22:50:52

R: pivoting & subtotals in data.table?

Pivoting and subtotals are common auxiliary steps in spreadsheets and SQL.
Assume a data.table with the fields date, myCategory, revenue. Assume that you want to know the proportion of day revenue of all revenue and the proportion of day revenue within different subgroup such that
b[,{
#First auxiliary variable of all revenue
totalRev = sum(revenue) #SUBGROUP OF ALL REV
#Second auxiliary variable of revenue by date, syntax wrong! How to do this?
{totalRev_date=sum(revenue), by=list(date)} #DIFFERENT SUBGROUP, by DATE's rev
#Within the subgroup by date and myCategory, we will use 1st&2nd auxiliary vars
.SD[,.(Revenue_prop_of_TOT=revenue/totalRev,
,Revenue_prop_of_DAY=revenue/totalRev_date) ,by=list(myCategory,date)]
},]
where we need to compute the auxiliary sums, all revenue of specific day and all revenue of whole history.
The end result should look like this:
date myCategory Revenue_prop_of_TOT Revenue_prop_of_DAY
2019-01-01 Cat1 0.002 0.2
...
where you see that the auxiliary variables are only help functions.
How can you pivot and compute subtotals within R data.table?
Another option using data.table::cube:
cb <- cube(DT, sum(value), by=c("date","category"), id=TRUE)
cb[grouping==0L, .(date, category,
PropByDate = V1 / cb[grouping==1L][.SD, on="date", x.V1],
PropByCategory = V1 / cb[grouping==2L][.SD, on="category", x.V1],
PropByTotal = V1 / cb[grouping==3L, V1]
)]
output:
date category PropByDate PropByCategory PropByTotal
1: 1 1 0.3333333 0.2500000 0.1
2: 1 2 0.6666667 0.3333333 0.2
3: 2 1 0.4285714 0.7500000 0.3
4: 2 2 0.5714286 0.6666667 0.4
data:
DT <- data.table(date=c(1, 1, 2, 2), category=c(1, 2, 1, 2), value=1:4)
# date category value
#1: 1 1 1
#2: 1 2 2
#3: 2 1 3
#4: 2 2 4
Hopefully I'm understanding correctly what you intend but please let me know in the comments if you need a different output.
b = data.table(date = rep(seq.Date(Sys.Date()-99, Sys.Date(), "days"), each=2),
myCategory = c("a", "b"),
revenue = rnorm(100, 200))
# global total, just create a constant
totalRev = b[, sum(revenue)]
# Total revenue at myCategory and date level / total Revenue
b[, Revenue_prop_of_TOT:=sum(revenue)/totalRev, by=.(myCategory, date)]
# you can calculate totalRev_date independently
b[, totalRev_date:=sum(revenue), by=date]
# If these are all the columns you have you don't need the sum(revenue) and by calls
b[, Revenue_prop_of_DAY:=sum(revenue)/totalRev_date, by=.(myCategory, date)]
Finally I would wrap it in a function.
revenue_total <- function(b){
totalRev = b[, sum(revenue)]
b[, Revenue_prop_of_TOT:=sum(revenue)/totalRev, by=.(myCategory, date)]
b[, totalRev_date:=sum(revenue), by=date]
b[, Revenue_prop_of_DAY:=sum(revenue)/totalRev_date, by=.(myCategory, date)]
b
}
b = revenue_total(b)
Options for pivoting and subtotals in R
cube answered here
groupingsets commented by marbel here

Totalize component products from table

I need my warehouse to be able to know how many items of each component we need per day. Basically, I have bundled items made of single products, and I want the warehouse to know how many of those single items they should provide in any given date.
I currently have data like this:
date bundle_name totbund prod1 totprod1 prod2 totprod2
06/01/2019 a_bund 1 a 1 b 1
06/01/2019 a 1
06/01/2019 b 2
07/01/2019 b_bund 1 b 2
07/01/2019 b_bund 2 b 4
07/01/2019 b 2
My expected output is this:
date all_item total
06/01/2019 a 2
06/01/2019 b 3
07/01/2019 b 8
Please notice that the bundle_item column can have bundled items or a single item, so it is mixed.
Something like this could work (I use 'a' as an example):
dat = dat %>%
group_by(date) %>%
summarize(a_bund = sum(tot_bund[bundle_name=='a']),
a_prod1 = sum(prod1[totprod1=='a']),
a_prod2 = sum(prod2[totprod2=='a'])) %>%
mutate(a = a_bund+a_prod1+a_prod2)
I wouldn't use the bundled notation, it sounds overly complicated. If you have everything in a row-format, you can use the group_by - summarize functionality of dplyr.
Assuming the data is called 'df'
library(dplyr)
df <- df %>%
select(date, prod = prod1, totprod = totprod1) %>%
filter(prod != "") %>%
bind_rows(df %>% select(date, prod = prod2, totprod = totprod2) %>% filter(prod != "") %>%
group_by(date, prod) %>%
summarize(totprod = sum(totprod))
I commented you needed a better approach at this problem.
I suggest you consider this from a structured database perspective. In such view, your data (and thus your world) is made of tables with differing and complementary information. And when you need to obtain information to solve your problem, you join data from different tables. If you have used excel, then you'll know it as vlookup.
How I'd approach your problem:
Table of components:
First, I'll have a table of components. This would be a very simple table of 3 columns: name of the product, component from which it is made and amount of component needed.
For your example, I'll have
library(data.table)
components <- structure(list(name = c("a", "b", "a_bund", "a_bund", "b_bund"),
component = c("a", "b", "a", "b", "b"),
amount = c(1, 1, 1, 1, 2)),
row.names = c(NA, -5L),
class = c("data.table", "data.frame"))
Which will produce:
components
name component amount
1: a a 1
2: b b 1
3: a_bund a 1
4: a_bund b 1
5: b_bund b 2
Notice that the information contained here is just the same information that you have in your table in columns 4 to 7 (by the way, your table is called "wide", while mine is "long" formatted. Long is much better for machine processing, and it's considered "tidy").
Table of Requests
Now that you have a table for the components, you'll need a table to put how many units of product x your clients need by y date. Do you notice that I separated the information content in both tables? There's one with components and nothing else; and there's one with requests and nothing else. Each thingy in its own basket!
This table I called requests and it's composed of three columns: dates with the date of the request, name with the name of the product requested by the client, and qty with the quantity the client expects of the product. That would be what you have in columns one to three in your data.
requests <- structure(list(dates = structure(c(17902, 17902, 17902, 17903, 17903, 17903), class = "Date"),
name = c("a_bund", "a", "b", "b_bund", "b_bund", "b"),
qty = c(1, 1, 2, 1, 2, 2)),
row.names = c(NA, -6L),
class = c("data.table", "data.frame"))
Which produces:
requests
dates name qty
1: 2019-01-06 a_bund 1
2: 2019-01-06 a 1
3: 2019-01-06 b 2
4: 2019-01-07 b_bund 1
5: 2019-01-07 b_bund 2
6: 2019-01-07 b 2
Joining the tables
With these two tables, you now need to know how many of each component you'll need in any given date. To solve this I'll use the data.table package, please see ?data.table for details.
requests[components, on = "name" ][, sum(qty*amount), by = .(dates, component)]
What is in there?
requests[components, on = "name"] joins the table requests with components by matching elements with the same name. In other words, it brings the component and amount (from components, of course) for each name in requests. Paste the command and see what the result is.
Data.table syntaxis allows "chaining" or passing an intermediate result to a new operation. That's what happens with the ][ sequence: I joined the tables and now am feeding that result into a new operation.
That new operation is sum(qty * amount). It is multiplying (you weren't wrong initially) the number of requested units qty by the amount of each component needed to produce it, and sums it (aggregates it) by = .(dates, component), which seems pretty self-explanatory. (If you come from the excel world, just think about a pivot or dynamic table).
That produces your expected output:
requests[components, on = "name" ][, sum(qty*amount), by = .(dates, component)]
dates component V1
1: 2019-01-06 a 2
2: 2019-01-06 b 3
3: 2019-01-07 b 8
While the result is the same other answers already provided, I hope you see the difference in approaches, and the enhanced usability of this one. If not, just imagine that k_bundle is made of 19 different components ;)

R count and substract events from a data frame

I am trying to calculate the families sizes from a data frame, which also contains two types of events : family members who died, and those who left the family. I would like to take into account these two parameters in order to compute the actual family size.
Here is a reproductive example of my problem, with 3 families only :
family <- factor(rep(c("001","002","003"), c(10,8,15)), levels=c("001","002","003"), labels=c("001","002","003"), ordered=TRUE)
dead <- c(0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0)
left <- c(0,0,0,0,0,1,0,0,0,1,1,0,0,0,1,1,0,0,0,1,1,1,0,0,0,0,0,0,1,1,1,0,0)
DF <- data.frame(family, dead, left) ; DF
I could count N = total family members (in each family) in a second dataframe DF2, by simply using table()
DF2 <- with(DF, data.frame(table(family)))
colnames(DF2)[2] <- "N" ; DF2
family N
1 001 10
2 002 8
3 003 15
But i can not find a proper way to get the actual number of people (for example, creating a new variable N2 into DF2) , calculated by substracting to N the number of members who died or left the family. I suppose i have to relate the two dataframes DF and DF2 in a way. i have looked for other related questions in this site but could not find the right answer...
If anyone has a good idea, it would be great !
Thank you in advance..
Deni
Logic : First we want to group_by(family) and then calculate 2 numbers : i) total #obs in each group ii) subtract the sum(dead) + sum(left) from this total .
In dplyr package : n() helps us get the total #observations in each group
In data.table : .N does the same above job
library(dplyr)
DF %>% group_by(family) %>% summarise( total = n(), current = n()-sum(dead,left, na.rm = TRUE))
# family total current
# (fctr) (int) (dbl)
#1 001 10 6
#2 002 8 4
#3 003 15 7
library(data.table)
# setDT() is preferred if incase your data was a data.frame. else just DF.
setDT(DF)[, .(total = .N, current = .N - sum(dead, left, na.rm = TRUE)), by = family]
# family total current
#1: 001 10 6
#2: 002 8 4
#3: 003 15 7
Here is a base R option
do.call(data.frame, aggregate(dl~family, transform(DF, dl = dead + left),
FUN = function(x) c(total=length(x), current=length(x) - sum(x))))
Or a modified version is
transform(aggregate(. ~ family, transform(DF, total = 1,
current = dead + left)[c(1,4:5)], FUN = sum), current = total - current)
# family total current
#1 001 10 6
#2 002 8 4
#3 003 15 7
I finally found another which works fine (from another post), allowing to compute everything from the original DF table. This uses the ddply function :
DF <- ddply(DF,.(family),transform,total=length(family))
DF <- ddply(DF,.(family),transform,actual=length(family)-sum(dead=="1")-sum(left=="1"))
DF
Thanks a lot to everyone who helped ! Deni

Combining all immediately previous rows that have the same value as last row in R

I have a dataframe that looks similar to this:
ID Description
1 "Low Blood Sugar, High Temperature"
1 "No Appetite"
2 "Blood Test Taken"
2 "D4556 Applied, No Obvious Reaction"
3 "At Rest"
1 "Lower Temperature, Improving"
1 "Walked 50m"
1 "Sedated"
What I want to do is return as an array, the last patient ID and all comments that occurred to the same patient immediately previously. That is, not anything that happened earlier. For example, I want to return:
ID Description
1 "Lower Temperature, Improving, Walked 50m, Sedated"
I can work this out with for loops, but I have a big dataframe and want something more efficient. I'm already subsetting over different variables using ddply, this is why I only need the last observations for the last ID.
Another data.table approach (using rleid):
library(data.table) #1.9.6
res <- setDT(df)[, list(ID[1L], toString(Description)), by = rleid(ID)]
You can remove the rleid column later on using
res[, rleid := NULL]
If you only want to return the last ID in aggregated form, you could do the following which should be pretty quick:
idx <- df[,rleid(ID)]
df[idx == max(idx), list(ID[1L], toString(Description))]
# V1 V2
#1: 1 Lower Temperature, Improving, Walked 50m, Sedated
I think this might be a solution:
#use rle to get the number of rows for the last id
idrows <- rle(df$ID)$lengths[length(rle(df$ID)$lengths)]
#use rle in the same way to get the actual id value
id <- rle(df$ID)$values[length(rle(df$ID)$values)]
#combine the above two variables to calculate the needed values
#i.e. the id and the pasted-together descriptions
> data.frame(id = id, Description = paste(df[(nrow(df)-idrows+1):nrow(df), 'Description'],
collapse=','))
id Description
1 1 Lower Temperature, Improving,Walked 50m,Sedated
Just to make it a bit more clear, df[(nrow(df)-idrows+1):nrow(df), 'Description'] selects the last rows for the needed id and paste pastes those descriptions together.
First we create a unique id by group in column ID:
df$id <- cumsum(c(1, diff(df$ID)!=0))
ID Description id
1 1 Low Blood Sugar, High Temperature 1
2 1 No Appetite 1
3 2 Blood Test Taken 2
4 2 D4556 Applied, No Obvious Reaction 2
5 3 At Rest 3
6 1 Lower Temperature, Improving 4
7 1 Walked 50m 4
8 1 Sedated 4
Then we select the last rows meeting the condition and create the vector:
c(t(df[df$id==max(df$id),][2]))
Output:
[1] "Lower Temperature, Improving" "Walked 50m" "Sedated"
With data.table you can try:
require(data.table)
setDT(df)
dt[,list(ID=ID[1],Description=paste(Description,collapse=", ")),
by=cumsum(c(TRUE,df$ID[2:nrow(df)]!=df$ID[1:(nrow(df)-1)]))][,
list(ID,Description)]
#ID Description
#1: 1 Low Blood Sugar, High Temperature, No Appetite
#2: 2 Blood Test Taken, D4556 Applied, No Obvious Reaction
#3: 3 At Rest
#4: 1 Lower Temperature, Improving, Walked 50m, Sedated
Although this old question already has an accepted answer, I felt challenged to add another data.table solution which uses rleid() but is different to this answer:
library(data.table) # CRAN version 1.10.4 used
setDT(DF)[, .(ID, Description = toString(Description)), by = rleid(ID)][, .SD[.N]]
# rleid ID Description
#1: 4 1 Lower Temperature, Improving, Walked 50m, Sedated
This returns the aggregated last entries concerning the very last patient ID as requested by the OP.
As already mentioned in the other answer, rleid can be removed by chaining [, rleid := NULL].
Just to mention, we can retrieve the aggregated last entries for each patient ID with a slight modification:
setDT(DF)[, .(Description = toString(Description)), .(ID, rleid(ID))][, .SD[.N], ID]
# ID rleid Description
#1: 1 4 Lower Temperature, Improving, Walked 50m, Sedated
#2: 2 2 Blood Test Taken, D4556 Applied, No Obvious Reaction
#3: 3 3 At Rest
Note that it is not required to name the by parameter as long as it is the third unnamed parameter.
Data
library(data.table)
DF <- fread(
'ID Description
1 "Low Blood Sugar, High Temperature"
1 "No Appetite"
2 "Blood Test Taken"
2 "D4556 Applied, No Obvious Reaction"
3 "At Rest"
1 "Lower Temperature, Improving"
1 "Walked 50m"
1 "Sedated"'
, data.table = FALSE)
Note that the parameter data.table = FALSE instructs fread() to return a data.frame to be in line with OP's specifications.

Resources