Indexing customer transactions in R [duplicate] - r

This question already has answers here:
Create counter with multiple variables [duplicate]
(6 answers)
Closed 9 years ago.
I'd like to index customer transactions in an R dataframe so that I can easily identify, say, the third transaction that a particular customer has made. For example, if I have the following data frame (ordered by customer and transaction date):
transactions = data.frame(CUST.ID = c(1, 1, 2, 2, 2, 2, 3, 3, 3),
DATE = as.Date(c("2009-07-02", "2013-08-15", "2010-01-02", "2004-03-05",
"2006-02-03", "2007-01-01", "2004-03-05", "2006-02-03", "2007-01-01")),
AMOUNT = c(5, 9, 21, 34, 76, 1, 100, 23, 10))
> transactions
CUST.ID DATE AMOUNT
1 1 2009-07-02 5
2 1 2013-08-15 9
3 2 2010-01-02 21
4 2 2004-03-05 34
5 2 2006-02-03 76
6 2 2007-01-01 1
7 3 2004-03-05 100
8 3 2006-02-03 23
9 3 2007-01-01 10
I can clearly see that customer 1 has made 2 transactions, customer 2 has made 4, etc.
What I would like is to index these transactions by customer, creating a new column in my dataframe. The following code achieves what I want:
transactions$COUNTER = 1
transactions$CUSTOMER.TRANS.NO = unlist(aggregate(COUNTER ~ CUST.ID,
data = transactions,
function(x) {rank(x, ties.method = "first")})[, 2])
transactions$COUNTER = NULL
> transactions
CUST.ID DATE AMOUNT CUSTOMER.TRANS.NO
1 1 2009-07-02 5 1
2 1 2013-08-15 9 2
3 2 2010-01-02 21 1
4 2 2004-03-05 34 2
5 2 2006-02-03 76 3
6 2 2007-01-01 1 4
7 3 2004-03-05 100 1
8 3 2006-02-03 23 2
9 3 2007-01-01 10 3
Now the first transaction for each customer is labelled 1, the second 2, etc.
So I've got what I want but it's such a horrible piece of code, creating a list and separating, it's just so ugly. Is anyone with more experience than me able to come up with a better solution?

Because you've taken the effort to post the sample code you tried (making your question a better Stack Overflow question than the duplicate I've linked to), I'll summarize the options here:
ave
within(transactions, { Trans.No <- ave(CUST.ID, CUST.ID, FUN = seq_along) })
getanID
library(splitstackshape)
getanID(transactions, "CUST.ID")
rle
## Depends on your data being sorted
transactions$Trans.No <- sequence(rle(transactions$CUST.ID)$lengths)
data.table
library(data.table)
DT <- data.table(transactions)
DT[, .id := sequence(.N), by = "CUST.ID"]

library(plyr)
ddply(transactions,.(CUST.ID),transform,CUSTOMER.TRANS.NO=seq(1,length(CUST.ID),1))
CUST.ID DATE AMOUNT CUSTOMER.TRANS.NO
1 1 2009-07-02 5 1
2 1 2013-08-15 9 2
3 2 2010-01-02 21 1
4 2 2004-03-05 34 2
5 2 2006-02-03 76 3
6 2 2007-01-01 1 4
7 3 2004-03-05 100 1
8 3 2006-02-03 23 2
9 3 2007-01-01 10 3

Related

Is there a quick way to transform intervals (Start and End) into a list of number in this interval in R

I have a file with interval values such as this for 50M lines:
>data
start_pos end_pos
1 1 10
2 3 6
3 5 9
4 6 11
And I would like to have a table of position occurrences so that I can compute the coverage on each position in the interval file such as this:
>occurence
position coverage
1 1
2 1
3 2
4 2
5 3
6 4
7 3
8 3
9 3
10 2
11 1
Is there any fast and best way to complete this task in R?
My plan was to loop through the data and concatenate the sequence in each interval into a vector and convert the final vector into a table.
count<-c()
for (row in 1:nrow(data)){
count<-c(count,(data[row,]$start_pos:data[row,]$end_pos))
}
occurence <- table(count)
The problem is that my file is huge and it takes way to much time and memory to do so.
The Bioconductor IRanges package does this fast and efficiently
library(IRanges)
ir = IRanges(start = c(1, 3, 5, 6), end = c(10, 6, 9, 11))
coverage(ir)
with
> coverage(ir) |> as.data.frame()
value
1 1
2 1
3 2
4 2
5 3
6 4
7 3
8 3
9 3
10 2
11 1

Pair-wise manipulating rows in data.frame

I have data on several thousand US basketball players over multiple years.
Each basketball player has a unique ID. It is known for what team and on which position they play in a given year, much like the mock data df below:
df <- data.frame(id = c(rep(1:4, times=2), 1),
year = c(1, 1, 2, 2, 3, 4, 4, 4,5),
team = c(1,2,3,4, 2,2,4,4,2),
position = c(1,2,3,4,1,1,4,4,4))
> df
id year team position
1 1 1 1 1
2 2 1 2 2
3 3 2 3 3
4 4 2 4 4
5 1 3 2 1
6 2 4 2 1
7 3 4 4 4
8 4 4 4 4
9 1 5 2 4
What is an efficient way to manipulate df into new_df below?
> new_df
id move time position.1 position.2 year.1 year.2
1 1 0 2 1 1 1 3
2 2 1 3 2 1 1 4
3 3 0 2 3 4 2 4
4 4 1 2 4 4 2 4
5 1 0 2 1 4 3 5
In new_df the first occurrence of the basketball player is compared to the second occurrence, recorded whether the player switched teams and how long it took the player to make the switch.
Note:
In the real data some basketball players occur more than twice and can play for multiple teams and on multiple positions.
In such a case a new row in new_df is added that compares each additional occurrence of a player with only the previous occurrence.
Edit: I think this is not a rather simple reshape exercise, because of the reasons mentioned in the previous two sentences. To clarify this, I've added an additional occurrence of player ID 1 to the mock data.
Any help is most welcome and appreciated!
s=table(df$id)
df$time=rep(1:max(s),each=length(s))
df1 = reshape(df,idvar = "id",dir="wide")
transform(df1, move=+(team.1==team.2),time=year.2-year.1)
id year.1 team.1 position.1 year.2 team.2 position.2 move time
1 1 1 1 1 3 2 1 0 2
2 2 1 2 2 4 2 1 1 3
3 3 2 3 3 4 4 4 0 2
4 4 2 4 4 4 4 4 1 2
The below code should help you get till the point where the data is transposed
You'll have to create the move and time variables
df <- data.frame(id = rep(1:4, times=2),
year = c(1, 1, 2, 2, 3, 4, 4, 4),
team = c(1, 2, 3, 4, 2, 2, 4, 4),
position = c(1, 2, 3, 4, 1, 1, 4, 4))
library(reshape2)
library(data.table)
setDT(df) #convert to data.table
df[,rno:=rank(year,ties="min"),by=.(id)] #gives the occurance
#creating the transposed dataset
Dcast_DT<-dcast(df,id~rno,value.var = c("year","team","position"))
This piece of code did the trick, using data.table
#transform to data.table
dt <- as.data.table(df)
#sort on year
setorder(dt, year, na.last=TRUE)
#indicate the names of the new columns
new_cols= c("time", "move", "prev_team", "prev_year", "prev_position")
#set up the new variables
dtt[ , (new_cols) := list(year - shift(year),team!= shift(team), shift(team), shift(year), shift(position)), by = id]
# select only repeating occurrences
dtt <- dtt[!is.na(dtt$time),]
#outcome
dtt
id year team position time move prev_team prev_year prev_position
1: 1 3 2 1 2 TRUE 1 1 1
2: 2 4 2 1 3 FALSE 2 1 2
3: 3 4 4 4 2 TRUE 3 2 3
4: 4 4 4 4 2 FALSE 4 2 4
5: 1 5 2 4 2 FALSE 2 3 1

Rolling sum in specified range

For df I want to take the rolling sum of the Value column over the last 10 seconds, with Time given in seconds. The dataframe is very large so using dply::complete is not an option (millions of data point, millisecond level). I prefer dplyr solution but think it may be possible with datatable left_join, just cant make it work.
df = data.frame(Row=c(1,2,3,4,5,6,7),Value=c(4,7,2,6,3,8,3),Time=c(10021,10023,10027,10035,10055,10058,10092))
Solution would add a column (Sum.10S) that takes the rolling sum of past 10 seconds:
df$Sum.10S=c(4,11,13,8,3,11,3)
Define a function sum10 which sums the last 10 seconds and use it with rollapplyr. It avoids explicit looping and runs about 10x faster than explicit looping using the data in the question.
library(zoo)
sum10 <- function(x) {
if (is.null(dim(x))) x <- t(x)
tt <- x[, "Time"]
sum(x[tt >= tail(tt, 1) - 10, "Value"])
}
transform(df, S10 = rollapplyr(df, 10, sum10, by.column = FALSE, partial = TRUE))
giving:
Row Value Time S10
1 1 4 10021 4
2 2 7 10023 11
3 3 2 10027 13
4 4 6 10035 8
5 5 3 10055 3
6 6 8 10058 11
7 7 3 10092 3
Well I wasn't fast enough to get the first answer in. But this solution is simpler, and doesn't require an external library.
df = data.frame(Row=c(1,2,3,4,5,6,7),Value=c(4,7,2,6,3,8,3),Time=c(10021,10023,10027,10035,10055,10058,10092))
df$SumR<-NA
for(i in 1:nrow(df)){
df$SumR[i]<-sum(df$Value[which(df$Time<=df$Time[i] & df$Time>=df$Time[i]-10)])
}
Row Value Time SumR
1 1 4 10021 4
2 2 7 10023 11
3 3 2 10027 13
4 4 6 10035 8
5 5 3 10055 3
6 6 8 10058 11
7 7 3 10092 3

Adding NA's where data is missing [duplicate]

This question already has an answer here:
Insert missing time rows into a dataframe
(1 answer)
Closed 5 years ago.
I have a dataset that look like the following
id = c(1,1,1,2,2,2,3,3,4)
cycle = c(1,2,3,1,2,3,1,3,2)
value = 1:9
data.frame(id,cycle,value)
> data.frame(id,cycle,value)
id cycle value
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 4
5 2 2 5
6 2 3 6
7 3 1 7
8 3 3 8
9 4 2 9
so basically there is a variable called id that identifies the sample, a variable called cycle which identifies the timepoint, and a variable called value that identifies the value at that timepoint.
As you see, sample 3 does not have cycle 2 data and sample 4 is missing cycle 1 and 3 data. What I want to know is there a way to run a command outside of a loop to get the data to place NA's where there is no data. So I would like for my dataset to look like the following:
> data.frame(id,cycle,value)
id cycle value
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 4
5 2 2 5
6 2 3 6
7 3 1 7
8 3 2 NA
9 3 3 8
10 4 1 NA
11 4 2 9
12 4 3 NA
I am able to solve this problem with a lot of loops and if statements but the code is extremely long and cumbersome (I have many more columns in my real dataset).
Also, the number of samples I have is very large so I need something that is generalizable.
Using merge and expand.grid, we can come up with a solution. expand.grid creates a data.frame with all combinations of the supplied vectors (so you'd supply it with the id and cycle variables). By merging to your original data (and using all.x = T, which is like a left join in SQL), we can fill in those rows with missing data in dat with NA.
id = c(1,1,1,2,2,2,3,3,4)
cycle = c(1,2,3,1,2,3,1,3,2)
value = 1:9
dat <- data.frame(id,cycle,value)
grid_dat <- expand.grid(id = 1:4,
cycle = 1:3)
# or you could do (HT #jogo):
# grid_dat <- expand.grid(id = unique(dat$id),
# cycle = unique(dat$cycle))
merge(x = grid_dat, y = dat, by = c('id','cycle'), all.x = T)
id cycle value
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 4
5 2 2 5
6 2 3 6
7 3 1 7
8 3 2 NA
9 3 3 8
10 4 1 NA
11 4 2 9
12 4 3 NA
A solution based on the package tidyverse.
library(tidyverse)
# Create example data frame
id <- c(1, 1, 1, 2, 2, 2, 3, 3, 4)
cycle <- c(1, 2, 3, 1, 2, 3, 1, 3, 2)
value <- 1:9
dt <- data.frame(id, cycle, value)
# Complete the combination between id and cycle
dt2 <- dt %>% complete(id, cycle)
Here is a solution with data.table doing a cross join:
library("data.table")
d <- data.table(id = c(1,1,1,2,2,2,3,3,4), cycle = c(1,2,3,1,2,3,1,3,2), value = 1:9)
d[CJ(id=id, cycle=cycle, unique=TRUE), on=.(id,cycle)]

Double match in r

I have a huge data set in r with one row per individual. One of my columns shows a family identifier (note, sex==1, male, sex==2, female).
ind sex income hw family.id
1 1 10 6 fam.1
2 2 8 7 fam.1
3 2 15 8 fam.2
4 1 7 4 fam.3
5 2 9 5 fam.3
How can I do a "double matching" so I can match couples in the data set for many of the variables that I am interested? For example, let's say individual 2, female, married with individual 1, male, should receive an entry in a new column with his income (same goes for hw):
ind sex income hw family.id income.male hw.male
1 1 10 6 fam.1 10 6
2 2 8 7 fam.1 8 6
3 2 15 8 fam.2 - -
4 1 7 4 fam.3 7 7
5 2 9 5 fam.3 9 7
I've said "double matching" in the title because I don't need to match only the family.ID, but I need to find a male that matches this fam.id. The reason I am doing this is because later all males will be dropped from the data set and I will remain only with rows for females.
I am sorry I can't show any coding I've worked. I've tried many approaches using match, ifelse, lapply and even unlist but it is not worth to add it here as unfortunately I can't make it work.
Anyone has a clue? We can work with both data.frames or data.tables environments.
You should go with data.table package. Here is an example:
library(data.table)
dt <- data.table(ind = c(1, 2, 3, 4, 5), sex =c(1, 2, 2, 1, 2), income = c(10, 8, 15, 7, 9), hw = c(6, 7, 8, 4, 5), family.id = c('fam.1', 'fam.1', 'fam.2', 'fam.3', 'fam.3'))
setkeyv(dt, 'family.id')
dt2 <- dt[dt[sex == 1, list(family.id, income, hw)]]
It will take income and hw of males (dt[sex == 1, list(family.id, income, hw)]) and match all individuals on family.id. As a result you obtain:
ind sex income hw family.id i.income i.hw
1: 1 1 10 6 fam.1 10 6
2: 2 2 8 7 fam.1 10 6
3: 4 1 7 4 fam.3 7 4
4: 5 2 9 5 fam.3 7 4
columns with prefix i. containing values of males for every family. Note that if no male is present you will not receive any row. If you still need this you can do:
dt2 <- merge(dt, dt[sex == 1, list(family.id, income, hw)], by = 'family.id', suffixes = c('', '.i'), all = TRUE)
to receive
family.id ind sex income hw income.i hw.i
1: fam.1 1 1 10 6 10 6
2: fam.1 2 2 8 7 10 6
3: fam.2 3 2 15 8 NA NA
4: fam.3 4 1 7 4 7 4
5: fam.3 5 2 9 5 7 4
Later when you need to drop male data you do:
dt2[sex == 2]
Let's assume that the dataframe is named 'dat'. You can merge the males and females by family.id with the merge function. You proposed answeer didn't make sense to me or to the otehr commenters but you can reassign "income" or "hw" within this new object.
> merge( dat[ dat$sex==1, ], dat[dat$sex==2,] , by="family.id")
family.id ind.x sex.x income.x hw.x ind.y sex.y income.y hw.y
1 fam.1 1 1 10 6 2 2 8 7
2 fam.3 4 1 7 4 5 2 9 5
To follow up on my comment:
require(data.table)
dt[dt[sex == 1L], c("i.m", "hw.m") := .(i.income, i.hw), on="family.id"][]
Extract the row indices where sex == 'male' for each family.id and add two columns by reference with the corresponding income and hw values.
where dt is:
dt = fread('ind sex income hw family.id
1 1 10 6 fam.1
2 2 8 7 fam.1
3 2 15 8 fam.2
4 1 7 4 fam.3
5 2 9 5 fam.3')

Resources