Counting observations for a given ID according to date in R [duplicate] - r

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 6 years ago.
First of all:
Sorry if this question might sound stupid...or I have formatted it incorrectly, I'm new to this.
I have the following data:
Value Date IDnum
1 230 2010-02-01 1
2 254 2011-07-07 2
3 300 2011-12-14 1
4 700 2011-01-23 3
5 150 2010-08-31 3
6 100 2010-05-06 1
Created using the following code:
Value <- c(230, 254, 300, 700, 150, 100)
Date <- as.Date(c("01/02/2010", "07/07/2011", "14/12/2011", "23/01/2011", "31/08/2010", "06/05/2010")
, "%d/%m/%Y")
IDnum <- c(001, 002, 001, 003, 003, 001)
MyData <- data.frame(Value, Date, IDnum)
I need R to create a column which counts and enumerates every row according to whether its the first, second etc observation for the given IDnum, by date. Thus giving me something similar:
Value Date IDnum Obs
1 230 2010-02-01 1 1
2 254 2011-07-07 2 1
3 300 2011-12-14 1 3
4 700 2011-01-23 3 2
5 150 2010-08-31 3 1
6 100 2010-05-06 1 2
Thanks

library(data.table)
df$Date <- as.Date(df$Date, format = "%Y-%m-%d")
setDT(df)[, Obs := order(Date),by = .(IDnum)]
library(dplyr)
df %>% group_by(IDnum) %>% mutate(Obs = order(Date))
# Value Date IDnum Obs
#1: 230 2010-02-01 1 1
#2: 254 2011-07-07 2 1
#3: 300 2011-12-14 1 3
#4: 700 2011-01-23 3 2
#5: 150 2010-08-31 3 1
#6: 100 2010-05-06 1 2

Related

Create a "flag" column in a dataset based on a another table in R

I have two datasets: dataset1 and dataset2.
zz <- "id_customer id_order order_date
1 1 2018-10
1 2 2018-11
2 3 2019-05
3 4 2019-06"
dataset1 <- read.table(text=zz, header=TRUE)
yy <- "id_customer order_date
1 2018-10
3 2019-06"
dataset2 <- read.table(text=yy, header=TRUE)
dataset2 is the result of a query where I have two columns: id_customer and date (format YYYY-mm).
Those correspond to customers which have a different status than the others in the source dataset (dataset1), for a specified month.
dataset1 is a list of transactions where I have id_customer, id_order and date (format YYYY-mm as well).
I want to enrich dataset1 with a "flag" column for each line set to 1 if the customer id appears in dataset2, during the corresponding month.
I have tried something as follows:
dataset$flag <- ifelse(dataset1$id_customer %in% dataset2$id_customer &
dataset1$date == dataset2$date,
"1", "0")
But I get a warning message that says 'longer object length is not a multiple of shorter object length'.
I understand that but cannot come up with a solution. Could someone please help?
You can add a flag to dataset2 then use merge(), keeping all rows from dataset1. Borrowing Chris' data:
dataset2$flag <- 1
merge(dataset1, dataset2, all.x = TRUE)
ID Date flag
1 1 2018-12 NA
2 1 2019-11 NA
3 2 2018-13 NA
4 2 2019-10 NA
5 2 2019-11 1
6 2 2019-12 NA
7 2 2019-12 NA
8 3 2018-12 1
9 3 2018-12 1
10 4 2018-13 1
EDIT:
This seems to work:
Illustrative data:
set.seed(100)
dt1 <- data.frame(
ID = sample(1:4, 10, replace = T),
Date = paste0(sample(2018:2019, 10, replace = T),"-", sample(10:13, 10, replace = T))
)
dt1
ID Date
1 2 2019-12
2 2 2019-12
3 3 2018-12
4 1 2018-12
5 2 2019-11
6 2 2019-10
7 4 2018-13
8 2 2018-13
9 3 2018-12
10 1 2019-11
dt2 <- data.frame(
ID = sample(1:4, 5, replace = T),
Date = paste0(sample(2018:2019, 5, replace = T),"-", sample(10:13, 5, replace = T))
)
dt2
ID Date
1 2 2019-11
2 4 2018-13
3 2 2019-13
4 4 2019-13
5 3 2018-12
SOLUTION:
The solution uses ifelse to define a condition upon which to set the 'flag' 1(as specified in the OP). That condition implies a match between dt1and dt2; thus we're using match. A complicating factor is that the condition requires a double match between two columns in each dataframe. Therefore, we use apply to paste the rows in the two columns together using paste0 and search for matches in these compound strings:
dt1$flag <- ifelse(match(apply(dt1[,1:2], 1, paste0, collapse = " "),
apply(dt2[,1:2], 1, paste0, collapse = " ")), 1, "NA")
RESULT:
dt1
ID Date flag
1 2 2019-12 NA
2 2 2019-12 NA
3 3 2018-12 1
4 1 2018-12 NA
5 2 2019-11 1
6 2 2019-10 NA
7 4 2018-13 1
8 2 2018-13 NA
9 3 2018-12 1
10 1 2019-11 NA
To check the results we can compare them with the results obtained from merge:
flagged_only <- merge(dt1, dt2)
flagged_only
ID Date
1 2 2019-11
2 3 2018-12
3 3 2018-12
4 4 2018-13
The dataframe flagged_onlycontains exactly the same four rows as the ones flagged 1 in dt1-- voilĂ !
It is very is to add a corresponding flag in a data.table way:
# Load library
library(data.table)
# Convert created tables to data.table object
setDT(dataset1)
setDT(dataset2)
# Add {0, 1} to dataset1 if the row can be found in dataset2
dataset1[, flag := 0][dataset2, flag := 1, on = .(id_customer, order_date)]
The result looks as follows:
> dataset1
id_customer id_order order_date flag
1: 1 1 2018-10 1
2: 1 2 2018-11 0
3: 2 3 2019-05 0
4: 3 4 2019-06 1
A bit more manipulations would be needed if you would have the full date/time in the datasets.

Count number of observations in one data frame based on values from another data frame

I have two very large data frames(50 million and 1.5 million) where some of the variables in both are same. I need to compare both and add another column in one data frame which gives count of matching observations in the other data frame.
For example: DF1 and DF2 both contain id, date, age_grp and gender variables. I want to add another column (match_count) in DF1 which show the count where DF1.id = DF2.id and DF1.date = DF2.date and DF1.age_grp = DF2.age_grp and DF1.gender = DF2.gender
Note
DF1
id date age_grp gender val
101 20140110 1 1 666
102 20150310 2 2 777
103 20160901 3 1 444
104 20160903 4 1 555
105 20010910 5 1 888
DF2
id date age_grp gender state
101 20140110 1 1 10
101 20140110 1 1 12
101 20140110 1 2 22
102 20150310 2 2 33
In the above example the combination "id = 101, date = 20140110, age_grp = 1, gender = 1" appears twice in DF2, hence the count 2 and the combination "id = 102, date = 20150010, age_grp = 2, gender = 2" appears once,hence the count 1.
Below is the resultant data frame I am looking for
Result
id date age_grp gender val match_count
101 20140110 1 1 666 2
102 20150310 2 2 777 1
103 20160901 3 1 444 0
104 20160903 4 1 555 0
105 20010910 5 1 888 0
Here is what I am doing at the moment and it works perfectly well for small data but does not scale well for large data. For this instance it did not return any results even after several hours.
Note: I have gone through this thread and it does not address the scale issue
with(DF1
, mapply(
function(arg_id,arg_agegrp, arg_gender, arg_date){
sum(arg_id == DF2$id
& agegrp == DF2$agegrp
& gender_bool == DF2$gender
& arg_date == DF2$date)
},
id, agegrp, gender, date)
)
UPDATE
The Id column is not unique, hence there could be two observations where id, date, agegrp and sex could be same and only val column could be different.
Here is what I will solve this problem by using dplyr
df2$state=NULL#noted you do not need column state
Name=names(df2)
df2=df2%>%group_by_(.dots=names(df2))%>%dplyr::summarise(match_count=n())
Target=merge(df1,df2,by.x=Name,by.y=Name,all.x=T)
Target[is.na(Target)]=0
Target
id date age_grp gender val match_count
1 101 20140110 1 1 666 2
2 102 20150310 2 2 777 1
3 103 20160901 3 1 444 0
4 104 20160903 4 1 555 0
5 105 20010910 5 1 888 0
data.table might be helpful here too. Aggregate DF2 by the variables specified, then join this back to DF1.
library(data.table)
setDT(DF1)
setDT(DF2)
vars <- c("id","date","age_grp","gender")
DF1[DF2[, .N, by=vars], count := N, on=vars]
DF1
# id date age_grp gender val count
#1: 101 20140110 1 1 666 2
#2: 102 20150310 2 2 777 1
#3: 103 20160901 3 1 444 NA
#4: 104 20160903 4 1 555 NA
#5: 105 20010910 5 1 888 NA

transform a dataframe from long to wide in r, but needs date transformation

I have a dataframe like this (each "NUMBER" indicate a student):
NUMBER Gender Grade Date.Tested WI WR WZ
1 F 4 2014-02-18 6 9 10
1 F 3 2014-05-30 9 8 2
2 M 5 2013-05-02 7 9 15
2 M 4 2009-05-21 5 7 2
2 M 5 2010-04-29 9 1 4
I know I can use:
cook <- reshape(data, timevar= "?", idvar= c("NUMBER","Gender"), direction = "wide")
to change it into a wide format. However, I want to remove the date.tested to the times (1st time, 2nd time...etc), and indicate the grade.
What I want at the end is like this:
NUMBER Gender Grade1 Grade 2 Grade 3 WI1 WR1 WZ1 WI2 WR2 WZ2 WI3 WR3 WZ3
1 F 3 4 NA 9 8 2 6 9 10 NA NA NA
and for the rest "NUMBER"s.
I have searched a lot but did not find an answer. Can someone help me with it?
Thank you very much!
Try
data$id <- with(data, ave(seq_along(NUMBER), NUMBER, FUN=seq_along))
reshape(data, idvar=c('NUMBER', 'Gender'), timevar='id', direction='wide')
If you want the Date.Tested variable to be included in the 'idvar' and you need only the 1st value for the group ('NUMBER' or 'GENDER')
data$Date.Tested <- with(data, ave(Date.Tested, NUMBER,
FUN=function(x) head(x,1)))
reshape(data, idvar=c('NUMBER', 'Gender', 'Date.Tested'),
timevar='id', direction='wide')

Indexing customer transactions in R [duplicate]

This question already has answers here:
Create counter with multiple variables [duplicate]
(6 answers)
Closed 9 years ago.
I'd like to index customer transactions in an R dataframe so that I can easily identify, say, the third transaction that a particular customer has made. For example, if I have the following data frame (ordered by customer and transaction date):
transactions = data.frame(CUST.ID = c(1, 1, 2, 2, 2, 2, 3, 3, 3),
DATE = as.Date(c("2009-07-02", "2013-08-15", "2010-01-02", "2004-03-05",
"2006-02-03", "2007-01-01", "2004-03-05", "2006-02-03", "2007-01-01")),
AMOUNT = c(5, 9, 21, 34, 76, 1, 100, 23, 10))
> transactions
CUST.ID DATE AMOUNT
1 1 2009-07-02 5
2 1 2013-08-15 9
3 2 2010-01-02 21
4 2 2004-03-05 34
5 2 2006-02-03 76
6 2 2007-01-01 1
7 3 2004-03-05 100
8 3 2006-02-03 23
9 3 2007-01-01 10
I can clearly see that customer 1 has made 2 transactions, customer 2 has made 4, etc.
What I would like is to index these transactions by customer, creating a new column in my dataframe. The following code achieves what I want:
transactions$COUNTER = 1
transactions$CUSTOMER.TRANS.NO = unlist(aggregate(COUNTER ~ CUST.ID,
data = transactions,
function(x) {rank(x, ties.method = "first")})[, 2])
transactions$COUNTER = NULL
> transactions
CUST.ID DATE AMOUNT CUSTOMER.TRANS.NO
1 1 2009-07-02 5 1
2 1 2013-08-15 9 2
3 2 2010-01-02 21 1
4 2 2004-03-05 34 2
5 2 2006-02-03 76 3
6 2 2007-01-01 1 4
7 3 2004-03-05 100 1
8 3 2006-02-03 23 2
9 3 2007-01-01 10 3
Now the first transaction for each customer is labelled 1, the second 2, etc.
So I've got what I want but it's such a horrible piece of code, creating a list and separating, it's just so ugly. Is anyone with more experience than me able to come up with a better solution?
Because you've taken the effort to post the sample code you tried (making your question a better Stack Overflow question than the duplicate I've linked to), I'll summarize the options here:
ave
within(transactions, { Trans.No <- ave(CUST.ID, CUST.ID, FUN = seq_along) })
getanID
library(splitstackshape)
getanID(transactions, "CUST.ID")
rle
## Depends on your data being sorted
transactions$Trans.No <- sequence(rle(transactions$CUST.ID)$lengths)
data.table
library(data.table)
DT <- data.table(transactions)
DT[, .id := sequence(.N), by = "CUST.ID"]
library(plyr)
ddply(transactions,.(CUST.ID),transform,CUSTOMER.TRANS.NO=seq(1,length(CUST.ID),1))
CUST.ID DATE AMOUNT CUSTOMER.TRANS.NO
1 1 2009-07-02 5 1
2 1 2013-08-15 9 2
3 2 2010-01-02 21 1
4 2 2004-03-05 34 2
5 2 2006-02-03 76 3
6 2 2007-01-01 1 4
7 3 2004-03-05 100 1
8 3 2006-02-03 23 2
9 3 2007-01-01 10 3

getting a sample of a data.frame in R

I have the following data frame in R:
id<-c(1,2,3,4,10,2,4,5,6,8,2,1,5,7,7)
date<-c(19970807,19970902,19971010,19970715,19991212,19961212,19980909,19990910,19980707,19991111,19970203,19990302,19970605,19990808,19990706)
spent<-c(1997,19,199,134,654,37,876,890,873,234,643,567,23,25,576)
df<-data.frame(id,date,spent)
I need to take a random sample of 3 customers (based on id) in a way that all observations of the customers be extracted.
You want to use %in% and unique
df[df$id %in% sample(unique(df$id),3),]
## id date spent
## 4 4 19970715 134
## 7 4 19980909 876
## 8 5 19990910 890
## 10 8 19991111 234
## 13 5 19970605 23
Using data.table to avoid $ referencing
library(data.table)
DT <- data.table(df)
DT[id %in% sample(unique(id),3)]
## id date spent
## 1: 1 19970807 1997
## 2: 4 19970715 134
## 3: 4 19980909 876
## 4: 1 19990302 567
## 5: 7 19990808 25
## 6: 7 19990706 576
This ensures that you are always evaluating the expressions within the data.table.
Use something like:
df[sample(df$id, 3), ]
# id date spent
# 1 1 19970807 1997
# 5 10 19991212 654
# 8 5 19990910 890
Of course, your samples would be different.
Update
If you want unique customers, you can aggregate first.
df2 = aggregate(list(date = df$date, spent = df$spent), list(id = df$id), c)
df2[sample(df2$id, 3), ]
# id date spent
# 4 4 19970715, 19980909 134, 876
# 5 5 19990910, 19970605 890, 23
# 8 8 19991111 234
OR--an option with out aggregate:
df[df$id %in% sample(unique(df$id), 3), ]
# id date spent
# 1 1 19970807 1997
# 3 3 19971010 199
# 12 1 19990302 567
# 14 7 19990808 25
# 15 7 19990706 576

Resources