I'm trying to determine the number of unique customers per week per store.
I have a piece of code that accomplishes this task but the tabulation is not what I am looking for.
I have the following table:
store week customer_ID
1 1 1
1 1 1
1 1 2
1 2 1
1 2 2
1 2 3
2 1 1
2 1 1
2 1 2
2 2 2
2 2 3
2 2 3
So every week I need to count how many unique customer there were.
Say for example if customer 1 had visited on week 1, then revisited on week 2 that would not count as a unique visit.
If that same customer visited store 2 on week 1 or any other week. Then that would count as a unique visit for store two.
The outcome would look like the following:
store week unique Customers
1 1 2
1 2 1
2 1 2
2 2 1
I used the following but its not correct
agg <- aggregate(data=df, customer_ID~ week+store, function(x) length(unique(x)))
structure(list(store = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L), week = c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L,
2L, 2L), customer_ID = c(1L, 1L, 2L, 1L, 2L, 3L, 1L, 1L, 2L,
2L, 3L, 3L)), .Names = c("store", "week", "customer_ID"), class = "data.frame", row.names = c(NA,
Here is a base R method. The idea is to split the data into a list of data.frames, one for each store. Assuming observations are ordered by week, then drop duplicated observations of customer ID. The subset data.frame is aggregated using your function. Then do.call and rbind put the results into a single data.frame:
do.call(rbind, lapply(split(df, df$store),
function(i) aggregate(data=i[!duplicated(i$customer_ID),],
customer_ID ~ week+store, length)))
week store customer_ID
1.1 1 1 2
1.2 2 1 1
2.1 1 2 2
2.2 2 2 1
to make sure that your data.frame is ordered properly prior to attempting this, you could use order:
df <- df[order(df$store, df$week), ]
In case it is of interest, I put together a data.table solution as well.
df[df[, !duplicated(customer_ID), by=store]$V1,
.(newCust=length(customer_ID)), by=.(store, week)]
store week newCust
1: 1 1 2
2: 1 2 1
3: 2 1 2
4: 2 2 1
This method uses a logical vector df[, !duplicated(customer_ID), by=store]$V1 to subset the data to unique IDs by store, and then calculates the unique number of new customers by store-week.
This question already has answers here:
Recode dates to study day within subject
(2 answers)
Closed 3 years ago.
I have data structured as below:
ID Day Desired Output
1 1 1
1 1 1
1 1 1
1 2 2
1 2 2
1 3 3
2 4 1
2 4 1
2 5 2
3 6 1
3 6 1
Is it possible to create a sequence for the desired output without using a loop? The dataset is quite large so a loop won't work, is it possible to do this with the dplyr package or maybe a combination of cumsum/diff?
An option is to group by 'ID', and then do a match on the 'Day' with the unique values of 'Day' column
df1 %>%
group_by(ID) %>%
mutate(desired = match(Day, unique(Day)))
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L,
3L), Day = c(1L, 1L, 1L, 2L, 2L, 3L, 4L, 4L, 5L, 6L, 6L)), row.names = c(NA,
-11L), class = "data.frame")
suppose I have
family person loop. mode
1 1 1 car
1 1 1 walk
1 1 1 car
1 1 2 walk
1 1 2 bus
1 2 1 bus
1 2 1 walk
1 2 2 bus
2 1 1 car
2 1 1 car
2 1 2 car
2 2 1 bus
I want this:
each person has some loop in each family. I want to keep the first row of each loop for each person and each family if it is car and remove all other rows in that loop (if it is car or bus or walk) . if the first row of loop is not car I don't remove anything
family person loop. mode
1 1 1 car
1 1 2 walk
1 1 2 bus
1 2 1 bus
1 2 1 walk
1 2 2 bus
2 1 1 car
2 1 2 car
2 2 1 bus
in the first family first person has car mode in his first row of the first loop so I removed all trips in his first loop and just kept the first one. his second loop doesn't have car mode so I kept all. second person also doesn't have car mode so I kept all. second household first person has mode car in his first loop so I kept the first row and removed rest in the loop. his second loop has one row so I kept it and second person doesn't have car mode so I kept it
An option is to group by 'family', 'person', 'loop.', and slice only the first row if the first element of 'mode' is 'car' or else return the full number of rows
df1 %>%
group_by(family, person, loop.) %>%
slice(if(first(mode) == 'car') 1 else row_number())
# A tibble: 9 x 4
# Groups: family, person, loop. [7]
# family person loop. mode
# <int> <int> <int> <chr>
#1 1 1 1 car
#2 1 1 2 walk
#3 1 1 2 bus
#4 1 2 1 bus
#5 1 2 1 walk
#6 1 2 2 bus
#7 2 1 1 car
#8 2 1 2 car
#9 2 2 1 bus
df1 <- structure(list(family = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L), person = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L,
1L, 2L), loop. = c(1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L,
1L), mode = c("car", "walk", "car", "walk", "bus", "bus", "walk",
"bus", "car", "car", "car", "bus")), class = "data.frame", row.names = c(NA,
I have two dataframes, one with ID , DATE, and the name of the drug . Another has ID and date of event date.event.
expected column prev_drug :
how can I count the number of the different drug prior the current date ? for example, for ID=1 , prev_drug for row 4 is 2 , because it has two drugs ( A ,B) different from drug C prior the the DATE of row 4.
2.expected column event.30d.prior :
for each ID and each DATE in the first data frame, how many events happened during the 30days prior to the DATE ? eg. for row 2, the event for id=1 happened at 1/20/2001 , falls in to the 30 days prior to 2/1/2001 period.
ID DATE DRUG prev_drug event.30d.prior
1 1/1/2001 A 0 0
1 2/1/2001 A 0 1
1 3/15/2001 B 1 0
1 4/20/2001 C 2 1
1 5/29/2001 A 2 0
1 5/2/2001 B 2 0
2 3/2/2001 A 0 1
2 3/23/2001 C 1 1
2 4/4/2001 D 2 0
2 5/5/2001 B 3 0
ID date.event
1 1/20/2001
1 4/11/2001
2 3/1/2001
Here is a solution with base R with some dplyr methods used. This is not the cleanest and best solution but it should solve your problem.
df<-structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L),
DATE = structure(c(11323, 11354, 11396, 11432, 11471, 11444,
11383, 11404, 11416, 11447), class = "Date"), DRUG = structure(c(1L,
1L, 2L, 3L, 1L, 2L, 1L, 3L, 4L, 2L), .Label = c("A", "B",
"C", "D"), class = "factor")), row.names = c(NA, -10L), class = "data.frame")
#Note DATE was converted to a Date object with the following line
#df$DATE<-as.Date(df$DATE, "%m/%d/%Y")
date.event<-read.table(header=TRUE, text="ID date.event
1 1/20/2001
1 4/11/2001
2 3/1/2001")
date.event$date.event<-as.Date(date.event$date.event, "%m/%d/%Y")
#calculate the prev_drup by counting the number of unique drugs
df<-df %>% group_by(ID) %>% mutate(prev_drug= (cumsum(!duplicated(DRUG)))-1)
#loop through each row after spitting and filtering by ID
event.30d.prior<-sapply(1:nrow(df), function(i){
events<-date.event[date.event$ID==df$ID[i], "date.event"]
sum(between(events, df$DATE[i]-30, df$DATE[i]))
finalanswer<-cbind(df, event.30d.prior=unlist(event.30d.prior))
I have 100 simulated data sets, for example a single set is shown below
pid time status
1 2 1
1 6 0
1 4 1
2 3 0
2 1 1
2 7 1
3 8 1
3 11 1
3 2 0
pid denotes patient id. This indicates that each patient has three records on the time and status column.
I want to write R code to delete any row with 0 status if that row is not a record for the first observation of a given patient and keep rows with 0 status if it denotes the first observation while the remaining rows with status 1 following the this 0 are deleted for that patient. The output should look like
pid time status
1 2 1
1 4 1
2 3 0
3 8 1
3 11 1
As there are 100 simulated data sets the positions of 0's and 1's in the status column are not the same for all the data. Could anyone be of help to provide R code that can perform this task?
Thank you in advance.
dplyr package can help. I added a record to your data example to include multiple 0 values for a pid.
Group by pid and with the function first you can hold the first value of status. Due to the group by this will be held for all the records per pid. Then just filter if the first record is 0 and row_number() = 1 just in case there are more records with 0 (see pid 4) or if the first record has status = 1 and keep all the records with status 1.
df %>%
group_by(pid) %>%
filter((first(status) == 0 & row_number() == 1) | (first(status) == 1 & status == 1))
# A tibble: 6 x 3
# Groups: pid [4]
pid time status
<int> <int> <int>
1 1 2 1
2 1 4 1
3 2 3 0
4 3 8 1
5 3 11 1
6 4 3 0
df <-
pid = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L),
time = c(2L, 6L, 4L, 3L, 1L, 7L, 8L, 11L, 2L, 3L, 6L, 8L),
status = c(1L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 0L)
.Names = c("pid", "time", "status"),
class = "data.frame",
row.names = c(NA,-12L)
This question is more appropriate on https://stackoverflow.com.
Here is an attempt using tapply() (it's a little verbose):
dat <- structure(list(pid = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L),
time = c(2L, 6L, 4L, 3L, 1L, 7L, 8L, 11L, 2L),
status = c(1L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 0L)),
.Names = c("pid", "time", "status"), class = "data.frame",
row.names = c(NA, -9L))
ind <- unlist(tapply(dat$status, dat$pid, function(x) {
# browser()
y <- (rep(FALSE, length(x)))
if (x[1] == 1) {
y[x != 0] <- TRUE
} else {
y[1] <- TRUE
dat[ind, ]
#> pid time status
#> 1 1 2 1
#> 3 1 4 1
#> 4 2 3 0
#> 7 3 8 1
#> 8 3 11 1
ind is a vector of TRUEs and FALSEs, which will indicate whether a row of dat should be kept or not according to your rules.
I use tapply(X, INDEX, FUN) to apply a function to subsets of a vector (here X = dat$status), which are defined by a grouping factor (here INDEX = dat$pid).
Here, I used an anonymous function (i.e., FUN = function(x){}) to do something with each subset of X.
In particular, I first define y, which I will return later, to be a vector of FALSEs.
If the first status is 1 for a subgroup, I turn all elements that are non-zero (i.e., y[x != 0]) into TRUE.
Otherwise, I turn only the first element (i.e., y[1]) into TRUE.
You may uncomment the browser() statement and see at the console what the function does by typing n (for next) or x or y (to see what they are).
I'm a newbie of R and I don't know how to get R calculate the means of a subgroups of means which are the means of a subgroup themselves. I'll explain clearer.
I have a data frame like this:
1 1 4
1 1 3
1 1 3
1 2 2
1 2 2
1 2 3
2 3 1
2 3 1
2 3 2
2 4 1
2 4 1
2 4 1
... ... ...
but the real one has a total of 5 groups and 25 words (5 words each group; every word has being assigned a number from 1 to 4 by 5 subjects...).
I need to get the means of WLN for every word and I can do that easily with a loop and save the results in a vector; but then I need a vector with the means of these means according to the group which the words belong to... So I need the means of means of words of the group 1, then of group 2, etc... (I don't know if I'm making it clear).
How can I get this without doing it one group by one?
With base, using aggregate
> aggregate(WLN~GROUP+WORD, mean, data=df)
1 1 1 3.333333
2 1 2 2.333333
3 2 3 1.333333
4 2 4 1.000000
where df is #Metrics' data.
Another alternative is using summaryBy from doBy package
> library(doBy)
> summaryBy(WLN~GROUP+WORD, FUN=mean, data=df)
1 1 1 3.333333
2 1 2 2.333333
3 2 3 1.333333
4 2 4 1.000000
Assume df is your dataframe:
df<-structure(list(GROUP = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L), WORD = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L,
4L, 4L), WLN = c(4L, 3L, 3L, 2L, 2L, 3L, 1L, 1L, 2L, 1L, 1L,
1L)), .Names = c("GROUP", "WORD", "WLN"), class = "data.frame", row.names = c(NA,
Plyr solution
ddply(df,.(GROUP,WORD),summarize, meanwln=mean(WLN))
GROUP WORD meanwln
1 1 1 3.333333
2 1 2 2.333333
3 2 3 1.333333
4 2 4 1.000000
Data.table solution:
GROUP WORD meanwln
1: 1 1 3.333333
2: 1 2 2.333333
3: 2 3 1.333333
4: 2 4 1.000000
with base:
If you also want row- and colmeans for the table above, you could do something like this:
x <- with(df,tapply(WLN,list(GROUP,WORD),mean))
addmargins(x, margin = seq_along(dim(x)), FUN = mean, quiet = TRUE)
And now dplyr is even better...
tmp <- group_by(df, WORD)
df1 <- summarise(tmp,
count = n(),
mWLN = mean(WLN, na.rm = TRUE))