I have a dataset that consists of customers and accounts where a customer can have multiple accounts. The dataset has several 'flags' on each account.
I'm trying to get a count of 'unique' hits on these flags per customer, i.e. if 3 accounts have flag1 I want this to count as 1 hit, but if just one of the accounts have flag2 too I want this to count as 2. Essentially, I want to see how many flags each customer hits across all of their accounts.
Example Input data frame:
cust acct flag1 flag2 flag3
a 123 0 1 0
a 456 1 1 0
b 789 1 1 1
c 428 0 1 0
c 247 0 1 0
c 483 0 1 1
Example Output dataframe:
cust acct flag1 flag2 flag3 UniqueSum
a 123 0 1 0 2
a 456 1 1 0 2
b 789 1 1 1 3
c 428 0 1 0 2
c 247 0 1 0 2
c 483 0 1 1 2
I've tried to use the following:
fSumData <- ddply(fData, "cust", numcolwise(sum, c(flag1,flag2,flag3))
but this sums the acct column too giving me one row per customer where I'd like to have the same amount of rows as the customer has accounts.
Using data.table:
require(data.table) # v1.9.6
dt[, un := sum(sapply(.SD, max)), by = cust, .SDcols = flag1:flag3]
We group by cust, and on the subdata for each group for columns flag1, flag2, flag3 (achieved using .SD and .SDcols), we extract each column's max, and summing it up would give the total number of 1's.
We update the original table with these values by reference using the LHS := RHS notation (see Reference Semantics vignette).
where dt is:
dt = fread('cust acct flag1 flag2 flag3
a 123 0 1 0
a 456 1 1 0
b 789 1 1 1
c 428 0 1 0
c 247 0 1 0
c 483 0 1 1')
One way that comes to my mind, is to colSum for each cust and check which are greater than 0. For example,
> tab
cust acct flag1 flag2 flag3
1 a 123 0 1 0
2 a 456 1 1 0
3 b 789 1 1 1
4 c 428 0 1 0
5 c 247 0 1 0
6 c 483 0 1 1
> uniqueSums <- sapply(tab$cust, function(cust) length(which(colSums(tab[tab$cust == cust,3:5]) > 0)))
> cbind(tab, uniqueSums = uniqueSums)
cust acct flag1 flag2 flag3 uniqueSums
1 a 123 0 1 0 2
2 a 456 1 1 0 2
3 b 789 1 1 1 3
4 c 428 0 1 0 2
5 c 247 0 1 0 2
6 c 483 0 1 1 2
For each value of cust, the function in sapply finds the rows, does a vectorized sum and checks for values that are greater than 0.
Here's an approach using library(dplyr):
df %>%
group_by(cust) %>%
summarise_each(funs(max), -acct) %>%
mutate(UniqueSum = rowSums(.[-1])) %>%
select(-starts_with("flag")) %>%
right_join(df, "cust")
#Source: local data frame [6 x 6]
#
# cust UniqueSum acct flag1 flag2 flag3
# (fctr) (dbl) (int) (int) (int) (int)
#1 a 2 123 0 1 0
#2 a 2 456 1 1 0
#3 b 3 789 1 1 1
#4 c 2 428 0 1 0
#5 c 2 247 0 1 0
#6 c 2 483 0 1 1
I was able to answer my own question after reading Roman's post, I did something like this where f data is my dataset.
fSumData <- ddply(fData, "cust", numcolwise(sum))
fSumData$UniqueHits <- ifelse(fSumData$flag1 >= 1;1,0) + ifelse(fSumData$flag2 >= 1;1;0) + ifelse(fSumData$flag3 >= 1;1;0)
I found this to be a bit faster than Roman's solution when running against my dataset, but am unsure if it's the optimal solution. Thank you all for your input this helped a ton!
The underused rowsum could be, also, of use:
rowSums(rowsum(DF[-(1:2)], DF$cust) > 0)[DF$cust]
#a a b c c c
#2 2 3 2 2 2
Related
I am trying to create a function to apply to a variable in a dataframe that, for a windows of 3 days forward from the current observation, calculate if the current price decrease and then return to the original price. The dataframe looks like this:
VarA VarB Date Price Diff VarD
1 1 2007-04-09 50 NA 0
1 1 2007-04-10 50 0 0
1 1 2007-04-11 48 -2 1
1 1 2007-04-12 48 0 1
1 1 2007-04-13 50 2 0
1 1 2007-04-14 50 0 0
1 1 2007-04-15 45 -5 1
1 1 2007-04-16 50 5 0
1 1 2007-04-17 45 -5 0
1 1 2007-04-18 48 3 0
1 1 2007-04-19 48 0 0
1 1 2007-04-20 50 2 0
Where VarA and VarB are grouping variables (in this example, they do not change), Price is the variable I wish to detect if it decrease and then increase again to the starting level, and Diff is the lagged price difference (if is of any help).
VarD shows the result of applying the function I am trying to guess. There are two conditions for VarD to take the value 1: 1) the price decrease from a level and then, in any of the two following days window, returns to the original level (i.e., 50 to 48 and again to 50, in rows 2 to 5, or 50 to 45 and again to 50 in rows 6 to 8); 2) there is a maximum of two days for the price to increase again to the starting level. Otherwise, VarD should take the value 0.
I do not have any clue of how to start.
The dataframe db is:
db <- read.table(header = TRUE, sep = ",", text = "VarA,VarB,Date,Price,Diff
1,1,2007-04-09,50,NA
1,1,2007-04-10,50,0
1,1,2007-04-11,48,-2
1,1,2007-04-12,48,0
1,1,2007-04-13,50,2
1,1,2007-04-14,50,0
1,1,2007-04-15,45,-5
1,1,2007-04-16,50,5
1,1,2007-04-17,45,-5
1,1,2007-04-18,48,3
1,1,2007-04-19,48,0
1,1,2007-04-20,50,2")
Thanks in advance.
Hope I understood your requirements correctly:
library(dplyr)
db %>%
#create Diff.2 as helper variable: increase in price from current day to 2 days later
mutate(Diff.2 = diff(c(Price,NA,NA), lag = 2)) %>%
mutate(Var.D = ifelse(
Diff.2 + lag(Diff.2, 2) == 0 & #condition 1: price increase from current day to 2 days later
#is cancelled out by price decrease from 2 days ago to current day
Diff.2 > 0, #condition 2: price increases from current day to 2 days later
1, 0)) %>%
mutate(Var.D = ifelse(is.na(Var.D), 0, Var.D)) %>%
select(-Diff.2)
VarA VarB Date Price Diff Var.D
1 1 1 2007-04-09 50 NA 0
2 1 1 2007-04-10 50 0 0
3 1 1 2007-04-11 48 -2 1
4 1 1 2007-04-12 48 0 1
5 1 1 2007-04-13 50 2 0
6 1 1 2007-04-14 50 0 0
7 1 1 2007-04-15 48 -2 0
8 1 1 2007-04-16 49 1 0
9 1 1 2007-04-17 45 -4 0
10 1 1 2007-04-18 45 0 0
11 1 1 2007-04-19 45 0 0
12 1 1 2007-04-20 50 0 0
I think I found the solution, if it is of interest. I use inputs from #G. Grothendieck so he deserve most of the credit (but not the blame for errors). The solution is in four steps:
Step 1: create a dummy variable equal to 1 if prices decrease and for each month it stays low.
db$Tmp1 <- 0
for (n in 1 : length(db$Date)) db$Tmp1[n] <- ifelse
(db$Diff[n] < 0, 1, ifelse (db$Tmp1[n-1:min(0, n)] == 1 &&
db$Diff[n] == 0, 1, 0))
The first part of ifelse tells that if price at date [n] decrease or if the previous value of Step1 is equal to 1 and price do not change, then assign the value 1, else 0.
Step 2: restrict the number of days the price could be lower in Step 1 to two days (thanks to #G. Grothendieck).
loop <- function(x) if (all(x[-1] == 1)) 0 else x[1]
db$Tmp2 = ifelse(db$Diff < 0, rollapply(db$Tmp1, 3, loop, partial = T,
align = "left"), ifelse(db$Diff==0 & lag(db$Tmp2) ==
1, 1, 0))
loop is a function that value 0 if all values -except the current date- are equal to 1, else take the value of Tmp1. Then, if the price decrease (db$diff < 0) apply loop to 3 values of Tmp1 forward, but if the price do not change and the previous value of Tmp2 is 1, assign a value of 1. Else assign 0.
Step 3: calculate if the price previous to the price decrease is repeated in a three days window from the original price.
loop2 <- function(x) if (any(x[-1] == x[1])) 1 else 0
Tmp3 = rollapply(Price, 4, loop2, partial = T, align = "left")
The function loop2 search if any price is repeated in 3 days from the current date (the 4 in the Tmp3 function). Then, Tmp3 apply loop2 to the Price vector (following this tread Ifelse statement with dataframe subset using date)
Step 4: Multiply Tmp2 and Tmp3 to obtain the result (and delete the auxiliary variables).
db$Sale <- db$Tmp2 * db$Tmp3
db$Tmp1 <- db$Tmp2 <- db$Tmp3 <- NULL
Now, Sale is just to multiply Tmp2 and Tmp3, as the first one adjust sales to a 3 days windows, and the second one show if the original price at the start of the decrease in price is present at a window of 3 days before.
Hope its useful to anyone.If any has corrections or suggestions, they are very welcome. Lastly, each one of the codes should be applied to each VarA and VarB, so each step should be in the following code:
db <-
db %>% group_by(VarA, VarB) %>%
mutate(
code
)
The output is:
VarA VarB Date Price Diff Tmp1 Tmp2 Tmp3 Sale
<int> <int> <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 2007-04-09 50 0 0 NA 1 NA
2 1 1 2007-04-10 50 0 0 0 1 0
3 1 1 2007-04-11 48 -2 1 1 1 1
4 1 1 2007-04-12 48 0 1 1 1 1
5 1 1 2007-04-13 50 2 0 0 1 0
6 1 1 2007-04-14 50 0 0 0 0 0
7 1 1 2007-04-15 48 -2 1 1 0 0
8 1 1 2007-04-16 49 1 0 0 0 0
9 1 1 2007-04-17 45 -4 1 0 1 0
10 1 1 2007-04-18 45 0 1 0 1 0
11 1 1 2007-04-19 45 0 1 0 0 0
12 1 1 2007-04-20 50 5 0 0 0 0
Thanks a lot.
This question already has answers here:
Is there a dplyr equivalent to data.table::rleid?
(6 answers)
Closed 5 years ago.
I'm trying to count # consecutive days inactive (consecDaysInactive), per ID.
I have already created an indicator variable inactive that is 1 on days where id is inactive and 0 when active. I also have an id variable, and a date variable. My analysis dataset will have hundreds of thousands of rows, so efficiency will be important.
The logic I'm trying to create is as follows:
per id, if user is active, consecDaysInactive = 0
per id, if user is inactive, and was active on previous day, consecDaysInactive = 1
per id, if user is inactive on previous day, consecDaysInactive = 1 + # previous consecutive inactive days
consecDaysInactive should reset to 0 for new values of id.
I've been able to create a cumulative sum, but unable to get it to reset at 0 after >= rows of inactive==0.
I've illustrated below the result that I want (consecDaysInactive), as well as the result that I was able to achieve programmatically (bad_consecDaysInactive).
library(dplyr)
d <- data.frame(id = c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2), date=as.Date(c('2017-01-01','2017-01-02','2017-01-03','2017-01-04','2017-01-05','2017-01-06','2017-01-07','2017-01-08','2017-01-01','2017-01-02','2017-01-03','2017-01-04','2017-01-05','2017-01-06','2017-01-07','2017-01-08')), inactive=c(0,0,0,1,1,1,0,1,0,1,1,1,1,0,0,1), consecDaysInactive=c(0,0,0,1,2,3,0,1,0,1,2,3,4,0,0,1))
d <- d %>%
group_by(id) %>%
arrange(id, date) %>%
do( data.frame(., bad_consecDaysInactive = cumsum(ifelse(.$inactive==1, 1,0))
)
)
d
where consecDaysInactive iterates by +1 for each consecutive day inactive, but resets to 0 each date user is active, and resets to 0 for new values of id. As the output shows below, I'm unable to get bad_consecDaysInactive to reset to 0 -- e.g. row
id date inactive consecDaysInactive bad_consecDaysInactive
<dbl> <date> <dbl> <dbl> <dbl>
1 1 2017-01-01 0 0 0
2 1 2017-01-02 0 0 0
3 1 2017-01-03 0 0 0
4 1 2017-01-04 1 1 1
5 1 2017-01-05 1 2 2
6 1 2017-01-06 1 3 3
7 1 2017-01-07 0 0 3
8 1 2017-01-08 1 1 4
9 2 2017-01-01 0 0 0
10 2 2017-01-02 1 1 1
11 2 2017-01-03 1 2 2
12 2 2017-01-04 1 3 3
13 2 2017-01-05 1 4 4
14 2 2017-01-06 0 0 4
15 2 2017-01-07 0 0 4
16 2 2017-01-08 1 1 5
I also considered (and tried) incrementing a variable within group_by() & do(), but since do() isn't iterative, I can't get my counter to get past 2:
d2 <- d %>%
group_by(id) %>%
do( data.frame(., bad_consecDaysInactive2 = ifelse(.$inactive == 0, 0, ifelse(.$inactive==1,.$inactive+lag(.$inactive), .$inactive))))
d2
which yielded, as described above:
id date inactive consecDaysInactive bad_consecDaysInactive bad_consecDaysInactive2
<dbl> <date> <dbl> <dbl> <dbl> <dbl>
1 1 2017-01-01 0 0 0 0
2 1 2017-01-02 0 0 0 0
3 1 2017-01-03 0 0 0 0
4 1 2017-01-04 1 1 1 1
5 1 2017-01-05 1 2 2 2
6 1 2017-01-06 1 3 3 2
7 1 2017-01-07 0 0 3 0
8 1 2017-01-08 1 1 4 1
9 2 2017-01-01 0 0 0 0
10 2 2017-01-02 1 1 1 1
11 2 2017-01-03 1 2 2 2
12 2 2017-01-04 1 3 3 2
13 2 2017-01-05 1 4 4 2
14 2 2017-01-06 0 0 4 0
15 2 2017-01-07 0 0 4 0
16 2 2017-01-08 1 1 5 1
As you can see, my iterator bad_consecDaysInactive2 resets at 0, but doesn't increment past 2! If there's a data.table solution, I'd be happy to hear it as well.
Here's a cute way to do it with a for-loop:
a <- c(1,1,1,1,0,0,1,0,1,1,1,0,0)
b <- rep(NA, length(a))
b[1] <- a[1]
for(i in 2:length(a)){
b[i] <- a[i]*(a[i]+b[i-1])
}
a
b
It may not be the most efficient way to do this, but it will be pretty darn fast. 11.7 seconds for ten million rows on my computer.
a <- round(runif(10000000,0,1))
b <- rep(NA, length(a))
b[1] <- a[1]
t <- Sys.time()
for(i in 2:length(a)){
b[i] <- a[i]*(a[i]+b[i-1])
}
b
Sys.time()-t
Time difference of 11.73612 secs
But this doesn't account for the need to do things per id. That's easy to fix, at a minimal efficiency penalty. Your example dataframe is sorted by id. If you actual data are not already sorted, then do so. Then:
a <- round(runif(10000000,0,1))
id <- round(runif(10000000,1,1000))
id <- id[order(id)]
b <- rep(NA, length(a))
b[1] <- a[1]
t <- Sys.time()
for(i in 2:length(a)){
b[i] <- a[i]*(a[i]+b[i-1])
if(id[i] != id[i-1]){
b[i] <- a[i]
}
}
b
Sys.time()-t
Time difference of 13.54373 secs
If we include the time that it took to sort id, then the time difference is closer to 19 seconds. Still not too bad!
How much of an efficiency savings can we get using Frank's answer in the comments on the OP?
d <- data.frame(inactive=a, id=id)
t2 <- Sys.time()
b <- setDT(d)[, v := if (inactive[1]) seq.int(.N) else 0L, by=rleid(inactive)]
Sys.time()-t2
Time difference of 2.233547 secs
I'm using R and am trying to create a new dataframe of averaged results from another dataframe based on the values in Column A. To demonstrate my goal here is some data:
set.seed(1981)
df <- data.frame(A = sample(c(0,1), replace=TRUE, size=100),
B=round(runif(100), digits=4),
C=sample(1:1000, 100, replace=TRUE))
head(df, 30)
A B C
0 0.6739 459
1 0.5466 178
0 0.154 193
0 0.41 206
1 0.7526 791
1 0.3104 679
1 0.739 434
1 0.421 171
0 0.3653 577
1 0.4035 739
0 0.8796 147
0 0.9138 37
0 0.7257 350
1 0.2125 779
0 0.1502 495
1 0.2972 504
0 0.2406 245
1 0.0325 613
0 0.8642 539
1 0.1096 630
1 0.2113 363
1 0.277 974
0 0.0485 755
1 0.0553 412
0 0.509 24
0 0.2934 795
0 0.0725 413
0 0.8723 606
0 0.3192 591
1 0.5557 177
I need to reduce the size of the data by calculating the average value for column B and column C for as many rows as the value in Column A stays consecutively the same, up to a maximum of 3 rows. If value A remains either 1, or 0 for more than 3 rows it would roll over into the next row in the new dataframe as you can see below.
The new dataframe requires the following columns:
Value of A B.Av C.Av No. of rows used
0 0.6739 459 1
1 0.5466 178 1
0 0.282 199.5 2
1 0.600666667 634.6666667 3
1 0.421 171 1
0 0.3653 577 1
1 0.4035 739 1
0 0.8397 178 3
1 0.2125 779 1
0 0.1502 495 1
1 0.2972 504 1
0 0.2406 245 1
1 0.0325 613 1
0 0.8642 539 1
1 0.1993 655.6666667 3
0 0.0485 755 1
1 0.0553 412 1
0 0.291633333 410.6666667 3
0 0.59575 598.5 2
1 0.5557 177 1
I haven't managed to find another similar scenario to mine whilst searching Stack Overflow so any help would be really appreciated.
Here is a base-R solution:
## define a function to split the run-length if greater than 3
split.3 <- function(l,v) {
o <- c(values=v,lengths=min(l,3))
while(l > 3) {
l <- l - 3
o <- rbind(o,c(values=v,lengths=min(l,3)))
}
return(o)
}
## compute the run-length encoding of column A
rl <- rle(df$A)
## apply split.3 to the run-length encoding
## the first column of vl are the values of column A
## the second column of vl are the corresponding run-length limited to 3
vl <- do.call(rbind,mapply(split.3,rl$lengths,rl$values))
## compute the begin and end row indices of df for each value of A to average
fin <- cumsum(vl[,2])
beg <- fin - vl[,2] + 1
## compute the averages
out <- do.call(rbind,lapply(1:length(beg), function(i) data.frame(`Value of A`=vl[i,1],
B.Av=mean(df$B[beg[i]:fin[i]]),
C.Av=mean(df$C[beg[i]:fin[i]]),
`No. of rows used`=fin[i]-beg[i]+1)))
## Value.of.A B.Av C.Av No..of.rows.used
##1 0 0.6739000 459.0000 1
##2 1 0.5466000 178.0000 1
##3 0 0.2820000 199.5000 2
##4 1 0.6006667 634.6667 3
##5 1 0.4210000 171.0000 1
##6 0 0.3653000 577.0000 1
##7 1 0.4035000 739.0000 1
##8 0 0.8397000 178.0000 3
##9 1 0.2125000 779.0000 1
##10 0 0.1502000 495.0000 1
##11 1 0.2972000 504.0000 1
##12 0 0.2406000 245.0000 1
##13 1 0.0325000 613.0000 1
##14 0 0.8642000 539.0000 1
##15 1 0.1993000 655.6667 3
##16 0 0.0485000 755.0000 1
##17 1 0.0553000 412.0000 1
##18 0 0.2916333 410.6667 3
##19 0 0.5957500 598.5000 2
##20 1 0.5557000 177.0000 1
Here is a data.table solution:
library(data.table)
setDT(df)
# create two group variables, consecutive A and for each consecutive A every three rows
(df[,rleid := rleid(A)][, threeWindow := ((1:.N) - 1) %/% 3, rleid]
# calculate the average of the columns grouped by the above two variables
[, c(.N, lapply(.SD, mean)), .(rleid, threeWindow)]
# drop group variables
[, `:=`(rleid = NULL, threeWindow = NULL)][])
# N A B C
#1: 1 0 0.6739000 459.0000
#2: 1 1 0.5466000 178.0000
#3: 2 0 0.2820000 199.5000
#4: 3 1 0.6006667 634.6667
#5: 1 1 0.4210000 171.0000
#6: 1 0 0.3653000 577.0000
#7: 1 1 0.4035000 739.0000
#8: 3 0 0.8397000 178.0000
#9: 1 1 0.2125000 779.0000
#10: 1 0 0.1502000 495.0000
#11: 1 1 0.2972000 504.0000
#12: 1 0 0.2406000 245.0000
#13: 1 1 0.0325000 613.0000
#14: 1 0 0.8642000 539.0000
#15: 3 1 0.1993000 655.6667
#16: 1 0 0.0485000 755.0000
#17: 1 1 0.0553000 412.0000
#18: 3 0 0.2916333 410.6667
#19: 2 0 0.5957500 598.5000
#20: 1 1 0.5557000 177.0000
I have a data set something similar to this with around 80 variables (flags) and 80,000 rows
< Acc_Nbr flag1 flag2 flag3 flag4 Exposure
< ab 1 0 1 0 1000
< bc 0 1 1 0 2000
< cd 1 1 0 1 3000
< ef 1 0 1 1 4000
< Expected Output
< Variable Count_Acct_Number Sum_Exposure Total_Acct_Number Total_Expo
< flag1 3 8000 4 10000
< flag2 2 5000 4 10000
< flag3 3 7000 4 10000
< flag4 2 7000 4 10000
Basically I want the output to show me count of account number and sum of exposure which are marked as 1 for each variable and in front of them total count of account numbers and exposures.
Please help.
We can convert the 'data.frame' to 'data.table' (setDT(df1), reshape it to 'long' with melt, grouped by 'variable', we get the sum of 'value1', sum of 'Exposure' where 'value1' is 1, number of rows (.N), and the sum of all the values in 'Exposure' to get the expected output.
library(data.table)
melt(setDT(df1), measure=patterns("^flag"))[,
list(Count_Acct_Number= sum(value1),
Sum_Exposure= sum(Exposure[value1==1]),
Total_Acct_Number = .N,
TotalExposure=sum(Exposure)),
by = variable]
# variable Count_Acct_Number Sum_Exposure Total_Acct_Number TotalExposure
#1: flag1 3 8000 4 10000
#2: flag2 2 5000 4 10000
#3: flag3 3 7000 4 10000
#4: flag4 2 7000 4 10000
A straigthforward way is to use the doBy package
library(doBy)
df <- data.frame(account=LETTERS[1:10], exposure=1:10*3.14, mark=round(runif(10)))
res <- as.data.frame(summaryBy(exposure~mark+account, df, FUN=sum))
subset(res, mark==0)
Starting with the base data (note, sample has randoms in it)
> df
account exposure mark
1 A 3.14 1
2 B 6.28 1
3 C 9.42 0
4 D 12.56 0
5 E 15.70 1
6 F 18.84 0
7 G 21.98 1
8 H 25.12 0
9 I 28.26 1
10 J 31.40 0
gives temp result which has marked the marks (in this case there is no actual summing, but would do as well)
> res
mark account exposure.sum
1 0 A 3.14
2 0 D 12.56
3 0 F 18.84
4 0 H 25.12
5 1 B 6.28
6 1 C 9.42
7 1 E 15.70
8 1 G 21.98
9 1 I 28.26
10 1 J 31.40
The final result can be selected with
> subset(res, mark==0)
mark account exposure.sum
1 0 A 3.14
2 0 D 12.56
3 0 F 18.84
4 0 H 25.12
I have a dataframe with person_id, study_id columns like below:
person_id study_id
10 1
11 2
10 3
10 4
11 5
I want to get the count for number of persons (unique by person_id) with 1 study or 2 studies - so not those with particular value for study_id but:
2 persons with 1 study
3 persons with 2 studies
1 person with with 3 studies
etc
How can I do this? I think maybe a count through loop but I wonder if there is a package that makes it easier?
To get a sample data set that better matches your expected output, i'll use this
dd <- data.frame(
person_id = c(10, 11, 15, 12, 10, 13, 10, 11, 12, 14, 15),
study_id = 1:11
)
Now I can count the number of people with a given number of studies with.
table(rowSums(with(dd, table(person_id, study_id))>0))
# 1 2 3
# 2 3 1
Where the top line is the number of studies, and the bottom line it the number of people with that number of studies.
This works because
with(dd, table(person_id, study_id))
returns
study_id
person_id 1 2 3 4 5 6 7 8 9 10 11
10 1 0 0 0 1 0 1 0 0 0 0
11 0 1 0 0 0 0 0 1 0 0 0
12 0 0 0 1 0 0 0 0 1 0 0
13 0 0 0 0 0 1 0 0 0 0 0
14 0 0 0 0 0 0 0 0 0 1 0
15 0 0 1 0 0 0 0 0 0 0 1
and then we use >0 and rowSums to get a count of unique studies for each person. Then we use table again to summarize the results.
The creating the table for your data is taking up too much RAM, you can try
table(with(dd, tapply(study_id, person_id, function(x) length(unique(x)))))
which is a slightly different way to get at the same thing.
You can use the aggregate function to get counts per user.
Then use it again to get counts per counts
i.e. assume your data is called "test"
person_id study_id
10 1
11 2
10 3
10 4
11 5
12 NA
You can set your NA to be a number such as zero so they are not ignored i.e.
test$study_id[is.na(test$study_id)] = 0
Then you can run the same function but with a condition that the study_id has to be greater than zero
stg=setNames(
aggregate(
study_id~person_id,
data=test,function(x){sum(x>0)}),
c("person_id","num_studies"))
Output:
stg
person_id num_studies
10 3
11 2
12 0
Then do the same to get counts of counts
setNames(
aggregate(
person_id~num_studies,
data=stg,length),
c("num_studies","num_users"))
Output:
num_studies num_users
0 1
2 1
3 1
Here's a solution using dplyr
library(dplyr)
tmp <- df %>%
group_by(person_id) %>%
summarise(num.studies = n()) %>%
group_by(num.studies) %>%
summarise(num.persons = n())
> dat <- read.table(h=T, text = "person_id study_id
10 1
11 2
10 3
10 4
11 5
12 6")
I think you can just use xtabs for this. I may have misunderstood the question, but it seems like that's what you want.
> table(xtabs(dat))
# 10 11 12
# 3 2 1
df <- data.frame(
person_id = c(10,11,10,10,11,11,11),
study_id = c(1,2,3,4,5,5,1))
# remove replicated rows
df <- unique(df)
# number of studies each person has been in:
summary(as.factor(df$person_id))
#10 11
# 3 4
# number of people in each study
summary(as.factor(df$study_id))
# 1 2 3 4 5
# 2 1 1 1 2