Cumulative count of each value [duplicate] - r

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 2 years ago.
I want to create a cumulative counter of the number of times each value appears.
e.g. say I have the column:
id
1
2
3
2
2
1
2
3
This would become:
id count
1 1
2 1
3 1
2 2
2 3
1 2
2 4
3 2
etc...

The ave function computes a function by group.
> id <- c(1,2,3,2,2,1,2,3)
> data.frame(id,count=ave(id==id, id, FUN=cumsum))
id count
1 1 1
2 2 1
3 3 1
4 2 2
5 2 3
6 1 2
7 2 4
8 3 2
I use id==id to create a vector of all TRUE values, which get converted to numeric when passed to cumsum. You could replace id==id with rep(1,length(id)).

Here is a way to get the counts:
id <- c(1,2,3,2,2,1,2,3)
sapply(1:length(id),function(i)sum(id[i]==id[1:i]))
Which gives you:
[1] 1 1 1 2 3 2 4 2

The dplyr way:
library(dplyr)
foo <- data.frame(id=c(1, 2, 3, 2, 2, 1, 2, 3))
foo <- foo %>% group_by(id) %>% mutate(count=row_number())
foo
# A tibble: 8 x 2
# Groups: id [3]
id count
<dbl> <int>
1 1 1
2 2 1
3 3 1
4 2 2
5 2 3
6 1 2
7 2 4
8 3 2
That ends up grouped by id. If you want it not grouped, add %>% ungroup().

For completeness, adding a data.table way:
library(data.table)
DT <- data.table(id = c(1, 2, 3, 2, 2, 1, 2, 3))
DT[, count := seq(.N), by = id][]
Output:
id count
1: 1 1
2: 2 1
3: 3 1
4: 2 2
5: 2 3
6: 1 2
7: 2 4
8: 3 2

The dataframe I had was too large and the accepted answer kept crashing. This worked for me:
library(plyr)
df$ones <- 1
df <- ddply(df, .(id), transform, cumulative_count = cumsum(ones))
df$ones <- NULL

Function to get the cumulative count of any array, including a non-numeric array:
cumcount <- function(x){
cumcount <- numeric(length(x))
names(cumcount) <- x
for(i in 1:length(x)){
cumcount[i] <- sum(x[1:i]==x[i])
}
return(cumcount)
}

Related

Subset data frame that include a variable

I have a list of events and sequences. I would like to print the sequences in a separate table if event = x is included somewhere in the sequence. See table below:
Event Sequence
1 a 1
2 a 1
3 x 1
4 a 2
5 a 2
6 a 3
7 a 3
8 x 3
9 a 4
10 a 4
In this case I would like a new table that includes only the sequences where Event=x was included:
Event Sequence
1 a 1
2 a 1
3 x 1
4 a 3
5 a 3
6 x 3
Base R solution:
d[d$Sequence %in% d$Sequence[d$Event == "x"], ]
Event Sequence
1: a 1
2: a 1
3: x 1
4: a 3
5: a 3
6: x 3
data.table solution:
library(data.table)
setDT(d)[Sequence %in% Sequence[Event == "x"]]
As you can see syntax/logic is quite similar between these two solutions:
Find event's that are equal to x
Extract their Sequence
Subset table according to specified Sequence
We can use dplyr to group the data and filter the sequence with any "x" in it.
library(dplyr)
df2 <- df %>%
group_by(Sequence) %>%
filter(any(Event %in% "x")) %>%
ungroup()
df2
# A tibble: 6 x 2
Event Sequence
<chr> <int>
1 a 1
2 a 1
3 x 1
4 a 3
5 a 3
6 x 3
DATA
df <- read.table(text = " Event Sequence
1 a 1
2 a 1
3 x 1
4 a 2
5 a 2
6 a 3
7 a 3
8 x 3
9 a 4
10 a 4",
header = TRUE, stringsAsFactors = FALSE)

Summary of values across rows and columns in R

I have a dataset that looks like:
Group A B C D
XYZ 4 Na 1 3
XYZ Na 2 2 1
DEF 4 3 2 1
DEF 3 3 1 1
PQR 1 Na Na 1
PQR 3 2 2 4
I want the summary of this dataset across rows and columns for the count of each value as below:
Group 4 3 2 1
XYZ 1 1 2 2
DEF 1 3 1 3
PQR 1 1 2 2
The count of 4 in the dataset for group XYZ across all rows and columns is 1, for 2 and 1 its 2, for 3 its 1. I can do this by creating 4 new columns 4,3,2,1 and getting the count row wise and then column wise, but this is not efficient and scalable. I am sure there is a better way to get this done.
Using reshape2 package we can melt and dcast as follows,
library(reshape2)
dcast(na.omit(melt(df, id.vars = 'Group')), Group ~ value, fun.aggregate = length)
# Group 1 2 3 4
#1 DEF 3 1 3 1
#2 PQR 2 2 1 1
#3 XYZ 2 2 1 1
This uses no packages and is just one line. Here DF$Group[row(DF[-1])] is a Group labels vector such that each element corresponds to the unravelled numeric vector unlist(DF[-1]).
table(DF$Group[row(DF[-1])], unlist(DF[-1]))
giving:
1 2 3 4
DEF 3 1 3 1
PQR 2 2 1 1
XYZ 2 2 1 1
If the order of rows and columns shown in the question is important then to we can create factors from each of the two table arguments with the factor levels being defined in the orders desired. In this case we use the following line instead of the line of code above:
table(Group = factor(DF$Group[row(DF[-1])], unique(DF$Group)), factor(unlist(DF[-1]), 4:1))
giving:
Group 4 3 2 1
XYZ 1 1 2 2
DEF 1 3 1 3
PQR 1 1 2 2
The above produces an object of class "table". This is a particularly suitable class for tabulated frequencies. For example, once in this form ftable can be used to easily rearrange it further as in ftable(tab, row.vars = 2) or ftable(tab, row.vars = 1:2) where tab is the above computed table.
If a data.frame were preferred then convert it like this:
cbind(Group = rownames(tab), as.data.frame.matrix(tab))
The input data.frame DF is defined reproducibly in Note 2 at the end.
Alternatives
Although the above seems the most direct here are some other alternatives that also use no packages:
1) by For each set of rows having the same Group value the anonymous function creates a data.frame identifying the Group, converting the columns other than the first to a factor with the indicated levels and running table to get the counts. The "by" list that is returned is sorted back to the original order and we rbind everything back together.
do.call("rbind",
by(DF, DF$Group, function(x) {
data.frame(Group = x[1,1],
as.list(table(factor(unlist(x[, -1]), levels = 4:1))),
check.names = FALSE)
})[unique(DF$Group)])
giving:
Group 4 3 2 1
XYZ XYZ 1 1 2 2
DEF DEF 1 3 1 3
PQR PQR 1 1 2 2
1a) This slightly shorter variation would also work. It returns a matrix identifying the groups using row names.
kount <- function(x) table(factor(unlist(x), levels = 4:1))
m <- do.call("rbind", by(DF[, -1], DF$Group, kount)[unique(DF$Group)])
giving:
> m
4 3 2 1
XYZ 1 1 2 2
DEF 1 3 1 3
PQR 1 1 2 2
2) outer
gps <- unique(DF$Group)
levs <- 4:1
kount2 <- function(g, lv) sum(subset(DF, Group == g)[-1] == lv, na.rm = TRUE)
m <- outer(gps, levs, Vectorize(kount2))
dimnames(m) <- list(gps, levs))
giving this matrix:
> m
4 3 2 1
XYZ 1 1 2 2
DEF 1 3 1 3
PQR 1 1 2 2
3) sapply
kount3 <- function(g) table(factor(unlist(DF[DF$Group == g, -1]), levels = 4:1))
gps <- as.character(unique(DF$Group))
do.call("rbind", sapply(gps, kount3, simplify = FALSE))
giving:
4 3 2 1
XYZ 1 1 2 2
DEF 1 3 1 3
PQR 1 1 2 2
4) aggregate
aggregate(1:nrow(DF), DF["Group"], function(ix)
table(factor(unlist(DF[ix, -1]), levels = 4:1)))[unique(DF$Group), ]
giving:
Group x.4 x.3 x.2 x.1
3 XYZ 1 1 2 2
1 DEF 1 3 1 3
2 PQR 1 1 2 2
5) tapply
do.call("rbind", tapply(1:nrow(DF), DF$Group, function(ix)
table(factor(unlist(DF[ix, -1]), levels = 4:1))))[unique(DF$Group), ]
6) reshape
with(reshape(DF, dir = "long", varying = list(2:5)),
table(factor(Group, unique(DF$Group)), factor(A, 4:1)))
giving:
4 3 2 1
XYZ 1 1 2 2
DEF 1 3 1 3
PQR 1 1 2 2
Note 1: (1a), (2), (3), (5) and (6) produce a matrix or table result with groups as row names. If you prefer a data frame with Groups as a column then supposing that m is the matrix, add this:
data.frame(Group = rownames(m), m, check.names = FALSE)
Note 2: The input DF in reproducible form is:
Lines <- "Group A B C D
XYZ 4 Na 1 3
XYZ Na 2 2 1
DEF 4 3 2 1
DEF 3 3 1 1
PQR 1 Na Na 1
PQR 3 2 2 4"
DF <- read.table(text = Lines, header = TRUE, na.strings = "Na")
We can use dplyr/tidyr
library(dplyr)
library(tidyr)
df1 %>%
mutate_each(funs(replace(., .=="Na", NA))) %>%
gather(Var, Val, A:D, na.rm=TRUE) %>%
group_by(Group, Val) %>%
tally() %>%
spread(Val, n)
# Group `1` `2` `3` `4`
#* <chr> <int> <int> <int> <int>
#1 DEF 3 1 3 1
#2 PQR 2 2 1 1
#3 XYZ 2 2 1 1

Finding unique rows in data.frame [duplicate]

This question already has answers here:
Count number of rows within each group
(17 answers)
Closed 4 years ago.
I have a data frame like the following example
a = c(1, 1, 1, 2, 2, 3, 4, 4)
b = c(3.5, 3.5, 2.5, 2, 2, 1, 2.2, 7)
df <-data.frame(a,b)
I can remove duplicated rows from R data frame by the following code, but how can I find how many times each duplicated rows repeated? I need the result as a vector.
unique(df)
or
df[!duplicated(df), ]
Here is solution using function ddply() from library plyr
library(plyr)
ddply(df,.(a,b),nrow)
a b V1
1 1 2.5 1
2 1 3.5 2
3 2 2.0 2
4 3 1.0 1
5 4 2.2 1
6 4 7.0 1
You could always kill two birds with the one stone:
aggregate(list(numdup=rep(1,nrow(df))), df, length)
# or even:
aggregate(numdup ~., data=transform(df,numdup=1), length)
# or even:
aggregate(cbind(df[0],numdup=1), df, length)
a b numdup
1 3 1.0 1
2 2 2.0 2
3 4 2.2 1
4 1 2.5 1
5 1 3.5 2
6 4 7.0 1
Here are two approaches.
# a example data set that is not sorted
DF <-data.frame(replicate(sequence(1:3),n=2))
# example using similar idea to duplicated.data.frame
count.duplicates <- function(DF){
x <- do.call('paste', c(DF, sep = '\r'))
ox <- order(x)
rl <- rle(x[ox])
cbind(DF[ox[cumsum(rl$lengths)],,drop=FALSE],count = rl$lengths)
}
count.duplicates(DF)
# X1 X2 count
# 4 1 1 3
# 5 2 2 2
# 6 3 3 1
# a far simpler `data.table` approach
library(data.table)
count.dups <- function(DF){
DT <- data.table(DF)
DT[,.N, by = names(DT)]
}
count.dups(DF)
# X1 X2 N
# 1: 1 1 3
# 2: 2 2 2
# 3: 3 3 1
Using dplyr:
summarise(group_by(df,a,b),length(b))
or
group_size(group_by(df,a,b))
#[1] 1 2 2 1 1 1

Add a column for counting unique tuples in the data frame [duplicate]

This question already has answers here:
How to get frequencies then add it as a variable in an array?
(3 answers)
Closed 8 years ago.
Suppose I have the following data frame:
userID <- c(1, 1, 3, 5, 3, 5)
A <- c(2, 3, 2, 1, 2, 1)
B <- c(2, 3, 1, 0, 1, 0)
df <- data.frame(userID, A, B)
df
# userID A B
# 1 1 2 2
# 2 1 3 3
# 3 3 2 1
# 4 5 1 0
# 5 3 2 1
# 6 5 1 0
I would like to create a data frame with the same columns but with an added final column that counts up the number of unique tuples / combinations of the other columns. The output should look like the following:
userID A B count
1 2 2 1
1 3 3 1
3 2 1 2
5 1 0 2
The meaning is the the tuple / combination of (1, 2, 2) occurs with count=1, while the tuple of (3, 2, 1) occurs twice so has count=2. I would prefer not to use any external packages.
1) aggregate
ag <- aggregate(count ~ ., cbind(count = 1, df), length)
ag[do.call("order", ag), ] # sort the rows
giving:
userID A B count
3 1 2 2 1
4 1 3 3 1
2 3 2 1 2
1 5 1 0 2
The last line of code which sorts the rows could be omitted if the order of the rows is unimportant.
The remaining solutions use the indicated packages:
2) sqldf
library(sqldf)
Names <- toString(names(df))
fn$sqldf("select *, count(*) count from df group by $Names order by $Names")
giving:
userID A B count
1 1 2 2 1
2 1 3 3 1
3 3 2 1 2
4 5 1 0 2
The order by clause could be omitted if the order is unimportant.
3) dplyr
library(dplyr)
df %>% regroup(as.list(names(df))) %>% summarise(count = n())
giving:
Source: local data frame [4 x 4]
Groups: userID, A
userID A B count
1 1 2 2 1
2 1 3 3 1
3 3 2 1 2
4 5 1 0 2
4) data.table
library(data.table)
data.table(df)[, list(count = .N), by = names(df)]
giving:
userID A B count
1: 1 2 2 1
2: 1 3 3 1
3: 3 2 1 2
4: 5 1 0 2
ADDED additional solutions. Also some small improvements.
Here's a fairly straightforward way (ave to the rescue!):
unique(cbind(df,
count = ave(rep(1, nrow(df)),
do.call(paste, df),
FUN = length)))
# userID A B count
# 1 1 2 2 1
# 2 1 3 3 1
# 3 3 2 1 2
# 4 5 1 0 2
Here's a variation of the above:
unique(within(df, {
counter <- rep(1, nrow(df))
count <- ave(counter, df, FUN = length)
rm(counter)
}))
# userID A B count
# 1 1 2 2 1
# 2 1 3 3 1
# 3 3 2 1 2
# 4 5 1 0 2
userID <- c(1, 1, 3, 5, 3, 5)
A <- c(2, 3, 2, 1, 2, 1)
B <- c(2, 3, 1, 0, 1, 0)
df <- data.frame(userID, A, B)
Make a quick factor of the tuples:
df$AB <- as.factor(paste(df$userID,df$A,df$B, sep=""))
No external packages just taking advantage of summary() and storing it as a DF then merging the counts on the original data:
df2 <- as.data.frame(summary(df$AB))
df2 <- data.frame(x=row.names(df2), y=df2[1])
names(df2) <- c("AB", "count")
df <- merge(df, df2, by="AB", all.x=TRUE)
df$AB <- NULL
Almost final output, just has dupes:
df
userID A B count
1 1 2 2 1
2 1 3 3 1
3 3 2 1 2
4 3 2 1 2
5 5 1 0 2
6 5 1 0 2
Lastly, clean up dupes:
df <- df[!duplicated(df), ]
Here you go:
df
userID A B count
1 1 2 2 1
2 1 3 3 1
3 3 2 1 2
5 5 1 0 2
Been a while not doing that with sql or plyr. if you can use dplyr or a package later on do it. Bioconductor has a lot of great sequencing packages if it starts to get more complex.
Hope this helps.
This should do the trick, even if it is a little bit ugly:
vec <- table(apply(df,1,paste,collapse=""))
df2 <- data.frame(do.call(rbind,strsplit(names(vec),"")))
names(df2) <- names(df)
df2$count <- vec
# userID A B count
#1 1 2 2 1
#2 1 3 3 1
#3 3 2 1 2
#4 5 1 0 2

Create counter with multiple variables [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 6 years ago.
I have my data that looks like below:
CustomerID TripDate
1 1/3/2013
1 1/4/2013
1 1/9/2013
2 2/1/2013
2 2/4/2013
3 1/2/2013
I need to create a counter variable, which will be like below:
CustomerID TripDate TripCounter
1 1/3/2013 1
1 1/4/2013 2
1 1/9/2013 3
2 2/1/2013 1
2 2/4/2013 2
3 1/2/2013 1
Tripcounter will be for each customer.
Use ave. Assuming your data.frame is called "mydf":
mydf$counter <- with(mydf, ave(CustomerID, CustomerID, FUN = seq_along))
mydf
# CustomerID TripDate counter
# 1 1 1/3/2013 1
# 2 1 1/4/2013 2
# 3 1 1/9/2013 3
# 4 2 2/1/2013 1
# 5 2 2/4/2013 2
# 6 3 1/2/2013 1
For what it's worth, I also implemented a version of this approach in a function included in my "splitstackshape" package. The function is called getanID:
mydf <- data.frame(IDA = c("a", "a", "a", "b", "b", "b", "b"),
IDB = c(1, 2, 1, 1, 2, 2, 2), values = 1:7)
mydf
# install.packages("splitstackshape")
library(splitstackshape)
# getanID(mydf, id.vars = c("IDA", "IDB"))
getanID(mydf, id.vars = 1:2)
# IDA IDB values .id
# 1 a 1 1 1
# 2 a 2 2 1
# 3 a 1 3 2
# 4 b 1 4 1
# 5 b 2 5 1
# 6 b 2 6 2
# 7 b 2 7 3
As you can see from the example above, I've written the function in such a way that you can specify one or more columns that should be treated as ID columns. It checks to see if any of the id.vars are duplicated, and if they are, then it generates a new ID variable for you.
You can also use plyr for this (using #AnadaMahto's example data):
> ddply(mydf, .(IDA), transform, .id = seq_along(IDA))
IDA IDB values .id
1 a 1 1 1
2 a 2 2 2
3 a 1 3 3
4 b 1 4 1
5 b 2 5 2
6 b 2 6 3
7 b 2 7 4
or even:
> ddply(mydf, .(IDA, IDB), transform, .id = seq_along(IDA))
IDA IDB values .id
1 a 1 1 1
2 a 1 3 2
3 a 2 2 1
4 b 1 4 1
5 b 2 5 1
6 b 2 6 2
7 b 2 7 3
Note that plyr does not have a reputation for being the quickest solution, for that you need to take a look at data.table.
Here's a data.table approach:
library(data.table)
DT <- data.table(mydf)
DT[, .id := sequence(.N), by = "IDA,IDB"]
DT
# IDA IDB values .id
# 1: a 1 1 1
# 2: a 2 2 1
# 3: a 1 3 2
# 4: b 1 4 1
# 5: b 2 5 1
# 6: b 2 6 2
# 7: b 2 7 3
meanwhile, you can also use dplyr. if your data.frame is called mydata
library(dplyr)
mydata %>% group_by(CustomerID) %>% mutate(TripCounter = row_number())
I need to do this often, and wrote a function that accomplishes it differently than the previous answers. I am not sure which solution is most efficient.
idCounter <- function(x) {
unlist(lapply(rle(x)$lengths, seq_len))
}
mydf$TripCounter <- idCounter(mydf$CustomerID)
Here's the procedure styled code. I dont believe in things like if you are using loop in R then you are probably doing something wrong
x <- dataframe$CustomerID
dataframe$counter <- 0
y <- dataframe$counter
count <- 1
for (i in 1:length(x)) {
ifelse (x[i] == x[i-1], count <- count + 1, count <- 1 )
y[i] <- count
}
dataframe$counter <- y
This isn't the right answer but showing some interesting things comparing to for loops, vectorization is fast does not care about sequential updating.
a<-read.table(textConnection(
"CustomerID TripDate
1 1/3/2013
1 1/4/2013
1 1/9/2013
2 2/1/2013
2 2/4/2013
3 1/2/2013 "), header=TRUE)
a <- a %>%
group_by(CustomerID,TripDate) # must in order
res <- rep(1, nrow(a)) #base # 1
res[2:6] <-sapply(2:6, function(i)if(a$CustomerID[i]== a$CustomerID[i - 1]) {res[i] = res[i-1]+1} else {res[i]= res[i]})
a$TripeCounter <- res

Resources