Finding unique rows in data.frame [duplicate] - r

This question already has answers here:
Count number of rows within each group
(17 answers)
Closed 4 years ago.
I have a data frame like the following example
a = c(1, 1, 1, 2, 2, 3, 4, 4)
b = c(3.5, 3.5, 2.5, 2, 2, 1, 2.2, 7)
df <-data.frame(a,b)
I can remove duplicated rows from R data frame by the following code, but how can I find how many times each duplicated rows repeated? I need the result as a vector.
unique(df)
or
df[!duplicated(df), ]

Here is solution using function ddply() from library plyr
library(plyr)
ddply(df,.(a,b),nrow)
a b V1
1 1 2.5 1
2 1 3.5 2
3 2 2.0 2
4 3 1.0 1
5 4 2.2 1
6 4 7.0 1

You could always kill two birds with the one stone:
aggregate(list(numdup=rep(1,nrow(df))), df, length)
# or even:
aggregate(numdup ~., data=transform(df,numdup=1), length)
# or even:
aggregate(cbind(df[0],numdup=1), df, length)
a b numdup
1 3 1.0 1
2 2 2.0 2
3 4 2.2 1
4 1 2.5 1
5 1 3.5 2
6 4 7.0 1

Here are two approaches.
# a example data set that is not sorted
DF <-data.frame(replicate(sequence(1:3),n=2))
# example using similar idea to duplicated.data.frame
count.duplicates <- function(DF){
x <- do.call('paste', c(DF, sep = '\r'))
ox <- order(x)
rl <- rle(x[ox])
cbind(DF[ox[cumsum(rl$lengths)],,drop=FALSE],count = rl$lengths)
}
count.duplicates(DF)
# X1 X2 count
# 4 1 1 3
# 5 2 2 2
# 6 3 3 1
# a far simpler `data.table` approach
library(data.table)
count.dups <- function(DF){
DT <- data.table(DF)
DT[,.N, by = names(DT)]
}
count.dups(DF)
# X1 X2 N
# 1: 1 1 3
# 2: 2 2 2
# 3: 3 3 1

Using dplyr:
summarise(group_by(df,a,b),length(b))
or
group_size(group_by(df,a,b))
#[1] 1 2 2 1 1 1

Related

Extract data based on another list

I am trying to extract rows of a dataset based on a list of time points nested within individuals. I have repeated time points (therefore exactly the same variable values) but I still want to keep the duplicated rows. How to achieve that in base R?
Here is the original dataset:
xx <- data.frame(id=rep(1:3, each=3), time=1:3, y=rep(1:3, each=3))
Here is the list of matrices where the third one is a vector
lst <- list(`1` = c(1, 1, 2), `2` = c(1, 3, 3), `3` = c(2, 2, 3))
Desirable outcome:
id time y
1 1 1
1 1 1 #this is the duplicated row
1 2 1
2 1 2
2 3 2
2 3 2 #this is the duplicated row
3 2 3
3 2 3 #this is the duplicated row
3 3 3
The code do.call(rbind, Map(function(p, q) subset(xx, id == q & time %in% p), lst, names(lst))) did not work for me because subset removes duplicated rows
The issue is that %in% doesn't iterate over the non-unique values repeatedly. To do so, we need to also iterate (lapply) over p internally. I'll wrap your inner subset in another do.call(rbind, lapply(p, ...)) to get what you expect:
do.call(rbind, Map(function(p, q) {
do.call(rbind, lapply(p, function(p0) subset(xx, id == q & time %in% p0)))
}, lst, names(lst)))
# id time y
# 1.1 1 1 1
# 1.2 1 1 1
# 1.21 1 2 1
# 2.4 2 1 2
# 2.6 2 3 2
# 2.61 2 3 2
# 3.8 3 2 3
# 3.81 3 2 3
# 3.9 3 3 3
(Row names are a distraction here ...)
An alternative would be to convert your lst into a frame of id and time, and then left-join on it:
frm <- do.call(rbind, Map(function(x, nm) data.frame(id = nm, time = x), lst, names(lst)))
frm
# id time
# 1.1 1 1
# 1.2 1 1
# 1.3 1 2
# 2.1 2 1
# 2.2 2 3
# 2.3 2 3
# 3.1 3 2
# 3.2 3 2
# 3.3 3 3
merge(frm, xx, by = c("id", "time"), all.x = TRUE)
# id time y
# 1 1 1 1
# 2 1 1 1
# 3 1 2 1
# 4 2 1 2
# 5 2 3 2
# 6 2 3 2
# 7 3 2 3
# 8 3 2 3
# 9 3 3 3
Two good resources for learning about merges/joins:
How to join (merge) data frames (inner, outer, left, right)
What's the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN and FULL JOIN?

Aggregate data frame/table by all rows, add counts, and do it fast [duplicate]

This question already has answers here:
Count number of rows within each group
(17 answers)
Closed 4 years ago.
I have a data frame like the following example
a = c(1, 1, 1, 2, 2, 3, 4, 4)
b = c(3.5, 3.5, 2.5, 2, 2, 1, 2.2, 7)
df <-data.frame(a,b)
I can remove duplicated rows from R data frame by the following code, but how can I find how many times each duplicated rows repeated? I need the result as a vector.
unique(df)
or
df[!duplicated(df), ]
Here is solution using function ddply() from library plyr
library(plyr)
ddply(df,.(a,b),nrow)
a b V1
1 1 2.5 1
2 1 3.5 2
3 2 2.0 2
4 3 1.0 1
5 4 2.2 1
6 4 7.0 1
You could always kill two birds with the one stone:
aggregate(list(numdup=rep(1,nrow(df))), df, length)
# or even:
aggregate(numdup ~., data=transform(df,numdup=1), length)
# or even:
aggregate(cbind(df[0],numdup=1), df, length)
a b numdup
1 3 1.0 1
2 2 2.0 2
3 4 2.2 1
4 1 2.5 1
5 1 3.5 2
6 4 7.0 1
Here are two approaches.
# a example data set that is not sorted
DF <-data.frame(replicate(sequence(1:3),n=2))
# example using similar idea to duplicated.data.frame
count.duplicates <- function(DF){
x <- do.call('paste', c(DF, sep = '\r'))
ox <- order(x)
rl <- rle(x[ox])
cbind(DF[ox[cumsum(rl$lengths)],,drop=FALSE],count = rl$lengths)
}
count.duplicates(DF)
# X1 X2 count
# 4 1 1 3
# 5 2 2 2
# 6 3 3 1
# a far simpler `data.table` approach
library(data.table)
count.dups <- function(DF){
DT <- data.table(DF)
DT[,.N, by = names(DT)]
}
count.dups(DF)
# X1 X2 N
# 1: 1 1 3
# 2: 2 2 2
# 3: 3 3 1
Using dplyr:
summarise(group_by(df,a,b),length(b))
or
group_size(group_by(df,a,b))
#[1] 1 2 2 1 1 1

multiplication with dataframes in r

I have two dataframes. One is called data, like
data <- data.frame(ID = c(1, 1, 2, 2 ),
Number = c(1,2, 1, 2),
Answer = c(1, 2, 3, 2 )
)
The other is called weights, like
weights <- data.frame ( Number=c(1,2),
weight1=c(0.5,1),
weight2=c(1, 1)
)
What I want is to use Data$Answers to multiply Weights$weight based on Number (in both dataframes). The final results should be look like
ID Number Answer Answer*Weights1 Answer*Weights2
1 1 1 1 1*0.5 1*1
2 1 2 2 2*1 2*1
3 2 1 3 3*0.5 3*1
4 2 2 2 2*1 2*1
How can I achieve it? Your inputs will be deeply appreciated. Thanks.
data <- merge(data, weights, by = "Number")
data <- transform(data,
A1 = Answer * weight1,
A2 = Answer * weight2)
# Number ID Answer weight1 weight2 A1 A2
#1 1 1 1 0.5 1 0.5 1
#2 1 2 3 0.5 1 1.5 3
#3 2 1 2 1.0 1 2.0 2
#4 2 2 2 1.0 1 2.0 2
You could also do
library(dplyr)
left_join(data, weights, by="Number") %>%
select(ID:Answer, Answer_weight1=weight1, Answer_weight2=weight2) %>%
mutate_each(funs(Answer*.), contains("weight"))
# ID Number Answer Answer_weight1 Answer_weight2
# 1 1 1 1 0.5 1
# 2 1 2 2 2.0 2
# 3 2 1 3 1.5 3
# 4 2 2 2 2.0 2
Here's how you could do this using data.table:
require(data.table) ## 1.9.2
setDT(data) ## convert data.frame to data.table by reference
setDT(weights)
setkey(data, Number) ## set the key columns to join by
data[weights, c("Answer1", "Answer2") :=
list(Answer * weight1, Answer * weight2)]
We perform a join, but directly create the required columns without the intermediate variables (weight1, weight2), and is therefore quite memory efficient. It modifies data in place.
Just in case you want those entries in the Answers*Weights1 and Answers*Weights2 columns to be strings and not actually multiplication, like you have in your original post:
data <- cbind(data,
paste(data[, 3], weights[, 2], sep = "*"),
paste(data[, 3], weights[, 3], sep = "*"))
names(data)[4:5] <- c("Answer*Weights1", "Answer*Weights2")
# ID Number Answer Answer*Weights1 Answer*Weights2
# 1 1 1 1 1*0.5 1*1
# 2 1 2 2 2*1 2*1
# 3 2 1 3 3*0.5 3*1
# 4 2 2 2 2*1 2*1
Or if you want numbers instead of strings
data[, 4] <- data[, 3] * weights[, 2]
data[, 5] <- data[, 3] * weights[, 3]
names(data)[4:5] <- c("Answer*Weights1", "Answer*Weights2")
# ID Number Answer Answer*Weights1 Answer*Weights2
# 1 1 1 1 0.5 1
# 2 1 2 2 2.0 2
# 3 2 1 3 1.5 3
# 4 2 2 2 2.0 2

Find how many times duplicated rows repeat in R data frame [duplicate]

This question already has answers here:
Count number of rows within each group
(17 answers)
Closed 4 years ago.
I have a data frame like the following example
a = c(1, 1, 1, 2, 2, 3, 4, 4)
b = c(3.5, 3.5, 2.5, 2, 2, 1, 2.2, 7)
df <-data.frame(a,b)
I can remove duplicated rows from R data frame by the following code, but how can I find how many times each duplicated rows repeated? I need the result as a vector.
unique(df)
or
df[!duplicated(df), ]
Here is solution using function ddply() from library plyr
library(plyr)
ddply(df,.(a,b),nrow)
a b V1
1 1 2.5 1
2 1 3.5 2
3 2 2.0 2
4 3 1.0 1
5 4 2.2 1
6 4 7.0 1
You could always kill two birds with the one stone:
aggregate(list(numdup=rep(1,nrow(df))), df, length)
# or even:
aggregate(numdup ~., data=transform(df,numdup=1), length)
# or even:
aggregate(cbind(df[0],numdup=1), df, length)
a b numdup
1 3 1.0 1
2 2 2.0 2
3 4 2.2 1
4 1 2.5 1
5 1 3.5 2
6 4 7.0 1
Here are two approaches.
# a example data set that is not sorted
DF <-data.frame(replicate(sequence(1:3),n=2))
# example using similar idea to duplicated.data.frame
count.duplicates <- function(DF){
x <- do.call('paste', c(DF, sep = '\r'))
ox <- order(x)
rl <- rle(x[ox])
cbind(DF[ox[cumsum(rl$lengths)],,drop=FALSE],count = rl$lengths)
}
count.duplicates(DF)
# X1 X2 count
# 4 1 1 3
# 5 2 2 2
# 6 3 3 1
# a far simpler `data.table` approach
library(data.table)
count.dups <- function(DF){
DT <- data.table(DF)
DT[,.N, by = names(DT)]
}
count.dups(DF)
# X1 X2 N
# 1: 1 1 3
# 2: 2 2 2
# 3: 3 3 1
Using dplyr:
summarise(group_by(df,a,b),length(b))
or
group_size(group_by(df,a,b))
#[1] 1 2 2 1 1 1

Cumulative count of each value [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 2 years ago.
I want to create a cumulative counter of the number of times each value appears.
e.g. say I have the column:
id
1
2
3
2
2
1
2
3
This would become:
id count
1 1
2 1
3 1
2 2
2 3
1 2
2 4
3 2
etc...
The ave function computes a function by group.
> id <- c(1,2,3,2,2,1,2,3)
> data.frame(id,count=ave(id==id, id, FUN=cumsum))
id count
1 1 1
2 2 1
3 3 1
4 2 2
5 2 3
6 1 2
7 2 4
8 3 2
I use id==id to create a vector of all TRUE values, which get converted to numeric when passed to cumsum. You could replace id==id with rep(1,length(id)).
Here is a way to get the counts:
id <- c(1,2,3,2,2,1,2,3)
sapply(1:length(id),function(i)sum(id[i]==id[1:i]))
Which gives you:
[1] 1 1 1 2 3 2 4 2
The dplyr way:
library(dplyr)
foo <- data.frame(id=c(1, 2, 3, 2, 2, 1, 2, 3))
foo <- foo %>% group_by(id) %>% mutate(count=row_number())
foo
# A tibble: 8 x 2
# Groups: id [3]
id count
<dbl> <int>
1 1 1
2 2 1
3 3 1
4 2 2
5 2 3
6 1 2
7 2 4
8 3 2
That ends up grouped by id. If you want it not grouped, add %>% ungroup().
For completeness, adding a data.table way:
library(data.table)
DT <- data.table(id = c(1, 2, 3, 2, 2, 1, 2, 3))
DT[, count := seq(.N), by = id][]
Output:
id count
1: 1 1
2: 2 1
3: 3 1
4: 2 2
5: 2 3
6: 1 2
7: 2 4
8: 3 2
The dataframe I had was too large and the accepted answer kept crashing. This worked for me:
library(plyr)
df$ones <- 1
df <- ddply(df, .(id), transform, cumulative_count = cumsum(ones))
df$ones <- NULL
Function to get the cumulative count of any array, including a non-numeric array:
cumcount <- function(x){
cumcount <- numeric(length(x))
names(cumcount) <- x
for(i in 1:length(x)){
cumcount[i] <- sum(x[1:i]==x[i])
}
return(cumcount)
}

Resources