R - Count duplicated rows keeping index of their first occurrences - r

I have been looking for an efficient way of counting and removing duplicate rows in a data frame while keeping the index of their first occurrences.
For example, if I have a data frame:
df<-data.frame(x=c(9.3,5.1,0.6,0.6,8.5,1.3,1.3,10.8),y=c(2.4,7.1,4.2,4.2,3.2,8.1,8.1,5.9))
ddply(df,names(df),nrow)
gives me
x y V1
1 0.6 4.2 2
2 1.3 8.1 2
3 5.1 7.1 1
4 8.5 3.2 1
5 9.3 2.4 1
6 10.8 5.9 1
But I want to keep the original indices (along with the row names) of the duplicated rows. like:
x y V1
1 9.3 2.4 1
2 5.1 7.1 1
3 0.6 4.2 2
5 8.5 3.2 1
6 1.3 8.1 2
8 10.8 5.9 1
"duplicated" returns the original rownames (here {1 2 3 5 6 8}) but doesnt count the number of occurences. I tried writing functions on my own but none of them are efficient enough to handle big data. My data frame can have up to couple of million rows (though columns are usually 5 to 10).

If you want to keep the index:
library(data.table)
setDT(df)[,.(.I, .N), by = names(df)][!duplicated(df)]
# x y I N
#1: 9.3 2.4 1 1
#2: 5.1 7.1 2 1
#3: 0.6 4.2 3 2
#4: 8.5 3.2 5 1
#5: 1.3 8.1 6 2
#6: 10.8 5.9 8 1
Or using data.tables unique method
unique(setDT(df)[,.(.I, .N), by = names(df)], by = names(df))

We can try with data.table. We convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'x', 'y' column, we get the nrow (.N ).
library(data.table)
setDT(df)[, list(V1=.N), by = .(x,y)]
# x y V1
#1: 9.3 2.4 1
#2: 5.1 7.1 1
#3: 0.6 4.2 2
#4: 8.5 3.2 1
#5: 1.3 8.1 2
#6: 10.8 5.9 1
If we need the row ids,
setDT(df)[, list(V1= .N, rn=.I[1L]), by = .(x,y)]
# x y V1 rn
#1: 9.3 2.4 1 1
#2: 5.1 7.1 1 2
#3: 0.6 4.2 2 3
#4: 8.5 3.2 1 5
#5: 1.3 8.1 2 6
#6: 10.8 5.9 1 8
Or
setDT(df, keep.rownames=TRUE)[, list(V1=.N, rn[1L]), .(x,y)]

Related

Create a new row from the average of specific rows from all columns [duplicate]

This question already has answers here:
How can I get each numeric column's mean in one data?
(2 answers)
Doing operation on multiple numbered tables in R
(1 answer)
Closed 2 years ago.
Let's say I have a dataset.
w=c(5,6,7,8)
x=c(1,2,3,4)
y=c(1,2,3,5)
length(y)=4
z=data.frame(w,x,y)
This will return
w x y
1 5 1 1
2 6 2 2
3 7 3 3
4 8 4 5
I would like to have a 5th row that averages only row 2 and 3.
w x y
1 5 1 1
2 6 2 2
3 7 3 3
4 8 4 5
5 6.5 2.5 2.5
How would I approach this?
There are a lot of examples with rowMeans, but I'm looking to average all columns, and from only specific rows.
You can use colMeans as :
rows <- c(2, 3)
rbind(z, colMeans(z[rows,]))
# w x y
#1 5.0 1.0 1.0
#2 6.0 2.0 2.0
#3 7.0 3.0 3.0
#4 8.0 4.0 5.0
#5 6.5 2.5 2.5
Does this work:
library(dplyr)
z %>% bind_rows(sapply(z[2:3,], mean))
w x y
1 5.0 1.0 1.0
2 6.0 2.0 2.0
3 7.0 3.0 3.0
4 8.0 4.0 5.0
5 6.5 2.5 2.5

Select the row conditioning on the first occurence of a fixed value using R

Here is my repeated measurements dataframe
subject StartTime_month StopTime_month ...
1 0.0 0.5
1 0.5 1.0
1 1.0 3.0
1 3.0 6.0
1 6.0 9.6
1 9.6 12.1
2 0.0 0.5
2 0.5 1.0
2 1.0 1.9
2 1.9 3.2
2 3.2 6.2
2 6.2 8.2
I would like to select the rows which have the first StopTime_month >6.0 for each subject
We can try with data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'subject', get the row index of the first instance where 'StopTime_month' is greater than 6, and use that to subset the rows
library(data.table)
setDT(df1)[df1[, .I[which(StopTime_month > 6)[1]], by = subject]$V1]
# subject StartTime_month StopTime_month
#1: 1 6.0 9.6
#2: 2 3.2 6.2
Supppose, if we need all the rows until the first instance of 'StopTime_month' greater than 6,
setDT(df1)[, .SD[cumsum(StopTime_month > 6)<2], by = subject]
# subject StartTime_month StopTime_month
# 1: 1 0.0 0.5
# 2: 1 0.5 1.0
# 3: 1 1.0 3.0
# 4: 1 3.0 6.0
# 5: 1 6.0 9.6
# 6: 2 0.0 0.5
# 7: 2 0.5 1.0
# 8: 2 1.0 1.9
# 9: 2 1.9 3.2
#10: 2 3.2 6.2
Or using dplyr
library(dplyr)
df1 %>%
filter(StopTime_month > 6) %>%
group_by(subject) %>%
slice(1L)
# subject StartTime_month StopTime_month
# <int> <dbl> <dbl>
#1 1 6.0 9.6
#2 2 3.2 6.2
With base R aggregate
aggregate(.~subject, df[df$StopTime_month > 6, ], function(x) x[1])
# subject StartTime_month StopTime_month
#1 1 6.0 9.6
#2 2 3.2 6.2
A base R solution:
For subject 1:
df[df$subject==1 & df$StopTime_month > 6,][1,]
For subject 2:
df[df$subject==2 & df$StopTime_month > 6,][1,]
(where df is your dataframe)

remove duplicate id based on certain criteria [duplicate]

This question already has answers here:
R function which.max with tapply
(3 answers)
Remove duplicates based on specific criteria
(3 answers)
Closed 7 years ago.
id <- c(1,1,2,3,4,4,5,6,7,7,7,8,9)
age <- c(10,10.6,11,11.3,10.9,11.4,10.7,11,10.5,11.1,12.3,10.3,10.7)
ageto11 <- abs(age-11)
df <- as.data.frame(cbind(id,age,ageto11))
df
id age ageto11
1 1 10.0 1.0
2 1 10.6 0.4
3 2 11.0 0.0
4 3 11.3 0.3
5 4 10.9 0.1
6 4 11.4 0.4
7 5 10.7 0.3
8 6 11.0 0.0
9 7 10.5 0.5
10 7 11.1 0.1
11 7 12.3 1.3
12 8 10.3 0.7
13 9 10.7 0.3
I am trying to remove the duplicated id in the above data frame, based on the criteria of selecting the smallest distance to age 11 (i.e. the smallest value of ageto11)
For example, when id=1, I would like to remove the first row, in which ageto11 is larger.
When id=7, I would like to keep the 10th row, in which ageto11 is the smallest.
The desired result should be like
id age ageto11
2 1 10.6 0.4
3 2 11.0 0.0
4 3 11.3 0.3
5 4 10.9 0.1
7 5 10.7 0.3
8 6 11.0 0.0
10 7 11.1 0.1
12 8 10.3 0.7
13 9 10.7 0.3
We convert the 'data.frame' to 'data.table' (setDT(df)), grouped by the 'id', get the difference of 'age' with 11, find the index of the minimum absolute value (which.min(abs..) and subset the dataset (.SD).
library(data.table)
setDT(df)[,.SD[which.min(abs(age-11))] , id]
# id age ageto11
#1: 1 10.6 0.4
#2: 2 11.0 0.0
#3: 3 11.3 0.3
#4: 4 10.9 0.1
#5: 5 10.7 0.3
#6: 6 11.0 0.0
#7: 7 11.1 0.1
#8: 8 10.3 0.7
#9: 9 10.7 0.3
EDIT: Just notified by #Pascal that the distance is already calculated in 'ageto11'. In that case
setDT(df)[, .SD[which.min(ageto11)], id]

Computing Colwise Means on a Given Interval

I have a data frame in R that can be approximated as:
df <- data.frame(x = rep(1:5, each = 4), y = rep(2:6, each = 4), z = rep(3:7, each = 4))
> df
x y z
1 1 2 3
2 1 2 3
3 1 2 3
4 1 2 3
5 2 3 4
6 2 3 4
7 2 3 4
8 2 3 4
9 3 4 5
10 3 4 5
11 3 4 5
12 3 4 5
13 4 5 6
14 4 5 6
15 4 5 6
16 4 5 6
17 5 6 7
18 5 6 7
19 5 6 7
20 5 6 7
I'd like to compute colwise means at intervals of 5, and then collapse these means into a new data frame. For example, I'd like to compute the colwise means of df[1:5,], df[6:10,], df[11:15,], and df[16:20,], and return a df that looks as follows:
[,1] [,2] [,3]
[1,] 1.2 2.2 3.2
[2,] 2.4 3.4 4.4
[3,] 3.6 4.6 5.6
[4,] 4.8 5.8 6.8
I'm currently using a for-loop as such (where temp.coeff would correspond to the "5" specified above):
my.means <- NULL
for (j in 1:baseFreq) {
temp.mean <- colMeans(temp.df[(temp.coeff*(j-1)+1):(temp.coeff*j),])
my.means <- rbind(my.means, temp.mean)
}
my.means <- t(my.means)
collapsed.df <- t(data.frame(colMeans(my.means)))
}
..but I feel like there's an apply statement that could do the job a lot more efficiently. In addition, while the above data frame only has 20 rows, the one's on which I'll be working will have several thousand. Thoughts?
Many thanks in advance SO.
aggregate can do this if you aggregate against an appropriate running index. You do end up with another column in the result (which can be removed).
aggregate(. ~ rep(seq(nrow(df)/5), each=5), data=df, FUN=mean)
## rep(seq(nrow(df)/5), each = 5) x y z
## 1 1 1.2 2.2 3.2
## 2 2 2.4 3.4 4.4
## 3 3 3.6 4.6 5.6
## 4 4 4.8 5.8 6.8
I really think data.table works great for situations like this. It is fast and easy.
require("data.table")
dt <- data.table(df)
dt[,row.num:=.I]
dt[,lapply(.SD,mean),by=list(interval=cut(row.num,seq(0,nrow(dt),by=5)))]
# interval x y z
# 1: (0,5] 1.2 2.2 3.2
# 2: (5,10] 2.4 3.4 4.4
# 3: (10,15] 3.6 4.6 5.6
# 4: (15,20] 4.8 5.8 6.8
This is a possible solution with a combination of apply and sapply:
apply(df, 2, function(x) sapply(seq(1,nrow(df),5), function(y) mean(x[y:(y+4)])))
# x y z
#[1,] 1.2 2.2 3.2
#[2,] 2.4 3.4 4.4
#[3,] 3.6 4.6 5.6
#[4,] 4.8 5.8 6.8
Edit after comment by #jbaums: depending on the desired behavior, you might want to add na.rm=TRUE to the mean calculation:
apply(df, 2, function(x) sapply(seq(1,nrow(df),5), function(y) mean(x[y:(y+4)], na.rm = TRUE)))

R create new column of the rank of factors

In R I have a dataframe with two columns one is a value and the other is the group that each value is assigned to:
my_group my_value
A 1.2
B 5.4
C 9.2
A 1.1
B 5.2
C 9.8
A 1.3
B 5.1
C 9.2
A 1.0
B 5.7
C 9.1
I want to create a third column that uses the average of my_value by group to rank the groups and enters that rank in each row:
my_group my_value my_group_rank
A 1.2 3
B 5.4 2
C 9.2 1
A 1.1 3
B 5.2 2
C 9.8 1
A 1.3 3
B 5.1 2
C 9.2 1
A 1.0 3
B 5.7 2
C 9.1 1
The following code will add the group ranks to your data, except that the ranks will be in opposite order, perhaps you can still use it. I use the package dplyr for this. In my example, I assume your data is in a data.frame called test.
require(dplyr)
test <- test %>%
group_by(my_group) %>%
mutate(avg = mean(my_value)) %>%
ungroup() %>%
mutate(my_group_rank = dense_rank(avg)) %>%
select(-avg)
# my_group my_value my_group_rank
#1 A 1.2 1
#2 B 5.4 2
#3 C 10.2 3
#4 A 1.1 1
#5 B 5.2 2
#6 C 9.8 3
#7 A 1.3 1
#8 B 5.1 2
#9 C 9.2 3
#10 A 1.0 1
#11 B 5.7 2
#12 C 10.1 3

Resources