Specific removing all duplicates with R [duplicate] - r

This question already has answers here:
Finding ALL duplicate rows, including "elements with smaller subscripts"
(9 answers)
Remove all duplicate rows including the "reference" row [duplicate]
(3 answers)
Closed 7 years ago.
For example I have two columns:
Var1 Var2
1 12
1 65
2 68
2 98
3 49
3 24
4 8
5 67
6 12
And I need to display only values which are unique for column Var1:
Var1 Var2
4 8
5 67
6 12
I can do you like this:
mydata=mydata[!unique(mydata$Var1),]
But when I use the same formula for my large data set with about 1 million observations, nothing happens - the sample size is still the same. Could you please explain my why?
Thank you!

With data.table (as it seem to be tagged with it) I would do
indx <- setDT(DT)[, .I[.N == 1], by = Var1]$V1
DT[indx]
# Var1 Var2
# 1: 4 8
# 2: 5 67
# 3: 6 12
Or... as #eddi reminded me, you can simply do
DT[, if(.N == 1) .SD, by = Var1]
Or (per the mentioned duplicates) with v >= 1.9.5 you could also do something like
setDT(DT, key = "Var1")[!(duplicated(DT) | duplicated(DT, fromLast = TRUE))]

You can use this:
df <- data.frame(Var1=c(1,1,2,2,3,3,4,5,6), Var2=c(12,65,68,98,49,24,8,67,12) );
df[ave(1:nrow(df),df$Var1,FUN=length)==1,];
## Var1 Var2
## 7 4 8
## 8 5 67
## 9 6 12
This will work even if the Var1 column is not ordered, because ave() does the necessary work to collect groups of equal elements (even if they are non-consecutive in the grouping vector) and map the result of the function call (length() in this case) back to each element that was a member of the group.
Regarding your code, it doesn't work because this is what unique() and its negation returns:
unique(df$Var1);
## [1] 1 2 3 4 5 6
!unique(df$Var1);
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
As you can see, unique() returns the actual unique values from the argument vector. Negation returns true for zero and false for everything else.
Thus, you end up row-indexing using a short logical vector (it will be short if there were any duplicates removed by unique()) consisting of TRUE where there were zeroes, and FALSE otherwise.

Related

find the place where the variable in a dataframe changes its value [duplicate]

This question already has answers here:
Finding the index of first changes in the elements of a vector
(5 answers)
Closed 4 years ago.
I have a lot of data frames in R which look like that:
A B
1 0
2 0
3 0
4 1
5 1
6 1
So between 3 and 4 the B changes value from 0 to 1. What is the most R way of returning the value of A where B changes value?
In the data B changes the value only once, and A is sorted (from 1 to n).
Here is a possible way. Use diff to get the values where column b changes but be carefull, the first value of b, by definition of change, hasn't changed. (The problem is that diff returns a vector with one less element.)
inx <- c(FALSE, diff(data$b) != 0)
data[inx, ]
# a b
#4 4 1
After seeing the OP's comment to another post, the following code shows that this method can also solve the issue when b starts with any value,not just zero.
data2 <- data.frame(a=c(1,2,3,4,5,6),b=c(1,1,1,0,0,0))
inx <- c(FALSE, diff(data2$b) != 0)
data2[inx, ]
# a b
#4 4 0
As OP mentioned,
In the data B changes the value only once
We can use cumsum with duplicated and which.max
which.max(cumsum(!duplicated(df$B)))
#[1] 4
If the value changes multiple times, this will give the index for last change instead.
If we need to subset the row, then we can do
df[which.max(cumsum(!duplicated(df$B))), ]
# A B
#4 4 1
To break it down further, for better understanding
!duplicated(df$B)
#[1] TRUE FALSE FALSE TRUE FALSE FALSE
cumsum(!duplicated(df$B))
#[1] 1 1 1 2 2 2
which.max(cumsum(!duplicated(df$B)))
#[1] 4
In order to identify a change in a sequence, one may use diff, like in the following code:
my_df <- data.frame(A = 1:6, B = c(0,0,0,1,1,1))
which(diff(my_df$B)==1)+1
[1] 4

Return subset of rows based on aggregate of keyed rows [duplicate]

This question already has an answer here:
Subset rows corresponding to max value by group using data.table
(1 answer)
Closed 6 years ago.
I would like to subset a data table in R within each subset, based on an aggregate function over the subset of rows. For example, for each key, return all values greater than the mean of a field calculated only for rows in subset. Example:
library(data.table)
t=data.table(Group=rep(c(1:5),each=5),Detail=c(1:25))
setkey(t,'Group')
library(foreach)
library(dplyr)
ret=foreach(grp=t[,unique(Group)],.combine=bind_rows,.multicombine=T) %do%
t[Group==grp&Detail>t[Group==grp,mean(Detail)],]
# Group Detail
# 1: 1 4
# 2: 1 5
# 3: 2 9
# 4: 2 10
# 5: 3 14
# 6: 3 15
# 7: 4 19
# 8: 4 20
# 9: 5 24
#10: 5 25
The question is, is it possible to succinctly code the last two lines using data.table features? Sorry if this is a repeat, I am also struggling explaining the exact goal to have google/stackoverflow find it.
Using the .SD function works. Was not aware of it, thanks:
dt[, .SD[Detail > mean(Detail)], by = Group]
Also works, with some performance gains:
indx <- dt[, .I[Detail > mean(Detail)], by = Group]$V1 ; dt[indx]

Removing duplicates based on two columns in R [duplicate]

This question already has answers here:
Remove duplicate column pairs, sort rows based on 2 columns [duplicate]
(3 answers)
Closed 7 years ago.
Suppose my data is as follows,
X Y
26 14
26 14
26 15
26 15
27 15
27 15
28 16
28 16
I want to remove the rows of duplicates. I am able to remove the duplicate rows based on one column by this command,
dat[c(T, diff(dat$X) != 0), ] or dat[c(T, diff(dat$Y) != 0), ]
But I want to remove the duplicates only when both the columns have the same previous value. I can't use unique here because the same data would occur later on. I want to check the previous value and calculate it
My sample output is,
x y
26 14
26 15
27 15
28 16
How can we do this in R?
Thanks
Ijaz
Using data.table v1.9.5 - installation instructions here:
require(data.table) # v1.9.5+
df[!duplicated(rleidv(df, cols = c("X", "Y"))), ]
rleidv() is best understood with examples:
rleidv(c(1,1,1,2,2,3,1,1))
# [1] 1 1 1 2 2 3 4 4
A unique index is generated for each consecutive run of values.
And the same can be accomplished on a list() or data.frame() or data.table() on a specific set of columns as well. For example:
df = data.frame(a = c(1,1,2,2,1), b = c(2,3,4,4,2))
rleidv(df) # computes on both columns 'a,b'
# [1] 1 2 3 3 4
rleidv(df, cols = "a") # only looks at 'a'
# [1] 1 1 2 2 3
The rest should be fairly obvious. We just check for duplicated() values, and return the non-duplicated ones.
using dplyr:
library(dplyr)
z %>% filter(X != lag(X) | Y != lag(Y) | row_number() == 1)
We need to include the row_number()==1 or we lose the first row

Removing same values in columns in some rows of a file in R

I have a file like this.
1 3
1 2
1 10
1 5
**5 5**
6 7
8 9
4 6
1 2
**10 10**
......
The file contains thousands of rows. I wanted to know, how can I remove the rows which contains the same values in columns in R ( The row containing 5 5 and row containing 10 10 )? I know how to remove duplicate columns or duplicate rows, but how do I go about selectively removing them? Thanks. :)
I would do this with indexing, example with small data frame:
myDf <- data.frame(a=c(3,5,8,6,9,4,3), b=c(3,3,5,8,9,6,4))
myDf <- myDf[myDf$a != myDf$b,]
I would consider writing a helper function like this:
indicator <- function(indf) {
rowSums(vapply(indf, function(x) x == indf[, 1],
logical(nrow(indf)))) == ncol(indf)
}
Basically, the function compares each column in the data.frame with the first column of the data.frame, then, checks to see which rowSums are the same as the number of columns in the data.frame.
This basically creates a logical vector that can be used to subset your data.frame.
Example:
mydf <- data.frame(a=c(3,5,8,6,9,4,3),
b=c(3,3,5,8,9,6,4),
c=c(3,4,5,6,9,7,2))
indicator(mydf)
# [1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE
mydf[!indicator(mydf), ]
# a b c
# 2 5 3 4
# 3 8 5 5
# 4 6 8 6
# 6 4 6 7
# 7 3 4 2

Return df with a columns values that occur more than once [duplicate]

This question already has answers here:
Subset data frame based on number of rows per group
(4 answers)
Closed 5 years ago.
I have a data frame df, and I am trying to subset all rows that have a value in column B occur more than once in the dataset.
I tried using table to do it, but am having trouble subsetting from the table:
t<-table(df$B)
Then I try subsetting it using:
subset(df, table(df$B)>1)
And I get the error
"Error in x[subset & !is.na(subset)] :
object of type 'closure' is not subsettable"
How can I subset my data frame using table counts?
Here is a dplyr solution (using mrFlick's data.frame)
library(dplyr)
newd <- dd %>% group_by(b) %>% filter(n()>1) #
newd
# a b
# 1 1 1
# 2 2 1
# 3 5 4
# 4 6 4
# 5 7 4
# 6 9 6
# 7 10 6
Or, using data.table
setDT(dd)[,if(.N >1) .SD,by=b]
Or using base R
dd[dd$b %in% unique(dd$b[duplicated(dd$b)]),]
May I suggest an alternative, faster way to do this with data.table?
require(data.table) ## 1.9.2
setDT(df)[, .N, by=B][N > 1L]$B
(or) you can couple .I (another special variable - see ?data.table) which gives the corresponding row number in df, along with .N as follows:
setDT(df)[df[, .I[.N > 1L], by=B]$V1]
(or) have a look at #mnel's another for another variation (using yet another special variable .SD).
Using table() isn't the best because then you have to rejoin it to the original rows of the data.frame. The ave function makes it easier to calculate row-level values for different groups. For example
dd<-data.frame(
a=1:10,
b=c(1,1,2,3,4,4,4,5,6, 6)
)
dd[with(dd, ave(b,b,FUN=length))>1, ]
#subset(dd, ave(b,b,FUN=length)>1) #same thing
a b
1 1 1
2 2 1
5 5 4
6 6 4
7 7 4
9 9 6
10 10 6
Here, for each level of b, it counts the length of b, which is really just the number of b's and returns that back to the appropriate row for each value. Then we use that to subset.

Resources