Eliminate duplicates in R [duplicate] - r

This question already has answers here:
How can I remove all duplicates so that NONE are left in a data frame?
(3 answers)
Closed 1 year ago.
If I have a df like this
data<-data.frame(id=c(1,1,3,4),n=c("x","y","e","w"))
data
id n
1 1 x
2 1 y
3 3 e
4 4 w
I want to get a new df like this:
data
id n
3 3 e
4 4 w
That is, I want it to remove all repeating rows. I've tried functions like distinct from dplyr but it always gets one of the repeating rows.

Another subset option with ave
subset(
data,
ave(n, id, FUN = length) == 1
)
gives
id n
3 3 e
4 4 w

We may need duplicated
subset(data, !(duplicated(id)|duplicated(id, fromLast = TRUE)))
id n
3 3 e
4 4 w
or use table
subset(data, id %in% names(which(table(id) == 1)))
id n
3 3 e
4 4 w

Just adding to the already useful answers with a dplyr solution.
library(dplyr)
data %>% filter(
!(duplicated(id,fromLast = FALSE) | duplicated(id,fromLast = TRUE) )
)
distinct won't work for you, as it will retain all distinct values based on, in your case, id in which 1 is always a part of.

Although more verbose, you can also use base R.
data[!(duplicated(data["id"])|duplicated(data["id"], fromLast=TRUE)),]
Output
id n
3 3 e
4 4 w
Or use dplyr.
library(dplyr)
data %>%
dplyr::group_by(id) %>%
dplyr::filter(n() == 1) %>%
dplyr::ungroup()

Related

Is there a way to count values by presence per rows in R?

I want a way to count values on a dataframe based on its presence by row
a = data.frame(c('a','b','c','d','f'),
c('a','b','a','b','d'))
colnames(a) = c('let', 'let2')
In this reproducible example, we have the letter "a" appearing in the first row and third row, totalizing two appearences. I've made this code to count the values based if the presence is TRUE, but I want it to atribute it automaticaly for all the variables present in the dataframe:
#for counting the variable a and atribunting the count to the b dataframe
b = data.frame(unique(unique(unlist(a))))
b$count = 0
for(i in 1:nrow(a)){
if(TRUE %in% apply(a[i,], 2, function(x) x %in% 'a') == TRUE){
b$count[1] = b$count[1] + 1
}
}
b$count[1]
[1] 2
The problem is that I have to make this manually for all variables and I want a way to make this automatically. Is there a way? The expected output is:
1 a 2
2 b 2
3 c 1
4 d 2
5 f 1
It can be done in base R by taking the unique values separately from the column, unlist to a vector and get the frequency count with table. If needed convert the table object to a two column data.frame with stack
stack(table(unlist(lapply(a, unique))))[2:1]
-output
# ind values
#1 a 2
#2 b 2
#3 c 1
#4 d 2
#5 f 1
If it is based on row, use apply with MARGIN = 1
table(unlist(apply(a, 1, unique)))
Or do a group by row to get the unique and count with table
table(unlist(tapply(unlist(a), list(row(a)), unique)))
Or a faster approach with dapply from collapse
library(collapse)
table(unlist(dapply(a, funique, MARGIN = 1)))
Does this work:
library(dplyr)
library(tidyr)
a %>% pivot_longer(cols = everything()) %>% distinct() %>% count(value)
# A tibble: 5 x 2
value n
<chr> <int>
1 a 2
2 b 2
3 c 1
4 d 2
5 f 1
Data used:
a
let let2
1 a a
2 b b
3 c a
4 d b
5 f d

Expand each row with specific value in tidyr [duplicate]

This question already has answers here:
Repeat rows of a data.frame [duplicate]
(10 answers)
Closed 3 years ago.
I have a dataset with grouped observations per row. However, I would like to expand each row observation from a single observation per replicate to a set number (in this case "20" observations each).
In the attached picture,
Each replicate is a row. I would like to expand each row into 20. So "wellA" for "LS x SB" becomes expands to 20 of the same line. As a bonus, I would also like to make a new column called "Replicate2" that numerically lists 1 to 20 to reflect these 20 new rows per replicate.
The idea would to then add the survival status per individual (reflected in the new columns "Status" and "Event").
I think the "expand" function in tidyr has potential but can't figure out how to just add a fixed number per replicate. Using the "Alive" column is adding a variable number of observations.
expand<-DF %>% expand(nesting(Date, Time, Cumulative.hrs, Timepoint, Treatment, Boat, Parentage, Well, Mom, Dad, Cone, NumParents, Parents), Alive)
Any help appreciated!
In base R, we can use rep to repeat rows and transform to add new columns
n <- 20
transform(df[rep(seq_len(nrow(df)), each = n), ], Replicate = 1:n, row.names = NULL)
Using a reproducible example with n = 3
df <- data.frame(a = 1:3, b = 4:6, c = 7:9)
n <- 3
transform(df[rep(seq_len(nrow(df)), each = n), ], Replicate = 1:n, row.names = NULL)
# a b c Replicate2
#1 1 4 7 1
#2 1 4 7 2
#3 1 4 7 3
#4 2 5 8 1
#5 2 5 8 2
#6 2 5 8 3
#7 3 6 9 1
#8 3 6 9 2
#9 3 6 9 3
Using dplyr we can use slice to repeat rows and mutate to add new column.
library(dplyr)
df %>%
slice(rep(seq_len(n()), each = n)) %>%
mutate(Replicate2 = rep(seq_len(n), n))
Do a cross join between your existing data and the numbers 1:20.
tidyr::crossing(DF, replicate2 = 1:20)
If you want to add additional columns, use mutate:
... %>% mutate(status = 1, event = FALSE)

Remove Duplicates from Col X based on condition in Col Y

I have a data frame in R, that has duplicates, in one of the columns, however I only want to remove the duplicate based on a specification in another column.
For Example:
DF:
X J Y
1 2 3
2 3 1
1 3 2
I want to remove rows, where X is a duplicate and = 3.
DF:
X J Y
2 3 1
1 3 2
I have tried reading on dplyr, but have so far only been unable to get the desired result.
We can create the condition to condition with duplicated and the equality operator
subset(df1, !((duplicated(X)|duplicated(X, fromLast = TRUE)) & Y == 3))
# X J Y
#2 2 3 1
#3 1 3 2
If we need to remove the whole group of rows of 'X' if there is any value of 'Y' is 3, then
library(dplyr)
df1t %>%
group_by(X) %>%
filter(! 3 %in% Y) #or
# filter(all(Y != 3))

R - Multiple criteria sum across different length data frames

First post, long time user.
I'm tryin to efficiently sum a column based on 2 criteria for every ID in another data frame of a different length. Below is an example:
ID
1 A
2 B
3 C
ID Color Type Price
A Green 1 5
A Blue 2 6
B Green 3 7
B Blue 2 2
C Green 2 4
C Blue 4 5
For each ID, I'd like to sum the price if the color is blue and the type is 2. The result would hopefully be the below:
ID Price
1 A 6
2 B 2
3 C 0
This seems like an easy task but I can't figure it out for some reason. Also, I'll need to perform this operation on 2 large data sets (>1,000,000 rows each). I've created a function and used it in a loop for prior problems like this but that solution doesn't work because of the amount of information. I feel that a function from the apply would probably be best but I can't get them to work.
I changed a bit your data example so it takes into account the fact that not all ID are in the first data frame, and that there are two values to sum solewhere:
df1 <- data.frame(ID = c("A","B","C"))
df2 <- read.table(text = "
ID Color Type Price
A Green 1 5
A Blue 2 6
A Blue 2 4
B Green 3 7
B Blue 2 2
C Green 2 4
C Blue 4 5
D Green 2 2
D Blue 4 8
",header = T)
The two main package to do that fast and on big data.frame are dplyr and data.table. They are quite equivalent (almost, see data.table vs dplyr: can one do something well the other can't or does poorly?). Here are the two solutions:
library(data.table)
setDT(df2)[ID %in% unique(df1$ID), .(sum = sum(Price[ Type == 2 & Color == "Blue"])),by = ID]
ID sum
1: A 10
2: B 2
3: C 0
You could do
setDT(df2)[ID %in% unique(df1$ID) & Type == 2 & Color == "Blue", .(sum = sum(Price)),by = ID]
but you will discard C as the entire condition for the row selection is not met:
ID sum
1: A 10
2: B 2
and with dplyr:
library(dplyr)
df2 %>%
filter(ID %in% unique(df1$ID)) %>%
group_by(ID) %>%
summarize(sum = sum(Price[Type==2 & Color=="Blue"]))
# A tibble: 3 x 2
ID sum
<fct> <int>
1 A 10
2 B 2
3 C 0
A sapply version. It may exist more elegant ways to write it, but if you have big tables as you said, you can easily parallelized it.
Using the data as proposed by #denis:
df1 <- data.frame(ID = c("A","B","C"))
df2 <- read.table(text = "
ID Color Type Price
A Green 1 5
A Blue 2 6
A Blue 2 4
B Green 3 7
B Blue 2 2
C Green 2 4
C Blue 4 5
D Green 2 2
D Blue 4 8
",header = T)
Here a simple function that does what you want with sapply:
getPrices <- function(tableid=df1,tablevalues=df2,color="Blue",type=2){
filteredtablevalues <- droplevels(tablevalues[ tablevalues$Color == "Blue" & tablevalues$Type == 2 & tablevalues$ID %in% df1$ID,])
#droplevels could be skipped by using unique(as.character(filteredtablevalues$ID)) in the sapply, not sure what would be the quickest
sapply(levels(filteredtablevalues$ID),function(id,tabval)
{
sum(tabval$Price[tabval$ID == id])
},tabval=filteredtablevalues)
}
As you see i added two parameters that allow you to select for pair color/type. And you can add this:
tmp=getPrices(df1,df2)
finaltable=cbind.data.frame(ID=names(tmp),Price=tmp)
If you absolutely need a data frame with a column ID and a column Price.
I will try some benchmark when I have time, but written this way you should be able to easily parallelize this with library(parallel) and library(Rmpi), which can save your life if you have very very big datasets.
EDIT :
Benchmark:
I was not able to reproduce the dplyr example proposed by #denis but I could comparer the data.table version:
#Create a bigger dataset
nt=10000 #nt as big as you want
df2=rbind.data.frame(df2,
list(ID= sample(c("A","B","C"),nt,replace=T),
Color=sample(c("Blue","Green"),nt,replace=T),
Type=sample.int(5,nt,replace=T),
Price=sample.int(5,nt,replace=T)
)
)
You can benchmark using the library(microbenchmark):
library(microbenchmark)
microbenchmark(sply=getPrices(df1,df2),dtbl=setDT(df2)[ID %in% unique(df1$ID), .(sum = sum(Price[ Type == 2 & Color == "Blue"])),by = ID],dplyr=df2 %>% filter(ID %in% unique(df1$ID)) %>% group_by(ID) %>% summarize(sum = sum(Price[Type==2 & Color=="Blue"])))
On my computer it gives:
Unit: milliseconds
expr min lq mean median uq max neval
sply 78.37484 83.89856 97.75373 89.17033 118.96890 131.3226 100
dtbl 75.67642 83.44380 93.16893 85.65810 91.98584 137.2851 100
dplyr 90.67084 97.58653 114.24094 102.60008 136.34742 150.6235 100
Edit2:
sapply appears to be slightly quicker than the data.table approach, though not significantly. But using sapply could be really helpful for you have huge ID table. Then you use library(parallel) and gain even more time.
Now the data.table approach seems to be the quickest. But still, the avantage of sapply is that you can parallelized it easily. Though in that case and given how I wrote the function getPrices, it will be efficient only if your ID table is huge.

Unique Data Frame Based On Three Column Values

I have a data frame of 6449x743, in which few rows are repeating twice with same column_X and column_Y values, but with higher column_Z values for second repetition for the same row. I want to keep the row with higher column_Z only.
I tried following, but this doesn't get rid of duplicate values and gives me output of 6449x743 only.
output <- unique(Data[,c('column_X', 'column_Y', max('column_Z'))])
Ideally, the output should be (6449 - N)x743, as number of rows will be less, but number of columns will remain same, as column_X and column_Y will become unique after filtering data based on column_Z.
If anyone has suggestions, please let me know. Thanks.
You can used not duplicated (!duplicated) with option fromLast = TRUE on specific columns like this:
df <- data.frame(a=c(1,1,2,3,4),b=c(2,2,3,4,5),c=1:5)
df <- df[order(df$c),] #make sure the data is sorted.
a b c
1 1 2 1
2 1 2 2
3 2 3 3
4 3 4 4
5 4 5 5
df[!duplicated(df$a,fromLast = TRUE) & !duplicated(df$b,fromLast = TRUE),]
a b c
2 1 2 2
3 2 3 3
4 3 4 4
5 4 5 5
Try
library(dplyr)
Data %>%
group_by(column_x, column_Y) %>%
filter(Z==max(column_Z))
It works with the sample data
set.seed(13)
df<-data_frame(a=sample(1:4, 50, rep=T),
b=sample(1:3, 50, rep=T),
x=runif(50), y=rnorm(50))
df %>% group_by(a,b) %>% filter(x==max(x))
Probably the easiest way would be to order the whole thing by column_Z and then remove the duplicates:
output <- Data[order(Data$column_Z, decreasing=TRUE),]
output <- output[!duplicated(paste(output$column_X, output$column_Y)),]
assuming I understood you correctly.
Here's an older answer which may be trying to accomplish the same thing that you are:
How to make a unique in R by column A and keep the row with maximum value in column B
Editing with relevant code:
A solution using package data.table:
set.seed(42)
dat <- data.frame(A=c('a','a','a','b','b'),B=c(1,2,3,5,200),C=rnorm(5))
library(data.table)
dat <- as.data.table(dat)
dat[,.SD[which.max(B)],by=A]
A B C
1: a 3 0.3631284
2: b 200 0.4042683

Resources