How to delete all the duplicates row based on two columns? - r

I have a data frame where I want to delete duplicates rows, but I want to delete them only if a value from another column is the same for all the rows. (To be more clear I want to delete the duplicates rows which have the same "Number" value for all rows)
There is a example of my data frame :
df <- data.frame("Name" = c("a", "a", "b", "b", "b", "c", "c", "c"),
"Number" = c(1, 1, 1, 2, 3, 4, 5, 5), stringsAsFactors = FALSE)
And the result I expect is :
result <- data.frame("Name" = c("b", "b", "b", "c", "c", "c"),
"Number" = c(1, 2, 3, 4, 5, 5), stringsAsFactors = FALSE)

We can group_by Name and remove groups which have more than 1 row and have only one distinct value.
library(dplyr)
df %>%
group_by(Name) %>%
filter(!(n_distinct(Number) == 1 & n() > 1))
# Name Number
# <chr> <dbl>
#1 b 2
#2 b 2
#3 b 3
and using base R ave, the same logic can be written as
df[with(df, !as.logical(ave(Number, Name, FUN = function(x)
length(unique(x)) == 1 & length(x) > 1))), ]

Here is a solution with data.table
library("data.table")
df <- data.table("Name" = c("a", "a", "b", "b", "b"),
"Number" = c(1, 1, 2, 2, 3))
df[, if (uniqueN(Number)!=1 || .N==1) .SD, Name]
and here is a solution with base R:
df <- data.frame("Name" = c("a", "a", "b", "b", "b"),
"Number" = c(1, 1, 2, 2, 3), stringsAsFactors = FALSE)
df[as.logical(ave(df$Number, df$Name, FUN=function(x) length(unique(x))!=1 || length(x)==1)),]

We can use data.table methods
library(data.table)
setDT(df)[, .SD[uniqueN(Number) > 1] , Name]
# Name Number
#1: b 1
#2: b 2
#3: b 3
#4: c 4
#5: c 5
#6: c 5

Related

How to pivot_wider only a single condition using a single command in R

Let's say I test 3 drugs (A, B, C) at 3 conditions (0, 1, 2), and then I want to compare two of the conditions (1, 2) to a reference condition (0). This is the plot I would like to get:
First: I do get there, but my solution seems overly complex.
# The data I have
df <- data.frame(
drug = c("A", "A", "A", "B", "B", "B", "C", "C", "C"),
cond = c(0, 1, 2, 0, 1, 2, 0, 1, 2),
result = c(1, 2, 3, 2, 4, 6, 3, 6, 9),
)
# The data I want
df_wider0 <- data.frame(
drug = c("A", "A", "B", "B", "C", "C"),
result0 = c(1, 1, 2, 2, 3, 3),
cond = c(1, 2, 1, 2, 1, 2),
result = c(2, 3, 4, 6, 6, 9)
)
# This pivots also condition 1 and 2 ...
df_wider <- tidyr::pivot_wider(
df,
names_from = cond,
values_from = result
)
# ... so I pivot out these two again ...
colnames(df_wider)[colnames(df_wider) == "0"] <- "result0"
df_wider0 <- tidyr::pivot_longer(
df_wider,
cols = c("1", "2"),
names_to = "cond",
values_to = "result"
)
# ... so that I can use this ggplot command:
library(ggplot2)
ggplot(df_wider0, aes(x = result0, y = result, label = drug)) +
geom_label() +
facet_wrap("cond")
As you can see, I use a sequence of pivot_wider and pivot_longer to do a selective pivot_wider (by inverting some of its effects later). Is there an integrated command that I can use to achieve this more elegantly?
This can also be a strategy. (Will work even if there are unequal number of conditions per group)
df %>%
filter(cond != 0) %>%
right_join(df %>% filter(cond == 0), by = "drug", suffix = c("", "0")) %>%
select(-cond0)
Revised df adopted
df <- data.frame(
drug = c("A", "A", "A", "B", "B", "B", "C", "C", "C", "D"),
cond = c(0, 1, 2, 0, 1, 2, 0, 1, 2, 0),
result = c(1, 2, 3, 2, 4, 6, 3, 6, 9, 10)
)
Result of above syntax
drug cond result result0
1 A 1 2 1
2 A 2 3 1
3 B 1 4 2
4 B 2 6 2
5 C 1 6 3
6 C 2 9 3
7 D NA NA 10
You may also fill cond if desired so
You can do this without any pivot statement at all.
library(dplyr)
library(ggplot2)
df_wider0 <- df %>%
mutate(result0 = result[match(drug, unique(drug))]) %>%
filter(cond != 0)
df_wider0
# drug cond result result0
#1 A 1 2 1
#2 A 2 3 1
#3 B 1 4 2
#4 B 2 6 2
#5 C 1 6 3
#6 C 2 9 3
Plot the data :
ggplot(df_wider0, aes(x = result0, y = result, label = drug)) +
geom_label() +
facet_wrap("cond")

Find row of value zero and add a row count before and after it

Based on the data below:
library(tidyverse)
limit <- c(7, 7, 7, 7, 7, 7, 7, 7, 7, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5)
group <- c("a", "a", "a", "a", "a", "a", "a", "a", "a","b", "b", "b", "b", "b", "b", "b", "b", "b", "c", "c", "c", "c", "c", "c", "c", "c", "c")
df <- data.frame(limit, group)
df
I'd like to create a new column (NewCol) as follows:
If there is a row where limit = Id, that should be 0 on NewCol. But then I'd like all the rows before 0 to go back in reverse order until the first row of the group, and all the rows after 0 to be counted until the end of the group.
so for example, in that case, for group "a" it should look like
-6, -5, -4, -3, -2, -1, 0, 1, 2 where -6 is the first row and 2 is the 9th row for that group.
This is what I've tried but still not getting what I need
df %>% group_by(group) %>% mutate(Id = seq(1:length(limit))) %>%
mutate(NewCol = ifelse(limit == Id, 0, NA)) %>%
mutate(nn=ifelse(is.na(NewCol),
zoo::na.locf(NewCol) + cumsum(is.na(NewCol))*1,
NewCol))
Thank you
It is just a difference between the row_number() and the 'limit' after grouping by
library(dplyr)
df %>%
group_by(group) %>%
mutate(NewCol = row_number() - limit)
Or using data.table
library(data.table)
setDT(df)[, NewCol := seq_len(.N) - limit]
Or with base R
df$NewCol <- with(df, ave(seq_along(limit), group, FUN = seq_along) - limit)
In Base R, we can use ave :
df$NewCol <- with(df, ave(limit, group, FUN = seq_along) - limit)
# limit group NewCol
#1 7 a -6
#2 7 a -5
#3 7 a -4
#4 7 a -3
#5 7 a -2
#6 7 a -1
#7 7 a 0
#8 7 a 1
#9 7 a 2
#10 4 b -3
#11 4 b -2
#12 4 b -1
#13 4 b 0
#...
Or using data.table :
library(data.table)
setDT(df)[, NewCol := seq_along(limit) - limit, group]
#Or
#setDT(df)[, NewCol := seq_len(.N) - limit, group]

Conditional search, match and replace values between data frames

I have two dataframes as shown below. I would like to replace text (cells) in dataframe 1 with corresponding values taken from dataframe 2 when there is a match. I have tried to give a simple example below.
I have some limited experience with R but cant think of an easy solution right away. Any help/suggestions will be much appreciated.
input_1 = data.frame(col1 = c("ex1", "ex2", "ex3", "ex4"),
col2 = c("A", "B", "C", "D"),
col3 = c("B", "E", "F", "D"))
input_2 = data.frame(colx = c("A", "B", "C", "D", "E", "F"),
coly = c(1, 2, 3, 4, 5, 6))
output = data.frame(col1 = c("ex1", "ex2", "ex3", "ex4"),
col2 = c(1, 2, 3, 4),
col3 = c(2, 5, 6, 4))
Here's a tidyverse solution :
library(tidyverse)
mutate_at(input_1, -1, ~deframe(input_2)[as.character(.)])
# col1 col2 col3
# 1 ex1 1 2
# 2 ex2 2 5
# 3 ex3 3 6
# 4 ex4 4 4
deframe builds a named vector from a data frame, more convenient in this case.
as.character is necessary as you have factor columns
Example using tidyverse. My solution involved merging twice to input_2, but matching different columns. The last pipe cleans the data frame and renames the columns.
library(tidyverse)
input_1 = data.frame(col1 = c("ex1", "ex2", "ex3", "ex4"),
col2 = c("A", "B", "C", "D"),
col3 = c("B", "E", "F", "D"))
input_2 = data.frame(colx = c("A", "B", "C", "D", "E", "F"),
coly = c(1, 2, 3, 4, 5, 6))
output = data.frame(col1 = c("ex1", "ex2", "ex3", "ex4"),
col2 = c(1, 2, 3, 4),
col3 = c(2, 5, 6, 4))
input_1 %>% inner_join(input_2, by = c("col2" = "colx")) %>%
inner_join(input_2, by = c("col3" = "colx")) %>%
select(col1, coly.x, coly.y) %>%
magrittr::set_colnames(c("col1", "col2", "col3"))
One approach using base R would be to loop over columns where we want to change values using lapply, match the values with input_2$colx and get the corresponding coly value.
input_1[-1] <- lapply(input_1[-1], function(x) input_2$coly[match(x, input_2$colx)])
input_1
# col1 col2 col3
#1 ex1 1 2
#2 ex2 2 5
#3 ex3 3 6
#4 ex4 4 4
Actually, you could go away without using lapply, you could directly unlist the values and match
input_1[-1] <- input_2$coly[match(unlist(input_1[-1]), input_2$colx)]

Find rows in data frame with certain columns are duplicated, then combine the the elements in other columns [duplicate]

This question already has answers here:
Collapse text by group in data frame [duplicate]
(2 answers)
Aggregating by unique identifier and concatenating related values into a string [duplicate]
(4 answers)
Closed 3 years ago.
I have one data frame, I want to find the rows where both columns A and B are duplicated, and then combine the rows by combing the elements in C column together.
My example:
DF = cbind.data.frame(A = c(1, 1, 2, 3, 3),
B = c("a", "b", "a", "c", "c"),
C = c("M", "N", "X", "M", "N"))
My expected result:
DFE = cbind.data.frame(A = c(1, 1, 2, 3),
B = c("a", "b", "a", "c"),
C = c("M", "N", "X", "M; N"))
Thanks a lot
Without packages:
DF <- aggregate(C ~ A + B, FUN = function(x) paste(x, collapse = "; "), data = DF)
Output:
A B C
1 1 a M
2 2 a X
3 1 b N
4 3 c M; N
Or with data.table:
setDT(DF)[, .(C = paste(C, collapse = "; ")), by = .(A, B)]
This is a tidyverse based solution where you can use paste with collapse after grouping it.
library(dplyr)
DF = cbind.data.frame(A = c(1, 1, 2, 3, 3),
B = c("a", "b", "a", "c", "c"),
C = c("M", "N", "X", "M", "N"))
DFE = cbind.data.frame(A = c(1, 1, 2, 3),
B = c("a", "b", "a", "c"),
C = c("M", "N", "X", "M; N"))
DF %>%
group_by(A,B) %>%
summarise(C = paste(C, collapse = ";"))
#> # A tibble: 4 x 3
#> # Groups: A [3]
#> A B C
#> <dbl> <fct> <chr>
#> 1 1 a M
#> 2 1 b N
#> 3 2 a X
#> 4 3 c M;N
Created on 2019-03-19 by the reprex package (v0.2.1)

How to remove duplicate pair-wise columns [duplicate]

This question already has an answer here:
Select equivalent rows [A-B & B-A] [duplicate]
(1 answer)
Closed 7 years ago.
Consider the following dataframe:
df <- data.frame(V1 = c("A", "A", "B", "B", "C", "C"),
V2 = c("B", "C", "A", "C", "A", "B"),
n = c(1, 3, 1, 2, 3, 2))
How can I remove duplicate pair-wise columns so that the output looks like:
# V1 V2 n
#1 A B 1
#2 A C 3
#3 B C 2
I tried unique() and duplicated() to no avail.
Not sure if this is the simplest way of doing it (transposing can be computationally expensive) but this would work with your data frame:
df <- data.frame(V1 = c("A", "A", "B", "B", "C", "C"),
V2 = c("B", "C", "A", "C", "A", "B"),
n = c(1, 3, 1, 2, 3, 2))
First, sort the data frame row-wise, so your value-pairs become true duplicates.
df <- data.frame(t(apply(df, 1, sort)))
Then you can just apply the unique function.
df <- unique(df)
If your column names and order are important, you'll have to re-establish those.
names(df) <- c("n", "V1", "V2")
df <- df[, c("V1", "V2", "n")]
Another option would be to reshape (xtabs(n~..)) the dataset ('df') to wide format, set the lower triangular matrix to 0, and remove the rows with "Freq" equal to 0.
m1 <- xtabs(n~V1+V2, df)
m1[lower.tri(m1)] <- 0
subset(as.data.frame(m1), Freq!=0)
# V1 V2 Freq
#4 A B 1
#7 A C 3
#8 B C 2

Resources