Subsetting a dataframe conditionally by comparing it with a different nrow dataframe - r

I am trying to extract the rows of a dataframe which present some common data with the rows of a different size dataframe:
df1:
A B C D
a t 4 9
s p 3 7
w d 1 10
df2:
A B C D
a t 3 7
m r 5 8
p m 1 3
g u 5 2
s p 2 6
I am trying to get the rows of df1 accomplishing this conditions:
1. A and B variables must be equal between both dataframes
2. df1$C must belong to the interval (df2$C -5, df2$C +5), so the absolute value of the different between both values must be less than 5.
new_df<-df1[df1$A == df2$A && df1$B == df2$B && (df1$C > (df2$C - 5) && df1$C < (df2$C + 5)), ]
But I am getting this error, because the number of rows of both dataframes are different:
longer object length is not a multiple of shorter object length
I have also tried to use which but I am getting the same error. How can I solve this?
My expected output would be:
new_df
A B C D
a t 4 9
s p 3 7

This is possibly one way (deliberately made more intermediate variables here, it can be shortened). My logic was that A and B matching can be used to join the df's (step1 - resulting in the data frame s1) and then further filter on the numeric conditions (step2 - resulting in the data frame s2):
df1 <- tibble::tribble(
~A, ~B, ~C, ~D,
"a", "t", 4, 9,
"s", "p" , 3, 7,
"w", "d", 1, 10
)
df2 <- tibble::tribble(
~A, ~B, ~C, ~D,
"a", "t", 3 , 7,
"m", "r", 5, 8,
"p", "m", 1 , 3,
"g", "u", 5, 2,
"s", "p", 2 , 6)
new_df<-df1[df1$A == df2$A && df1$B == df2$B && (df1$C > (df2$C - 5) && df1$C < (df2$C + 5)), ]
s1 <- inner_join(df1, df2, by = (c("A", "B")), suffix = c(".from1", ".from2"))
s2 <- s1 %>%
mutate(condition1 = C.from1 > C.from2 - 5,
condition2 = C.from1 < C.from2 + 5) %>%
filter(condition1, condition2) %>%
select(-starts_with("condition"))

Here is a base R solution:
Merging the 2 DF by A and B makes sure that these variables already match and assign it to a new DF.
In this new DF, apply the remaining 2 conditions and delete the last two columns which came from the merge.
df1 <- tibble::tribble(
~A, ~B, ~C, ~D,
"a", "t", 4, 9,
"s", "p" , 3, 7,
"w", "d", 1, 10
)
df2 <- tibble::tribble(
~A, ~B, ~C, ~D,
"a", "t", 3 , 7,
"m", "r", 5, 8,
"p", "m", 1 , 3,
"g", "u", 5, 2,
"s", "p", 2 , 6)
merge(df1, df2, by = c('A', 'B')) -> df3
df3[(df3$C.x > df3$C.y-5) && df3$C.x < (df3$C.y + 5),][,-c(5,6)]
#> A B C.x D.x
#> 1 a t 4 9
#> 2 s p 3 7

Related

Copy rows based on the condition in R

I have a data set as I've shown below:
data <- tribble(
~book_name, ~clicks, ~type,
"A", 10, "X",
"B", 20, "Y",
"C", 30, "Y",
"A", 10, "Z",
"A", 10, "X",
)
Now, I want to copy and paste the rows if the type is "X". So, my desired data set is something like this:
desired_data <- tribble(
~book_name, ~clicks, ~type,
"A", 10, "X",
"B", 20, "Y",
"C", 30, "Y",
"A", 10, "Z",
"A", 10, "X",
"A", 10, "X",
"A", 10, "X",
)
How to do this?
Filter and bind rows
data_x <- data %>% filter(type == 'X')
desired_data <- bind_rows(data,data_x)
A base R solution. The idea is to prepare the row indices for the desired output. 1:nrow(data) is for all rows. which(data$type == "X") is for the rows you would like to duplicate. By combing these two parts together, we can get the desired output.
data[c(1:nrow(data), which(data$type == "X")), ]
# # A tibble: 7 x 3
# book_name clicks type
# <chr> <dbl> <chr>
# 1 A 10 X
# 2 B 20 Y
# 3 C 30 Y
# 4 A 10 Z
# 5 A 10 X
# 6 A 10 X
# 7 A 10 X

How to delete all the duplicates row based on two columns?

I have a data frame where I want to delete duplicates rows, but I want to delete them only if a value from another column is the same for all the rows. (To be more clear I want to delete the duplicates rows which have the same "Number" value for all rows)
There is a example of my data frame :
df <- data.frame("Name" = c("a", "a", "b", "b", "b", "c", "c", "c"),
"Number" = c(1, 1, 1, 2, 3, 4, 5, 5), stringsAsFactors = FALSE)
And the result I expect is :
result <- data.frame("Name" = c("b", "b", "b", "c", "c", "c"),
"Number" = c(1, 2, 3, 4, 5, 5), stringsAsFactors = FALSE)
We can group_by Name and remove groups which have more than 1 row and have only one distinct value.
library(dplyr)
df %>%
group_by(Name) %>%
filter(!(n_distinct(Number) == 1 & n() > 1))
# Name Number
# <chr> <dbl>
#1 b 2
#2 b 2
#3 b 3
and using base R ave, the same logic can be written as
df[with(df, !as.logical(ave(Number, Name, FUN = function(x)
length(unique(x)) == 1 & length(x) > 1))), ]
Here is a solution with data.table
library("data.table")
df <- data.table("Name" = c("a", "a", "b", "b", "b"),
"Number" = c(1, 1, 2, 2, 3))
df[, if (uniqueN(Number)!=1 || .N==1) .SD, Name]
and here is a solution with base R:
df <- data.frame("Name" = c("a", "a", "b", "b", "b"),
"Number" = c(1, 1, 2, 2, 3), stringsAsFactors = FALSE)
df[as.logical(ave(df$Number, df$Name, FUN=function(x) length(unique(x))!=1 || length(x)==1)),]
We can use data.table methods
library(data.table)
setDT(df)[, .SD[uniqueN(Number) > 1] , Name]
# Name Number
#1: b 1
#2: b 2
#3: b 3
#4: c 4
#5: c 5
#6: c 5

Find rows in data frame with certain columns are duplicated, then combine the the elements in other columns [duplicate]

This question already has answers here:
Collapse text by group in data frame [duplicate]
(2 answers)
Aggregating by unique identifier and concatenating related values into a string [duplicate]
(4 answers)
Closed 3 years ago.
I have one data frame, I want to find the rows where both columns A and B are duplicated, and then combine the rows by combing the elements in C column together.
My example:
DF = cbind.data.frame(A = c(1, 1, 2, 3, 3),
B = c("a", "b", "a", "c", "c"),
C = c("M", "N", "X", "M", "N"))
My expected result:
DFE = cbind.data.frame(A = c(1, 1, 2, 3),
B = c("a", "b", "a", "c"),
C = c("M", "N", "X", "M; N"))
Thanks a lot
Without packages:
DF <- aggregate(C ~ A + B, FUN = function(x) paste(x, collapse = "; "), data = DF)
Output:
A B C
1 1 a M
2 2 a X
3 1 b N
4 3 c M; N
Or with data.table:
setDT(DF)[, .(C = paste(C, collapse = "; ")), by = .(A, B)]
This is a tidyverse based solution where you can use paste with collapse after grouping it.
library(dplyr)
DF = cbind.data.frame(A = c(1, 1, 2, 3, 3),
B = c("a", "b", "a", "c", "c"),
C = c("M", "N", "X", "M", "N"))
DFE = cbind.data.frame(A = c(1, 1, 2, 3),
B = c("a", "b", "a", "c"),
C = c("M", "N", "X", "M; N"))
DF %>%
group_by(A,B) %>%
summarise(C = paste(C, collapse = ";"))
#> # A tibble: 4 x 3
#> # Groups: A [3]
#> A B C
#> <dbl> <fct> <chr>
#> 1 1 a M
#> 2 1 b N
#> 3 2 a X
#> 4 3 c M;N
Created on 2019-03-19 by the reprex package (v0.2.1)

Create column identifying minimum character from within a group and label ties

I have paired data for 10 subjects (with some missing and some ties). My goal is to select the eye with the best disc_grade (A > B > C) and label ties accordingly from the data frame below.
I'm stuck on how to use R code to select the rows with the best disc_grade for each subject.
df <- structure(list(patientID = c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6,
6, 7, 7, 8, 8, 9, 9, 10, 10), eye = c("R", "L", "R", "L", "R",
"L", "R", "L", "R", "L", "R", "L", "R", "L", "R", "L", "R", "L",
"R", "L"), disc_grade = c(NA, "B", "C", "B", "B", "C", "B", "C",
"B", "A", "B", "B", "C", "B", NA, NA, "B", "C", "B", "C")), .Names = c("patientID", "eye", "disc_grade"), class = c("tbl_df", "data.frame"), row.names = c(NA, -20L))
The desired output is:
patientID eye disc_grade
2 1 L B
4 2 L B
5 3 R B
7 4 R B
10 5 L A
11 6 Tie B
14 7 L B
17 9 R B
19 10 R B
This seems to work:
df %>%
group_by(patientID) %>%
filter(disc_grade == min(disc_grade, na.rm=TRUE)) %>%
summarise(eye = if (n()==1) eye else "Tie", disc_grade = first(disc_grade))
patientID eye disc_grade
(dbl) (chr) (chr)
1 1 L B
2 2 L B
3 3 R B
4 4 R B
5 5 L A
6 6 Tie B
7 7 L B
8 9 R B
9 10 R B
There is a warning for group 8, but we get the desired result thanks to how filter works on NAs.
With data.table:
setDT(df)[,
.SD[ disc_grade == min(disc_grade, na.rm=TRUE) ][,
.( eye = if (.N==1) eye else "Tie", disc_grade = disc_grade[1] )
]
, by=patientID]
Again, there's a warning, but now we do get a row for group 8, since [ does not ignore NAs. To get around this, you could filter the NAs before or after the operation (as in other answers). My best idea for doing it during the main operation is pretty convoluted:
setDT(df)[,
.SD[ which(disc_grade == min(disc_grade, na.rm=TRUE)) ][,
if (.N >= 1) list( eye = if (.N==1) eye else "Tie", disc_grade = disc_grade[1] )
]
, by=patientID]
One option with data.table
library(data.table)
na.omit(setDT(df))[, eye:=if(uniqueN(disc_grade)==1 &
.N >1) 'Tie' else eye, patientID
][order(factor(disc_grade, levels=c('A', 'B', 'C'))),
.SD[1L] ,patientID][order(patientID)]
# patientID eye disc_grade
#1: 1 L B
#2: 2 L B
#3: 3 R B
#4: 4 R B
#5: 5 L A
#6: 6 Tie B
#7: 7 L B
#8: 9 R B
#9: 10 R B
library(dplyr)
df <- structure(list(patientID = c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6,
6, 7, 7, 8, 8, 9, 9, 10, 10), eye = c("R", "L", "R", "L", "R",
"L", "R", "L", "R", "L", "R", "L", "R", "L", "R", "L", "R", "L",
"R", "L"), disc_grade = c(NA, "B", "C", "B", "B", "C", "B", "C",
"B", "A", "B", "B", "C", "B", NA, NA, "B", "C", "B", "C")), .Names = c("patientID", "eye", "disc_grade"), class = c("tbl_df", "data.frame"), row.names = c(NA, -20L))
df %>%
filter(!is.na(disc_grade)) %>% ## remove rows with NAs
group_by(patientID) %>% ## for each patient
filter(disc_grade == min(disc_grade)) %>% ## keep the row (his eye) that has the best score
mutate(eye_upd = ifelse(n() > 1, "tie", eye)) %>% ## if you kept both eyes you have a tie
select(patientID,eye_upd,disc_grade) %>%
distinct()
# patientID eye_upd disc_grade
# (dbl) (chr) (fctr)
# 1 1 L B
# 2 2 L B
# 3 3 R B
# 4 4 R B
# 5 5 L A
# 6 6 tie B
# 7 7 L B
# 8 9 R B
# 9 10 R B
There's certainly a better way to do this, but this gets the job done...need more coffee...
df_orig <- df
library(dplyr)
df %>%
filter(!is.na(disc_grade)) %>%
group_by(patientID) %>%
summarise(best = min(disc_grade)) %>%
left_join(., df_orig, by = c("patientID" = "patientID",
"best" = "disc_grade")) %>%
group_by(patientID) %>%
mutate(eye = ifelse(n() > 1, "tie", eye)) %>%
distinct(patientID) %>%
select(patientID, eye, best)
Note: I am able to get away with min(disc_grade) because of type conversation. Consider looking at as.numeric(as.factor(df$disc_grade)).

How to remove duplicate pair-wise columns [duplicate]

This question already has an answer here:
Select equivalent rows [A-B & B-A] [duplicate]
(1 answer)
Closed 7 years ago.
Consider the following dataframe:
df <- data.frame(V1 = c("A", "A", "B", "B", "C", "C"),
V2 = c("B", "C", "A", "C", "A", "B"),
n = c(1, 3, 1, 2, 3, 2))
How can I remove duplicate pair-wise columns so that the output looks like:
# V1 V2 n
#1 A B 1
#2 A C 3
#3 B C 2
I tried unique() and duplicated() to no avail.
Not sure if this is the simplest way of doing it (transposing can be computationally expensive) but this would work with your data frame:
df <- data.frame(V1 = c("A", "A", "B", "B", "C", "C"),
V2 = c("B", "C", "A", "C", "A", "B"),
n = c(1, 3, 1, 2, 3, 2))
First, sort the data frame row-wise, so your value-pairs become true duplicates.
df <- data.frame(t(apply(df, 1, sort)))
Then you can just apply the unique function.
df <- unique(df)
If your column names and order are important, you'll have to re-establish those.
names(df) <- c("n", "V1", "V2")
df <- df[, c("V1", "V2", "n")]
Another option would be to reshape (xtabs(n~..)) the dataset ('df') to wide format, set the lower triangular matrix to 0, and remove the rows with "Freq" equal to 0.
m1 <- xtabs(n~V1+V2, df)
m1[lower.tri(m1)] <- 0
subset(as.data.frame(m1), Freq!=0)
# V1 V2 Freq
#4 A B 1
#7 A C 3
#8 B C 2

Resources