Compare values between rows for specific columns - r

I have a rather curious question, but I hope I can find an answer. Unfortunately, the search function in stackoverflow didn't help me with this one.
I have the following dataset structure:
my.df <- data.frame(prs_id = c(1234, 1255, 1556, 3173),
vrs_id = c(3145, 3145, 3333, 3333),
V1_2017 = c(12,14,12,35),
V2_2017 = c("A", "B", "C", "D"),
V1_2018 = c(13,16,13,34),
V2_2018 = c("A", "B", "C", "D"),
V1_2019 = c(15,17,17,45),
V2_2019 = c("A", "B", "C", "D"),
V1_2020 = c(17,17,22,45),
V2_2020 = c("A", "B", "C", "D"))
As you might see, I filtered duplicates from a larger dataset (duplicates in "vrs_id"). The duplicates are not supposed to be there and the dataset is at the moment in wide format. I need a way to decide, which "vrs_id" to keep and which to drop. Therefore the function must compare each corresponding values of V1_2020 to V1_2017, according to the "vrs_id" they belong to. V2 is just to visualize, that there are more (actually 13) variables between the V1 variables.
E.g. "vrs_id" == 3145 requires to check, which V1_2020 (45 and 22) is larger. If non is larger (see 17 vs 17 for vrs_id == 3145), the function should move to the next variable V1_2019 and do the same. If, at the end, there is no difference in the duplicates, the first (according to the rownumber in the dataframe) should be chosen.
The subset only has duplicates and the corresponding original inside, so a potential function must not be capable of comparing even more values across rows. I tried to include pmax, but when grouping the dataframe by var_id, it automatically chose var_id as maximum in the line. But excluding var_id from the subset, consequently, gave an error in grouping because the grouping variable was missing.
Is there anybody with an idea to compute this comparisons?
Any help would be appreciated!
Edit:
The expected output should look like this:
new.df <- data.frame(
prs_id = c(1255, 3173),
vrs_id = c(3145, 3333),
V1_2017 = c(14,35),
V2_2017 = c("B", "D"),
V1_2018 = c(16,34),
V2_2018 = c("B", "D"),
V1_2019 = c(17,45),
V2_2019 = c("B", "D"),
V1_2020 = c(17,45),
V2_2020 = c("B", "D"))
prs_id No. 2 and 4 should be kept, since for prs_id No. 2 in V1_2019 17>15 (despite in V1_2020 17=17) and for No. 4 V1_2020 shows 45 > 22, therefore prs_id No. 3 is discarded.

This may be an option, using tidyverse packages.
Note this only works where all values for one prs_id are greater than or equal to the variables in duplicate prs_ids.
library(dplyr)
library(tidyr)
library(stringr)
df1 <-
my.df %>%
pivot_longer(starts_with("V1")) %>%
group_by(vrs_id, name) %>%
mutate(max_val = if_else(value == max(value), 0, 1)) %>%
ungroup() %>%
group_by(prs_id) %>%
mutate(prs_discard = sum(max_val)) %>%
filter(prs_discard == 0) %>%
select(-c(max_val, prs_discard)) %>%
pivot_wider(names_from = name, values_from = value)
df1
#> # A tibble: 2 x 10
#> # Groups: prs_id [2]
#> prs_id vrs_id V2_2017 V2_2018 V2_2019 V2_2020 V1_2017 V1_2018 V1_2019 V1_2020
#> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 1255 3145 B B B B 14 16 17 17
#> 2 3173 3333 D D D D 35 34 45 45
Created on 2021-11-23 by the reprex package (v2.0.1)

Related

In R, subset a dataframe on rows whose ID appears more than once [duplicate]

This question already has answers here:
Finding ALL duplicate rows, including "elements with smaller subscripts"
(9 answers)
Closed last month.
Background
I have a dataframe d with ~10,000 rows and n columns, one of which is an ID variable. Most ID's appear once, but some appear more than once. Say that it looks like this:
Problem
I'd like a new dataframe d_sub which only contains ID's that appear more than once in d. I'd like to have something that looks like this:
What I've tried
I've tried something like this:
d_sub <- subset(d, duplicated(d$ID))
But that only gets me one entry for ID's b and d, and I want each of their respective rows:
Any thoughts?
We may need to change the duplicated with | condition as duplicated by itself is FALSE for the first occurrence of 'ID'
d_sub <- subset(d, duplicated(ID)|duplicated(ID, fromLast = TRUE))
We could use add_count, then filter on n:
library(dplyr)
df %>%
add_count(ID) %>%
filter(n!=1) %>%
select(-n)
Example:
library(dplyr)
df <- tribble(
~ID, ~gender, ~zip,
"a", "f", 1,
"b", "f", NA,
"b", "m", 2,
"c", "f", 3,
"d", "f", NA,
"d", "m", 4)
df %>%
add_count(ID) %>%
filter(n!=1) %>%
select(-n)
Output:
ID gender zip
<chr> <chr> <dbl>
1 b f NA
2 b m 2
3 d f NA
4 d m 4

Compare each row in two dataframes in R

I have 2 data frames with account numbers and amounts plus some other irrelevant columns. I would like to compare the output with a Y or N if they match or not.
I need to compare the account number in row 1 in dataframe A to the account number in row 1 in dataframe B and if they match put a Y in a column or an N if they don't. I've managed to get the code to check if there is a match in the entire dataframe but I need to check each row individually.
E.g.
df1
|account.num|x1|x2|x3|
|100|a|b|c|
|101|a|b|c|
|102|a|b|c|
|103|a|b|c|
df2
|account.num|x1|x2|x3|
|100|a|b|c|
|102|a|b|c|
|101|a|b|c|
|103|a|b|c|
output
|account.num|x1|x2|x3|match|
|100|a|b|c|Y|
|101|a|b|c|N|
|102|a|b|c|N|
|103|a|b|c|Y|
So, row 1 matches as they have the same account number, but row 2 doesn't because they are different. However, the other data in the dataframe doesn't matter just that column. Can I do this without merging the data frames? (I did have tables, but they won't work. I don't know why. So sorry if that's hard to follow).
You can use == to compare if account.num is equal, and use this boolean vector to subset c("N", "Y")
df1$match <- c("N", "Y")[1 + (df1[[1]] == df2[[1]])]
df1
# account.num x1 x2 x3 match
#1 100 a b c Y
#2 101 a b c N
#3 102 a b c N
#4 103 a b c Y
Data:
df1 <- data.frame(account.num=100:103, x1="a", x2="b", x3="c")
df2 <- data.frame(account.num=c(100,102,101,103), x1="a", x2="b", x3="c")
If you want a base R solution, here is a quick sketch. Assuming boath dataframes are of the same length (number of rows), it should work with your data.
# example dataframes
a <- data.frame(A=c(1,2,3), B=c("one","two","three"))
b <- data.frame(A=c(3,2,1), B=c("three","two","one"))
res <- c() #initialise empty result vector
for (rownum in c(1:nrow(a))) {
# iterate over all numbers of rows
res[rownum] <- all(a[rownum,]==b[rownum,])
}
res # result vector
# [1] FALSE TRUE FALSE
# you can put it in frame a like this. example colname is "equalB"
a$equalB <- res
If you want a tidyverse solution, you can use left_join.
The principle here would be to try to match the data from df2 to the data from df1. If it matches, it would add TRUE to a match column. Then, the code replace the NA values with FALSE.
I'm also adding code to create the data frames from the exemple.
library(tidyverse)
df1 <-
tribble(~account_num, ~x1, ~x2, ~x3,
100, "a", "b", "c",
101, "a", "b", "c",
102, "a", "b", "c",
103, "a", "b", "c") %>%
rowid_to_column() # because position in the df is an important information,
# I need to hardcode it in the df
df2 <-
tribble(~account_num, ~x1, ~x2, ~x3,
100, "a", "b", "c",
102, "a", "b", "c",
101, "a", "b", "c",
103, "a", "b", "c") %>%
rowid_to_column()
# take a
df1 %>%
# try to match df1 with version of df2 with a new column where `match` = TRUE
# according to `rowid`, `account_num`, `x1`, `x2`, and `x3`
left_join(df2 %>%
tibble::add_column(match = TRUE),
by = c("rowid", "account_num", "x1", "x2", "x3")
) %>%
# replace the NA in `match` with FALSE in the df
replace_na(list(match = FALSE))

Merging of dataframes based off substrings in a column

I have two data frames, one (df_protein) contains experimental measured data from protein fragments carrying a modification, in the other (df_modificaton) I have a database of the "name" off all modification. Now I am trying to merge those together.
Both have a column with the modified sequence (the amino acid which is modified has an asterisk). But in df_protein the sequence of the whole fragment (!) is stored (starting and ending with ""), while in df_modification only the 7 amino acids before and after the modification are given (if it is at the start or the end of the protein the remaining places are marked with "")
For better illustration here a MWE:
df_protein <- data_frame(
Protein = c("A", "A", "A", "B", "B"),
Sequence = c("_EPTPSIASDIY*LPIATQELR_" , "_S*SSSLLASPGHISVK_", "_SSS*SLLASPGHISVK_", "_TQDPVPPET*PSDSDHK_", "_SMS*VDLSHIPLK_") ,
Counts = c(3.456, 6.126, 10.023 ,0.000, 7.250)
)
df_modificaton <- data_frame(
Protein = c("A", "A", "A", "B", "B", "B"),
Sequence = c("TIPEQRLS*SSSLLAS", "PSIASDIY*LPIATQ", "PEQRLSSS*SLLASPG", "DPVPPET*PSDSDHK", "FYYEILNS*PEKACSL","_____SMS*VDLSHIP"),
Modification = c("S125", "Y77", "S127", "T456", "S44", "S3")
)
# How can I merge the above to the following result:
df_merged <- data_frame(
Protein = c("A", "A", "A", "B", "B"),
Sequence = c("_EPTPSIASDIY*LPIATQELR_" , "_S*SSSLLASPGHISVK_", "_SSS*SLLASPGHISVK_", "_TQDPVPPET*PSDSDHK_", "_SMS*VDLSHIPLK_") ,
Counts = c(3.456, 6.126, 10.023 ,0.000, 7.250),
Modification = c("Y77", "S125", "S127", "T456", "S3")
)
I am using tidyverse but I am also fine with other packages. Thanks.
One approach is to use the fuzzyjoin package to perform a stringdist join:
library(dplyr)
library(fuzzyjoin)
stringdist_inner_join(df_protein, df_modificaton,
by = "Sequence", method = "jw", distance_col = "distance") %>%
group_by(Sequence.x) %>%
slice_min(distance)
# A tibble: 5 x 7
# Groups: Sequence.x [5]
Protein.x Sequence.x Counts Protein.y Sequence.y Modification distance
<chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
1 A _EPTPSIASDIY*LPIATQELR_ 3.46 A PSIASDIY*LPIATQ Y77 0.260
2 A _S*SSSLLASPGHISVK_ 6.13 A PEQRLSSS*SLLASPG S127 0.294
3 B _SMS*VDLSHIPLK_ 7.25 B _____SMS*VDLSHIP S3 0.15
4 A _SSS*SLLASPGHISVK_ 10.0 A PEQRLSSS*SLLASPG S127 0.294
5 B _TQDPVPPET*PSDSDHK_ 0 B DPVPPET*PSDSDHK T456 0.137

remove rows if values exists with the same combination in different columns

I have a 410 DNA sequences that I have confronted with each other, to get the similarity.
Now, to trim the database, I should get rid of the row that have the same value in 2 columns, because of course every value will be double.
To make myself clear, I have something like
tribble(
~seq01, ~seq02, ~ similarity,
"a", "b", 100.000,
"b", "a", 100.000,
"c", "d", 99.000,
"d", "c", 99.000,
)
comparing a-b and b-a is the same thing, so I'd want to get rid of the double value
What I want to end up with is
tribble(
~seq01, ~seq02, ~ similarity,
"a", "b", 100.000,
"c", "d", 99.000
)
I am not sure on how to proceed, all the ways I thought of are kinda hacky. I checked other answers, but don't really satisfy me.
Any input would be greatly appreciated (but tidy inputs are even more appreciated!)
We can use pmin and pmax to sort the values and then use distinct to select unique rows.
library(dplyr)
df %>%
mutate(col1 = pmin(seq01, seq02),
col2 = pmax(seq01, seq02), .before = 1) %>%
distinct(col1, col2, similarity)
# col1 col2 similarity
# <chr> <chr> <dbl>
#1 a b 100
#2 c d 99
Another, base R, approach:
df$add1 <- apply(df[,1:2], 1, min) # find rowwise minimum values
df$add2 <- apply(df[,1:2], 1, max) # find rowwise maximum values
df <- df[!duplicated(df[,4:5]),] # remove rows with identical values in new col's
df[,4:5] <- NULL # remove auxiliary col's
Result:
df
# A tibble: 2 x 3
seq01 seq02 similarity
<chr> <chr> <dbl>
1 a b 100
2 c d 99

How to separate the data based on names present in a column?

I have a data frame that looks this.
df <- data.frame(names = c("Ram","Siddhharth","Nithin","Aashrit","Bragadish","Sridhar"),
house = c("A", "B", "A", "B", "A", "B"))
I want to create a new data frame which gets re-arranged based on the house they are in.
Expected Output
house_A house_B
1 Ram Siddhharth
2 Nithin Aashrit
3 Bragadish Sridhar
How can I achieve this? Many thanks in advance.
You could use tidyr:
df %>%
pivot_wider(names_from="house", names_prefix="house_", values_from="names", values_fn=list) %>%
unnest(cols=everything())
This returns
# A tibble: 3 x 2
house_A house_B
<chr> <chr>
1 Ram Siddhharth
2 Nithin Aashrit
3 Bragadish Sridhar

Resources