Merging of dataframes based off substrings in a column - r

I have two data frames, one (df_protein) contains experimental measured data from protein fragments carrying a modification, in the other (df_modificaton) I have a database of the "name" off all modification. Now I am trying to merge those together.
Both have a column with the modified sequence (the amino acid which is modified has an asterisk). But in df_protein the sequence of the whole fragment (!) is stored (starting and ending with ""), while in df_modification only the 7 amino acids before and after the modification are given (if it is at the start or the end of the protein the remaining places are marked with "")
For better illustration here a MWE:
df_protein <- data_frame(
Protein = c("A", "A", "A", "B", "B"),
Sequence = c("_EPTPSIASDIY*LPIATQELR_" , "_S*SSSLLASPGHISVK_", "_SSS*SLLASPGHISVK_", "_TQDPVPPET*PSDSDHK_", "_SMS*VDLSHIPLK_") ,
Counts = c(3.456, 6.126, 10.023 ,0.000, 7.250)
)
df_modificaton <- data_frame(
Protein = c("A", "A", "A", "B", "B", "B"),
Sequence = c("TIPEQRLS*SSSLLAS", "PSIASDIY*LPIATQ", "PEQRLSSS*SLLASPG", "DPVPPET*PSDSDHK", "FYYEILNS*PEKACSL","_____SMS*VDLSHIP"),
Modification = c("S125", "Y77", "S127", "T456", "S44", "S3")
)
# How can I merge the above to the following result:
df_merged <- data_frame(
Protein = c("A", "A", "A", "B", "B"),
Sequence = c("_EPTPSIASDIY*LPIATQELR_" , "_S*SSSLLASPGHISVK_", "_SSS*SLLASPGHISVK_", "_TQDPVPPET*PSDSDHK_", "_SMS*VDLSHIPLK_") ,
Counts = c(3.456, 6.126, 10.023 ,0.000, 7.250),
Modification = c("Y77", "S125", "S127", "T456", "S3")
)
I am using tidyverse but I am also fine with other packages. Thanks.

One approach is to use the fuzzyjoin package to perform a stringdist join:
library(dplyr)
library(fuzzyjoin)
stringdist_inner_join(df_protein, df_modificaton,
by = "Sequence", method = "jw", distance_col = "distance") %>%
group_by(Sequence.x) %>%
slice_min(distance)
# A tibble: 5 x 7
# Groups: Sequence.x [5]
Protein.x Sequence.x Counts Protein.y Sequence.y Modification distance
<chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
1 A _EPTPSIASDIY*LPIATQELR_ 3.46 A PSIASDIY*LPIATQ Y77 0.260
2 A _S*SSSLLASPGHISVK_ 6.13 A PEQRLSSS*SLLASPG S127 0.294
3 B _SMS*VDLSHIPLK_ 7.25 B _____SMS*VDLSHIP S3 0.15
4 A _SSS*SLLASPGHISVK_ 10.0 A PEQRLSSS*SLLASPG S127 0.294
5 B _TQDPVPPET*PSDSDHK_ 0 B DPVPPET*PSDSDHK T456 0.137

Related

Compare values between rows for specific columns

I have a rather curious question, but I hope I can find an answer. Unfortunately, the search function in stackoverflow didn't help me with this one.
I have the following dataset structure:
my.df <- data.frame(prs_id = c(1234, 1255, 1556, 3173),
vrs_id = c(3145, 3145, 3333, 3333),
V1_2017 = c(12,14,12,35),
V2_2017 = c("A", "B", "C", "D"),
V1_2018 = c(13,16,13,34),
V2_2018 = c("A", "B", "C", "D"),
V1_2019 = c(15,17,17,45),
V2_2019 = c("A", "B", "C", "D"),
V1_2020 = c(17,17,22,45),
V2_2020 = c("A", "B", "C", "D"))
As you might see, I filtered duplicates from a larger dataset (duplicates in "vrs_id"). The duplicates are not supposed to be there and the dataset is at the moment in wide format. I need a way to decide, which "vrs_id" to keep and which to drop. Therefore the function must compare each corresponding values of V1_2020 to V1_2017, according to the "vrs_id" they belong to. V2 is just to visualize, that there are more (actually 13) variables between the V1 variables.
E.g. "vrs_id" == 3145 requires to check, which V1_2020 (45 and 22) is larger. If non is larger (see 17 vs 17 for vrs_id == 3145), the function should move to the next variable V1_2019 and do the same. If, at the end, there is no difference in the duplicates, the first (according to the rownumber in the dataframe) should be chosen.
The subset only has duplicates and the corresponding original inside, so a potential function must not be capable of comparing even more values across rows. I tried to include pmax, but when grouping the dataframe by var_id, it automatically chose var_id as maximum in the line. But excluding var_id from the subset, consequently, gave an error in grouping because the grouping variable was missing.
Is there anybody with an idea to compute this comparisons?
Any help would be appreciated!
Edit:
The expected output should look like this:
new.df <- data.frame(
prs_id = c(1255, 3173),
vrs_id = c(3145, 3333),
V1_2017 = c(14,35),
V2_2017 = c("B", "D"),
V1_2018 = c(16,34),
V2_2018 = c("B", "D"),
V1_2019 = c(17,45),
V2_2019 = c("B", "D"),
V1_2020 = c(17,45),
V2_2020 = c("B", "D"))
prs_id No. 2 and 4 should be kept, since for prs_id No. 2 in V1_2019 17>15 (despite in V1_2020 17=17) and for No. 4 V1_2020 shows 45 > 22, therefore prs_id No. 3 is discarded.
This may be an option, using tidyverse packages.
Note this only works where all values for one prs_id are greater than or equal to the variables in duplicate prs_ids.
library(dplyr)
library(tidyr)
library(stringr)
df1 <-
my.df %>%
pivot_longer(starts_with("V1")) %>%
group_by(vrs_id, name) %>%
mutate(max_val = if_else(value == max(value), 0, 1)) %>%
ungroup() %>%
group_by(prs_id) %>%
mutate(prs_discard = sum(max_val)) %>%
filter(prs_discard == 0) %>%
select(-c(max_val, prs_discard)) %>%
pivot_wider(names_from = name, values_from = value)
df1
#> # A tibble: 2 x 10
#> # Groups: prs_id [2]
#> prs_id vrs_id V2_2017 V2_2018 V2_2019 V2_2020 V1_2017 V1_2018 V1_2019 V1_2020
#> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 1255 3145 B B B B 14 16 17 17
#> 2 3173 3333 D D D D 35 34 45 45
Created on 2021-11-23 by the reprex package (v2.0.1)

remove rows if values exists with the same combination in different columns

I have a 410 DNA sequences that I have confronted with each other, to get the similarity.
Now, to trim the database, I should get rid of the row that have the same value in 2 columns, because of course every value will be double.
To make myself clear, I have something like
tribble(
~seq01, ~seq02, ~ similarity,
"a", "b", 100.000,
"b", "a", 100.000,
"c", "d", 99.000,
"d", "c", 99.000,
)
comparing a-b and b-a is the same thing, so I'd want to get rid of the double value
What I want to end up with is
tribble(
~seq01, ~seq02, ~ similarity,
"a", "b", 100.000,
"c", "d", 99.000
)
I am not sure on how to proceed, all the ways I thought of are kinda hacky. I checked other answers, but don't really satisfy me.
Any input would be greatly appreciated (but tidy inputs are even more appreciated!)
We can use pmin and pmax to sort the values and then use distinct to select unique rows.
library(dplyr)
df %>%
mutate(col1 = pmin(seq01, seq02),
col2 = pmax(seq01, seq02), .before = 1) %>%
distinct(col1, col2, similarity)
# col1 col2 similarity
# <chr> <chr> <dbl>
#1 a b 100
#2 c d 99
Another, base R, approach:
df$add1 <- apply(df[,1:2], 1, min) # find rowwise minimum values
df$add2 <- apply(df[,1:2], 1, max) # find rowwise maximum values
df <- df[!duplicated(df[,4:5]),] # remove rows with identical values in new col's
df[,4:5] <- NULL # remove auxiliary col's
Result:
df
# A tibble: 2 x 3
seq01 seq02 similarity
<chr> <chr> <dbl>
1 a b 100
2 c d 99

Add a row to a dataframe that repeats a row and replaces 2 entries

I want to add rows to a dataframe (or tibble) as part of a data entry project. I need to:
Find one row that holds a specific value in one column (obsid)
Duplicate that row. However, replace the value in column "word".
Append the new row to the dataframe
I want to write a function that makes it easy. When I write the function, it won't add the new rows. I can print out the answer. But it won't alter the basic dataframe
If I do it without a function it works as well.
Why won't the function add the row?
df <- tibble(obsid = c("a","b" , "c", "d"), b=c("a", "a", "b", "b"), word= c("what", "is", "the", "answer"))
df$main <- 1
addrow <- function(id, newword) {
rowtoadd <- df %>%
filter(obsid== id & main==1) %>%
mutate(word=replace(word, main==1, newword)) %>%
mutate(main=replace(main, word==newword, 0))
df <- bind_rows(df, rowtoadd)
print(rowtoadd)
print(filter(df, df$obsid== id))}
addrow("a", "xxx")
R objects usually don't modify itself, you need to warp the result in return() to return the modified copy of that dataframe.
Change your function to:
df <- tibble(obsid = c("a","b" , "c", "d"), b=c("a", "a", "b", "b"), word= c("what", "is", "the", "answer"))
df$main <- 1
addrow <- function(id, newword) {
rowtoadd <- df %>%
filter(obsid== id & main==1) %>%
mutate(word=replace(word, main==1, newword)) %>%
mutate(main=replace(main, word==newword, 0))
df <- bind_rows(df, rowtoadd)
return(df)
}
> addrow("a", "xxx")
# A tibble: 5 x 4
obsid b word main
<chr> <chr> <chr> <dbl>
1 a a what 1
2 b a is 1
3 c b the 1
4 d b answer 1
5 a a xxx 0

How to associate a list of character vectors with your data frame in R

The shape of my data is fairly simple:
set.seed(1337)
id <- c(1:4)
values <- runif(0, 1, n=4)
df <- data.frame(id, values)
df
id values
1 1 0.57632155
2 2 0.56474213
3 3 0.07399023
4 4 0.45386562
What isn't simple: I have a list of character-value arrays that match up to each row, where each list item can be empty, or it can contain up to 5 separate tags, like...
tags <- list(
c("A"),
NA,
c("A", "B", "C"),
c("B", "C")
)
I will be asked various questions using the tags as classifers, for instance, "what is the average value of all rows with a B tag?" Or "how many rows contain both tag A and tag C?"
What way would you choose to store the tags so that I can do this? My real-life data file is quite large, which makes experimenting with unlist or other commands difficult.
Here are couple of options to get the expected output. Create 'tags' as a list column in the dataset and unnest (already from the comments), and then summarise the number of 'A' or 'C' by getting the sum of logical vector. Similarly, the mean of 'values' where 'tag' is 'B'
library(tidyverse)
df %>%
mutate(tag = tags) %>%
unnest %>%
summarise(nAC = sum(tag %in% c("A", "C")),
meanB = mean(values[tag == "B"], na.rm = TRUE))
That is not very hard . you just need assign your list to your df create a new columns as name tags then we do unnest, I have list the solutions for your listed questions .
library(tidyr)
library(dplyr)
df$tags=list(
c("A"),
NA,
c("A", "B", "C"),
c("B", "C")
)
Newdf=df%>%tidyr::unnest(tags)
Q1.
Newdf%>%group_by(tags)%>%summarise(Mean=mean(values))%>%filter(tags=='B')
tags Mean
<chr> <dbl>
1 B 0.263927925960161
Q2.
Newdf%>%group_by(id)%>%dplyr::summarise(Count=any(tags=='A')&any(tags=='C'))
# A tibble: 4 x 2
id Count
<int> <lgl>
1 1 FALSE
2 2 NA
3 3 TRUE
4 4 FALSE

How to change the df column name within a list

I have a list of dfs. The dfs all have the same column names. I would like to:
(1) Change one of the column names to the name of the df within the list
(2) full_join all the dfs after name change
Example of my list:
my_list <- list(one = data.frame(Type = c(1,2,3), Class = c("a", "a", "b")),
two = data.frame(Type = c(1,2,3), Class = c("a", "a", "b")))
Output that I want:
data.frame(Type = c(1,2,3),
one = c("a", "a", "b"),
two = c("a", "a", "b"))
Type one two
1 a a
2 a a
3 b b
You could possible use dplyr::bind_rows combined with tidyr::spread to achieve the same result (if you are happy to consider alternative approaches). For example:
library(tidyverse)
my_list %>% bind_rows(.id = "groups") %>% spread(groups, Class)
#> Type one two
#> 1 1 a a
#> 2 2 a a
#> 3 3 b b
The first step can be tricky, but it's simple if you iterate over names(my_list).
transformed <- sapply(names(my_list), function(name) {
df <- my_list[[name]]
colnames(df)[colnames(df) == 'Class'] <- name
df
}, simplify = FALSE, USE.NAMES = TRUE)
With purrr::reduce and dplyr::full_join the result can be obtained:
purrr::reduce(transformed, dplyr::full_join)
# Type one two
# 1 1 a a
# 2 2 a a
# 3 3 b b

Resources