fuzzy matching in DNA seqs - r

For the purposes of the reprex I've generated a tibble called random_DNA_tbl that is a random selection of 10 DNA sequences (of 100 bases). I've got a separate tibble called subseq_tbl, with 3 shorter sequences that match 100% to 3 of the sequences in random_DNA_tbl, but I'd also like to use fuzzy matching of sequences from subseq_tbl to other sequences in random_DNA_tbl. I was hopping to be able to use the fuzzyjoin package stringdist_XX_join functions, however these don't seem to work, even though the subseq sequences are actually perfect matches and do work with other matching functions, e.g. regex_XX_join.
library(tidyverse)
library(fuzzjoin)
random_DNA_tbl <- structure(list(random_name = c("random_seq_1", "random_seq_2",
"random_seq_3", "random_seq_4", "random_seq_5", "random_seq_6",
"random_seq_7", "random_seq_8", "random_seq_9", "random_seq_10"
), random_seq = c("CTCCAGTATTAGTCAATGATAAGGGCGAAGGAGCAGTTCTGATATCTCTGTGAAGTAGCATGCGTCTGACTCTCGGGCGCGGCGGAAGACCGAGGAGCGC",
"TTTTCGTCCGACAGAACATCATATAAACTCGATTTAATCTTCTTTTCAAAATCAATTCGAGGGCACCCGATGCGCGTACTGTCAACCATCAAGATAACGA",
"GAATAGTGTACCAGGTCTTATAGTATGTTCATTCGTACAAAAGGATCCAAAACCAATAGGAACCGCTTCTCCCAACAAGCCTGCTCCTTGCAGAGTGAGT",
"GTGACGCCAGATTCTTGACCTGAACCCAGTTCTACCCCCCCAAAACGATCTGGCTTCCGCTCTCTAATGACAGCTATATTGCTTGATAGAGATCGGTAGG",
"ACCGCCTTCCGTAGGTGAACAACCAGCCTCCTGCGGCCAGGGAAGAAGTCGTGGCCTTGGTTAATTTTGGGTTACTAAACGGACACCCACCGTGGCTCAC",
"ACGACTATCAAGACAACTTGTCTCAGAGCTTCACGCACCAACCCCTAACCCAGCAACTCCAGGGCATTGCCACTCTATGATTCGGCGCGGGTGCGCCCTC",
"GGTAGCACTGAGATCAGCCACTATCAAGGTGCTCCTCACTTCTGGTTCTCAGGTTGCGGGCCGATCATTTTTCTCCGAATTAGCGGTCTTTCACGTCAGA",
"CACTGAATAGTCAGCGTAAAGGCGTCAATCTGTCAGCTCGACGGCAGAAGATGTCCAGCGTGCAGTTTCATAGGCGCCCCGGGGAACCTTCTGTGAGAAT",
"GCCTCTTAATTCTTGAACCGCGAGAGGACACAGTGAGATCTGTTCCATTTCCCCCGTTGCCCGCATGGATCGCCCAGACTCTAGACTTAGTGTGACCTTT",
"CGGTATCGGATTGGTCTACGAATCCGCGACCCTCAAGGTTATTTCTGGATGGAGTTCCGTGCTCGCCTGGATGCACTGCCCAAGCAATTAGGACGAAGTA"
)), .Names = c("random_name", "random_seq"), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
subseq_tbl <- structure(list(subseq_name = c("subseq1", "subseq2", "subseq3"
), subseq = c("TCAACCATCAAGATAAC", "TAGCGGTCTTTCACGT", "AAGGATCC"
)), .Names = c("subseq_name", "subseq"), row.names = c(NA, -3L
), class = c("tbl_df", "tbl", "data.frame"))
Doesn't work:
stringdist_left_join(random_DNA_tbl, subseq_tbl, by = c(random_seq = "subseq"))
Does work:
regex_left_join(random_DNA_tbl, subseq_tbl, by = c(random_seq = "subseq"))
I've tried tweaking the max_dist parameter in stringdist but to no avail. Can anyone shed any light on the problem please?

Related

How I get names into the specific strings

I have the following vector:
a <- c("teste3/Nova pasta3/texto33.txt", "teste3/texto3.txt", "teste3/Nova pasta3",
"teste3")
In certain cases I have not a vector, but a dataframe
structure(list(filename = c("teste1/", "teste1/Nova pasta1/",
"teste1/Nova pasta1/texto11.txt", "teste1/texto1.txt", "teste1/New Folder/"
)), class = "data.frame", row.names = c(NA, -5L))
I would to get the names that are between slash bar (/*/).
In this case just the name (Nova pasta3) for the vector and the name (Nova pasta1) for the dataframe.
Thanks

How do I find common characters in a list of dataframes?

I have about 70 dataframes in a list, each of them has a column named SNP. I want to find the common SNPs that exist in all dataframes. This is the code I used:
setwd("~")
library(data.table)
files <- list.files()
dflist <- list()
for(i in 1:length(files)){
dflist[[i]] <- fread(files[i])
}
map(dflist, ~.$SNP) %>%
reduce(intersect)
However, this returns the following message:
character(0)
list(structure(list(`10:103391446` = c("10:115562764:TTTC_",
"10:115562765:TTC_T", "10:14188623_CCTGA_C", "10:15988900:G_GGT"
)), row.names = c(NA, -4L), class = c("data.table", "data.frame"
)), structure(list(SNP = c("rs34394051",
"rs11121177", "rs10799615", "rs590013")), row.names = c(NA, -4L
), class = c("data.table", "data.frame")),
structure(list(SNP = c("rs34394051", "rs11121177", "rs10799615",
"rs590013")), row.names = c(NA, -4L), class = c("data.table",
"data.frame")))
Can you help please?
Your problems appear to be two-fold:
One of your frames is missing SNP as a column name. That will often cause problems:
setdiff(mtcars$QUUX, mtcars$cyl)
# NULL
This is not hard to fix (names(dflist[[1]]) <- "SNP"), but does not resolve all of the problems.
Your first frame has completely different-looking data. When I skip the first frame, it works.
map(dflist[-1], ~.$SNP) %>%
reduce(intersect)
# [1] "rs34394051" "rs11121177" "rs10799615" "rs590013"

how to merge two data.frame and mark matched found or not

I have two data.frame, df1 and df2 that look like following:
df1:
df2:
df1 and df2 can be build using code:
df1<-structure(list(Var = c("SEX", "SEXSP", "FEMCBP", "FEMCBPSP",
"RACE", "RACESP", "ETHNIC", "INITVER", "IFCDT", "STDYPART"),
Label = c("Gender:", "If other, please specify:", "If female, please select one of the following:",
"If other, please specify:", "Race:", "If other, please specify:",
"Ethnicity:", "Version of protocol the subject consented to when subject started the study:",
"Date Informed Consent was signed by subject to start the study (DD MMM YYYY):",
"Study Arm:")), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
df2<- structure(list(Var2 = c("RACE", "RACESP", "ETHNIC", "IFCDT",
"STDYPART"), Label2 = c("Race:", "If other, please specify:",
"Ethnicity:", "Date Informed Consent was signed by subject to start the study (DD MMM YYYY):",
"Study Arm:")), row.names = c(NA, -5L), class = c("tbl_df", "tbl",
"data.frame"))
I would like to merge those two together and see whether we can find df1 in df2. I would like to get sth that looks like this:
what should I do?
df3<-merge(df1, df2, by.x=var, by.y=var2)
and?
After defining your data frames, write the code below. all.x means after matching it by the key i.e by.x and by.y, fetch all records from the left table (x)
df <- merge(df1,df2,by.x = "Var",by.y = "Var2",all.x = TRUE)
Create a column which shows if there was a match
df$Matched <- ifelse(!is.na(df$Label2),"Y","N")

How to remove additional numbers in each cell in a dataframe

I am doing some data analyzing with R. I read a csv file. I would like to eliminate 000,000,000 from each cell. How can I get rid of only 000? I tried to use grep(), but it dropped rows.
This is the dataframe:
You can try this. I have included dummy data based on your screenshot (and please attention to coment of #andrew_reece):
#Code
df$NewVar <- trimws(gsub('000','',df$VIOLATIONS_RAW),whitespace = ',')
Output:
VIOLATIONS_RAW NewVar
1 202,403,506,000 202,403,506
2 213,145,123 213,145,123
3 212,000 212
4 123,000,000,000 123
Some data used:
#Data
df <- structure(list(VIOLATIONS_RAW = c("202,403,506,000", "213,145,123",
"212,000", "123,000,000,000")), row.names = c(NA, -4L), class = "data.frame")
We could also do in a general way to remove any number of 0's
df$VIOLATIONS_RAW <- trimws(gsub("(?<=,)0+(?=(,|$))", "",
df$VIOLATIONS_RAW, perl = TRUE), whitespace=",")
df$VIOLATIONS_RAW
#[1] "202,403,506" "213,145,123" "212" "123"
data
df <- structure(list(VIOLATIONS_RAW = c("202,403,506,000", "213,145,123",
"212,000", "123,000,000,000")), row.names = c(NA, -4L), class = "data.frame")

row.name using `structure` function as c(NA, *integer*)

Does anyone know why when I run this:
row.names(structure(list(speed = c(4, 7), dist = c(2, 22)),
row.names = c(NA, 2L), class = "data.frame"))
I get this:
# "1" "2"
and not c(NA, 2L)? I mean what row.names argument in structure exactly does to the argument?
I came across this when I tried to use dput to see the structure of some dataframes. e.g.
dput(cars)
And I noticed the row.names argument in it, which is: c(NA,
-50L).
c(NA, n) is how data frames internally store the row names in the common case of 1:n so as to save space and processing time. This is not supposed to be accessible to the user who is to regard it as "1", "2", ... so the accessor functions translate it.

Resources