For the purposes of the reprex I've generated a tibble called random_DNA_tbl that is a random selection of 10 DNA sequences (of 100 bases). I've got a separate tibble called subseq_tbl, with 3 shorter sequences that match 100% to 3 of the sequences in random_DNA_tbl, but I'd also like to use fuzzy matching of sequences from subseq_tbl to other sequences in random_DNA_tbl. I was hopping to be able to use the fuzzyjoin package stringdist_XX_join functions, however these don't seem to work, even though the subseq sequences are actually perfect matches and do work with other matching functions, e.g. regex_XX_join.
library(tidyverse)
library(fuzzjoin)
random_DNA_tbl <- structure(list(random_name = c("random_seq_1", "random_seq_2",
"random_seq_3", "random_seq_4", "random_seq_5", "random_seq_6",
"random_seq_7", "random_seq_8", "random_seq_9", "random_seq_10"
), random_seq = c("CTCCAGTATTAGTCAATGATAAGGGCGAAGGAGCAGTTCTGATATCTCTGTGAAGTAGCATGCGTCTGACTCTCGGGCGCGGCGGAAGACCGAGGAGCGC",
"TTTTCGTCCGACAGAACATCATATAAACTCGATTTAATCTTCTTTTCAAAATCAATTCGAGGGCACCCGATGCGCGTACTGTCAACCATCAAGATAACGA",
"GAATAGTGTACCAGGTCTTATAGTATGTTCATTCGTACAAAAGGATCCAAAACCAATAGGAACCGCTTCTCCCAACAAGCCTGCTCCTTGCAGAGTGAGT",
"GTGACGCCAGATTCTTGACCTGAACCCAGTTCTACCCCCCCAAAACGATCTGGCTTCCGCTCTCTAATGACAGCTATATTGCTTGATAGAGATCGGTAGG",
"ACCGCCTTCCGTAGGTGAACAACCAGCCTCCTGCGGCCAGGGAAGAAGTCGTGGCCTTGGTTAATTTTGGGTTACTAAACGGACACCCACCGTGGCTCAC",
"ACGACTATCAAGACAACTTGTCTCAGAGCTTCACGCACCAACCCCTAACCCAGCAACTCCAGGGCATTGCCACTCTATGATTCGGCGCGGGTGCGCCCTC",
"GGTAGCACTGAGATCAGCCACTATCAAGGTGCTCCTCACTTCTGGTTCTCAGGTTGCGGGCCGATCATTTTTCTCCGAATTAGCGGTCTTTCACGTCAGA",
"CACTGAATAGTCAGCGTAAAGGCGTCAATCTGTCAGCTCGACGGCAGAAGATGTCCAGCGTGCAGTTTCATAGGCGCCCCGGGGAACCTTCTGTGAGAAT",
"GCCTCTTAATTCTTGAACCGCGAGAGGACACAGTGAGATCTGTTCCATTTCCCCCGTTGCCCGCATGGATCGCCCAGACTCTAGACTTAGTGTGACCTTT",
"CGGTATCGGATTGGTCTACGAATCCGCGACCCTCAAGGTTATTTCTGGATGGAGTTCCGTGCTCGCCTGGATGCACTGCCCAAGCAATTAGGACGAAGTA"
)), .Names = c("random_name", "random_seq"), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
subseq_tbl <- structure(list(subseq_name = c("subseq1", "subseq2", "subseq3"
), subseq = c("TCAACCATCAAGATAAC", "TAGCGGTCTTTCACGT", "AAGGATCC"
)), .Names = c("subseq_name", "subseq"), row.names = c(NA, -3L
), class = c("tbl_df", "tbl", "data.frame"))
Doesn't work:
stringdist_left_join(random_DNA_tbl, subseq_tbl, by = c(random_seq = "subseq"))
Does work:
regex_left_join(random_DNA_tbl, subseq_tbl, by = c(random_seq = "subseq"))
I've tried tweaking the max_dist parameter in stringdist but to no avail. Can anyone shed any light on the problem please?
Related
I have the following vector:
a <- c("teste3/Nova pasta3/texto33.txt", "teste3/texto3.txt", "teste3/Nova pasta3",
"teste3")
In certain cases I have not a vector, but a dataframe
structure(list(filename = c("teste1/", "teste1/Nova pasta1/",
"teste1/Nova pasta1/texto11.txt", "teste1/texto1.txt", "teste1/New Folder/"
)), class = "data.frame", row.names = c(NA, -5L))
I would to get the names that are between slash bar (/*/).
In this case just the name (Nova pasta3) for the vector and the name (Nova pasta1) for the dataframe.
Thanks
I have about 70 dataframes in a list, each of them has a column named SNP. I want to find the common SNPs that exist in all dataframes. This is the code I used:
setwd("~")
library(data.table)
files <- list.files()
dflist <- list()
for(i in 1:length(files)){
dflist[[i]] <- fread(files[i])
}
map(dflist, ~.$SNP) %>%
reduce(intersect)
However, this returns the following message:
character(0)
list(structure(list(`10:103391446` = c("10:115562764:TTTC_",
"10:115562765:TTC_T", "10:14188623_CCTGA_C", "10:15988900:G_GGT"
)), row.names = c(NA, -4L), class = c("data.table", "data.frame"
)), structure(list(SNP = c("rs34394051",
"rs11121177", "rs10799615", "rs590013")), row.names = c(NA, -4L
), class = c("data.table", "data.frame")),
structure(list(SNP = c("rs34394051", "rs11121177", "rs10799615",
"rs590013")), row.names = c(NA, -4L), class = c("data.table",
"data.frame")))
Can you help please?
Your problems appear to be two-fold:
One of your frames is missing SNP as a column name. That will often cause problems:
setdiff(mtcars$QUUX, mtcars$cyl)
# NULL
This is not hard to fix (names(dflist[[1]]) <- "SNP"), but does not resolve all of the problems.
Your first frame has completely different-looking data. When I skip the first frame, it works.
map(dflist[-1], ~.$SNP) %>%
reduce(intersect)
# [1] "rs34394051" "rs11121177" "rs10799615" "rs590013"
I have two data.frame, df1 and df2 that look like following:
df1:
df2:
df1 and df2 can be build using code:
df1<-structure(list(Var = c("SEX", "SEXSP", "FEMCBP", "FEMCBPSP",
"RACE", "RACESP", "ETHNIC", "INITVER", "IFCDT", "STDYPART"),
Label = c("Gender:", "If other, please specify:", "If female, please select one of the following:",
"If other, please specify:", "Race:", "If other, please specify:",
"Ethnicity:", "Version of protocol the subject consented to when subject started the study:",
"Date Informed Consent was signed by subject to start the study (DD MMM YYYY):",
"Study Arm:")), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
df2<- structure(list(Var2 = c("RACE", "RACESP", "ETHNIC", "IFCDT",
"STDYPART"), Label2 = c("Race:", "If other, please specify:",
"Ethnicity:", "Date Informed Consent was signed by subject to start the study (DD MMM YYYY):",
"Study Arm:")), row.names = c(NA, -5L), class = c("tbl_df", "tbl",
"data.frame"))
I would like to merge those two together and see whether we can find df1 in df2. I would like to get sth that looks like this:
what should I do?
df3<-merge(df1, df2, by.x=var, by.y=var2)
and?
After defining your data frames, write the code below. all.x means after matching it by the key i.e by.x and by.y, fetch all records from the left table (x)
df <- merge(df1,df2,by.x = "Var",by.y = "Var2",all.x = TRUE)
Create a column which shows if there was a match
df$Matched <- ifelse(!is.na(df$Label2),"Y","N")
I am doing some data analyzing with R. I read a csv file. I would like to eliminate 000,000,000 from each cell. How can I get rid of only 000? I tried to use grep(), but it dropped rows.
This is the dataframe:
You can try this. I have included dummy data based on your screenshot (and please attention to coment of #andrew_reece):
#Code
df$NewVar <- trimws(gsub('000','',df$VIOLATIONS_RAW),whitespace = ',')
Output:
VIOLATIONS_RAW NewVar
1 202,403,506,000 202,403,506
2 213,145,123 213,145,123
3 212,000 212
4 123,000,000,000 123
Some data used:
#Data
df <- structure(list(VIOLATIONS_RAW = c("202,403,506,000", "213,145,123",
"212,000", "123,000,000,000")), row.names = c(NA, -4L), class = "data.frame")
We could also do in a general way to remove any number of 0's
df$VIOLATIONS_RAW <- trimws(gsub("(?<=,)0+(?=(,|$))", "",
df$VIOLATIONS_RAW, perl = TRUE), whitespace=",")
df$VIOLATIONS_RAW
#[1] "202,403,506" "213,145,123" "212" "123"
data
df <- structure(list(VIOLATIONS_RAW = c("202,403,506,000", "213,145,123",
"212,000", "123,000,000,000")), row.names = c(NA, -4L), class = "data.frame")
Does anyone know why when I run this:
row.names(structure(list(speed = c(4, 7), dist = c(2, 22)),
row.names = c(NA, 2L), class = "data.frame"))
I get this:
# "1" "2"
and not c(NA, 2L)? I mean what row.names argument in structure exactly does to the argument?
I came across this when I tried to use dput to see the structure of some dataframes. e.g.
dput(cars)
And I noticed the row.names argument in it, which is: c(NA,
-50L).
c(NA, n) is how data frames internally store the row names in the common case of 1:n so as to save space and processing time. This is not supposed to be accessible to the user who is to regard it as "1", "2", ... so the accessor functions translate it.