How I get names into the specific strings - r

I have the following vector:
a <- c("teste3/Nova pasta3/texto33.txt", "teste3/texto3.txt", "teste3/Nova pasta3",
"teste3")
In certain cases I have not a vector, but a dataframe
structure(list(filename = c("teste1/", "teste1/Nova pasta1/",
"teste1/Nova pasta1/texto11.txt", "teste1/texto1.txt", "teste1/New Folder/"
)), class = "data.frame", row.names = c(NA, -5L))
I would to get the names that are between slash bar (/*/).
In this case just the name (Nova pasta3) for the vector and the name (Nova pasta1) for the dataframe.
Thanks

Related

How do I find common characters in a list of dataframes?

I have about 70 dataframes in a list, each of them has a column named SNP. I want to find the common SNPs that exist in all dataframes. This is the code I used:
setwd("~")
library(data.table)
files <- list.files()
dflist <- list()
for(i in 1:length(files)){
dflist[[i]] <- fread(files[i])
}
map(dflist, ~.$SNP) %>%
reduce(intersect)
However, this returns the following message:
character(0)
list(structure(list(`10:103391446` = c("10:115562764:TTTC_",
"10:115562765:TTC_T", "10:14188623_CCTGA_C", "10:15988900:G_GGT"
)), row.names = c(NA, -4L), class = c("data.table", "data.frame"
)), structure(list(SNP = c("rs34394051",
"rs11121177", "rs10799615", "rs590013")), row.names = c(NA, -4L
), class = c("data.table", "data.frame")),
structure(list(SNP = c("rs34394051", "rs11121177", "rs10799615",
"rs590013")), row.names = c(NA, -4L), class = c("data.table",
"data.frame")))
Can you help please?
Your problems appear to be two-fold:
One of your frames is missing SNP as a column name. That will often cause problems:
setdiff(mtcars$QUUX, mtcars$cyl)
# NULL
This is not hard to fix (names(dflist[[1]]) <- "SNP"), but does not resolve all of the problems.
Your first frame has completely different-looking data. When I skip the first frame, it works.
map(dflist[-1], ~.$SNP) %>%
reduce(intersect)
# [1] "rs34394051" "rs11121177" "rs10799615" "rs590013"

How do I use the list name as part of column name using tidyr?

I have a nested JSON file that is a pretty simply structure. The list name is flavor, and then there is a nested df one level below. One of the columns is nested further. How can I use the name of the list "flavors" as a prefix to the column names when I unnest? I would be looking for column names like flavor.id, flavor.name, etc.
I don't have a great reprex example, but I'd be looking to use some form of tidyr or purrr. I tried to use purrr::flatten() to no avail.
Sample Reprex
sample <- list(
flavor = structure(list(nested_col = list(structure(list(column = 0L,id = "B30D41F4-5684-11E1-8E9A-8F09EE5110CB"), class = "data.frame", row.names = 1L),
structure(list(id = "B30B5B28-5684-11E1-8E9A-8F09EE5110CB", column = 0L), class = "data.frame", row.names = 1L)),
short_name = c("Bi", "Br"), abbr = c("RR", "CHOC"), long_abbr = c("BXB","BK"), id = c("13", "11"), name = c("Rock n Road","Chocolate")), class = "data.frame", row.names = c(NA,2L)))
I would be looking to extract the list into a tbl_df. The corresponding columns would look something like flavor_nested_col, flavor_short_name, etc.

How to remove additional numbers in each cell in a dataframe

I am doing some data analyzing with R. I read a csv file. I would like to eliminate 000,000,000 from each cell. How can I get rid of only 000? I tried to use grep(), but it dropped rows.
This is the dataframe:
You can try this. I have included dummy data based on your screenshot (and please attention to coment of #andrew_reece):
#Code
df$NewVar <- trimws(gsub('000','',df$VIOLATIONS_RAW),whitespace = ',')
Output:
VIOLATIONS_RAW NewVar
1 202,403,506,000 202,403,506
2 213,145,123 213,145,123
3 212,000 212
4 123,000,000,000 123
Some data used:
#Data
df <- structure(list(VIOLATIONS_RAW = c("202,403,506,000", "213,145,123",
"212,000", "123,000,000,000")), row.names = c(NA, -4L), class = "data.frame")
We could also do in a general way to remove any number of 0's
df$VIOLATIONS_RAW <- trimws(gsub("(?<=,)0+(?=(,|$))", "",
df$VIOLATIONS_RAW, perl = TRUE), whitespace=",")
df$VIOLATIONS_RAW
#[1] "202,403,506" "213,145,123" "212" "123"
data
df <- structure(list(VIOLATIONS_RAW = c("202,403,506,000", "213,145,123",
"212,000", "123,000,000,000")), row.names = c(NA, -4L), class = "data.frame")

row.name using `structure` function as c(NA, *integer*)

Does anyone know why when I run this:
row.names(structure(list(speed = c(4, 7), dist = c(2, 22)),
row.names = c(NA, 2L), class = "data.frame"))
I get this:
# "1" "2"
and not c(NA, 2L)? I mean what row.names argument in structure exactly does to the argument?
I came across this when I tried to use dput to see the structure of some dataframes. e.g.
dput(cars)
And I noticed the row.names argument in it, which is: c(NA,
-50L).
c(NA, n) is how data frames internally store the row names in the common case of 1:n so as to save space and processing time. This is not supposed to be accessible to the user who is to regard it as "1", "2", ... so the accessor functions translate it.

fuzzy matching in DNA seqs

For the purposes of the reprex I've generated a tibble called random_DNA_tbl that is a random selection of 10 DNA sequences (of 100 bases). I've got a separate tibble called subseq_tbl, with 3 shorter sequences that match 100% to 3 of the sequences in random_DNA_tbl, but I'd also like to use fuzzy matching of sequences from subseq_tbl to other sequences in random_DNA_tbl. I was hopping to be able to use the fuzzyjoin package stringdist_XX_join functions, however these don't seem to work, even though the subseq sequences are actually perfect matches and do work with other matching functions, e.g. regex_XX_join.
library(tidyverse)
library(fuzzjoin)
random_DNA_tbl <- structure(list(random_name = c("random_seq_1", "random_seq_2",
"random_seq_3", "random_seq_4", "random_seq_5", "random_seq_6",
"random_seq_7", "random_seq_8", "random_seq_9", "random_seq_10"
), random_seq = c("CTCCAGTATTAGTCAATGATAAGGGCGAAGGAGCAGTTCTGATATCTCTGTGAAGTAGCATGCGTCTGACTCTCGGGCGCGGCGGAAGACCGAGGAGCGC",
"TTTTCGTCCGACAGAACATCATATAAACTCGATTTAATCTTCTTTTCAAAATCAATTCGAGGGCACCCGATGCGCGTACTGTCAACCATCAAGATAACGA",
"GAATAGTGTACCAGGTCTTATAGTATGTTCATTCGTACAAAAGGATCCAAAACCAATAGGAACCGCTTCTCCCAACAAGCCTGCTCCTTGCAGAGTGAGT",
"GTGACGCCAGATTCTTGACCTGAACCCAGTTCTACCCCCCCAAAACGATCTGGCTTCCGCTCTCTAATGACAGCTATATTGCTTGATAGAGATCGGTAGG",
"ACCGCCTTCCGTAGGTGAACAACCAGCCTCCTGCGGCCAGGGAAGAAGTCGTGGCCTTGGTTAATTTTGGGTTACTAAACGGACACCCACCGTGGCTCAC",
"ACGACTATCAAGACAACTTGTCTCAGAGCTTCACGCACCAACCCCTAACCCAGCAACTCCAGGGCATTGCCACTCTATGATTCGGCGCGGGTGCGCCCTC",
"GGTAGCACTGAGATCAGCCACTATCAAGGTGCTCCTCACTTCTGGTTCTCAGGTTGCGGGCCGATCATTTTTCTCCGAATTAGCGGTCTTTCACGTCAGA",
"CACTGAATAGTCAGCGTAAAGGCGTCAATCTGTCAGCTCGACGGCAGAAGATGTCCAGCGTGCAGTTTCATAGGCGCCCCGGGGAACCTTCTGTGAGAAT",
"GCCTCTTAATTCTTGAACCGCGAGAGGACACAGTGAGATCTGTTCCATTTCCCCCGTTGCCCGCATGGATCGCCCAGACTCTAGACTTAGTGTGACCTTT",
"CGGTATCGGATTGGTCTACGAATCCGCGACCCTCAAGGTTATTTCTGGATGGAGTTCCGTGCTCGCCTGGATGCACTGCCCAAGCAATTAGGACGAAGTA"
)), .Names = c("random_name", "random_seq"), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
subseq_tbl <- structure(list(subseq_name = c("subseq1", "subseq2", "subseq3"
), subseq = c("TCAACCATCAAGATAAC", "TAGCGGTCTTTCACGT", "AAGGATCC"
)), .Names = c("subseq_name", "subseq"), row.names = c(NA, -3L
), class = c("tbl_df", "tbl", "data.frame"))
Doesn't work:
stringdist_left_join(random_DNA_tbl, subseq_tbl, by = c(random_seq = "subseq"))
Does work:
regex_left_join(random_DNA_tbl, subseq_tbl, by = c(random_seq = "subseq"))
I've tried tweaking the max_dist parameter in stringdist but to no avail. Can anyone shed any light on the problem please?

Resources