R. Remove everything between to delimiter characters [duplicate] - r

This question already has answers here:
Remove the letters between two patterns of strings in R
(3 answers)
Closed 2 years ago.
I have a data frame with this kind of expression in column C:
GT_rs9628326:N_rs9628326
GT_rs1111:N_rs1111
GT_rs8374:N_rs8374
Using R, I want to remove everything between the first "T" and ":", as well as everything after the "N". I know this can be done with gsub. I would get:
GT:N
GT:N
GT:N

Maybe you can try
gsub("_\\w+","",s)
giving
[1] "GT:N" "GT:N" "GT:N"
Data
s <- c("GT_rs9628326:N_rs9628326","GT_rs1111:N_rs1111","GT_rs8374:N_rs8374")

Another option would be splitting the strings by : and then replace non necessary text in order to collapse all together again by same split symbol (I have used #ThomasIsCoding data thanks):
#Data
v1 <- c("GT_rs9628326:N_rs9628326","GT_rs1111:N_rs1111","GT_rs8374:N_rs8374")
#Code
unlist(lapply(lapply(strsplit(v1,split = ':'),
function(x) sub("_[^_]+$", "", x)),
function(x) paste0(x,collapse = ':')))
Output:
[1] "GT:N" "GT:N" "GT:N"

Using str_remove from stringr
library(stringr)
str_remove_all(s, "_\\w+")
#[1] "GT:N" "GT:N" "GT:N"
data
s <- c("GT_rs9628326:N_rs9628326","GT_rs1111:N_rs1111","GT_rs8374:N_rs8374")

Remove a word after either "T" or "N". Using #ThomasIsCoding's data.
gsub('(?<=T|N)\\w+', '', s, perl = TRUE)
#[1] "GT:N" "GT:N" "GT:N"

Related

Remove period and text after in list [duplicate]

This question already has answers here:
Remove part of string after "."
(6 answers)
gsub() in R is not replacing '.' (dot)
(3 answers)
Closed last year.
I have a list
t <- list('mcd.norm_1','mcc.norm_1', 'mcr.norm_1')
How can i convert the list to remove the period and everything after so the list is just
'mcd' 'mcc' 'mcr'
You may try
library(stringr)
lapply(t, function(x) str_split(x, "\\.", simplify = T)[1])
Another possible solution:
library(tidyverse)
t <- list('mcd.norm_1','mcc.norm_1', 'mcr.norm_1')
t %>%
str_remove("\\..*")
#> [1] "mcd" "mcc" "mcr"
This could be another option:
unlist(sapply(t, \(x) regmatches(x, regexec(".*(?=\\.)", x, perl = TRUE))))
[1] "mcd" "mcc" "mcr"

Newbie on R and newbie to programmation - fusion all values of a vector into one some value

It might be a dumb question but I'm very new to programming and I'm currently working with R
I want to transform a vector A = "AT" "GCT" "TCA" into A ="ATGCTTCA".
Can someone help me please?
We can use str_c from stringr
library(stringr)
str_c(A, collapse="")
#[1] "ATGCTTCA"
Or with paste from base R
paste(A, collapse="")
#[1] "ATGCTTCA"
data
A <- c("AT", "GCT", "TCA")
Try this using paste0() over your vector:
#Data
A = c("AT","GCT","TCA")
#Collapse
Anew <- paste0(A,collapse = '')
Output:
[1] "ATGCTTCA"

Get data out of "c(\"a\", \"b\")" format

I have a string "c(\"AV\", \"IM\")", which I'm trying to transform into the string "AV IM".
My issue is that I can't unlist() or flatten() this, as it's a character, and neither paste() nor stringr::str_c() work, since it's technically still 1 character value.
Any ideas how I can do this?
Tidyverse solutions preferred, if possible.
EDIT: I know this can be solved via regex, but I feel like this is more a "fundamental" problem to be solved string-level than it is a regex problem, if that makes any sense.
Not sure how you got here, but this as presented would be an eval/parse situation. However, as noted in many other answers on this site, there's almost always a better way of preparing your data so you end up in a more R-friendly form. See, for starters, What specifically are the dangers of eval(parse(...))?.
> a <- "c(\"AV\", \"IM\")"
> (b <- eval(parse(text=a)))
[1] "AV" "IM"
> paste(b, collapse=" ")
[1] "AV IM"
You can also consider to use regular expression to replace all symbols and the beginning c.
s <- "c(\"AV\", \"IM\")"
s_vec <- strsplit(s, split = ",")[[1]]
gsub("[[:punct:]]|^c", "", s_vec)
# [1] "AV" " IM"
Well it is not quite easy how you got here. You can use eval-parse, though it is not vectorized. And also it is slow. Thus you need a regular expression:
a <- "c(\"AV\", \"IM\")"
stringr::str_extract_all(a,"\\w+(?!\\()")
[[1]]
[1] "AV" "IM"
Other answers output a vector. My understanding is you want a space-delimited list of your strings.
library(dplyr)
a <- "c(\"AV\", \"IM\")"
a %>%
gsub("c(", "", ., fixed=TRUE) %>%
gsub("\"", "", ., fixed=TRUE) %>%
gsub(",", "", ., fixed=TRUE) %>%
gsub(")", "", ., fixed=TRUE)
Output
"AV IM"
EDIT Or simply (from #www's answer):
a %>%
gsub("[[:punct:]]|^c", "", .)

Convert a string with dot and comma to numeric [duplicate]

This question already has answers here:
Factor with comma and percentage to numeric
(3 answers)
Closed 5 years ago.
Is there a way to convert into a numeric type a string type?
For instance:
> as.numeric("1.560,65")
[1] NA
Warning message:
NAs introduced by coercion
I receive the above error.
I need the thousands to be displayed and separated by dot while (i.e. 1.560) the decimals to be displays and separated by comma.
> as.numeric("1.560")
[1] 1.56
> as.numeric("1.560")>2
[1] FALSE
In the above example while I want R to convert 1.560 into numeric it translates it to 1.56 which is not in thousands and this lower than 2 and thus my computations are wrong.
Any help is much appreciated.
You have to use a regexpr to format your strings into an understandable format for R and then convert it as numeric
as.numeric(gsub(",", "\\.", gsub("\\.","", "1.560,65")))
[1] 1560.65
For numeric formating see formatC
formatC(1560.65, format = "f", big.mark = ".", decimal.mark = ",")
[1] "1.560,6500"
String pattern matching and replacement can be done by using gsub function. Here is an example for your case:
str_numbers <- c("1.560,65", "134,2","123","0,32")
as.numeric(gsub(",", "\\.", gsub("\\.", "", str_numbers)))
The first call replaces the . with empty string. The second the , with .
> (tmp <- gsub("\\.", "", str_numbers))
[1] "1560,65" "134,2" "123" "0,32"
> gsub(",", "\\.", tmp)
[1] "1560.65" "134.2" "123" "0.32"

R: truncate strings to a word

I'm new to R, and trying to use it to truncate words in the headers of a spreadsheet to a word. For example:
Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Lachnospiraceae(100);unclassified(100);
Bacteria(100);Tenericutes(100);Mollicutes(100);Mollicutes_RF9(100);unclassified(100);unclassified(100);
So I would like to shorten the taxon to a single word without the numbers: like Clostridia and Mollicutes. I think it can be done, but can't figure how.
Thanks.
We can use sub
sub("\\(.*", "", "Firmicutes(100)")
Suppose, we read the data in 'R' using read.csv/read.table with check.names=FALSE, then we apply the same code on the column names
colnames(data) <- sub("\\(.*", "", colnames(data))
If it is a single string
library(stringr)
str1 <- "Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Lachnospiraceae(100);unclassified(100)"
str_extract_all(str1, "[^()0-9;]+")[[1]]
#[1] "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Lachnospiraceae"
#[6] "unclassified"
Update
Suppose if we need to extract the third word i.e. "Clostridia"
sub("^([^(]+[(][^;]+;){2}(\\w+).*", "\\2", str1)
#[1] "Clostridia"
Using only base commands, the names can be extracted with this code:
nam <- c("Bacteria(100);Tenericutes(100);Mollicutes(100);Mollicutes_RF9(100);unclassified(100);unclassified(100);")
nam <- strsplit(nam, ";")[[1]]
nam <- unname(sapply(nam, FUN=function(x) sub("\\(.*", "", x)))
nam
[1] "Bacteria" "Tenericutes" "Mollicutes" "Mollicutes_RF9" "unclassified" "unclassified"
Is this what you need? Or did I completely misunderstood?
gsub('\\(.*\\)', '', unlist(strsplit(x, ';'))[3])
#[1] "Clostridia"
where x is your column name

Resources