Replacing multiple punctuation marks in a column of data - r

Column in a df:
chr10:123453:A:C
chr10:2345543:TTTG:CG
chr10:3454757:G:C
chr10:4567875765:C:G
Desired output:
chr10:123453_A/C
chr10:2345543_TTTG/CG
chr10:3454757_G/C
chr10:4567875765_C/G
I think I could use stingsplit but I wanted to try and do it all in a R oneliner. Any ideas would be greatly welcome!

Try this:
gsub(":([A-Z]+):([A-Z]+)$", "_\\1/\\2", x, perl = TRUE)
[1] "chr10:123453_A/C" "chr10:2345543_TTTG/CG"
Here we use backreference twice: \\1 recollects what's between the pre-ultimate and the ultimate :, whereas \\2 recollects what's after the ultimate :.
Data:
x <- c("chr10:123453:A:C","chr10:2345543:TTTG:CG")

Related

Remove first "." from values in R

I have a dataset with different values in R. Some values are like 11.474 and others like 1.034.496 in the same column. I would like to change the values with two dots from 1.034.496 to 1034.496. Is there anyone who could help me please?
Thanks for the help!
Use gsub with Perl regexes:
df <- data.frame(a = c('11.474', '1.034.496', '1.234.034.496'))
df$a = gsub('[.](?=.*[.])', '', df$a, perl = TRUE)
print(df)
## a
## 1 11.474
## 2 1034.496
## 3 1234034.496
Here, [.](?=.*[.]) is a literal dot (has to be escaped like so \. or put into a character class like so: [.]), followed by a literal dot using positive lookahead: (?=PATTERN).
I guess there must be other smarter regex approaches than the below one, but here is my attempt
> ifelse(lengths(gregexpr("\\.",v))>1,sub("\\.","",v),v)
[1] "11.474" "1034.496"
where
v <- c("11.474","1.034.496")

removing numeric character from data

I want to remove the number 0 from all the names in my data.frame.
I have tried to do it myself, however, working with strings is a first time for me.
I have tried:
gsub('\0', '', df )
reproducible code:
df <- c("y2016.09", "y2010.05", "y2010.06", "y2010.07", "y2010.08",
"y2010.09")
expected output
y2016.9
y2010.5
y2010.6
y2010.7
y2010.8
y2010.9
We can specify the location of . (. is a metacharacter in regex - for any character, so it is escaped \\ to evaluate it literally) and 0 or more character of 0's is matched (0*), in the replacement, replace with . i.e. the one we removed by matching
sub("\\.0*", ".", df)
#[1] "y2016.9" "y2010.5" "y2010.6" "y2010.7" "y2010.8" "y2010.9"
Here is another regex solution using lookarounds, but not as simple as the one by #akrun
> gsub("(?<=\\.)0+","",df,perl = TRUE)
[1] "y2016.9" "y2010.5" "y2010.6" "y2010.7" "y2010.8" "y2010.9"

Extract text with gsub

I am setting up an automated data analysis procedure and, more or less at the end of the procedure, I would like to extract automatically the name of the file that has been analysed. I have a data frame with a column containing names, with the following style:
Baseline/Cell_Line_2_KB_1813_B_Baseline
Dose 0001/Cell_Line_3_KB1720_1_0001
Dose 0010/Cell_Line_1_KB1810 mat_0010
I would like to extract just the characters in bold: "KB_1813_B", "KB1720_1" and "KB1810 mat" in a separate column.
I used gsub with the following command:
df$column.with.names <- gsub(".*KB|_.*", "KB", df$column.with.new.names)
I could easily remove the first part of the problem, but I am stuck trying to remove the second part. Is there some command in gsub to remove everything, starting from the end of the name, until you encounter a special character ( "_" in my case)?
Thank you :)
We can use str_extract
library(stringr)
str_extract(df$column.with.new.names, "KB_*\\d+[_ ]*[^_]*")
#[1] "KB_1813_B" "KB1720_1" "KB1810 mat"
Or the same pattern can be captured as a group with sub
sub(".*(KB_*\\d+[_ ]*[^_]*).*", "\\1", df$column.with.new.names)
#[1] "KB_1813_B" "KB1720_1" "KB1810 mat"
data
df <- data.frame(column.with.new.names = c("Baseline/Cell_Line_2_KB_1813_B_Baseline",
"Dose 0001/Cell_Line_3_KB1720_1_0001",
"Dose 0010/Cell_Line_1_KB1810 mat_0010"), stringsAsFactors = FALSE)
The way to do this is using regex groups:
x <- c("Baseline/Cell_Line_2_KB_1813_B_Baseline",
"Dose 0001/Cell_Line_3_KB1720_1_0001",
"Dose 0010/Cell_Line_1_KB1810 mat_0010")
gsub("^.+Cell_Line_._(.+)_.+$", "\\1", x)
[1] "KB_1813_B" "KB1720_1" "KB1810 mat"

Using gsub or sub function to only get part of a string?

Col
WBU-ARGU*06:03:04
WBU-ARDU*08:01:01
WBU-ARFU*11:03:05
WBU-ARFU*03:456
I have a column which has 75 rows of variables such as the col above. I am not quite sure how to use gsub or sub in order to get up until the integers after the first colon.
Expected output:
Col
WBU-ARGU*06:03
WBU-ARDU*08:01
WBU-ARFU*11:03
WBU-ARFU*03:456
I tried this but it doesn't seem to work:
gsub("*..:","", df$col)
Following may help you here too.
sub("([^:]*):([^:]*).*","\\1:\\2",df$dat)
Output will be as follows.
> sub("([^:]*):([^:]*).*","\\1:\\2",df$dat)
[1] "WBU-ARGU*06:03" "WBU-ARDU*08:01" "WBU-ARFU*11:03" "WBU-ARFU*03:456b"
Where Input for data frame is as follows.
dat <- c("WBU-ARGU*06:03:04","WBU-ARDU*08:01:01","WBU-ARFU*11:03:05","WBU-ARFU*03:456b")
df <- data.frame(dat)
Explanation: Following is only for explanation purposes.
sub(" ##using sub for global subtitution function of R here.
([^:]*) ##By mentioning () we are keeping the matched values from vector's element into 1st place of memory(which we could use later), which is till next colon comes it will match everything.
: ##Mentioning letter colon(:) here.
([^:]*) ##By mentioning () making 2nd place in memory for matched values in vector's values which is till next colon comes it will match everything.
.*" ##Mentioning .* to match everything else now after 2nd colon comes in value.
,"\\1:\\2" ##Now mentioning the values of memory holds with whom we want to substitute the element values \\1 means 1st memory place \\2 is second memory place's value.
,df$dat) ##Mentioning df$dat dataframe's dat value.
You may use
df$col <- sub("(\\d:\\d+):\\d+$", "\\1", df$col)
See the regex demo
Details
(\\d:\\d+) - Capturing group 1 (its value will be accessible via \1 in the replacement pattern): a digit, a colon and 1+ digits.
: - a colon
\\d+ - 1+ digits
$ - end of string.
R Demo:
col <- c("WBU-ARGU*06:03:04","WBU-ARDU*08:01:01","WBU-ARFU*11:03:05","WBU-ARFU*03:456")
sub("(\\d:\\d+):\\d+$", "\\1", col)
## => [1] "WBU-ARGU*06:03" "WBU-ARDU*08:01" "WBU-ARFU*11:03" "WBU-ARFU*03:456"
Alternative approach:
df$col <- sub("^(.*?:\\d+).*", "\\1", df$col)
See the regex demo
Here,
^ - start of string
(.*?:\\d+) - Group 1: any 0+ chars, as few as possible (due to the lazy *? quantifier), then : and 1+ digits
.* - the rest of the string.
However, it should be used with the PCRE regex engine, pass perl=TRUE:
col <- c("WBU-ARGU*06:03:04","WBU-ARDU*08:01:01","WBU-ARFU*11:03:05","WBU-ARFU*03:456")
sub("^(.*?:\\d+).*", "\\1", col, perl=TRUE)
## => [1] "WBU-ARGU*06:03" "WBU-ARDU*08:01" "WBU-ARFU*11:03" "WBU-ARFU*03:456"
See the R online demo.
sub("(\\d+:\\d+):\\d+$", "\\1", df$Col)
[1] "WBU-ARGU*06:03" "WBU-ARDU*08:01" "WBU-ARFU*11:03" "WBU-ARFU*03:456"
Alternatively match what you want (instead of subbing out what you don't want) with stringi:
stringi::stri_extract_first(df$Col, regex = "[A-Z-\\*]+\\d+:\\d+")
Slightly more concise stringr:
stringr::str_extract(df$Col, "[A-Z-\\*]+\\d+:\\d+")
# or
stringr::str_extract(df$Col, "[\\w-*]+\\d+:\\d+")

Regexpr not working as expected

For the following string <10.16;13.05) I want to match only the first number (sometimes the first number does not exist, i.e. <;13.05)). I used the following regular expression:
grep("[0-9]+\\.*[0-9]*(?=;)","<10.16;13.05)",value=T,perl=T)
However, the result is not "10.16" but "<10.16;13.05)". Could anyone please help me with this one? Thanks.
You could also use strsplit here with minimum regex, i.e.
x <- '<10.16;13.05)'
as.numeric(gsub('<(.*)', '\\1', unlist(strsplit(x, ';', fixed = TRUE))[1]))
#[1] 10.16
x <- '<;13.05)'
as.numeric(gsub('<(.*)', '\\1', unlist(strsplit(x, ';', fixed = TRUE))[1]))
#[1] NA
I believe you are using the wrong regex function. grep just tells you whether the patern was found, it does not extract it.
Try instead
regmatches("<10.16;13.05)", regexpr("\\d*\\.\\d*", "<10.16;13.05)"))

Resources