how to extract specific character using str_extrac() in R - r

Context
I have a character vector a.
I want to extract the text between the last slash(/) and the .nc using the str_extract()function.
I have tried like this: str_extract(a, "(?=/).*(?=.nc)"), but failed.
Question
How can I get the text between the last lash and .nc in character vector a.
Reproducible code
a = c(
'data/temp/air/pm2.5/pm2.5_year_2014.nc',
'data/temp/air/pm10/pm10_year_2014.nc',
'efcv/asdfe/weewr/rtrkhh/ss_fef_10233_dfdfe.nc'
)
# My solution (failed)
str_extract(a, "(?=/).*(?=.nc)")
# [1] "/temp/air/pm2.5/pm2.5_year_2014"
# [2] "/temp/air/pm10/pm10_year_2014"
# [3] "/asdfe/weewr/rtrkhh/ss_fef_10233_dfdfe"
# The expected output should like this:
# [1] "pm2.5_year_2014"
# [2] "pm10_year_2014"
# [3] "ss_fef_10233_dfdfe"

Here is a regex replacement approach:
a = c(
'data/temp/air/pm2.5/pm2.5_year_2014.nc',
'data/temp/air/pm10/pm10_year_2014.nc',
'efcv/asdfe/weewr/rtrkhh/ss_fef_10233_dfdfe.nc'
)
output <- gsub(".*/|\\.[^.]+$", "", a)
output
[1] "pm2.5_year_2014" "pm10_year_2014" "ss_fef_10233_dfdfe"
Here is the regex logic:
.*/ match everything from the start of the string until the last /
| OR
\.[^.]+$ match everything from final dot until the end of the string
Then we replace these matches by empty string to remove them, leaving behind the filenames.

Related

How to I use regular expressions to match a substring?

I want to change the rownames of cov_stats, such that it contains a substring of the FileName column values. I only want to retain the string that begins with "SRR" followed by 8 digits (e.g., SRR18826803).
cov_list <- list.files(path="./stats/", full.names=T)
cov_stats <- rbindlist(sapply(cov_list, fread, simplify=F), use.names=T, idcol="FileName")
rownames(cov_stats) <- gsub("^\.\/\SRR*_\stats.\txt", "SRR*", cov_stats[["FileName"]])
Second attempt
rownames(cov_stats) <- gsub("^SRR[:digit:]*", "", cov_stats[["FileName"]])
Original strings
> cov_stats[["FileName"]]
[1] "./stats/SRR18826803_stats.txt" "./stats/SRR18826804_stats.txt"
[3] "./stats/SRR18826805_stats.txt" "./stats/SRR18826806_stats.txt"
[5] "./stats/SRR18826807_stats.txt" "./stats/SRR18826808_stats.txt"
Desired substring output
[1] "SRR18826803" "SRR18826804"
[3] "SRR18826805" "SRR18826806"
[5] "SRR18826807" "SRR18826808"
Would this work for you?
library(stringr)
stringr::str_extract(cov_stats[["FileName"]], "SRR.{0,8}")
You can use
rownames(cov_stats) <- sub("^\\./stats/(SRR\\d{8}).*", "\\1", cov_stats[["FileName"]])
See the regex demo. Details:
^ - start of string
\./stats/ - ./stats/ string
(SRR\d{8}) - Group 1 (\1): SRR string and then eight digits
.* - the rest of the string till its end.
Note that sub is used (not gsub) because there is only one expected replacement operation in the input string (since the regex matches the whole string).
See the R demo:
cov_stats <- c("./stats/SRR18826803_stats.txt", "./stats/SRR18826804_stats.txt", "./stats/SRR18826805_stats.txt", "./stats/SRR18826806_stats.txt", "./stats/SRR18826807_stats.txt")
sub("^\\./stats/(SRR\\d{8}).*", "\\1", cov_stats)
## => [1] "SRR18826803" "SRR18826804" "SRR18826805" "SRR18826806" "SRR18826807"
An equivalent extraction stringr approach:
library(stringr)
rownames(cov_stats) <- str_extract(cov_stats[["FileName"]], "SRR\\d{8}")

Removing brackets in a string without the content

I would like to rearrange the Data I have. It is composed just with names, but some are with brackets and I would like to get rid, to keep the content, and habe at the end 2 names.
For exemple
df <- c ("Do(i)lfal", "Do(i)lferl", "Steff(l)", "Steffe", "Steffi")
I would like to have at the end
df <- c( "Doilfal", "Dolfal", "Doilferl", "Dolferl", "Steff", "Steffl", "Steffe", "Steffi")
I tried
sub("(.*)(\\([a-z]\\))(.*)$", "\\1\\2, \\1\\2\\3", df)
But it is not very working
Thank you very much
df = gsub("[\\(\\)]", "", df)
You made two small mistakes:
In the first case you want \1\2\3, because you want all letter. It's in the second name that you want \1\3 (skipping the middle vowel).
You placed the parentheses themselves (i) inside the capture group. So it's also being capture. You must place the capture group only around the thing inside the parentheses.
A small change to your regex does it:
sub("(.*)\\(([a-z])\\)(.*)$", "\\1\\2\\3, \\1\\3", df)
You can use
df <- c ("Do(i)lfal", "Do(i)lferl", "Steff(l)", "Steffe", "Steffi")
unlist(strsplit( paste(sub("(.*?)\\(([a-z])\\)(.*)", "\\1\\2\\3, \\1\\3", df), collapse=","), "\\s*,\\s*"))
# => [1] "Doilfal"
# [2] "Dolfal"
# [3] "Doilferl"
# [4] "Dolferl"
# [5] "Steffl"
# [6] "Steff"
# [7] "Steffe"
# [8] "Steffi"
See the online R demo and the first regex demo. Details:
First, the sub is executed with the first regex, (.*?)\(([a-z])\)(.*) that matches
(.*?) - any zero or more chars as few as possible, captured into Group 1 (\1)
\( - a ( char
([a-z]) - Group 2 (\2): any ASCII lowercase letter
\) - a ) char
(.*) - any zero or more chars as many as possible, captured into Group 3 (\3)
Then, the results are pasted with a , char as a collpasing char
Then, the resulting char vector is split with the \s*,\s* regex that matches a comma enclosed with zero or more whitespace chars.

How can I paste a comma (,) in a string of numbers in R Statistics?

I'm quite newbie at R Statistics. I have a vector with multiple objects inside (numbers), and I want to put a comma between the first and second number for the whole objects.
x gives this result:
[8] -8196110 -7681989 -8042092 -8196660 -7606310 -7217828 -7634887
[15] -7401244 -7211947 -7636932 -7606444 -7598894 -7398965```
My question is how to automatically put a comma in all those objects between the first and the second numbers. The desired output would be:
```[1] -8,385772 -7,390682 -8,019960 -8,300000 -8,069984 -8,786782 -7,414995
[8] -8,196110 -7,681989 -8,042092 -8,196660 -7,606310 -7,217828 -7,634887
[15] -7,401244 -7,211947 -7,636932 -7,606444 -7,598894 -7,398965```
We can use sub to capture the first digit from the start (^) of the string and replace with the backreference (\\1) followed by the,
sub("^(-?\\d)", "\\1,", x)
-output
[1] "-8,196110" "-7,681989" "-8,042092" "-8,196660" "-7,606310" "-7,217828" "-7,634887" "-7,401244" "-7,211947" "-7,636932" "-7,606444" "-7,598894" "-7,398965"
data
x <- c(-8196110, -7681989, -8042092, -8196660, -7606310, -7217828,
-7634887, -7401244, -7211947, -7636932, -7606444, -7598894, -7398965
)
We can use strsplit to split our numeric vector into a list where each element has the first digit and then the rest of the number. Then pass that into an sapply call that inserts a comma in the right spot:
x_split = strsplit(as.character(x), split = '')
sapply(x_split, function(k){paste0(c(k[1], ',',k[2:length(k)]), collapse = '')})

Regex: how to keep all digits when splitting a string?

Question
Using a regular expression, how do I keep all digits when splitting a string?
Overview
I would like to split each element within the character vector sample.text into two elements: one of only digits and one of only the text.
Current Attempt is Dropping Last Digit
This regular expression - \\d\\s{1} - inside of base::strsplit() removes the last digit. Below is my attempt, along with my desired output.
# load necessary data -----
sample.text <-
c("111110 Soybean Farming", "0116 Soybeans")
# split string by digit and one space pattern ------
strsplit(sample.text, split = "\\d\\s{1}")
# [[1]]
# [1] "11111" "Soybean Farming"
#
# [[2]]
# [1] "011" "Soybeans"
# desired output --------
# [[1]]
# [1] "111110" "Soybean Farming"
#
# [[2]]
# [1] "0116" "Soybeans"
# end of script #
Any advice on how I can split sample.text to keep all digits would be much appreciated! Thank you.
Because you're splitting on \\d, the digit there is consumed in the regex, and not present in the output. Use lookbehind for a digit instead:
strsplit(sample.text, split = "(?<=\\d) ", perl=TRUE)
http://rextester.com/GDVFU71820
Some alternative solutions, using very simple pattern matching on the first occurrence of space:
1) Indirectly by using sub to substitute your own delimiter, then strsplit on your delimiter:
E.g. you can substitute ';' for the first space (if you know that character does not exist in your data):
strsplit( sub(' ', ';', sample.text), split=';')
2) Using regexpr and regmatches
You can effectively match on the first " " (space character), and split as follows:
regmatches(sample.text, regexpr(" ", sample.text), invert = TRUE)
Result is a list, if that's what you are after per your sample desired output:
[[1]]
[1] "111110" "Soybean Farming"
[[2]]
[1] "0116" "Soybeans"
3) Using stringr library:
library(stringr)
str_split_fixed(sample.text, " ", 2) #outputs a character matrix
[,1] [,2]
[1,] "111110" "Soybean Farming"
[2,] "0116" "Soybeans"

Regex expression starting from a certain character

Example: "example._AL(5)._._4500_GRE/Jan_2018"
I am trying to extract text from the above string containing parentheses. I wanna extract everything starting from AL.
Output should look like: "AL(5)._._4500_GRE/Jan_2018"
There is some question on what we can assume is known but here are a few variations which make various assumptions.
1) word( This removes everything prior to the first word followed by a parenthesis.
"^" matches the start of string
".*?" is the shortest match of anything provided we still match rest of regex
"\\w+" matches a word
"\\(" matches a left paren
(...) forms a capture group which the replacement string can refer to as "\\1"
Code
x <- "example.AL(5)._._4500_GRE/Jan_2018"
sub("^.*?(\\w+\\()", "\\1", x)
## [1] "AL(5)._._4500_GRE/Jan_2018"
1a) or matching a word followed by ( followed by anything and extracting that:
library(gsubfn)
strapplyc(x, "\\w+\\(.*", simplify = TRUE)
## [1] "AL(5)._._4500_GRE/Jan_2018"
2) AL( or if we know that the word is AL then:
sub("^.*?(AL\\(.*)", "\\1", x)
## [1] "AL(5)._._4500_GRE/Jan_2018"
3) remove up to 1st dot or if we know that the part to be removed is the part before and including the first dot:
sub("^.*?\\.", "", x)
## [1] "AL(5)._._4500_GRE/Jan_2018"
4) dot separated fields If the format of the input is dot-separated fields we can parse them all out at once like this:
read.table(text = x, sep = ".", as.is = TRUE)
## V1 V2 V3 V4
## 1 example AL(5) _ _4500_GRE/Jan_2018

Resources