how to add a character to a string in R [duplicate] - r

This question already has answers here:
Insert a character at a specific location in a string
(8 answers)
Closed 6 years ago.
I have something like this:
text <- "abcdefg"
and I want something like this:
"abcde.fg"
how could I achieve this without assigning a new string to the vector text but instead changing the element of the vector itself? Finally, I would like to randomly insert the dot and actually not a dot but the character element of a vector.

We can try with sub to capture the first 5 characters as a group ((.{5})) followed by one or more characters in another capture group ((.*)) and then replace with the backreference of first group (\\1) followed by a . followed by second backreference (\\2).
sub("(.{5})(.*)", "\\1.\\2", text)
#[1] "abcde.fg"
NOTE: This solution is direct and doesn't need to paste anything together.

Also, substring with paste will work:
paste(substring(text, c(1,6), c(5,7)), collapse=".")
"abcde.fg"
The substring function accepts vector start-stop arguments and "splits" the string at the desired locations. We then can paste these elements together and with the collapse argument.
Without relying on the vector arguments, we could use the newer and recommended substr function:
paste(c(substr(text, 1, 5), substr(text, 6,7)), collapse=".")
[1] "abcde.fg"
Note that as mentioned by konrad-rudolph, this will create a copy of the vector.

Related

String split to remove everything after _ [duplicate]

This question already has answers here:
How to extract everything until first occurrence of pattern
(4 answers)
Closed 1 year ago.
I have a list of file names and want to string extract just the part of the name before the _
I tried using the following but was unsuccessful.
condition <- strsplit(count_files, "_*")
also tried
condition <- strsplit(count_files, "_*.[c,t]sv")
Any suggestions?
Just use trimws from base R
trimws(count_files, whitespace = "_.*")
[1] "Fibroblast" "Fibroblast"
The output from strsplit is a list, it may need to be unlisted. Also, when we use _* the regex mentioned is zero or more _. Instead, it should be _.* i.e. _ followed by zero or more other characters (.*)
unlist(strsplit(count_files, "_.*"))
data
count_files <- c("Fibroblast_1.csv", "Fibroblast_2.csv")

Split string after last underscore in R [duplicate]

This question already has answers here:
Separate string after last underscore
(2 answers)
Closed 2 years ago.
I have a string like "ABC_Something_Filename". How can I split it into "ABC_Something" and "Filename" in R?
I do not want to remove anything. I want both components - before and after last underscore.
Edit: I tried using what's mentioned for columns separation but that is too extensive for my use case. Hence, I finding a regex alternative to simply split a string
One option would be to use strsplit with a negative lookahead which asserts that the underscore on which to split is the final one in the input:
input <- "ABC_Something_Filename"
parts <- strsplit(input, "_(?!.*_)", perl=TRUE)[[1]]
parts
[1] "ABC_Something" "Filename"
You can use str_match and capture data in two groups.
x <- 'ABC_Something_Filename'
stringr::str_match(x, '(.*)_(.*)')[, -1]
#[1] "ABC_Something" "Filename"

Replace a string with first few characters [duplicate]

This question already has answers here:
Regex group capture in R with multiple capture-groups
(9 answers)
Closed 2 years ago.
Let say I have a pattern like -
Str = "#sometext_any_character_including_&**(_etc_blabla\\s"
Now I want to replace above text with
"#some\\s"
i.e. I just want to retain first 4 characters and trailing space and beginning #. Is there any r way to do this?
Any pointer will be highly appreciated.
I would extract using regex. If you want all text following the \\s I would capture them with an ex:
import re
# Extract
pattern = re.compile("(#[a-z]{4}|\\\s)")
my_match = "".join(pattern.findall(my_string))
An option with sub
sub("^(#.{4}).*(\\\\s)$", "\\1\\2", Str)
#[1] "#some\\s"
str_replace(string, pattern, replacement)
or
str_replace_all(string, pattern, replacement)
You can use

Modify a substring matched by regex and place it back [duplicate]

This question already has answers here:
Applying a function to a backreference within gsub in R
(4 answers)
Closed 3 years ago.
I have an arbitrary string, say "1a2 2a1 3a2 10a5" I want to do an arbitrary mathematical operation, say doubling, to some of the numbers, say anything followed by an "a".
I can extract the numbers I want with relative ease
string = "1a2 2a1 3a2 10a5"
numbers = stringr::str_extract_all(string,'[0-9]+(?=a)')[[1]]
and obviously, doubling them is trivial
numbers = 2*(as.integer(numbers))
But I am having problems with placing the new results in their old positions. To get the output "2a2 4a1 6a2 20a5". I feel like there should be a single function that accomplishes this but all I can think of is recording the original indexes of the matches using gregexpr and manually placing the new results in the coordinates.
We can use str_replace_all from stringr to capture numbers followed by "a" and then multiply them by 2 in their callback function.
stringr::str_replace_all(string, "\\d+(?=a)", function(m) as.integer(m) * 2)
#[1] "2a2 4a1 6a2 20a5"
gsubfn is like gsub except that the second argument can be a string, function (possibly expressed in formula notation), a list or a proto object. If it is a function the capture groups are input into it and the match is replaced with the output of the function.
library(gsubfn)
string <- "1a2 2a1 3a2 10a5"
gsubfn("(\\d+)(?=a)", ~ 2L * as.integer(..1), string)
## [1] "2a2 4a1 6a2 20a5"
This variation also works. backref=0 says input the match into the function rather than the the capture groups.
gsubfn("\\d+(?=a)", ~ 2 * as.integer(x), string, backref = 0)

Extracting substring from string if pattern varies [duplicate]

This question already has answers here:
Function to extract domain name from URL in R
(5 answers)
Closed 4 years ago.
I would like to extract names from weblinks using substr(). My problem is that the patterns vary slightly, so I am not sure how to account for the variances. Here is a sample:
INPUT:
list <- c("https://www.gatcoin.io/wp-content/uploads/2017/08/GATCOIN-Whitepaper_ENG-1.pdf",
"https://appcoins.io/pdf/appcoins_whitepaper.pdf",
"https://pareto.network/download/Pareto-Technical-White-Paper.pdf",
"http://betbox.ai/BetBoxBizWhitepaper.pdf",
"https://www.aidcoin.co/assets/documents/whitepaper.pdf")
What I want as Output
c("gatcoin", "appcoins", "pareto", "betbox", "aidcoin")
In my understanding I need to specify the start and end of the string to be extracted, but sometimes start would be "https://", while other times it would be "https://www."
How could I solve this?
You can do this easily with stringr...
library(stringr)
str_match(list, "\\/(www\\.)*(\\w+)\\.")[,3]
[1] "gatcoin" "appcoins" "pareto" "betbox" "aidcoin"
The regex extracts the first sequence of letters between a slash and an optional www., and the next dot.
The equivalent in base R is slightly messier...
sub(".+?\\/(?:www\\.)*(\\w+)\\..+", "\\1", list)
This adds the start and end of the string as well, replacing the whole lot with just the capture group you want. It sets the optional www. as a non-capturing group, as sub and str_match behave differently if the first group is not found.
list <- c("https://www.gatcoin.io/wp-content/uploads/2017/08/GATCOIN- Whitepaper_ENG-1.pdf",
"https://appcoins.io/pdf/appcoins_whitepaper.pdf",
"https://pareto.network/download/Pareto-Technical-White-Paper.pdf",
"http://betbox.ai/BetBoxBizWhitepaper.pdf",
"https://www.aidcoin.co/assets/documents/whitepaper.pdf")
pattern <- c("https://", "www.", "http://")
for(p in pattern) list <- gsub(p, "", list)
unlist(lapply(strsplit(list, "[.]"), function(x) x[1]))
[1] "gatcoin" "appcoins" "pareto" "betbox" "aidcoin"
You could use Regular Expressions. However, that is reinventing the wheel. People have thought about how to split URLs before so use an already existing function.
For example parse_url of the httr package. Or google "R parse URL" for alternatives.
urls <- list("https://www.gatcoin.io/wp-content/uploads/2017/08/GATCOIN-Whitepaper_ENG-1.pdf",
"https://appcoins.io/pdf/appcoins_whitepaper.pdf",
"https://pareto.network/download/Pareto-Technical-White-Paper.pdf",
"http://betbox.ai/BetBoxBizWhitepaper.pdf",
"https://www.aidcoin.co/assets/documents/whitepaper.pdf")
Use lapply to use parse_url for every element of urls
parsed <- lapply(urls, httr::parse_url)
Now you have a list of lists. Each element of the list parsed has multiple elements itself which contain the parts of the URL`.
Extract all the elements parsed[[...]]$hostname:
hostname <- sapply(parsed, function(e) e$hostname)
Split those by the dot and take the second last element:
domain <- strsplit(hostname, "\\.")
domain <- sapply(domain, function(d) d[length(d)-1])
This Regex captures the word after the ://(www.) .
(?::\/\/(?:www.)?)(\w+)

Resources