removing everything after first 'backslash' in a string - r

I have a vector like below
vec <- c("abc\edw\www", "nmn\ggg", "rer\qqq\fdf"......)
I want to remove everything after as soon as first slash is encountered, like below
newvec <- c("abc","nmn","rer")
Thank you.
My original vector is as below (only the head)
[1] "peoria ave\nste \npeoria" [2] "wood dr\nphoenix"
"central ave\nphoenix"
[4] "southern ave\nphoenix" [5] "happy valley rd\nste
\nglendaleaz " "the americana at brand\n americana way\nglendale"
Here the problem is my original csv file does not contain backslashes, but when i read it backslashes appear. Original csv file is as below
[1] "peoria ave [2] "wood dr
nste nphoenix"
npeoria"
As you can see, they are actually separated by "ENTER" but when i read it in R using read.csv() they are replaced by backslashes.

another solution :
sub("\\\\.*", "", x)

vec <- c("abc\\edw\\www", "nmn\\ggg", "rer\\qqq\\fdf")
sub("([^\\\\])\\\\.*","\\1", vec)
[1] "abc" "nmn" "rer"

strssplit(vec, "\\\\") should do the job.
TO select the first element [[1]][1] 2nd [[1]][2]

Related

How to remove some character of different length from end of a a string in R?

I want to remove every thing after second '_' and convert this vector
vec
[1] "HSC_bdl_HSC" "HSC_oil_HSC" "EC_oil_EC"
[4] "Chol (Sox9+)_ccl4_Chol (Sox9+)"
to
vec
[1] "HSC_bdl" "HSC_oil" "EC_oil"
[4] "Chol (Sox9+)_ccl4"
but I can't do that with gsub() or substr().
Thanks for any help.
You may use sub here with the pattern _[^_]*$:
sub("_[^_]*$", "", x)
[1] "HSC_bdl" "HSC_oil" "EC_oil"
[4] "Chol (Sox9+)_ccl4"
Data:
x <- c("HSC_bdl_HSC", "HSC_oil_HSC", "EC_oil_EC", "Chol (Sox9+)_ccl4_Chol (Sox9+)")

How to extract several substrings with a foor loop in R

I have the following 100 strings:
[3] "Department_Complementary_Demand_Converted_Sum"
[4] "Department_Home_Demand_Converted_Sum"
[5] "Department_Store A_Demand_Converted_Sum"
[6] "Department_Store B_Demand_Converted_Sum"
...
[100] "Department_Unisex_Demand_Converted_Sum"
Obviously I can for every string use substr() with different start and end values for the string indices. But as one can see, all the strings start with Department_ and end with _Demand_Converted_Sum. I want to only extract what's inbetween. If there was a way to always start at index 11 from the left and end on index 21 from the left then I can just run a for loop over all the 100 strings above.
Example
Given input: Department_Unisex_Demand_Converted_Sum
Expected output: Unisex
Looks a like a classic case for lookarounds:
library(stringr)
str_extract(str, "(?<=Department_)[^_]+(?=_)")
[1] "Complementary" "Home" "Store A"
Data:
str <- c("Department_Complementary_Demand_Converted_Sum",
"Department_Home_Demand_Converted_Sum",
"Department_Store A_Demand_Converted_Sum")
Using strsplit(),
sapply(strsplit(string, '_'), '[', 2)
# [1] "Complementary" "Home" "Store A"
or stringi::stri_sub_all().
unlist(stringi::stri_sub_all(str, 12, -22))
# [1] "Complementary" "Home" "Store A"

How to match everything except for digits followed by a space and ONLY digits followed by a space?

The problem
What the header says, basically. Given a string, I need to extract from it everything that is not a leading number followed by a space. So, given this string
"420 species of grass"
I would like to get
"species of grass"
But, given a string with a number not in the beginning, like so
"The clock says it is 420"
or a string with a number not followed by a space, like so
"It is 420 already"
I would like to get back the same string, with the number preserved
"The clock says it is 420"
"It is 420 already"
What I have tried
Matching a leading number followed by a space works as expected:
library(stringr)
str_extract_all("420 species of grass", "^\\d+(?=\\s)")
[[1]]
[1] "420"
> str_extract_all("The clock says it is 420", "^\\d+(?=\\s)")
[[1]]
character(0)
> str_extract_all("It is 420 already", "^\\d+(?=\\s)")
[[1]]
character(0)
But, when I try to match anything but a leading number followed by a space, it doesn't:
> str_extract_all("420 species of grass", "[^(^\\d+(?=\\s))]+")
[[1]]
[1] "species" "of" "grass"
> str_extract_all("The clock says it is 420", "[^(^\\d+(?=\\s))]+")
[[1]]
[1] "The" "clock" "says" "it" "is"
> str_extract_all("It is 420 already", "[^(^\\d+(?=\\s))]+")
[[1]]
[1] "It" "is" "already"
It seems this regex matches anything but digits AND spaces instead.
How do I fix this?
I think #Douglas's answer is more concise, however, I guess your actual case would be more complicated and you may want to check ?regexpr which can identify the starting position of your specific pattern.
A method using for loop is below:
list <- list("420 species of grass",
"The clock says it is 420",
"It is 420 already")
extract <- function(x) {
y <- vector('list', length(x))
for (i in seq_along(x)) {
if (regexpr("420", x[[i]])[[1]] > 1) {
y[[i]] <- x[[i]]
}
else{
y[[i]] <- substr(x[[i]], (regexpr(" ", x[[i]])[[1]] + 1), nchar(x[[i]]))
}
}
return(y)
}
> extract(list)
[[1]]
[1] "species of grass"
[[2]]
[1] "The clock says it is 420"
[[3]]
[1] "It is 420 already"
I think the easiest way to do this is by removing the numbers instead of extracting the desired pattern:
library(stringr)
strings <- c("420 species of grass", "The clock says it is 420", "It is 420 already")
str_remove(strings, pattern = "^\\d+\\s")
[1] "species of grass" "The clock says it is 420" "It is 420 already"
An easy way out is to replace any digits followed by spaces that occur right from the start of string using this regex,
^\d+\s+
with empty string.
Regex Demo using substitution
Sample R code using sub demo
sub("^\\d+\\s+", "", "420 species of grass")
sub("^\\d+\\s+", "", "The clock says it is 420")
sub("^\\d+\\s+", "", "It is 420 already")
Prints,
[1] "species of grass"
[1] "The clock says it is 420"
[1] "It is 420 already"
Alternative way to achieve same using matching, you can use following regex and capture contents of group1,
^(?:\d+\s+)?(.*)$
Regex Demo using match
Also, anything you place inside a character set looses its special meaning like positive lookahead inside it [^(^\\d+(?=\\s))]+ and simply behaves as a literal, so your regex becomes incorrect.
Edit:
Although solution using sub is better but in case you want match based solution using R codes, you need to use str_match instead of str_extract_all and for accessing group1 contents you need to use [,2]
R Code Demo using match
library(stringr)
print(str_match("420 species of grass", "^(?:\\d+\\s+)?(.*)$")[,2])
print(str_match("The clock says it is 420", "^(?:\\d+\\s+)?(.*)$")[,2])
print(str_match("It is 420 already", "^(?:\\d+\\s+)?(.*)$")[,2])
Prints,
[1] "species of grass"
[1] "The clock says it is 420"
[1] "It is 420 already"

How to put a space in between a list of strings?

This is my current dataset:
c("Jetstar","Qantas", "QantasLink","RegionalExpress","TigerairAustralia",
"VirginAustralia","VirginAustraliaRegionalAirlines","AllAirlines",
"Qantas-allQFdesignatedservices","VirginAustralia-allVAdesignatedservices")
I want to add a space in between airlines name and separate it with space.
For this i tried this code:
airlines$airline <- gsub("([[:lower:]]) ([[:upper:]])", "\\1 \\2", airlines$airline)
But I got the text in the same format as before.
My desired output is as below:
txt <- c("Jetstar","Qantas", "QantasLink","RegionalExpress","TigerairAustralia",
"VirginAustralia","VirginAustraliaRegionalAirlines","AllAirlines",
"Qantas-allQFdesignatedservices","VirginAustralia-allVAdesignatedservices")
You need two different sorts of rules: one for the spaces before the case changes and the other for recurring words ("designated", "services") or symbols ("-"). You could start with a pattern that identified a lowercase character followed by an uppercase character (identified with a character class like "[A-Z]") and then insert a space between those two characters in two capture classes (created with flanking parentheses around a section of a pattern). See the ?regex Details section for a quick description of character classes and capture classes:
gsub("([a-z])([A-Z])", "\\1 \\2", txt)
You then use that result as an argument that adds a space before any of the recurring words in your text that you want also separated:
gsub("(-|all|designated|services)", " \\1", # second pattern and sub for "specials"
gsub("([a-z])([A-Z])", "\\1 \\2", txt)) #first pattern and sub for case changes
[1] "Jetstar"
[2] "Qantas"
[3] "Qantas Link"
[4] "Regional Express"
[5] "Tigerair Australia"
[6] "Virgin Australia"
[7] "Virgin Australia Regional Airlines"
[8] "All Airlines"
[9] "Qantas - all QF designated services"
[10] "Virgin Australia - all VA designated services"
I see that someone upvoted my earlier answer to Splitting CamelCase in R which was similar, but this one had a few more wrinkles to iron out.
This could (almost) do the trick
gsub("([A-Z])", " \\1", airlines)
Borrowed from: splitting-camelcase-in-r
Of course names like Qantas-allQFd… will stil pose a problem because of the two consecutive UpperCase letters ("QF") in the second part of the string.
I have tried to figure it out and I have come up with something:
library(stringr)
data_vec<- c("Jetstar","Qantas", "QantasLink","RegionalExpress","TigerairAustralia",
"VirginAustralia","VirginAustraliaRegionalAirlines","AllAirlines",
"Qantas-allQFdesignatedservices","VirginAustralia-allVAdesignatedservices")
str_trim(gsub("(?<=[A-Z]{2})([a-z]{1})", " \\1",gsub("([A-Z]{1,2})", " \\1", data_vec)))
I Hope this helps.

R keep double quotes from list to String

I have a list that looks like this
[1] "SCOPUS_ID:84942789431" "SCOPUS_ID:84928151617" "SCOPUS_ID:84939229259" "SCOPUS_ID:84946407175"
[5] "SCOPUS_ID:84933039513" "SCOPUS_ID:84942789431" "SCOPUS_ID:84942607254" "SCOPUS_ID:84948165954"
[9] "SCOPUS_ID:84926379258" "SCOPUS_ID:84946771354" "SCOPUS_ID:84944223683" "SCOPUS_ID:84942789431"
[13] "SCOPUS_ID:84939169499" "SCOPUS_ID:84947104346" "SCOPUS_ID:84948764343" "SCOPUS_ID:84938075139"
[17] "SCOPUS_ID:84946196118" "SCOPUS_ID:84930820238" "SCOPUS_ID:84947785321" "SCOPUS_ID:84933496680"
[21] "SCOPUS_ID:84942789431"
I want to use the function toString but to keep the double quotes so to look like this
[1] " \"SCOPUS_ID:84942789431\", \"SCOPUS_ID:84928151617\", ... "
I'll admit that I'm fairly confused by what you're asking for, but I think this is what you want:
x <- c("SCOPUS_ID:84942789431", "SCOPUS_ID:84928151617", "SCOPUS_ID:84939229259")
paste('"', x, '"', sep = "", collapse = ", ")
# [1] "\"SCOPUS_ID:84942789431\", \"SCOPUS_ID:84928151617\", \"SCOPUS_ID:84939229259\""
I know you said you didn't want to use paste because it'll take 2-3 seconds, but I can't think of an alternative that gives you what you want right now. I'm sure others will have suggestions.

Resources