replace characters after occurrence of a specific character in R - r

I have a list of characters like this:-
a <- c("NM020506_1","NM_020519_1","NM00_1030297.2")
I am trying to get an output like this using base R.
NM020506, NM, NM00
i.e ignore everything after "_".
I tried something like this. But clearly it is not correct.
a
[1] "NM020506_1" "NM_020519_1" "NM00_1030297.2"
> substr(a,1,unlist(gregexpr(pattern ='_',a))-1)
[1] "NM020506" "NM" "NM00_1030"
>

You can use sub function, whereby you substitute everything after _ with empty.
a <- c("NM020506_1","NM_020519_1","NM00_1030297.2")
sub("_.*","",a)
[1] "NM020506" "NM" "NM00"
No need to use gregexpr since it is greedy and yet you only need the first - . You can rather use regexpr which is not greedy
substr(a,1,regexpr(pattern ='_',a)-1)
[1] "NM020506" "NM" "NM00"

You can use strsplitas:
#data
a <- c("NM020506_1","NM_020519_1","NM00_1030297.2")
sapply(strsplit(a,"_"),function(x)x[1])
#[1] "NM020506" "NM" "NM00"

Related

extract part of a string using a before and after pattern in R

I'd like to extract the characters 120497 and 120542 from the vector below so that I have something like this c("120497","120542"). I think I could perform this task by extracting everything after "-t" and before ".html"
data<-c("mies-are-going-straight-to-hell-t120497.html?sid=0e4851bc16db",
"oss-on-wall-street-cryptocurrency-t120542.html?sid=1c1328efb1e39b40123679e173f184a1")
Thanks!
str_extract(data, "\\d+(?=.html)")
[1] "120497" "120542"
If we consider the numbers to be the firsts then:
sub(".*?(\\d+).*", "\\1", data)
[1] "120497" "120542"

Extracting numbers (in decimal and </> form) from strings in R

I have a dataset in which a column (the result variable) contains data in both numeric and character form [e.g. positive, negative, <0.1, 600, >1000 etc].
I want to extract only the numeric data in this column (i.e. <0.1, 600, >1000). Ideally without the use of any external packages.
I tried the following:
x<-gsub('\\D','', x)
But it removes the decimals or less than/more than sign (e.g. 1.56 became 156, <1.0 became 10)
I then tried the following:
x<-as.numeric(gsub("(\\D)\\.","", x))
This time round it keeps the decimal but coerced other values such as <0.1, >100 to become NAs instead.
So my question is, is there any way I can modify the function such that it will keep values containing the '<' or '>" as it is without replacement.
Meaning from
x = c("negative","positive","1.22","<1.0",">200")
I will be able to get back
x = c("","","1.22","<1.0",">200)
I would really appreciate if someone can teach me how to resolve this issue thanks!
Do you need this?
> gsub("[^0-9.<>]", "", x)
[1] "" "" "1.22" "<1.0" ">200"
Does this work for you ? Using grep we can find which all items of the vectors contains numbers, then using value=TRUE will give us those items present. Another way could be using grepl to get logical output for the match. Also in your case \\D would not work as it is match to all non digits including dot, greater than signs.
grep('\\d+', x, value=TRUE)
would yield : [1] "1.22" "<1.0" ">200"
grepl('\\d+', x)
would yield: [1] FALSE FALSE TRUE TRUE TRUE
You may also try gsub using:
> gsub('[a-zA-Z]+', '', x)
[1] "" "" "1.22" "<1.0" ">200"
Using str_remove
library(stringr)
str_remove_all(x, "[A-Za-z]+")
#[1] "" "" "1.22" "<1.0" ">200"
What, what about something like this? Find that elements that do not match your conditions and set them to an empty string.
x <- x[grep('[a-zA-Z]', x)] <- ""

Extract substring in R using grepl

I have a table with a string column formatted like this
abcdWorkstart.csv
abcdWorkcomplete.csv
And I would like to extract the last word in that filename. So I think the beginning pattern would be the word "Work" and ending pattern would be ".csv". I wrote something using grepl but not working.
grepl("Work{*}.csv", data$filename)
Basically I want to extract whatever between Work and .csv
desired outcome:
start
complete
I think you need sub or gsub (substitute/extract) instead of grepl (find if match exists). Note that when not found, it will return the entire string unmodified:
fn <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')
out <- sub(".*Work(.*)\\.csv$", "\\1", fn)
out
# [1] "start" "complete" "abcdNothing.csv"
You can work around this by filtering out the unchanged ones:
out[ out != fn ]
# [1] "start" "complete"
Or marking them invalid with NA (or something else):
out[ out == fn ] <- NA
out
# [1] "start" "complete" NA
With str_extract from stringr. This uses positive lookarounds to match any character one or more times (.+) between "Work" and ".csv":
x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")
library(stringr)
str_extract(x, "(?<=Work).+(?=\\.csv)")
# [1] "start" "complete"
Just as an alternative way, remove everything you don't want.
x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")
gsub("^.*Work|\\.csv$", "", x)
#[1] "start" "complete"
please note:
I have to use gsub. Because I first remove ^.*Work then \\.csv$.
For [\\s\\S] or \\d\\D ... (does not work with [g]?sub)
https://regex101.com/r/wFgkgG/1
Works with akruns approach:
regmatches(v1, regexpr("(?<=Work)[\\s\\S]+(?=[.]csv)", v1, perl = T))
str1<-
'12
.2
12'
gsub("[^.]","m",str1,perl=T)
gsub(".","m",str1,perl=T)
gsub(".","m",str1,perl=F)
. matches also \n when using the R engine.
Here is an option using regmatches/regexpr from base R. Using a regex lookaround to match all characters that are not a . after the string 'Work', extract with regmatches
regmatches(v1, regexpr("(?<=Work)[^.]+(?=[.]csv)", v1, perl = TRUE))
#[1] "start" "complete"
data
v1 <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')

How to extract everything until first occurrence of pattern

I'm trying to use the stringr package in R to extract everything from a string up until the first occurrence of an underscore.
What I've tried
str_extract("L0_123_abc", ".+?(?<=_)")
> "L0_"
Close but no cigar. How do I get this one? Also, Ideally I'd like something that's easy to extend so that I can get the information in between the 1st and 2nd underscore and get the information after the 2nd underscore.
To get L0, you may use
> library(stringr)
> str_extract("L0_123_abc", "[^_]+")
[1] "L0"
The [^_]+ matches 1 or more chars other than _.
Also, you may split the string with _:
x <- str_split("L0_123_abc", fixed("_"))
> x
[[1]]
[1] "L0" "123" "abc"
This way, you will have all the substrings you need.
The same can be achieved with
> str_extract_all("L0_123_abc", "[^_]+")
[[1]]
[1] "L0" "123" "abc"
The regex lookaround should be
str_extract("L0_123_abc", ".+?(?=_)")
#[1] "L0"
Using gsub...
gsub("(.+?)(\\_.*)", "\\1", "L0_123_abc")
You can use sub from base using _.* taking everything starting from _.
sub("_.*", "", "L0_123_abc")
#[1] "L0"
Or using [^_] what is everything but not _.
sub("([^_]*).*", "\\1", "L0_123_abc")
#[1] "L0"
or using substr with regexpr.
substr("L0_123_abc", 1, regexpr("_", "L0_123_abc")-1)
#substr("L0_123_abc", 1, regexpr("_", "L0_123_abc", fixed=TRUE)-1) #More performant alternative
#[1] "L0"

Retrieving a specific part of a string in R

I have the next vector of strings
[1] "/players/playerpage.htm?ilkidn=BRYANPHI01"
[2] "/players/playerpage.htm?ilkidhh=WILLIROB027"
[3] "/players/playerpage.htm?ilkid=THOMPWIL01"
I am looking for a way to retrieve the part of the string that is placed after the equal sign meaning I would like to get a vector like this
[1] "BRYANPHI01"
[2] "WILLIROB027"
[3] "THOMPWIL01"
I tried using substr but for it to work I have to know exactly where the equal sign is placed in the string and where the part i want to retrieve ends
We can use sub to match the zero or more characters that are not a = ([^=]*) followed by a = and replace it with ''.
sub("[^=]*=", "", str1)
#[1] "BRYANPHI01" "WILLIROB027" "THOMPWIL01"
data
str1 <- c("/players/playerpage.htm?ilkidn=BRYANPHI01",
"/players/playerpage.htm?ilkidhh=WILLIROB027",
"/players/playerpage.htm?ilkid=THOMPWIL01")
Using stringr,
library(stringr)
word(str1, 2, sep = '=')
#[1] "BRYANPHI01" "WILLIROB027" "THOMPWIL01"
Using strsplit,
strsplit(str1, "=")[[1]][2]
# [1] "BRYANPHI01"
With Sotos comment to get results as vector:
sapply(str1, function(x){
strsplit(x, "=")[[1]][2]
})
Another solution based on regex, but extracting instead of substituting, which may be more efficient.
I use the stringi package which provides a more powerful regex engine than base R (in particular, supporting look-behind).
str1 <- c("/players/playerpage.htm?ilkidn=BRYANPHI01",
"/players/playerpage.htm?ilkidhh=WILLIROB027",
"/players/playerpage.htm?ilkid=THOMPWIL01")
stri_extract_all_regex(str1, pattern="(?<==).+$", simplify=T)
(?<==) is a look-behind: regex will match only if preceded by an equal sign, but the equal sign will not be part of the match.
.+$ matches everything until the end. You could replace the dot with a more precise symbol if you are confident about the format of what you match. For example, '\w' matches any alphanumeric character, so you could use "(?<==)\\w+$" (the \ must be escaped so you end up with \\w).

Resources