Regexpr not working as expected

Regexpr not working as expected - r

For the following string <10.16;13.05) I want to match only the first number (sometimes the first number does not exist, i.e. <;13.05)). I used the following regular expression:
grep("[0-9]+\\.*[0-9]*(?=;)","<10.16;13.05)",value=T,perl=T)
However, the result is not "10.16" but "<10.16;13.05)". Could anyone please help me with this one? Thanks.

You could also use strsplit here with minimum regex, i.e.
x <- '<10.16;13.05)'
as.numeric(gsub('<(.*)', '\\1', unlist(strsplit(x, ';', fixed = TRUE))[1]))
#[1] 10.16
x <- '<;13.05)'
as.numeric(gsub('<(.*)', '\\1', unlist(strsplit(x, ';', fixed = TRUE))[1]))
#[1] NA

I believe you are using the wrong regex function. grep just tells you whether the patern was found, it does not extract it.
Try instead
regmatches("<10.16;13.05)", regexpr("\\d*\\.\\d*", "<10.16;13.05)"))

Related

Remove first "." from values in R

I have a dataset with different values in R. Some values are like 11.474 and others like 1.034.496 in the same column. I would like to change the values with two dots from 1.034.496 to 1034.496. Is there anyone who could help me please?
Thanks for the help!

Use gsub with Perl regexes:
df <- data.frame(a = c('11.474', '1.034.496', '1.234.034.496'))
df$a = gsub('[.](?=.*[.])', '', df$a, perl = TRUE)
print(df)
## a
## 1 11.474
## 2 1034.496
## 3 1234034.496
Here, [.](?=.*[.]) is a literal dot (has to be escaped like so \. or put into a character class like so: [.]), followed by a literal dot using positive lookahead: (?=PATTERN).

I guess there must be other smarter regex approaches than the below one, but here is my attempt
> ifelse(lengths(gregexpr("\\.",v))>1,sub("\\.","",v),v)
[1] "11.474" "1034.496"
where
v <- c("11.474","1.034.496")

Remove part of a string until a character is found R

I have a regex problem or somewhat regex related problem...
I have strings that look like this:
"..........))))..)))))))"
"....))))))))...)).))))..))"
"......))))...)))...)))))"
I want to remove the initial dot sequence, so that I only get the string starting by the first occurence of ")" symbol. Say, the output would be somthing like:
"))))..)))))))"
"))))))))...)).))))..))"
"))))...)))...)))))"
I assume it would be somewhat similar to a lookahead regex but cannot figure out the correct one...
Any help?
Thanks

We match for 0 or more dots (\\.*) from the start (^) of the string and replace it with blank
sub("^\\.*", "", v1)
#[1] "))))..)))))))" "))))))))...)).))))..))" "))))...)))...)))))"
If it needs to start from ), then as above match 0 or more dots till the first ) and replace with the )
sub("^\\.*\\)", ")", v1)
#[1] "))))..)))))))" "))))))))...)).))))..))" "))))...)))...)))))"
data
v1 <- c("..........))))..)))))))", "....))))))))...)).))))..))", "......))))...)))...)))))")

You can simply remove dots from the beginning of the line (marked in the regex by ^) until you reach a non-dot character:
a <- "..........))))..)))))))"
b <- "....))))))))...)).))))..))"
c <- "......))))...)))...)))))"
sub("^\\.*", "", a) # "))))..)))))))"
sub("^\\.*", "", b) # "))))))))...)).))))..))"
sub("^\\.*", "", c) # "))))...)))...)))))"

The way your question is worded, the goal isn't to remove just . from the beginning, but any symbol until the first ) is encountered. So this answer is a more general solution.
stringr::str_extract("..........))))..)))))))","\\).*$")
Alternatively, if you want to stick with base R, you could use sub/gsub like this:
gsub("[^\\)]*(\\).*$)","\\1","..........))))..)))))))")
sub("[^\\)]*","","..........))))..)))))))")

Find first matching substring in a long string in R

I'm trying to find the first matching string from a vector in a long string. I have for example a example_string <- 'LionabcdBear1231DogextKittyisananimalTurtleisslow' and a matching_vector<- c('Turtle',Dog') Now I want that it returns 'Dog' as this is the first substring in the matching_vector that we see in the example string: LionabcdBear1231DogextKittyisananimalTurtleisslow
I already tried pmatch(example_string,matching_vector) but it doesn't work. Obviously as it doesn't work with substrings...
Thanks!
Tim

Is the following solution working for you?
example_string <- 'LionabcdBear1231DogextKittyisananimalTurtleisslow'
matching_vector<- c('Turtle','Dog')
match_ids <- sapply(matching_vector, function(x) regexpr(x ,example_string))
result <- names(match_ids)[which.min(match_ids)]
> result
[1] "Dog"

We can use stri_match_first from stringi
library(stringi)
stri_match_first(example_string, regex = paste(matching_vector, collapse="|"))

Retrieving a specific part of a string in R

I have the next vector of strings
[1] "/players/playerpage.htm?ilkidn=BRYANPHI01"
[2] "/players/playerpage.htm?ilkidhh=WILLIROB027"
[3] "/players/playerpage.htm?ilkid=THOMPWIL01"
I am looking for a way to retrieve the part of the string that is placed after the equal sign meaning I would like to get a vector like this
[1] "BRYANPHI01"
[2] "WILLIROB027"
[3] "THOMPWIL01"
I tried using substr but for it to work I have to know exactly where the equal sign is placed in the string and where the part i want to retrieve ends

We can use sub to match the zero or more characters that are not a = ([^=]*) followed by a = and replace it with ''.
sub("[^=]*=", "", str1)
#[1] "BRYANPHI01" "WILLIROB027" "THOMPWIL01"
data
str1 <- c("/players/playerpage.htm?ilkidn=BRYANPHI01",
"/players/playerpage.htm?ilkidhh=WILLIROB027",
"/players/playerpage.htm?ilkid=THOMPWIL01")

Using stringr,
library(stringr)
word(str1, 2, sep = '=')
#[1] "BRYANPHI01" "WILLIROB027" "THOMPWIL01"

Using strsplit,
strsplit(str1, "=")[[1]][2]
# [1] "BRYANPHI01"
With Sotos comment to get results as vector:
sapply(str1, function(x){
strsplit(x, "=")[[1]][2]
})

Another solution based on regex, but extracting instead of substituting, which may be more efficient.
I use the stringi package which provides a more powerful regex engine than base R (in particular, supporting look-behind).
str1 <- c("/players/playerpage.htm?ilkidn=BRYANPHI01",
"/players/playerpage.htm?ilkidhh=WILLIROB027",
"/players/playerpage.htm?ilkid=THOMPWIL01")
stri_extract_all_regex(str1, pattern="(?<==).+$", simplify=T)
(?<==) is a look-behind: regex will match only if preceded by an equal sign, but the equal sign will not be part of the match.
.+$ matches everything until the end. You could replace the dot with a more precise symbol if you are confident about the format of what you match. For example, '\w' matches any alphanumeric character, so you could use "(?<==)\\w+$" (the \ must be escaped so you end up with \\w).

str_replace (package stringr) cannot replace brackets in r?

I have a string, say
fruit <- "()goodapple"
I want to remove the brackets in the string. I decide to use stringr package because it usually can handle this kind of issues. I use :
str_replace(fruit,"()","")
But nothing is replaced, and the following is replaced:
[1] "()good"
If I only want to replace the right half bracket, it works:
str_replace(fruit,")","")
[1] "(good"
However, the left half bracket does not work:
str_replace(fruit,"(","")
and the following error is shown:
Error in sub("(", "", "()good", fixed = FALSE, ignore.case = FALSE, perl = FALSE) :
invalid regular expression '(', reason 'Missing ')''
Anyone has ideas why this happens? How can I remove the "()" in the string, then?

Escaping the parentheses does it...
str_replace(fruit,"\\(\\)","")
# [1] "goodapple"
You may also want to consider exploring the "stringi" package, which has a similar approach to "stringr" but has more flexible functions. For instance, there is stri_replace_all_fixed, which would be useful here since your search string is a fixed pattern, not a regex pattern:
library(stringi)
stri_replace_all_fixed(fruit, "()", "")
# [1] "goodapple"
Of course, basic gsub handles this just fine too:
gsub("()", "", fruit, fixed=TRUE)
# [1] "goodapple"

The accepted answer works for your exact problem, but not for the more general problem:
my_fruits <- c("()goodapple", "(bad)apple", "(funnyapple")
str_replace(my_fruits,"\\(\\)","")
## "goodapple" "(bad)apple", "(funnyapple"
This is because the regex exactly matches a "(" followed by a ")".
Assuming you care only about bracket pairs, this is a stronger solution:
str_replace(my_fruits, "\\([^()]{0,}\\)", "")
## "goodapple" "apple" "(funnyapple"

Building off of MJH's answer, this removes all ( or ):
my_fruits <- c("()goodapple", "(bad)apple", "(funnyapple")
str_replace_all(my_fruits, "[//(//)]", "")
[1] "goodapple" "badapple" "funnyapple"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Regexpr not working as expected - r

You could also use strsplit here with minimum regex, i.e. x <- '<10.16;13.05)' as.numeric(gsub('<(.)', '\\1', unlist(strsplit(x, ';', fixed = TRUE))[1])) #[1] 10.16 x <- '<;13.05)' as.numeric(gsub('<(.)', '\\1', unlist(strsplit(x, ';', fixed = TRUE))[1])) #[1] NA

I believe you are using the wrong regex function. grep just tells you whether the patern was found, it does not extract it. Try instead regmatches("<10.16;13.05)", regexpr("\\d\\.\\d", "<10.16;13.05)"))

Related

Remove first "." from values in R

Remove part of a string until a character is found R

Find first matching substring in a long string in R

Retrieving a specific part of a string in R

str_replace (package stringr) cannot replace brackets in r?

Categories

Resources

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Regexpr not working as expected - r

You could also use strsplit here with minimum regex, i.e. x <- '<10.16;13.05)' as.numeric(gsub('<(.*)', '\\1', unlist(strsplit(x, ';', fixed = TRUE))[1])) #[1] 10.16 x <- '<;13.05)' as.numeric(gsub('<(.*)', '\\1', unlist(strsplit(x, ';', fixed = TRUE))[1])) #[1] NA

I believe you are using the wrong regex function. grep just tells you whether the patern was found, it does not extract it. Try instead regmatches("<10.16;13.05)", regexpr("\\d*\\.\\d*", "<10.16;13.05)"))

Related

Remove first "." from values in R

Remove part of a string until a character is found R

Find first matching substring in a long string in R

Retrieving a specific part of a string in R

str_replace (package stringr) cannot replace brackets in r?

Categories

Resources

You could also use strsplit here with minimum regex, i.e. x <- '<10.16;13.05)' as.numeric(gsub('<(.)', '\\1', unlist(strsplit(x, ';', fixed = TRUE))[1])) #[1] 10.16 x <- '<;13.05)' as.numeric(gsub('<(.)', '\\1', unlist(strsplit(x, ';', fixed = TRUE))[1])) #[1] NA

I believe you are using the wrong regex function. grep just tells you whether the patern was found, it does not extract it. Try instead regmatches("<10.16;13.05)", regexpr("\\d\\.\\d", "<10.16;13.05)"))