Count with how many spaces a string starts - r

I want to know with how many spaces a string starts. Here are some examples:
string.1 <- " starts with 4 spaces"
string.2 <- " starts with only 2 spaces"
My attempt was the following but this leads to 1 in both cases and I understand why this is the case.
stringr::str_count(string.1, "^ ")
stringr::str_count(string.2, "^ ")
I'd prefer if there was a solution completely like this but with another regex.

The ^ pattern matches a single space at the start of the string, that is why both test cases return 1.
To match consecutive spaces at the start of the string, you may use
stringr::str_count(string.1, "\\G ")
Or, to count any whitespaces,
stringr::str_count(string.1, "\\G\\s")
See the R demo
The \G pattern matches a space at the start and each space after the successful match due to the \G anchor.
Another approach: count the length of ^\s+ matches (1 or more whitespace chars at the start of the string):
strings <- c(" starts with 4 spaces", " starts with only 2 spaces")
matches <- regmatches(strings, regexpr("^\\s+", strings))
sapply(matches, nchar)
# => 4 2

One approach might be to take the nchar of the input string, with all content from the first non whitespace character until the end stripped.
string.1 <- " starts with 4 spaces"
nchar(sub("\\S.*$", "", string.1))

Related

Remove one number at position n of the number in a string of numbers separated by slashes

I have a character column with this configuration:
data <- data.frame(
id = 1:3,
codes = c("08001301001", "08002401002 / 08002601003 / 17134604034", "08004701005 / 08005101001"))
I want to remove the 6th digit of any number within the string. The numbers are always 10 characters long.
My code works. However I believe it might be done easier using RegEx, but I couldn't figure it out.
library(stringr)
remove_6_digit <- function(x){
idxs <- str_locate_all(x,"/")[[1]][,1]
for (idx in c(rev(idxs+7), 6)){
str_sub(x, idx, idx) <- ""
}
return(x)
}
result <- sapply(data$codes, remove_6_digit, USE.NAMES = F)
You can use
gsub("\\b(\\d{5})\\d", "\\1", data$codes)
See the regex demo. This will remove the 6th digit from the start of a digit sequence.
Details:
\b - word boundary
(\d{5}) - Capturing group 1 (\1): five digits
\d - a digit.
While word boundary looks enough for the current scenario, a digit boundary is also an option in case the numbers are glued to word chars:
gsub("(?<!\\d)(\\d{5})\\d", "\\1", data$codes, perl=TRUE)
where perl=TRUE enables the PCRE regex syntax and (?<!\d) is a negative lookbehind that fails the match if there is a digit immediately to the left of the current location.
And if you must only change numeric char sequences of 10 digits (no shorter and no longer) you can use
gsub("\\b(\\d{5})\\d(\\d{4})\\b", "\\1\\2", data$codes)
gsub("(?<!\\d)(\\d{5})\\d(?=\\d{4}(?!\\d))", "\\1", data$codes, perl=TRUE)
One remark though: your numbers consist of 11 digits, so you need to replace \\d{4} with \\d{5}, see this regex demo.
Another possible solution, using stringr::str_replace_all and lookaround :
library(tidyverse)
data %>%
mutate(codes = str_replace_all(codes, "(?<=\\d{5})\\d(?=\\d{5})", ""))
#> id codes
#> 1 1 0800101001
#> 2 2 0800201002 / 0800201003 / 1713404034
#> 3 3 0800401005 / 0800501001

Remove all words in string containing punctuation (R)

How (in R) would I remove any word in a string containing punctuation, keeping words without?
test.string <- "I am:% a test+ to& see if-* your# fun/ction works o\r not"
desired <- "I a see works not"
Here is an approach using sub which seems to work:
test.string <- "I am:% a test$ to& see if* your# fun/ction works o\r not"
gsub("[A-Za-z]*[^A-Za-z ]\\S*\\s*", "", test.string)
[1] "I a see works not"
This approach is to use the following regex pattern:
[A-Za-z]* match a leading letter zero or more times
[^A-Za-z ] then match a symbol once (not a space character or a letter)
\\S* followed by any other non whitespace character
\\s* followed by any amount of whitespace
Then, we just replace with empty string, to remove the words having one or more symbols in them.
You can use this regex
(?<=\\s|^)[a-z0-9]+(?=\\s|$)
(?<=\\s|^) - positive lookbehind, match should be preceded by space or start of string.
[a-z0-9]+ - Match alphabets and digits one or more time,
(?=\\s|$) - Match must be followed by space or end of string
Demo
Tim's edit:
This answer uses a whitelist approach, namely identify all words which the OP does want to retain in his output. We can try matching using the regex pattern given above, and then connect the vector of matches using paste:
test.string <- "I am:% a test$ to& see if* your# fun/ction works o\\r not"
result <- regmatches(test.string,gregexpr("(?<=\\s|^)[A-Za-z0-9]+(?=\\s|$)",test.string, perl=TRUE))[[1]]
paste(result, collapse=" ")
[1] "I a see works not"
Here are couple of more approaches
First approach:
str_split(test.string, " ", n=Inf) %>% # spliting the line into words
unlist %>%
.[!str_detect(., "\\W|\r")] %>% # detect words without punctuation or \r
paste(.,collapse=" ") # collapse the words to get the line
Second approach:
str_extract_all(test.string, "^\\w+|\\s\\w+\\s|\\w+$") %>%
unlist %>%
trimws() %>%
paste(., collapse=" ")
^\\w+ - words having only [a-zA-Z0-9_] and also is start of the string
\\s\\w+\\s - words with [a-zA-Z0-9_] and having space before and after the word
\\w+$ - words having [a-zA-Z0-9_] and also is the end of the string

Extracting string between punctuation, when present

I'm trying to extract a string after a : or ; and before a ; if the 2nd punctuation is present, then to remove everything after a ; if present. Goal result is a number.
The current code is able to do between : and ; OR after : but cannot handle ; alone or : alone.
Also, gsub(|(OF 100); SEE NOTE) isn't working, and I'm not sure why the initial : isn't being excluded and needs the gsub at all.
test<-c("Score (ABC): 2 (of 100); see note","Amount of ABC; 30%","Presence of ABC: negative","ABC not tested")
#works for :/;
toupper((regmatches(toupper(test), gregexpr(":\\s* \\K.*?(?=;)", toupper(test), perl=TRUE))))
#works for :
test<-toupper((regmatches(toupper(test), gregexpr(":\\s* (.*)", toupper(test), perl=TRUE))))
#removes extra characters:
test<-gsub(": |(OF 100); SEE NOTE|%|; ","",test)
#Negative to numeric:
test[grepl("NEGATIVE|<1",test)]<-0
test
Expected result: 2 30 0
Here are some solutions.
The first two are base. The first only uses very simple regular expressions. The second is shorter and the regular expression is only a bit more complicated. In both cases we return NA if there is no match but you can replace NAs with 0 (using ifelse(is.na(x), 0, x) where x is the answer with NAs) afterwards if that is important to you.
The third is almost the same as the second but uses strapply in gsubfn. It returns 0 instead of NA.
1) read.table Replace all colons with semicolons and read it in as semicolon-separated fields. Pick off the second such field and remove the first non-digit and everything after it. Then convert what is left to numeric.
DF <- read.table(text = gsub(":", ";", test),
as.is = TRUE, fill = TRUE, sep = ";", strip.white = TRUE)
as.numeric(sub("\\D.*", "", DF$V2))
##[1] 2 30 NA
2) strcapture Match from the start characters which are not colon or semicolon and then match a colon or semicolon and then match a space and finally capture digits. Return the captured digits converted to numeric.
strcapture("^[^:;]+[;:] (\\d+)", test, list(num = numeric(0)))$num
##[1] 2 30 NA
3) strapply Using the same pattern as in (2) convert the match to numeric and return 0 if the match is empty.
library(gsubfn)
strapply(test, "^[^:;]+[;:] (\\d+)", as.numeric, simplify = TRUE, empty = 0)
## [1] 2 30 0
Another approach:
out <- gsub('(^.+?[;:][^0-9]+)(\\d+)(.*$)|^.+', '\\2', test)
out[out == ''] <- 0
as.numeric(out)
## [1] 2 30 0
Per the OP's description (italics is mine):
extract a string after a : or ; and before a ; if the 2nd punctuation is present, then to remove everything after a ; if present. Goal result is a number.
I think some of the other suggestions may miss that italicized criteria. So here is the OP's test set with one extra condition at the end to test that:
test<-c( "Score (ABC): 2 (of 100); see note",
"Amount of ABC; 30%",
"Presence of ABC: negative",
"...and before a ; if the second punctuation is present, then remove everything after a ; if present [so 666 should not be returned]")
One-liner to return results as requested:
sub( pattern='.+?[:;]\\D*?[^;](\\d*).*?;*.*',
replacement='\\1',
x=test, perl=TRUE)
Results matching OP's request:
[1] "2" "30" "" ""
If the OP really wants an integer with zeros where no match is found, set the sub() replacement = '0\\1' and wrap with as.integer() as follows:
as.integer( gsub( pattern='.+?[:;]\\D*?[^;](\\d*).*?;*.*',
replacement='0\\1',
x=test, perl=TRUE) )
Result:
[1] 2 30 0 0
Fully working online R (R 3.3.2) example:
https://ideone.com/TTuKzG
Regexp explanation
OP wants to find just one match in a string so the sub() function works just fine.
Technique for using sub() is to make a pattern that matches all strings, but use a capture group in the middle to capture zero or more digits if conditions around it are met.
The pattern .+?[:;]\\D*?[^;](\\d*).*?;*.* is read as follows
.+? Match any character (except for line terminators) + between one and unlimited times ? as few times as possible, expanding as needed (lazy)
[:;] Match a single character in the list between the square brackets, in this case : or ;
\\D Match any character that's NOT a digit (equal to [^0-9])
*? Quantifier * Matches between zero and unlimited times ? as few times as possible, expanding as needed (lazy)
[^;] The ^ hat as first character between square brackets means: Match a single character NOT present in the list between the square brackets, in this case match any character NOT ;
(\d*) Everything between curved brackets is a capturing group - this is the 1st capturing croup: \\d* matches a digit (equal to [0-9]) between zero and unlimited times, as many times as possible(greedy)
;* Match the ; character * between zero and unlimited times [so ; does not have to be present but is matched if it is there: This is the key to excluding anything after the second delimiter as the OP requested]
.* Match any character * between zero and unlimited times, as many times as possible (greedy) [so picks up everything to the end of the line]
The replacement = \\1 refers to the 1st capture group in our pattern. We replace everything that was matched by the pattern with what we found in the capture group. \\d* can match no digits, so will return an empty string if there is no number found where we are expecting it.

Extract a year number from a string that is surrounded by special characters

What's a good way to extract only the number 2007 from the following string:
some_string <- "1_2_start_2007_3_end"
The pattern to detect the year number in my case would be:
4 digits
surrounded by "_"
I am quite new to using regular expressions. I tried the following:
regexp <- "_+[0-9]+_"
names <- str_extract(files, regexp)
But this does not take into account that there are always 4 digits and outputs the underlines as well.
You may use a sub option, too:
some_string <- "1_2_start_2007_3_end"
sub(".*_(\\d{4})_.*", "\\1", some_string)
See the regex demo
Details
.* - any 0+ chars, as many as possible
_ - a _ char
(\\d{4}) - Group 1 (referred to via \1 from the replacement pattern): 4 digits
_.* - a _ and then any 0+ chars up to the end of string.
NOTE: akrun's str_extract(some_string, "(?<=_)\\d{4}") will extract the leftmost occurrence and my sub(".*_(\\d{4})_.*", "\\1", some_string) will extract the rightmost occurrence of a 4-digit substring enclosed with _. For my my solution to return the leftmost one use a lazy quantifier with the first .: sub(".*?_(\\d{4})_.*", "\\1", some_string).
R test:
some_string <- "1_2018_start_2007_3_end"
sub(".*?_(\\d{4})_.*", "\\1", some_string) # leftmost
## -> 2018
sub(".*_(\\d{4})_.*", "\\1", some_string) # rightmost
## -> 2007
We can use regex lookbehind to specify the _ and extract the 4 digits that follow
library(stringr)
str_extract(some_string, "(?<=_)\\d{4}")
#[1] "2007"
If the pattern also shows - both before and after the 4 digits, then use regex lookahead as well
str_extract(some_string, "(?<=_)\\d{4}(?=_)")
#[1] "2007"
Just to get a non-regex approach out there, in which we split on _ and convert to numeric. All non-numbers will be coerced to NA, so we use !is.na to eliminate them. We then use nchar to count the characters, and pull the one with 4.
i1 <- as.numeric(strsplit(some_string, '_')[[1]])
i1 <- i1[!is.na(i1)]
i1[nchar(i1) == 4]
#[1] 2007
This is the quickest regex I could come up with:
\S.*_(\d{4})_\S.*
It means,
any number of non-space characters,
then _
followed by four digits (d{4})
above four digits is your year captured using ()
another _
any other gibberish non space string
Since, you mentioned you're new, please test this and all other answers at https://regex101.com/, pretty good to learn regex, it explains in depth what your regex is actually doing.
If you just care about (year) then below regex is enough:
_(\d{4})_

R retrieving strings with sub: Why this does not work?

I would like to extract parts of strings. The string is:
> (x <- 'ab/cd efgh "xyz xyz"')
> [1] "ab/cd efgh \"xyz xyz\""
Now, I would like first to extract the first part:
> # get "ab/cd efgh"
> sub(" \"[/A-Za-z ]+\"","",x)
[1] "ab/cd efgh"
But I don't succeed in extracting the second part:
> # get "xyz xyz"
> sub("(\"[A-Za-z ]+\")$","\\1",x, perl=TRUE)
[1] "ab/cd efgh \"xyz xyz\""
What is wrong with this code?
Thanks for help.
Your last snippet does not work because you reinsert the whole match back into the result: (\"[A-Za-z ]+\")$ matches and captures ", 1+ letters and spaces, " into Group 1 and \1 in the replacement puts it back.
You may actually get the last part inside quotes by removing all chars other than " at the start of the string:
x <- 'ab/cd efgh "xyz xyz"'
sub('^[^"]+', "", x)
See the R demo
The sub here will find and replace just once, and it will match the string start (with ^) followed with 1+ chars other than " with [^"]+ negated character class.
To get this to work with sub, you have to match the whole string. The help file says
For sub and gsub return a character vector of the same length and with the same attributes as x (after possible coercion to character). Elements of character vectors x which are not substituted will be returned unchanged (including any declared encoding).
So to get this to work with your regex, pre-pend the sometimes risky catchall ".*"
sub(".*(\"[A-Za-z ]+\")$","\\1",x, perl=TRUE)
[1] "\"xyz xyz\""

Resources