How to extract numbers between multiple underscores - r

I have files as following:
"C:/FolderInf/infoAnalysis/und_2017_01_28_12.csv"
And I would like to extract the figures (actually a date) from the file names using gsub(), sub() basically without any package involved.
Number of failed trials:
gsub("\\D", "", lfs) here also the last give 12 appears, I need only the first 8 figures!
gsub("(.+?)(\\_.*)", "\\2", lfs) complication in removing the rest unwanted characters....

You can specify this extraction exactly:
gsub(".*(\\d{4}_\\d{2}_\\d{2}).*", "\\1", x)
## [1] "2017_01_28"

We can use basename to extract part of the string and then match one or more characters that are not a _ ([^_]+) from the start (^) of the string, followed by a _ or | a _ followed by one or more characters that are not a _ ([^_]+) until the end ($) of the string and replace it with a blank ("")
gsub("^[^_]+_|_[^_]+$", "", basename(str1))
#[1] "2017_01_28"
data
str1 <- "C:/FolderInf/infoAnalysis/und_2017_01_28_12.csv"

You could try (but there's probably more elegant solutions available):
substr(gsub("[^0-9]","",test),1,8)
[1] "20170128"

Using R base function strsplit (No package needed)
paste(as.character(unlist(strsplit(as.character(string),"[_.]"))[c(2,3,4)]),collapse=" ")

As you have told they are dates, you can use the next regexp, for example:
^[^_]*_([0-9]*)_([0-9]*)_([0-9]*)_([0-9]*)
See demo. Beware that you cannot abbreviate this expression as _([0-9]*){4}, as you would end with only a matching group (and not the four that the above expression has) and wouldn't get the four numbers. If you need only the date, you can cut the regexp before the last group.

Related

Extract a certain pattern string from the text by R

I have a column of texts look like below:
str1 = "ABCID 123456789 is what I'm looking for, could you help me to check this Item's status?"
I want to use gsub function in R to extract "ABCID 123456789" from there. The number might change with different numbers, but ABCID is a constant. Can someone know the solution with that please? Thanks very much!
We can use str_extract to select the fixed word followed by space and one or more numbers (\\d+)
library(stringr)
str_extract(df1$col1, "ABCID \\d+")
If there are multiple instances, use str_extract_all
str_extract_all(df1$col1, "ABCID \\d+")
NOTE: The OP states that to extract "ABCID 123456789" from there
If the number has constant length (9) you could you use positive lookbehind:
sub("(?<=ABCID \\d{9}).*", "", str1, perl = TRUE)
# [1] "ABCID 123456789"
Match the beginning of string (^) leading letters (ABCID), a space, digits (\d+) and everything else (.*) and replace it all with the captured portion, i.e. the portion within parentheses. Note that we want to use sub, not gsub, here because there is only one substitution.
sub("^(ABCID \\d+).*", "\\1", str1)
## [1] "ABCID 123456789"

How to remove starting(suffix) special character("_") from column names [duplicate]

After I collapse my rows and separate using a semicolon, I'd like to delete the semicolons at the front and back of my string. Multiple semicolons represent blanks in a cell. For example an observation may look as follows after the collapse:
;TX;PA;CA;;;;;;;
I'd like the cell to look like this:
TX;PA;CA
Here is my collapse code:
new_df <- group_by(old_df, unique_id) %>% summarize_each(funs(paste(., collapse = ';')))
If I try to gsub for semicolon it removes all of them. If if I remove the end character it just removes one of the semicolons. Any ideas on how to remove all at the beginning and end, but leaving the ones in between the observations? Thanks.
use the regular expression ^;+|;+$
x <- ";TX;PA;CA;;;;;;;"
gsub("^;+|;+$", "", x)
The ^ indicates the start of the string, the + indicates multiple matches, and $ indicates the end of the string. The | states "OR". So, combined, it's searching for any number of ; at the start of a string OR any number of ; at the end of the string, and replace those with an empty space.
The stringi package allows you to specify patterns which you wish to preserve and trim everything else. If you only have letters there (though you could specify other pattern too), you could simply do
stringi::stri_trim_both(";TX;PA;CA;;;;;;;", "\\p{L}")
## [1] "TX;PA;CA"

regex - define boundary using characters & delimiters

I realize this is a rather simple question and I have searched throughout this site, but just can't seem to get my syntax right for the following regex challenges. I'm looking to do two things. First have the regex to pick up the first three characters and stop at a semicolon. For example, my string might look as follows:
Apt;House;Condo;Apts;
I'd like to go here
Apartment;House;Condo;Apartment
I'd also like to create a regex to substitute a word in between delimiters, while keep others unchanged. For example, I'd like to go from this:
feline;labrador;bird;labrador retriever;labrador dog; lab dog;
To this:
feline;dog;bird;dog;dog;dog;
Below is the regex I'm working with. I know ^ denotes the beginning of the string and $ the end. I've tried many variations, and am making substitutions, but am not achieving my desired out put. I'm also guessing one regex could work for both? Thanks for your help everyone.
df$variable <- gsub("^apt$;", "Apartment;", df$variable, ignore.case = TRUE)
Here is an approach that uses look behind (so you need perl=TRUE):
> tmp <- c("feline;labrador;bird;labrador retriever;labrador dog; lab dog;",
+ "lab;feline;labrador;bird;labrador retriever;labrador dog; lab dog")
> gsub( "(?<=;|^) *lab[^;]*", "dog", tmp, perl=TRUE)
[1] "feline;dog;bird;dog;dog;dog;"
[2] "dog;feline;dog;bird;dog;dog;dog"
The (?<=;|^) is the look behind, it says that any match must be preceded by either a semi-colon or the beginning of the string, but what is matched is not included in the part to be replaced. The * will match 0 or more spaces (since your example string had one case where there was space between the semi-colon and the lab. It then matches a literal lab followed by 0 or more characters other than a semi-colon. Since * is by default greedy, this will match everything up to, but not including' the next semi-colon or the end of the string. You could also include a positive look ahead (?=;|$) to make sure it goes all the way to the next semi-colon or end of string, but in this case the greediness of * will take care of that.
You could also use the non-greedy modifier, then force to match to end of string or semi-colon:
> gsub( "(?<=;|^) *lab.*?(?=;|$)", "dog", tmp, perl=TRUE)
[1] "feline;dog;bird;dog;dog;dog;"
[2] "dog;feline;dog;bird;dog;dog;dog"
The .*? will match 0 or more characters, but as few as it can get away with, stretching just until the next semi-colon or end of line.
You can skip the look behind (and perl=TRUE) if you match the delimiter, then include it in the replacement:
> gsub("(;|^) *lab[^;]*", "\\1dog", tmp)
[1] "feline;dog;bird;dog;dog;dog;"
[2] "dog;feline;dog;bird;dog;dog;dog"
With this method you need to be careful that you only match the delimiter on one side (the first in my example) since the match consumes the delimiter (not with the look-ahead or look-behind), if you consume both delimiters, then the next will be skipped and only every other field will be considered for replacement.
I'd recommend doing this in two steps:
Split the string by the delimiters
Do the replacements
(optional, if that's what you gotta do) Smash the strings back together.
To split the string, I'd use the stringr library. But you can use base R too:
myString <- "Apt;House;Condo;Apts;"
# base R
splitString <- unlist(strsplit(myString, ";", fixed = T))
# with stringr
library(stringr)
splitString <- as.vector(str_split(myString, ";", simplify = T))
Once you've done that, THEN you can do the text substitution:
# base R
fixedApts <- gsub("^Apt$|^Apts$", "Apartment", splitString)
# with stringr
fixedApts <- str_replace(splitString, "^Apt$|^Apts$", "Apartment")
# then do the rest of your replacements
There's probabably a better way to do the replacements than regular expressions (using switch(), maybe?)
Use paste0(fixedApts, collapse = "") to collapse the vector into a single string at the end if that's what you need to do.

remove all delimiters at beginning and end of string

After I collapse my rows and separate using a semicolon, I'd like to delete the semicolons at the front and back of my string. Multiple semicolons represent blanks in a cell. For example an observation may look as follows after the collapse:
;TX;PA;CA;;;;;;;
I'd like the cell to look like this:
TX;PA;CA
Here is my collapse code:
new_df <- group_by(old_df, unique_id) %>% summarize_each(funs(paste(., collapse = ';')))
If I try to gsub for semicolon it removes all of them. If if I remove the end character it just removes one of the semicolons. Any ideas on how to remove all at the beginning and end, but leaving the ones in between the observations? Thanks.
use the regular expression ^;+|;+$
x <- ";TX;PA;CA;;;;;;;"
gsub("^;+|;+$", "", x)
The ^ indicates the start of the string, the + indicates multiple matches, and $ indicates the end of the string. The | states "OR". So, combined, it's searching for any number of ; at the start of a string OR any number of ; at the end of the string, and replace those with an empty space.
The stringi package allows you to specify patterns which you wish to preserve and trim everything else. If you only have letters there (though you could specify other pattern too), you could simply do
stringi::stri_trim_both(";TX;PA;CA;;;;;;;", "\\p{L}")
## [1] "TX;PA;CA"

str_extract in R to extract number from a string

I have a table as below that I would like to extract the number following the underscore
description desired_output
desc_lvl1_id_1 1
desc_lvl1_id_2 2
The solution that I have come up with is split into two parts, first to get the underscore and the number that I want, then to take out the underscore gsub("_", "", str_extract(description, "_[0-9]")). I'm hoping though that this can be done in one step
We can use a positive lookbehind ((?<=_)) and match the numbers that follow the _ as the pattern in str_extract.
library(stringr)
df1$desired_output <- as.numeric(str_extract(df1$description, '(?<=_)\\d+'))

Resources