Using R script in PowerBI Query Editor to find six digit numeric string in a description column and add this as a new column to the table. It works EXCEPT where the number string is preceded by a "_" (underscore character)
# 'dataset' holds the input data for this script ##
library(stringr)
# assign regex to variable #
pattern <- "(?:^|\\D)(\\d{6})(?!\\d)"
# define function to use pattern ##
isNewSiteNum = function(x) substr(str_extract(x,pattern),1,6)
# output statement - within adds new column to dataset ##
output <- within(dataset,{NewSiteNum=isNewSiteNum(dataset$LineItemComment)})
number string can be at start, end or in the middle of the description text. When the number string is preceded by underscore (_123456 for example) the regex returns the _12345 instead of 123456. Not sure how to tell this to skip the underscore but still grab the six digits (and not break the cases where there is no leading underscore that currently work.)
regex101.com shows the full match as '_123456' and group.1 as '123456' but my result column has '_12345' For the case with a leading space the full match is ' 123456' yet my result column is correct. I seem to be missing something since the full match gets 7 char and the desirec group 1 has 6.
The problem was with the str_extract which I could not get to work. However, by using the str_match and selecting the group I get what I am looking for.
# 'dataset' holds input data
library(stringr)
pattern<-"(?:^|\\D)(\\d{6})(?!\\d)"
SiteNum = function(x) str_match(x, pattern)[,2]
output<-within(dataset,{R_SiteNum2=SiteNum(dataset$ReqComments)})
this does not pick up non-numeric initial characters.
Related
I have a single column of words that I am trying to clean. Some of the words have characters in them that I would like replaced with a space.
I know how to replace a single character in a string:
df2 <- data.frame(gsub("-"," ",data$string_column))
This example replaces the '-' character with a space.
How do I apply this procedure to an array of characters? I have tried the following:
df2 <- data.frame(gsub(c("-","&")," ",data$string_column))
This code runs, but it will only perform the operation of the first character, and not the second.
Any ideas on how to define a list of characters to be replaced by a space?
Thank you
You need
data$string_column <- gsub("[-&]", " ", data$string_column)
This way, all - and & chars in the string_column of the data dataframe will get replaced with a space char.
I have a vector with some codes. However, for an unknown reason, some of the code start with X# (# being a number 0-9). If my vector item does start with x#, I need to remove the first two letters.
Examples:
codes <- c('x0fa319-432f39-4fre78', '23weq0-4fsf198-417203', 'x2431-5435-1242-qewf')
expectedResult <- c('fa319-432f39-4fre78', '23weq0-4fsf198-417203', '431-5435-1242-qewf')
I tried using str_replace and gsub, but I couldn't get it right:
gsub("X\\d", "", codes)
but this would remove the x# even if it was in the middle of the string.
Any ides?
You can use
codes <- c('x0fa319-432f39-4fre78', '23weq0-4fsf198-417203', 'x2431-5435-1242-qewf')
sub("^x\\d", "", codes, ignore.case=TRUE)
See the R demo.
The ^x\d pattern matches x and any digit at the start of a string.
sub replaces the first occurrence only.
ignore.case=TRUE enables case insensitive matching.
How can I extract digits from a string that can have a structure of xxxx.x or xxxx.x-x and combine them as a number? e.g.
list <- c("1010.1-1", "1010.2-1", "1010.3-1", "1030-1", "1040-1",
"1060.1-1", "1060.2-1", "1070-1", "1100.1-1", "1100.2-1")
The desired (numeric) output would be:
101011, 101021, 101031...
I tried
regexp <- "([[:digit:]]+)"
solution <- str_extract(list, regexp)
However that only extracts the first set of digits; and using something like
regexp <- "([[:digit:]]+\\.[[:digit:]]+\\-[[:digit:]]+)"
returns the first result (data in its initial form) if matched otherwise NA for shorter strings. Thoughts?
Remove all non-digit symbols:
list <- c("1010.1-1", "1010.2-1", "1010.3-1", "1030-1", "1040-1", "1060.1-1", "1060.2-1", "1070-1", "1100.1-1", "1100.2-1")
as.numeric(gsub("\\D+", "", list))
## => [1] 101011 101021 101031 10301 10401 106011 106021 10701 110011 110021
See the R demo online
I have no experience with R but I do know regular expressions. When I look at the pattern you're specifying "([[:digit:]]+)". I assume [[:digit:]] stands for [0-9], so you're capturing one group of digits.
It seems to me you're missing a + to make it capture multiple groups of digits.
I'd think you'd need to use "([[:digit:]]+)+".
After I collapse my rows and separate using a semicolon, I'd like to delete the semicolons at the front and back of my string. Multiple semicolons represent blanks in a cell. For example an observation may look as follows after the collapse:
;TX;PA;CA;;;;;;;
I'd like the cell to look like this:
TX;PA;CA
Here is my collapse code:
new_df <- group_by(old_df, unique_id) %>% summarize_each(funs(paste(., collapse = ';')))
If I try to gsub for semicolon it removes all of them. If if I remove the end character it just removes one of the semicolons. Any ideas on how to remove all at the beginning and end, but leaving the ones in between the observations? Thanks.
use the regular expression ^;+|;+$
x <- ";TX;PA;CA;;;;;;;"
gsub("^;+|;+$", "", x)
The ^ indicates the start of the string, the + indicates multiple matches, and $ indicates the end of the string. The | states "OR". So, combined, it's searching for any number of ; at the start of a string OR any number of ; at the end of the string, and replace those with an empty space.
The stringi package allows you to specify patterns which you wish to preserve and trim everything else. If you only have letters there (though you could specify other pattern too), you could simply do
stringi::stri_trim_both(";TX;PA;CA;;;;;;;", "\\p{L}")
## [1] "TX;PA;CA"
I have a variable in a data frame that contains raw json text. Some observations have a set 14 digit number that I want to extract and some don't. If the observation has the information it is under this format:
{"blur": "10010010010010"
I want to extract the 14 digits after {"blur": " if there is a match for this left-hand side part of the string. I tried str_extract but my regex syntax is not the best, any suggestions here?
If it's fully formed JSON you could use a JSON parser but assuming
it's just fragments as shown in the question or it is fully formed and you prefer to use regular expressions anyways
each input has 0 or 1 occurrences of the digit string
if 0 occurrences then use NA
then try this.
The second argument to strapply is the regular expression. It returns the portion matched to the capture group, i.e. the part of the regular expression within parentheses. The empty=NA argument tells it what to return if no occurrences are found.
library(gsubfn)
s <- c('{"blur": "10010010010010"', 'abc') # test input
strapply(s, '{"blur": "(\\d+)"', empty = NA, simplify = TRUE)
## [1] "10010010010010" NA