Add parentheses to a string with a given condition - r

I wanted to add parentheses to the below strings under a condition. The numbers consist of two parts: "Id - subId", and I wanted to put parenthesis when there are multiple subId.
sample_string1 = "376-12~23, 28, 32, 35, 37,376-1"
sample_string2 = "391-1~8, 391-22~23"
sample_string3 = "391-10~21, 391-24, 27, 29"
These are my desirable outcome.
desire_string1 = "376-(12~23, 28, 32, 35, 37),376-1"
desire_string2 = "391-(1~8), 391-(22~23)"
desire_string3 = "391-(10~21), 391-(24, 27, 29)"
How can I do this? Thanks in advance

This is a pretty complicated Regex problem. I would honestly recommend that instead of using this solution, you instead separate out the variable that you want and make them tidy.
However, you asked this question, so here's a regex answer. I've used the stringr package because I find it easier and more readable than grep.
The regex breaks down like this:
(?<=-) - Positive lookbehind to find a - but don't capture it
(\\d+[\\~\\,] ?[^\\-]*)+ - Capture a number of 1 or more digits followed by either a ~ or a , followed maybe a space followed by 0 or more characters that aren't a -. Capture a group that is 1 or more of these combinations of characters long.
((?=, *\\d+-)|$) - Find either a forward lookahead after the previous capture that contains a , some spaces and a number of 1 or more digits long, or capture the end of line character.
replacement= "(\\1)" - Replace the result that you captured with ( then the first group you captured then )
library(stringr)
sample_string1 = "376-12~23, 28, 32, 35, 37,376-1"
sample_string2 = "391-1~8, 391-22~23"
sample_string3 = "391-10~21, 391-24, 27, 29"
# (?!u)
ss1 <- str_replace_all(sample_string1,
"(?<=-)(\\d+[\\~\\,] ?[^\\-]*)+((?=, *\\d+-)|$)",
replacement= "(\\1)")
ss1
# "376-(12~23, 28, 32, 35, 37),376-1"
ss2 <- str_replace_all(sample_string2,
"(?<=-)(\\d+[\\~\\,] ?[^\\-]*)+((?=, *\\d+-)|$)",
replacement= "(\\1)")
ss2
# "391-(1~8), 391-(22~23)"
ss3 <- str_replace_all(sample_string3,
"(?<=-)(\\d+[\\~\\,] ?[^\\-]*)+((?=, *\\d+-)|$)",
replacement= "(\\1)")
ss3
# "391-(10~21), 391-(24, 27, 29)"

A regex that produces the correct output is:
(?:(\d+-)((?:\d+~\d+|(?:,?\s*\d+){2,})+)(?=,\s*\d+-|\"))
Demo: https://regex101.com/r/QHDCMd/1/
(\d+-) match the ID and the dash
\d+~\d+ match a subid range or ...
(?:,?\s*\d+){2,} at least two subids
(?=,\s*\d+-|\") positive look-ahead for next ID or closing quotes

Related

R - changing date's order

I have a dataframe (dat), with a "date" variable, which is in the format of dd.mm.yyyy (example: 31.12.2022)
I would like to know how could I reverse it to yyyy.mm.dd?
I also tried to separate d, m, and y, so that I could re-merge them in a different column, but I am facing problems with this.
dat2 <- separate(data = dat,
col = "date",
sep = ".",
into = c("session_day","session_month", "session_year"))
which is giving this message
Warning message: Expected 3 pieces.
Additional pieces discarded in 1566 rows [1, 2, 3, 4, 5, 6, 7, 8, 9,
10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
I appreciate any suggestions.
Thank you.
I like what you already tried and think you can continue with that. Casting the column to a date first and using format to rearrange it as you wish as mentioned in the comments is definetely the best way to approach this problem, but I would like to explain why you are getting that error message when trying it your way:
You are getting the error message because the sep argument in tidyr::seperate needs a regular expression. You are using sep = "." right now, but a . is a special character in regular expressions, meaning any character.
If you want to match a dot you will need to escape it using \\.
This should work for you and then move on from there.
dat2 <- separate(data = dat,
col = "date",
sep = "\\.",
into = c("session_day","session_month", "session_year"))

Parsing a character string of varying lengths

I'm trying to parse a string of estimated salaries to create a new field called "Salary.Min" which should be a numeric value. It seems straightforward and I can handle this in SQL with a quick case statement but I'm having trouble translating into R.
Do I need to use a for loop here or is there a more efficient/simple way? Generally I'm looking to do something akin to "if 4th character in string = K then return characters 2:3, otherwise return characters 2:4"
This code seemed to be okay at first but after validating I've realized it's eliminating all records where the 4th character = K (ie minimum salaries of $100k+)
> ifelse(
> substr(data_public$Salary.Estimate, 4,4) == "K",
> data_public$Salary.Min<- substr(data_public$Salary.Estimate, 2, 3),
> data_public$Salary.Min<- substr(data_public$Salary.Estimate, 2, 4))
I have a wide range of Salary.Estimate values, a few for example:
a) $105K - $115K
b) $89K - $95K
c) $78K - $85K
We could make this shorter with trimws and substr. Here, we get the substring from 2 to 4 character and specify the whitespace in trimws as 'K' where the which = 'right' signifies to match for the trailing character
data_public$Salary.Min <- trimws(substr( data_public$Salary.Estimate, 2, 4),
which = 'right', whitespace = "K")
Or we could use sub
sub("^.(..)K?.", "\\1", data_public$Salary.Estimate)
In the ifelse code, the assignment should be outside the ifelse
data_public$Salary.Min<- with(data_public,
ifelse(substr(Salary.Estimate, 4, 4) == "K",
substr(Salary.Estimate, 2, 3), substr(Salary.Estimate,2, 4)))

Insert a separator between a number with Regex

I have a 20 digits number. Something like: 000001001520081000000.
But I have to turn this to 00000100-15.2008.1.00.0000. After seven numbers, I have to insert a -. Then, after 2, I insert a dot. Then, again after four, one and two numbers.
I was trying find the number this way: d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d
and then convert to \d\d\d\d\d\d\d-\d\d.\d.\d\d\d\d.\d\d\d\d , but it was not working.
Then, I really do not know how to do. I am using R and I tried with grep .
(\d{7})(\d{2})(\d{4})(\d)(\d{2})(\d{4})
By placing capture groups around each of your intervals, you can use gsub to insert values between the matches.
gsub(
"(\\d{7})(\\d{2})(\\d{4})(\\d)(\\d{2})(\\d{4})",
"\\1-\\2,\\3,\\4,\\5,\\6",
"000001001520081000000",
perl=TRUE
)
[1] "0000010-01,5200,8,10,00000"
tmp <- as.character("000001001520081000000")
tmp2 <- paste0(substr(tmp, 1, 8),
"-",
substr(tmp, 9, 10),
".",
substr(tmp, 11, 14),
".",
substr(tmp, 15, 15),
".",
substr(tmp, 16, 17),
".",
substr(tmp, 18, nchar(tmp)))
tmp2
Output:
[1] "00000100-15.2008.1.00.0000"

R "return()" returning an NA value at the end of a function

I am having the following problem in R. This function takes in a vector of numbers that have "[", and "]" at the beginning and end. The goal of the function is to remove the starting and trailing square brackets and return a vector of the numbers. A sample input is "[23, 54, 12, 54, 32, 45, 74, 29]" and the output should be, "23, 54, 12, 54, 32, 45, 74, 29", as a numeric object. Everything works until I try to return the value. The "return(thing)" statement returns NA instead of the vector. I must be missing something. Any thoughts.
split_bmi <- function(thing) {
thing <- as.character(thing)
thing <- strsplit(thing, "")
thing <- unlist(thing)
thing <- thing[c(-1, -length(thing))]
thing <- capture.output(cat(thing, sep = ""))
thing <- list(strsplit(thing, ","))
thing <- as.numeric(thing)
return(thing)
}
thing is a list when you pass it to as.numeric, but as.numeric is not smart enough to look through elements of a list. For instance as.numeric(list(letters)) produces NA with a warning. Try as.numeric(unlist(thing)).
#joran's solution is a very good one.
Another solution, using the stringr package.
library(stringr)
split_bmi <- function(x) {
x <- str_replace(x, "\\[" , "") %>%
str_replace("\\]", "") %>%
str_split(pattern = ",") %>%
unlist() %>%
as.numeric()
return(x)
}

convert str(time) to date format with other data included

I have data in the format:
['12,Dec,2014, 02,15,28,31,37,04,06', '9,Dec,2014, 01,03,31,42,46,04,11',...]
I am trying to convert the str(date component) into date format using:
new_data =''
for line in date_data:
line = datetime.datetime.strptime(str(line), "%d,%b,%Y")
new_data = new_data + line
print(new_data)
At least the 'routine recognises the date part, but can do nothing with the numbers. How could I overcome this problem please. I have tried using % for as many characters as follow the date without success. I have never used the time module before.
What I want to achieve is to associate each number with the date it appears. I am trying to teach myself parsing of text files by the way
If the date is separated from the numbers by a comma followed by a space, then you could use line.split(', ', 1) to split the line into two parts.
Then you could call datetime.datetime.strptime to parse the date.
import datetime as DT
date_data = ['12,Dec,2014, 02,15,28,31,37,04,06', '9,Dec,2014, 01,03,31,42,46,04,11']
for line in date_data:
part = line.split(', ', 1)
date = DT.datetime.strptime(part[0], '%d,%b,%Y').date()
numbers = map(int, part[1].split(','))
print(date, numbers)
yields
(datetime.date(2014, 12, 12), [2, 15, 28, 31, 37, 4, 6])
(datetime.date(2014, 12, 9), [1, 3, 31, 42, 46, 4, 11])

Resources