Insert a separator between a number with Regex

Insert a separator between a number with Regex - r

I have a 20 digits number. Something like: 000001001520081000000.
But I have to turn this to 00000100-15.2008.1.00.0000. After seven numbers, I have to insert a -. Then, after 2, I insert a dot. Then, again after four, one and two numbers.
I was trying find the number this way: d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d
and then convert to \d\d\d\d\d\d\d-\d\d.\d.\d\d\d\d.\d\d\d\d , but it was not working.
Then, I really do not know how to do. I am using R and I tried with grep .

(\d{7})(\d{2})(\d{4})(\d)(\d{2})(\d{4})
By placing capture groups around each of your intervals, you can use gsub to insert values between the matches.
gsub(
"(\\d{7})(\\d{2})(\\d{4})(\\d)(\\d{2})(\\d{4})",
"\\1-\\2,\\3,\\4,\\5,\\6",
"000001001520081000000",
perl=TRUE
)
[1] "0000010-01,5200,8,10,00000"

tmp <- as.character("000001001520081000000")
tmp2 <- paste0(substr(tmp, 1, 8),
"-",
substr(tmp, 9, 10),
".",
substr(tmp, 11, 14),
".",
substr(tmp, 15, 15),
".",
substr(tmp, 16, 17),
".",
substr(tmp, 18, nchar(tmp)))
tmp2
Output:
[1] "00000100-15.2008.1.00.0000"

Related

R - changing date's order

I have a dataframe (dat), with a "date" variable, which is in the format of dd.mm.yyyy (example: 31.12.2022)
I would like to know how could I reverse it to yyyy.mm.dd?
I also tried to separate d, m, and y, so that I could re-merge them in a different column, but I am facing problems with this.
dat2 <- separate(data = dat,
col = "date",
sep = ".",
into = c("session_day","session_month", "session_year"))
which is giving this message
Warning message: Expected 3 pieces.
Additional pieces discarded in 1566 rows [1, 2, 3, 4, 5, 6, 7, 8, 9,
10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
I appreciate any suggestions.
Thank you.

I like what you already tried and think you can continue with that. Casting the column to a date first and using format to rearrange it as you wish as mentioned in the comments is definetely the best way to approach this problem, but I would like to explain why you are getting that error message when trying it your way:
You are getting the error message because the sep argument in tidyr::seperate needs a regular expression. You are using sep = "." right now, but a . is a special character in regular expressions, meaning any character.
If you want to match a dot you will need to escape it using \\.
This should work for you and then move on from there.
dat2 <- separate(data = dat,
col = "date",
sep = "\\.",
into = c("session_day","session_month", "session_year"))

Parsing a character string of varying lengths

I'm trying to parse a string of estimated salaries to create a new field called "Salary.Min" which should be a numeric value. It seems straightforward and I can handle this in SQL with a quick case statement but I'm having trouble translating into R.
Do I need to use a for loop here or is there a more efficient/simple way? Generally I'm looking to do something akin to "if 4th character in string = K then return characters 2:3, otherwise return characters 2:4"
This code seemed to be okay at first but after validating I've realized it's eliminating all records where the 4th character = K (ie minimum salaries of $100k+)
> ifelse(
> substr(data_public$Salary.Estimate, 4,4) == "K",
> data_public$Salary.Min<- substr(data_public$Salary.Estimate, 2, 3),
> data_public$Salary.Min<- substr(data_public$Salary.Estimate, 2, 4))
I have a wide range of Salary.Estimate values, a few for example:
a) $105K - $115K
b) $89K - $95K
c) $78K - $85K

We could make this shorter with trimws and substr. Here, we get the substring from 2 to 4 character and specify the whitespace in trimws as 'K' where the which = 'right' signifies to match for the trailing character
data_public$Salary.Min <- trimws(substr( data_public$Salary.Estimate, 2, 4),
which = 'right', whitespace = "K")
Or we could use sub
sub("^.(..)K?.", "\\1", data_public$Salary.Estimate)
In the ifelse code, the assignment should be outside the ifelse
data_public$Salary.Min<- with(data_public,
ifelse(substr(Salary.Estimate, 4, 4) == "K",
substr(Salary.Estimate, 2, 3), substr(Salary.Estimate,2, 4)))

How do I add characters to the beginning or end of result of paste0?

I am trying to create a character string that looks something like
"c,1,2,3,4,5,6,7,8"
I am able to get the number part of the string by doing:
paste0(1:200, collapse = ",")
How would I add "c," to the beginning of the result of paste0? Alternatively, how could I join ",c" end of the result?

We can wrap with paste
paste0("c(", paste0(1:200, collapse = ","), ")")
Or with sprintf
sprintf("c(%s)", paste0(1:200, collapse=","))
If we need only 'c', then use
sprintf("c,%s", paste0(1:200, collapse = ","))

You just want the string c, not the function right?
paste0(c("c", 1:8), collapse = ",")

Create a vector made up of c and 1:8 and then use toString:
toString(c("c", 1:8))
## [1] "c, 1, 2, 3, 4, 5, 6, 7, 8"

convert str(time) to date format with other data included

I have data in the format:
['12,Dec,2014, 02,15,28,31,37,04,06', '9,Dec,2014, 01,03,31,42,46,04,11',...]
I am trying to convert the str(date component) into date format using:
new_data =''
for line in date_data:
line = datetime.datetime.strptime(str(line), "%d,%b,%Y")
new_data = new_data + line
print(new_data)
At least the 'routine recognises the date part, but can do nothing with the numbers. How could I overcome this problem please. I have tried using % for as many characters as follow the date without success. I have never used the time module before.
What I want to achieve is to associate each number with the date it appears. I am trying to teach myself parsing of text files by the way

If the date is separated from the numbers by a comma followed by a space, then you could use line.split(', ', 1) to split the line into two parts.
Then you could call datetime.datetime.strptime to parse the date.
import datetime as DT
date_data = ['12,Dec,2014, 02,15,28,31,37,04,06', '9,Dec,2014, 01,03,31,42,46,04,11']
for line in date_data:
part = line.split(', ', 1)
date = DT.datetime.strptime(part[0], '%d,%b,%Y').date()
numbers = map(int, part[1].split(','))
print(date, numbers)
yields
(datetime.date(2014, 12, 12), [2, 15, 28, 31, 37, 4, 6])
(datetime.date(2014, 12, 9), [1, 3, 31, 42, 46, 4, 11])

Converting Factor to Date in R

I have a dataset imported from a large group of .csv file. The date imports as a factor, but the data is in the following format
, 11, 4480, - 4570,NE, 12525,LB, , 10, , , , 0, 7:26A,26OC11,
, 11, 7090, - 7290,NE, 5250,LB, , 9, , , , 0, 7:28A,26OC11,
, 11, 5050, - 5065,NE, 50,LB, , 7, , , , 0, 7:31A,26OC11,
, 12, 5440, - 5530,NE, 13225,LB, , 6, , , , 0, 8:10A,26OC11,
, 12, 1020, - 1220,NE, 12020,LB, , 14, , , , 0, 8:12A,26OC11,
, 12, 50, - 25,NE, 12040,LB, , 15, , , , 0, 8:13A,26OC11,
4
For example would be 26 Oct 2011. How would I convert these factors to a date and the time to a time. I need to be able to use the time to generate a time interval between records.

Are you sure there are only two letters for the month? That doesn't make any sense!, how do you tell between JUNE and JULY?. If you can get three letters you could do something simple like this.
as.Date(as.character(mydata$mydate), format = '%d%b%y')
You could also use levels()[] instead of as.character(), but this should be simpler for now
Now if you also want the time. You can put it all together with this command
as.POSIXct(strptime(paste(as.character(mydata$mydate), paste(as.character(mydata$mytime), "M", sep = "")), "%d%b%y %I:%M%p"))
You have to be specially careful with the format. You can see a list of what %I, %d and so, means... here http://stat.ethz.ch/R-manual/R-devel/library/base/html/strptime.html

a <- c("26OC11", "01JA12")
month.abb.2 <- toupper(substr(month.abb, 0, 2))
for (i in seq_along(month.abb.2))
a <- sub(month.abb.2[i], month.abb[i], a)
as.Date(a, format="%d%b%y")
# [1] "2011-10-26" "2012-01-01"
However it would be interesting to see how Jul & Jun differ when you got only 2 characters for the month name. Looks unusual.

As mentioned , It is unsual to get 2 letters for a month, but you can add the missing letter using some regular expressions. Then you use dmy from lubridate to convert dates. Here I am using gsubfn.
library(lubridate)
library(gsubfn)
dmy(gsubfn("OC|JA",list(OC="OCT",JA="JAN"), ## You can extend here for other months
c("26OC11","26JA12")))
[1] "2011-10-26 UTC" "2012-01-26 UTC"

This is how I ended up creating the date i needed
Day<-substring(Date,1,2)
Month<-substring(Date,3,4)
Year<-substring(Date,5,6)
Month<-replace(Month,Month=="AU",8)
Month<-replace(Month,Month=="JA",1)
Month<-replace(Month,Month=="FE",2)
Month<-replace(Month,Month=="MR",3)
Month<-replace(Month,Month=="AP",4)
Month<-replace(Month,Month=="MY",5)
Month<-replace(Month,Month=="JN",6)
Month<-replace(Month,Month=="JL",7)
Month<-replace(Month,Month=="SE",9)
Month<-replace(Month,Month=="OC",10)
Month<-replace(Month,Month=="NO",11)
Month<-replace(Month,Month=="DE",12)
Date2 <- as.Date( paste( Month , Day , Year, sep = "." ) , format = "%m.%d.%y" )
dataset$Day<-Day
dataset$Month<-Month
dataset$Year<-Year
dataset$Date2<-Date2
Weekday<-weekdays(Date2)
dataset$Weekday<-as.factor(Weekday)
Thanks for all the help

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Insert a separator between a number with Regex - r

tmp <- as.character("000001001520081000000") tmp2 <- paste0(substr(tmp, 1, 8), "-", substr(tmp, 9, 10), ".", substr(tmp, 11, 14), ".", substr(tmp, 15, 15), ".", substr(tmp, 16, 17), ".", substr(tmp, 18, nchar(tmp))) tmp2 Output: [1] "00000100-15.2008.1.00.0000"

Related

R - changing date's order

Parsing a character string of varying lengths

How do I add characters to the beginning or end of result of paste0?

convert str(time) to date format with other data included

Converting Factor to Date in R

Categories

Resources