I have data in the format:
['12,Dec,2014, 02,15,28,31,37,04,06', '9,Dec,2014, 01,03,31,42,46,04,11',...]
I am trying to convert the str(date component) into date format using:
new_data =''
for line in date_data:
line = datetime.datetime.strptime(str(line), "%d,%b,%Y")
new_data = new_data + line
print(new_data)
At least the 'routine recognises the date part, but can do nothing with the numbers. How could I overcome this problem please. I have tried using % for as many characters as follow the date without success. I have never used the time module before.
What I want to achieve is to associate each number with the date it appears. I am trying to teach myself parsing of text files by the way
If the date is separated from the numbers by a comma followed by a space, then you could use line.split(', ', 1) to split the line into two parts.
Then you could call datetime.datetime.strptime to parse the date.
import datetime as DT
date_data = ['12,Dec,2014, 02,15,28,31,37,04,06', '9,Dec,2014, 01,03,31,42,46,04,11']
for line in date_data:
part = line.split(', ', 1)
date = DT.datetime.strptime(part[0], '%d,%b,%Y').date()
numbers = map(int, part[1].split(','))
print(date, numbers)
yields
(datetime.date(2014, 12, 12), [2, 15, 28, 31, 37, 4, 6])
(datetime.date(2014, 12, 9), [1, 3, 31, 42, 46, 4, 11])
Related
I have a dataframe (dat), with a "date" variable, which is in the format of dd.mm.yyyy (example: 31.12.2022)
I would like to know how could I reverse it to yyyy.mm.dd?
I also tried to separate d, m, and y, so that I could re-merge them in a different column, but I am facing problems with this.
dat2 <- separate(data = dat,
col = "date",
sep = ".",
into = c("session_day","session_month", "session_year"))
which is giving this message
Warning message: Expected 3 pieces.
Additional pieces discarded in 1566 rows [1, 2, 3, 4, 5, 6, 7, 8, 9,
10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
I appreciate any suggestions.
Thank you.
I like what you already tried and think you can continue with that. Casting the column to a date first and using format to rearrange it as you wish as mentioned in the comments is definetely the best way to approach this problem, but I would like to explain why you are getting that error message when trying it your way:
You are getting the error message because the sep argument in tidyr::seperate needs a regular expression. You are using sep = "." right now, but a . is a special character in regular expressions, meaning any character.
If you want to match a dot you will need to escape it using \\.
This should work for you and then move on from there.
dat2 <- separate(data = dat,
col = "date",
sep = "\\.",
into = c("session_day","session_month", "session_year"))
I have a csv that contains "Period", which are quarters, and "Percent". After reading the data into R, the "Period" column is "chr" and "Percent" column is "num". I want to change the quarter values to dates, so:
for (i in 1:length(sloos_tighten$Period)) {
sloos_tighten$Period[i] <- paste("Q", substring(sloos_tighten$Period[i], 6), "/", substring(sloos_tighten$Period[i], 1, 4), sep = "")
sloos_tighten$Period[i] <- as.Date(as.yearqtr(sloos_tighten$Period[i], format = "Q%q/%Y"))
}
where the first line in the for-loop changes the format of the quarter to be readable by as.yearqtr, and the second line changes the quarter to a date. The first line works as intended, but the second line converts the date to a four-digit number. I think this is because "Period" is of type "chr", but I don't know how to change it to date. I have tried to create a new column with type date, but I cannot find any resource online that explains it. Any help is appreciated. Thanks in advance.
> dput(head(sloos_tighten, 10))
structure(list(Period = c("1990:2", "1990:3", "1990:4", "1991:1",
"1991:2", "1991:3", "1991:4", "1992:1", "1992:2", "1992:3"),
`Large and medium` = c(54.4, 46.7, 54.2, 38.6, 20, 18.6,
16.7, 10, 3.5, -3.4), Small = c(52.7, 33.9, 40.7, 31.6, 6.9,
8.8, 7, 0, -7.1, -1.7)), row.names = c(NA, 10L), class = "data.frame")
^What the data looks like after import
The literal for loop is fine in a sense, but unfortunately there are two problems here:
There is a class problem here: if $Period is a string, then when you reassign one of its values with something of Date class, the date is then converted into a string. This is because in R data.frame's, with few exceptions, all values in a column must be the same type. That's because a column is (almost always) a vector, and R treats vectors as homogenous.
You can get around this by pre-allocating a vector of type Date and assigning it piecemeal:
newdate <- rep(Sys.Date()[NA], nrow(sloos_tighten)) # just to get the class right
for (i in 1:length(sloos_tighten$Period)) {
tmp <- paste("Q", substring(sloos_tighten$Period[i], 6), "/", substring(sloos_tighten$Period[i], 1, 4), sep = "")
newdate[i] <- as.Date(as.yearqtr(tmp, format = "Q%q/%Y"))
}
(But please, don't use this code, look at #2 below first.)
Not a problem per se, but an efficiency: R is good at doing things as a whole vector. If you reassign all of $Period in one step, then all is faster.
sloos_tighten$Period <-
as.Date(
paste0(substring(sloos_tighten$Period, 6),
"/", substring(sloos_tighten$Period, 1, 4)),
format = "%q/%Y")
This switches from paste(.., sep="") to paste0, a convenience function. Then, it removes the leading "Q" since really we don't keep it around, so why add it (other than perhaps declarative code). Last, it does a whole vector of strings at once.
(This is taking the data sight-unseen, so untested.)
I wanted to add parentheses to the below strings under a condition. The numbers consist of two parts: "Id - subId", and I wanted to put parenthesis when there are multiple subId.
sample_string1 = "376-12~23, 28, 32, 35, 37,376-1"
sample_string2 = "391-1~8, 391-22~23"
sample_string3 = "391-10~21, 391-24, 27, 29"
These are my desirable outcome.
desire_string1 = "376-(12~23, 28, 32, 35, 37),376-1"
desire_string2 = "391-(1~8), 391-(22~23)"
desire_string3 = "391-(10~21), 391-(24, 27, 29)"
How can I do this? Thanks in advance
This is a pretty complicated Regex problem. I would honestly recommend that instead of using this solution, you instead separate out the variable that you want and make them tidy.
However, you asked this question, so here's a regex answer. I've used the stringr package because I find it easier and more readable than grep.
The regex breaks down like this:
(?<=-) - Positive lookbehind to find a - but don't capture it
(\\d+[\\~\\,] ?[^\\-]*)+ - Capture a number of 1 or more digits followed by either a ~ or a , followed maybe a space followed by 0 or more characters that aren't a -. Capture a group that is 1 or more of these combinations of characters long.
((?=, *\\d+-)|$) - Find either a forward lookahead after the previous capture that contains a , some spaces and a number of 1 or more digits long, or capture the end of line character.
replacement= "(\\1)" - Replace the result that you captured with ( then the first group you captured then )
library(stringr)
sample_string1 = "376-12~23, 28, 32, 35, 37,376-1"
sample_string2 = "391-1~8, 391-22~23"
sample_string3 = "391-10~21, 391-24, 27, 29"
# (?!u)
ss1 <- str_replace_all(sample_string1,
"(?<=-)(\\d+[\\~\\,] ?[^\\-]*)+((?=, *\\d+-)|$)",
replacement= "(\\1)")
ss1
# "376-(12~23, 28, 32, 35, 37),376-1"
ss2 <- str_replace_all(sample_string2,
"(?<=-)(\\d+[\\~\\,] ?[^\\-]*)+((?=, *\\d+-)|$)",
replacement= "(\\1)")
ss2
# "391-(1~8), 391-(22~23)"
ss3 <- str_replace_all(sample_string3,
"(?<=-)(\\d+[\\~\\,] ?[^\\-]*)+((?=, *\\d+-)|$)",
replacement= "(\\1)")
ss3
# "391-(10~21), 391-(24, 27, 29)"
A regex that produces the correct output is:
(?:(\d+-)((?:\d+~\d+|(?:,?\s*\d+){2,})+)(?=,\s*\d+-|\"))
Demo: https://regex101.com/r/QHDCMd/1/
(\d+-) match the ID and the dash
\d+~\d+ match a subid range or ...
(?:,?\s*\d+){2,} at least two subids
(?=,\s*\d+-|\") positive look-ahead for next ID or closing quotes
I have a 20 digits number. Something like: 000001001520081000000.
But I have to turn this to 00000100-15.2008.1.00.0000. After seven numbers, I have to insert a -. Then, after 2, I insert a dot. Then, again after four, one and two numbers.
I was trying find the number this way: d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d
and then convert to \d\d\d\d\d\d\d-\d\d.\d.\d\d\d\d.\d\d\d\d , but it was not working.
Then, I really do not know how to do. I am using R and I tried with grep .
(\d{7})(\d{2})(\d{4})(\d)(\d{2})(\d{4})
By placing capture groups around each of your intervals, you can use gsub to insert values between the matches.
gsub(
"(\\d{7})(\\d{2})(\\d{4})(\\d)(\\d{2})(\\d{4})",
"\\1-\\2,\\3,\\4,\\5,\\6",
"000001001520081000000",
perl=TRUE
)
[1] "0000010-01,5200,8,10,00000"
tmp <- as.character("000001001520081000000")
tmp2 <- paste0(substr(tmp, 1, 8),
"-",
substr(tmp, 9, 10),
".",
substr(tmp, 11, 14),
".",
substr(tmp, 15, 15),
".",
substr(tmp, 16, 17),
".",
substr(tmp, 18, nchar(tmp)))
tmp2
Output:
[1] "00000100-15.2008.1.00.0000"
I have a dataset imported from a large group of .csv file. The date imports as a factor, but the data is in the following format
, 11, 4480, - 4570,NE, 12525,LB, , 10, , , , 0, 7:26A,26OC11,
, 11, 7090, - 7290,NE, 5250,LB, , 9, , , , 0, 7:28A,26OC11,
, 11, 5050, - 5065,NE, 50,LB, , 7, , , , 0, 7:31A,26OC11,
, 12, 5440, - 5530,NE, 13225,LB, , 6, , , , 0, 8:10A,26OC11,
, 12, 1020, - 1220,NE, 12020,LB, , 14, , , , 0, 8:12A,26OC11,
, 12, 50, - 25,NE, 12040,LB, , 15, , , , 0, 8:13A,26OC11,
4
For example would be 26 Oct 2011. How would I convert these factors to a date and the time to a time. I need to be able to use the time to generate a time interval between records.
Are you sure there are only two letters for the month? That doesn't make any sense!, how do you tell between JUNE and JULY?. If you can get three letters you could do something simple like this.
as.Date(as.character(mydata$mydate), format = '%d%b%y')
You could also use levels()[] instead of as.character(), but this should be simpler for now
Now if you also want the time. You can put it all together with this command
as.POSIXct(strptime(paste(as.character(mydata$mydate), paste(as.character(mydata$mytime), "M", sep = "")), "%d%b%y %I:%M%p"))
You have to be specially careful with the format. You can see a list of what %I, %d and so, means... here http://stat.ethz.ch/R-manual/R-devel/library/base/html/strptime.html
a <- c("26OC11", "01JA12")
month.abb.2 <- toupper(substr(month.abb, 0, 2))
for (i in seq_along(month.abb.2))
a <- sub(month.abb.2[i], month.abb[i], a)
as.Date(a, format="%d%b%y")
# [1] "2011-10-26" "2012-01-01"
However it would be interesting to see how Jul & Jun differ when you got only 2 characters for the month name. Looks unusual.
As mentioned , It is unsual to get 2 letters for a month, but you can add the missing letter using some regular expressions. Then you use dmy from lubridate to convert dates. Here I am using gsubfn.
library(lubridate)
library(gsubfn)
dmy(gsubfn("OC|JA",list(OC="OCT",JA="JAN"), ## You can extend here for other months
c("26OC11","26JA12")))
[1] "2011-10-26 UTC" "2012-01-26 UTC"
This is how I ended up creating the date i needed
Day<-substring(Date,1,2)
Month<-substring(Date,3,4)
Year<-substring(Date,5,6)
Month<-replace(Month,Month=="AU",8)
Month<-replace(Month,Month=="JA",1)
Month<-replace(Month,Month=="FE",2)
Month<-replace(Month,Month=="MR",3)
Month<-replace(Month,Month=="AP",4)
Month<-replace(Month,Month=="MY",5)
Month<-replace(Month,Month=="JN",6)
Month<-replace(Month,Month=="JL",7)
Month<-replace(Month,Month=="SE",9)
Month<-replace(Month,Month=="OC",10)
Month<-replace(Month,Month=="NO",11)
Month<-replace(Month,Month=="DE",12)
Date2 <- as.Date( paste( Month , Day , Year, sep = "." ) , format = "%m.%d.%y" )
dataset$Day<-Day
dataset$Month<-Month
dataset$Year<-Year
dataset$Date2<-Date2
Weekday<-weekdays(Date2)
dataset$Weekday<-as.factor(Weekday)
Thanks for all the help