Parsing complicated date text in R [closed]

Parsing complicated date text in R [closed] - r

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I would like to extract all the dates from some text content. The content consists of date text like this:
21, 17, 16, 12, 10, 6, 5, 3 June 2019, 30 and 28, 27 May 2019
I expect to keep all the dates in a list() like this:
c("2019-06-21", "2019-06-17", "2019-06-16", "2019-06-12", "2019-06-10", "2019-06-06", "2019-06-05", "2019-06-03", "2019-05-30", "2019-05-28", "2019-05-27")
Is that possible to do that? Thanks.

To complement #Oliver answer, there is a solution which uses stringr and lubridate packages and implements quite simple regular expressions.
First of all, find the month-year blocks (like "June 2019"):
mny_loc_list <- str_locate_all(date_string,
paste0("\\b(", paste(month.name, collapse = "|"), ")", "\\s*\\d{4}"))
print(mny_loc_list)
> > mny_loc_list
[[1]]
start end
[1,] 29 38
[2,] 55 62
[3,] 72 81
Note, please, that the built-in month.name vector should correspond to the month names in your original dates string. Inconsistencies may be fixed with proper setting of the locales or by setting a month names vector manually.
Then, create a function to transform the dates corresponding to each month-year block to the calendar dates:
ExtractForMonth <- function(list_entry, string_entry) {
# define the end of a previous month-year block
if (string_entry %in% 1) {
block_begin <- 1
} else {
# take the end of a previous entry if it is not the first block
block_begin <- list_entry[(string_entry - 1), 2] + 1
}
n_day <- str_sub(date_string, block_begin, list_entry[string_entry, 1] - 1)
month_year <- str_sub(date_string,
list_entry[string_entry, 1], list_entry[string_entry, 2])
day_date <- str_extract_all(n_day, "\\b\\d+?\\b")
date_final <- paste0(unlist(day_date), " ", month_year)
return(lubridate::dmy(date_final))
}
Finally, apply this function to each pair of the month-year block locations:
dates_list <- lapply(
function(i) ExtractForMonth(list_entry = mny_loc_list[[1]],
string_entry = i),
X = seq(to = nrow(mny_loc_list[[1]])))
print(dates_list)
[[1]]
[1] "2019-06-21" "2019-06-17" "2019-06-16" "2019-06-12" "2019-06-10"
[6] "2019-06-06" "2019-06-05" "2019-06-03"
[[2]]
[1] "2019-05-30" "2019-05-28" "2019-05-27"

As pointed out in the comment, the simple answer is to clean the data into a format which R understands. If the data is imported from another software, it is often (if not always) easier from that software rather than R.
That said, it is always possible to translate, although for these tasks it has to be done manually. Below is an illustration of how this could be achieved in R, using only the base package.
dates <- '21, 17, 16, 12, 10, 6, 5, 3 June 2019, 30 and 28, 27 May 2019'
#split on ', ' and ' and '
split_dates <- strsplit(dates, ", | and ", perl = TRUE)[[1]]
#Find the dats which contain months and year
long_dates <- which(nchar(split_dates) > 2)
#Function to format dates
make_dates <- function(string){
string <- unlist(strsplit(string, " "))
nString <- length(string)
year <- string[nString]
month <- string[nString - 1]
as.Date(paste0(year, month, string[seq(nString - 2)]), format = '%Y%B%d')
}
#Date vector for output
output_Dates <- integer(length(split_dates))
class(output_Dates) <- "Date"
j <- 0
for(i in long_dates){
output_Dates[j:i] <- make_dates(split_dates[j:i])
j <- i + 1
}
output_Dates
[1]"2019-06-21" "2019-06-17" "2019-06-16" "2019-06-12" "2019-06-10" "2019-06-06" "2019-06-05" "2019-06-03" "2019-05-30" "2019-05-28" "2019-05-27"
Note that you seem to be lacking 2019-05-30 in your expected output for it to be consistent.

Related

Find pattern and replace

A similar question was probably asked but here goes :
Suppose I have the following erronous dates in my df (in numeric format such that yyyymmdd): 20169904, 20179999, 20161099. These dates are from my date column, where many dates are wrong - no such thing as day = 99 or month = 99.
Now I wish to ONLY change the 99 in dd to 01. In other words, I need to find ONLY the dates that are yyyymm99 and change them to yyyymm01. I am not having trouble with str_sub(df$date,7,8) <- 01. However, this changes all dd in the column to 01. I only need to change those that are yyyymm99.
Using pipes or multi-step solutions are both ok with me.
Thanks in advance!

Here is a solution with gsub():
gsub("99$", "01", df$date)
The $ in regular expressions means "end of line" or "end of string". With "99$", gsub() only matches "99" at the end of the string.

Here's a base R solution that will replace 99s in the mm part of the string if you need that as well, and will work if there are 99s in both the mm and dd portions.
df <- data.frame(date = c("19990104", "20160399", "19901003", "20199904", "20169999"), stringsAsFactors = FALSE)
df$new_date <- sapply(df$date, function(x) {
if(!is.na(as.Date(x, format = "%Y%m%d"))) {
return(x)
}
new_date <- x
if(grepl(".99$", x)) {
new_date <- paste0(substr(x, 1, 6), "01")
}
if(grepl("^\\d{4}99\\d{2}", new_date)) {
new_date <- paste0(substr(new_date, 1, 4), "01", substr(new_date, 7, 8))
}
return(new_date)
})
And here's the result.
date new_date
1 19990104 19990104
2 20160399 20160301
3 19901003 19901003
4 20199904 20190104
5 20169999 20160101

How to tidy my weekyear variable in the dataset

I have a dataset with a weekyear variable.
For example:
Weekyear
12016
22016
32016
...
422016
432016
442016
As you might understand this creates some difficulties as approaching this variable as an integer does not allow me to sort it descending-wise.
Therefore, I want to change variable from 12016 to 201601 to allow desc ordering. This would have been easy if my values would have the same number of characters, they aren't (for example 12016 and 432016).
Does anyone know how to treat this variable? Thanks in advance!
Diederik

Your could use stringr::str_sub to get the format you want:
# Getting the year
years <- stringr::str_sub(text, -4)
# Getting the weeks
weeks <- stringr::str_sub(text, end = nchar(text) - 4)
weeks <- ifelse(nchar(weeks) == 1, paste0(0, weeks), weeks)
as.integer(paste0(years, weeks))
[1] 201601 201602 201603 201642 201643 201644
Data:
text <- c(12016, 22016, 32016, 422016, 432016, 442016)
EDIT:
Or, you can use a combo of str_pad and str_sub:
library(stringr)
text_paded <- str_pad(text, 6, "left", 0)
as.integer(paste0(str_sub(text_paded, start = -4), str_sub(text_paded, end = 2)))
[1] 201601 201602 201603 201642 201643 201644

You can extract the year and week using modulo arithmetic and integer division.
x <- 432016
year <- x %% 10000
week <- x %/% 10000
week <- sprintf("%02d", week) # make sure single digits have leading zeros
new_x <- paste0(year, week)
new_x <- as.integer(new_x)
new_x

Here is a very short approach using regex. No packages needed.
To better understand it, I split it in 2 steps but you can nest the calls.
text <- c(12016, 22016, 32016, 422016, 432016, 442016)
# first add a zero to weeks with one digit
text1 <- gsub("(\\b\\d{5}\\b)", "0\\1", text)
# then change position of first two and last four digits
gsub("([0-9]{2})([0-9]{4})", "\\2\\1", text1)

How to Vectorize splitting a Date in R into Multiple Columns [duplicate]

This question already has answers here:
Split date into different columns for year, month and day
(4 answers)
Closed 6 years ago.
I have a dataset which looks like:
mother_id,dateOfBirth
1,1962-09-24
2,1991-02-19
3,1978-11-11
I need to extract the constituent elements (day,month,year) from date of birth and put them in corresponding columns to look like:
mother_id,dateOfBirth,dayOfBirth,monthOfBirth,yearOfBirth
1,1962-09-24,24,09,1962
2,1991-02-19,19,02,1991
3,1978-11-11,11,11,1978
Currently, I have it coded as a loop:
data <- read.csv("/home/tumaini/Desktop/IHI-Projects/Data-Linkage/matching file dss nacp.csv",stringsAsFactors = F)
dss_individuals <- read.csv("/home/tumaini/Desktop/IHI-Projects/Data-Linkage/Data/dssIndividuals.csv", stringsAsFactors = F)
lookup <- data[,c("patientid","extId")]
# remove duplicates
lookup <- lookup[!(duplicated(lookup$patientid)),]
dss_individuals$dateOfBirth <- as.character.Date(dss_individuals$dob)
dss_individuals$dayOfBirth <- 0
dss_individuals$monthOfBirth <- 0
dss_individuals$yearOfBirth <- 0
# Loop starts here
for(i in 1:nrow(dss_individuals)){ #nrow(dss_individuals)
split_list <- unlist(strsplit(dss_individuals[i,]$dateOfBirth,'[- ]'))
dss_individuals[i,]["dayOfBirth"] <- split_list[3]
dss_individuals[i,]["monthOfBirth"] <- split_list[2]
dss_individuals[i,]["yearOfBirth"] <- split_list[1]
}
This seems to work, but is horrendously slow as I have 400 000 rows. Is there a way I can get this done more efficiently?

I compared the speed of substr, format, and use of lubridate. It seems that lubridate and format are much faster than substr, if the the variable is stored as date. However, substr would be fastest if the variable is stored as character vector. The results of a single run is shown.
x <- sample(
seq(as.Date('1000/01/01'), as.Date('2000/01/01'), by="day"),
400000, replace = T)
system.time({
y <- substr(x, 1, 4)
m <- substr(x, 6, 7)
d <- substr(x, 9, 10)
})
# user system elapsed
# 3.775 0.004 3.779
system.time({
y <- format(x,"%y")
m <- format(x,"%m")
d <- format(x,"%d")
})
# user system elapsed
# 1.118 0.000 1.118
system.time({
y <- year(x)
m <- month(x)
d <- day(x)
})
# user system elapsed
# 0.951 0.000 0.951
x1 <- as.character(x)
system.time({
y <- substr(x1, 1, 4)
m <- substr(x1, 6, 7)
d <- substr(x1, 9, 10)
})
# user system elapsed
# 0.082 0.000 0.082

Not sure if this will solve your speed issues but here is a nicer way of doing it using dplyr and lubridate. In general when it comes to manipulating data.frames I personally recommend using either data.tables or dplyr. Data.tables is supposed to be faster but dplyr is more verbose which I personally prefer as I find it easier to pick up my code after not having read it for months.
library(dplyr)
library(lubridate)
dat <- data.frame( mother_id = c(1,2,3),
dateOfBirth = ymd(c( "1962-09-24" ,"1991-02-19" ,"1978-11-11"))
)
dat %>% mutate( year = year(dateOfBirth) ,
month = month(dateOfBirth),
day = day(dateOfBirth) )
Or you can use the mutate_each function to save having to write the variable name multiple times (though you get less control over the name of the output variables)
dat %>% mutate_each( funs(year , month , day) , dateOfBirth)

Here are some solutions. These solutions each (i) use 1 or 2 lines of code and (ii) return numeric year, month and day columns. In addition, the first two solutions use no packages -- the third uses chron's month.day.year function.
1) POSIXlt Convert to "POSIXlt" class and pick off the parts.
lt <- as.POSIXlt(DF$dateOfBirth, origin = "1970-01-01")
transform(DF, year = lt$year + 1900, month = lt$mon + 1, day = lt$mday)
giving:
mother_id dateOfBirth year month day
1 1 1962-09-24 1962 9 24
2 2 1991-02-19 1991 2 19
3 3 1978-11-11 1978 11 11
2) read.table
cbind(DF, read.table(text = format(DF$dateOfBirth), sep = "-",
col.names = c("year", "month", "day")))
giving:
mother_id dateOfBirth year month day
1 1 1962-09-24 1962 9 24
2 2 1991-02-19 1991 2 19
3 3 1978-11-11 1978 11 11
3) chron::month.day.year
library(chron)
cbind(DF, month.day.year(DF$dateOfBirth))
giving:
mother_id dateOfBirth month day year
1 1 1962-09-24 9 24 1962
2 2 1991-02-19 2 19 1991
3 3 1978-11-11 11 11 1978
Note 1: Often when year, month and day are added to data it is not really necessary and in fact they could be generated on the fly when needed using format, substr or as.POSIXlt so you might critically examine whether you actually need to do this.
Note 2: The input data frame, DF in reproducible form, was assumed to be:
Lines <- "mother_id,dateOfBirth
1,1962-09-24
2,1991-02-19
3,1978-11-11"
DF <- read.csv(text = Lines)

Use format once for each part:
dss_individuals$dayOfBirth <- format(dss_individuals$dateOfBirth,"%d")
dss_individuals$monthOfBirth <- format(dss_individuals$dateOfBirth,"%m")
dss_individuals$yearOfBirth <- format(dss_individuals$dateOfBirth,"%Y")

Check the substr function from the base package (or other functions from the nice stringr package) to extract different parts of a string. This function may assume that day, month and year are always in the same place and with the same length.
The strsplit function is vectorized so using rbind.data.frame to convert your list to a dataframe works:
do.call(rbind.data.frame, strsplit(df$dateOfBirth, split = '-'))
Results need to be transposed in order to be used: you can do it using do.call or the t function.

R Programming 30 day Months

I'm currently writing a script in the R Programming Language and I've hit a snag.
I have time series data organized in a way where there are 30 days in each month for 12 months in 1 year. However, I need the data organized in a proper 365 days in a year calendar, as in 30 days in a month, 31 days in a month, etc.
Is there a simple way for R to recognize there are 30 days in a month and to operate within that parameter? At the moment I have my script converting the number of days from the source in UNIX time and it counts up.
For example:
startingdate <- "20060101"
endingdate <- "20121230"
date <- seq(from = as.Date(startingdate, "%Y%m%d"), to = as.Date(endingdate, "%Y%m%d"), by = "days")
This would generate an array of dates with each month having 29 days/30 days/31 days etc. However, my data is currently organized as 30 days per month, regardless of 29 days or 31 days present.
Thanks.

The first 4 solutions are basically variations of the same theme using expand.grid. (3) uses magrittr and the others use no packages. The last two work by creating long sequence of numbers and then picking out the ones that have month and day in range.
1) apply This gives a series of yyyymmdd numbers such that there are 30 days in each month. Note that the line defining yrs in this case is the same as yrs <- 2006:2012 so if the years are handy we could shorten that line. Omit as.numeric in the line defining s if you want character string output instead. Also, s and d are the same because we have whole years so we could omit the line defining d and use s as the answer in this case and also in general if we are always dealing with whole years.
startingdate <- "20060101"
endingdate <- "20121230"
yrs <- seq(as.numeric(substr(startingdate, 1, 4)), as.numeric(substr(endingdate, 1, 4)))
g <- expand.grid(yrs, sprintf("%02d", 1:12), sprintf("%02d", 1:30))
s <- sort(as.numeric(apply(g, 1, paste, collapse = "")))
d <- s[ s >= startingdate & s <= endingdate ] # optional if whole years
Run some checks.
head(d)
## [1] 20060101 20060102 20060103 20060104 20060105 20060106
tail(d)
## 20121225 20121226 20121227 20121228 20121229 20121230
length(d) == length(2006:2012) * 12 * 30
## [1] TRUE
2) no apply An alternative variation would be this. In this and the following solutions we are using yrs as calculated in (1) so we omit it to avoid redundancy. Also, in this and the following solutions, the corresponding line to the one setting d is omitted, again, to avoid redundancy -- if you don't have whole years then add the line defining d in (1) replacing s in that line with s2.
g2 <- expand.grid(yr = yrs, mon = sprintf("%02d", 1:12), day = sprintf("%02d", 1:30))
s2 <- with(g2, sort(as.numeric(paste0(yr, mon, day))))
3) magrittr This could also be written using magrittr like this:
library(magrittr)
expand.grid(yr = yrs, mon = sprintf("%02d", 1:12), day = sprintf("%02d", 1:30)) %>%
with(paste0(yr, mon, day)) %>%
as.numeric %>%
sort -> s3
4) do.call Another variation.
g4 <- expand.grid(yrs, 1:12, 1:30)
s4 <- sort(as.numeric(do.call("sprintf", c("%d%02d%02d", g4))))
5) subset sequence Create a sequence of numbers from the starting date to the ending date and if each number is of the form yyyymmdd pick out those for which mm and dd are in range.
seq5 <- seq(as.numeric(startingdate), as.numeric(endingdate))
d5 <- seq5[ seq5 %/% 100 %% 100 %in% 1:12 & seq5 %% 100 %in% 1:30]
6) grep Using seq5 from (5)
d6 <- as.numeric(grep("(0[1-9]|1[0-2])(0[1-9]|[12][0-9]|30)$", seq5, value = TRUE))

Here's an alternative:
date <- unclass(startingdate):unclass(endingdate) %% 30L
month <- rep(1:12, each = 30, length.out = NN <- length(date))
year <- rep(1:(NN %/% 360 + 1), each = 360, length.out = NN)
(of course, we can easily adjust by adding constants to taste if you want a specific day to be 0, or a specific month, etc.)

Converting Vector into Dates in R

I have a vector of dates of the form BW01.68, BW02.68, ... , BW26.10. BW stands for "bi-week", so for example, "BW01.68" represents the first bi-week of the year 1968, and "BW26.10" represents the 26th (and final) bi-week of the year 2010. Using R, how could I convert this vector into actual dates, say, of the form 01-01-1968, 01-15-1968, ... , 12-16-2010? Is there a way for R to know exactly which dates correspond to each bi-week? Thanks for any help!

An alternative solution.
biwks <- c("BW01.68", "BW02.68", "BW26.10")
bw <- substr(biwks,3,4)
yr <- substr(biwks,6,7)
yr <- paste0(ifelse(as.numeric(yr) > 15,"19","20"),yr)
# the %j in the date format is the number of days into the year
as.Date(paste(((as.numeric(bw)-1) * 14) + 1,yr,sep="-"),format="%j-%Y")
#[1] "1968-01-01" "1968-01-15" "2010-12-17"
Though I will note that a 'bi-week' seems a strange measure and I can't be sure that just using 14 day blocks is what is intended in your work.

You can make this code a lot shorter. I have spaced out each step to help understanding but you could finish it off in one (long) line of code.
bw <- c('BW01.68', 'BW02.68','BW26.10','BW22.13')
# the gsub will ensure that bw01.1 the same as bw01.01, bw1.01, or bw1.1
#isolating year no
yearno <- as.numeric(
gsub(
x = bw,
pattern = "BW.*\\.",
replacement = ""
)
)
#isolating and converting bw to no of days
dayno <- 14 * as.numeric(
gsub(
x = bw,
pattern = "BW|\\.[[:digit:]]{1,2}",
replacement = ""
)
)
#cutoff year chosen as 15
yearno <- yearno + 1900
yearno[yearno < 1915] <- yearno[yearno < 1915] + 100
# identifying dates
dates <- as.Date(paste0('01/01/',yearno),"%d/%m/%Y") + dayno
# specifically identifinyg mondays of that week no
mondaydates <- dates - as.numeric(strftime(dates,'%w')) + 1
Output -
> bw
[1] "BW01.68" "BW02.68" "BW26.10" "BW22.13"
> dates
[1] "1968-01-15" "1968-01-29" "2010-12-31" "2013-11-05"
> mondaydates
[1] "1968-01-15" "1968-01-29" "2010-12-27" "2013-11-04"
PS: Just be careful that you're aligned with how bw is measured in your data and whether you're translating it correctly. You should be able to manipulate this to get it to work, for instance you might encounter a bw 27.