Find pattern and replace - r

A similar question was probably asked but here goes :
Suppose I have the following erronous dates in my df (in numeric format such that yyyymmdd): 20169904, 20179999, 20161099. These dates are from my date column, where many dates are wrong - no such thing as day = 99 or month = 99.
Now I wish to ONLY change the 99 in dd to 01. In other words, I need to find ONLY the dates that are yyyymm99 and change them to yyyymm01. I am not having trouble with str_sub(df$date,7,8) <- 01. However, this changes all dd in the column to 01. I only need to change those that are yyyymm99.
Using pipes or multi-step solutions are both ok with me.
Thanks in advance!

Here is a solution with gsub():
gsub("99$", "01", df$date)
The $ in regular expressions means "end of line" or "end of string". With "99$", gsub() only matches "99" at the end of the string.

Here's a base R solution that will replace 99s in the mm part of the string if you need that as well, and will work if there are 99s in both the mm and dd portions.
df <- data.frame(date = c("19990104", "20160399", "19901003", "20199904", "20169999"), stringsAsFactors = FALSE)
df$new_date <- sapply(df$date, function(x) {
if(!is.na(as.Date(x, format = "%Y%m%d"))) {
return(x)
}
new_date <- x
if(grepl(".99$", x)) {
new_date <- paste0(substr(x, 1, 6), "01")
}
if(grepl("^\\d{4}99\\d{2}", new_date)) {
new_date <- paste0(substr(new_date, 1, 4), "01", substr(new_date, 7, 8))
}
return(new_date)
})
And here's the result.
date new_date
1 19990104 19990104
2 20160399 20160301
3 19901003 19901003
4 20199904 20190104
5 20169999 20160101

Related

Change date format from YYYYQQ or YYYY to mm/dd/yyyy

I have a column of data with two different formats: yyyyqq and yyyyy. I want to reformat the column to mmddyyyyy.
Whenever I use the following command as.Date(as.character(x), format = "%y") the output is yyyy-12-03. I cannot get any other combination of as.Date to work.
I'm sure this is a simple fix, but how do I do this?
Using the following assumptions:
2021 <- 2021-01-01
2021Q1 <- 2021-01-01
2021Q2 <- 2021-04-01
2021Q3 <- 2021-07-01
2021Q4 <- 2021-10-01
You can use the following:
as.Date(paste(substr(x, 1, 4), 3*as.numeric(max(substr(x, 6, 6),1))-2, "1", sep = "-"))
Edit: You can wrap this in a format(..., "%m%d%Y) but as already said in the comments I would not recommend it.
Here is a function which translates to the first (if frac=0) or last (if frac=1) date of the period. First append a 01 (first of the period) or 04 (last of the period) to the end of the input. That puts them all in yyyyqq format possibly with junk at the end. Then yearqtr will convert to a yearqtr object ignoring any junk. Then convert that to a Date object. as.Date.yearqtr uses the same meaning for frac. Finally format it as a character string in mm/dd/yyyy format.
(One alternative is to replace the format(...) line with chron::as.chron() in which case it will render in the same manner, since the format specified is the default for chron, but be a chron dates object which can be manipulated more conveniently, e.g. it sorts chronologically, than a character string.)
library(zoo)
to_date <- function(x, frac = 1) x |>
paste0(if (frac == 1) "04" else "01") |>
as.yearqtr("%Y%q") |>
as.Date(frac = frac) |>
format("%m/%d/%Y")
# test data
dd <- data.frame(x = c(2001, 2003, 200202, 200503))
transform(dd, first = to_date(x, frac = 0), last = to_date(x, frac = 1))
giving:
x first last
1 2001 01/01/2001 12/31/2001
2 2003 01/01/2003 12/31/2003
3 200202 04/01/2002 06/30/2002
4 200503 07/01/2005 09/30/2005

How to tidy my weekyear variable in the dataset

I have a dataset with a weekyear variable.
For example:
Weekyear
12016
22016
32016
...
422016
432016
442016
As you might understand this creates some difficulties as approaching this variable as an integer does not allow me to sort it descending-wise.
Therefore, I want to change variable from 12016 to 201601 to allow desc ordering. This would have been easy if my values would have the same number of characters, they aren't (for example 12016 and 432016).
Does anyone know how to treat this variable? Thanks in advance!
Diederik
Your could use stringr::str_sub to get the format you want:
# Getting the year
years <- stringr::str_sub(text, -4)
# Getting the weeks
weeks <- stringr::str_sub(text, end = nchar(text) - 4)
weeks <- ifelse(nchar(weeks) == 1, paste0(0, weeks), weeks)
as.integer(paste0(years, weeks))
[1] 201601 201602 201603 201642 201643 201644
Data:
text <- c(12016, 22016, 32016, 422016, 432016, 442016)
EDIT:
Or, you can use a combo of str_pad and str_sub:
library(stringr)
text_paded <- str_pad(text, 6, "left", 0)
as.integer(paste0(str_sub(text_paded, start = -4), str_sub(text_paded, end = 2)))
[1] 201601 201602 201603 201642 201643 201644
You can extract the year and week using modulo arithmetic and integer division.
x <- 432016
year <- x %% 10000
week <- x %/% 10000
week <- sprintf("%02d", week) # make sure single digits have leading zeros
new_x <- paste0(year, week)
new_x <- as.integer(new_x)
new_x
Here is a very short approach using regex. No packages needed.
To better understand it, I split it in 2 steps but you can nest the calls.
text <- c(12016, 22016, 32016, 422016, 432016, 442016)
# first add a zero to weeks with one digit
text1 <- gsub("(\\b\\d{5}\\b)", "0\\1", text)
# then change position of first two and last four digits
gsub("([0-9]{2})([0-9]{4})", "\\2\\1", text1)

Converting filenames to date in year + weeks returns Error in charToDate (x): character string is not in a standard unambiguous format

For a time series analysis of over 1000 raster in a raster stack I need the date. The data is almost weekly in the structure of the files
"... 1981036 .... tif"
The zero separates year and week
I need something like: "1981-36"
but always get the error
Error in charToDate (x): character string is not in a standard unambiguous format
library(sp)
library(lubridate)
library(raster)
library(Zoo)
raster_path <- ".../AVHRR_All"
all_raster <- list.files(raster_path,full.names = TRUE,pattern = ".tif$")
all_raster
brings me:
all_raster
".../VHP.G04.C07.NC.P1981036.SM.SMN.Andes.tif"
".../VHP.G04.C07.NC.P1981037.SM.SMN.Andes.tif"
".../VHP.G04.C07.NC.P1981038.SM.SMN.Andes.tif"
…
To get the year and the associated week, I have used the following code:
timeline <- data.frame(
year= as.numeric(substr(basename(all_raster), start = 17, stop = 17+3)),
week= as.numeric(substr(basename(all_raster), 21, 21+2))
)
timeline
brings me:
timeline
year week
1 1981 35
2 1981 36
3 1981 37
4 1981 38
…
But I need something like = "1981-35" to be able to plot my time series later
I tried that:
timeline$week <- as.Date(paste0(timeline$year, "%Y")) + week(timeline$week -1, "%U")
and get the error:Error in charToDate(x) : character string is not in a standard unambiguous format
or I tried that
fileDates <- as.POSIXct(substr((all_raster),17,23), format="%y0%U")
and get the same error
until someone will post a better way to do this, you could try:
x <- c(".../VHP.G04.C07.NC.P1981036.SM.SMN.Andes.tif", ".../VHP.G04.C07.NC.P1981037.SM.SMN.Andes.tif",
".../VHP.G04.C07.NC.P1981038.SM.SMN.Andes.tif")
xx <- substr(x, 21, 27)
library(lubridate)
dates <- strsplit(xx,"0")
dates <- sapply(dates,function(x) {
year_week <- unlist(x)
year <- year_week[1]
week <- year_week[2]
start_date <- as.Date(paste0(year,'-01-01'))
date <- start_date+weeks(week)
#note here: OP asked for beginning of week.
#There's some ambiguity here, the above is end-of-week;
#uncommment here for beginning of week, just subtracted 6 days.
#I think this might yield inconsistent results, especially year-boundaries
#hence suggestion to use end of week. See below for possible solution
#date <- start_date+weeks(week)-days(6)
return (as.character(date))
})
newdates <- as.POSIXct(dates)
format(newdates, "%Y-%W")
Thanks to #Soren who posted this anwer here: Get the month from the week of the year
You can do it if you specify that Monday is a Weekday 1 with %u:
w <- c(35,36,37,38)
y <- c(1981,1981,1981,1981)
s <- c(1,1,1,1)
df <- data.frame(y,w,s)
df$d <- paste(as.character(df$y), as.character(df$w),as.character(df$s), sep=".")
df$date <- as.Date(df$d, "%Y.%U.%u")
# So here we have variable date as date if you need that for later.
class(df$date)
#[1] "Date"
# If you want it to look like Y-W, you can do the final formatting:
df$date <- format(df$date, "%Y-%U")
# y w s d date
# 1 1981 35 1 1981.35.1 1981-35
# 2 1981 36 1 1981.36.1 1981-36
# 3 1981 37 1 1981.37.1 1981-37
# 4 1981 38 1 1981.38.1 1981-38
# NB: though it looks correct, the resulting df$date is actually a character:
class(df$date)
#[1] "character"
Alternatively, you could do the same by setting the Sunday as 0 with %w.

Get the month from the week of the year

Let's say we have this:
ex <- c('2012-41')
This represent the week 41 from the year 2012. How would I get the month from this?
Since a week can be between two months, I will be interested to get the month when that week started (here October).
Not duplicate to How to extract Month from date in R (do not have a standard date format like %Y-%m-%d).
you could try:
ex <- c('2019-10')
splitDate <- strsplit(ex, "-")
dateNew <- as.Date(paste(splitDate[[1]][1], splitDate[[1]][2], 1, sep="-"), "%Y-%U-%u")
monthSelected <- lubridate::month(dateNew)
3
I hope this helps!
This depends on the definition of week. See the discussion of %V and %W in ?strptime for two possible definitions of week. We use %V below but the function allows one to specify the other if desired. The function performs a sapply over the elements of x and for each such element it extracts the year into yr and forms a sequence of all dates for that year in sq. It then converts those dates to year-month and finds the first occurrence of the current component of x in that sequence, finally extracting the match's month.
yw2m <- function(x, fmt = "%Y-%V") {
sapply(x, function(x) {
yr <- as.numeric(substr(x, 1, 4))
sq <- seq(as.Date(paste0(yr, "-01-01")), as.Date(paste0(yr, "-12-31")), "day")
as.numeric(format(sq[which.max(format(sq, fmt) == x)], "%m"))
})
}
yw2m('2012-41')
## [1] 10
The following will add the week-of-year to an input of year-week formatted strings and return a vector of dates as character. The lubridate package weeks() function will add the dates corresponding to the end of the relevant week. Note for example I've added an additional case in your 'ex' variable to the 52nd week, and it returns Dec-31st
library(lubridate)
ex <- c('2012-41','2016-4','2018-52')
dates <- strsplit(ex,"-")
dates <- sapply(dates,function(x) {
year_week <- unlist(x)
year <- year_week[1]
week <- year_week[2]
start_date <- as.Date(paste0(year,'-01-01'))
date <- start_date+weeks(week)
#note here: OP asked for beginning of week.
#There's some ambiguity here, the above is end-of-week;
#uncommment here for beginning of week, just subtracted 6 days.
#I think this might yield inconsistent results, especially year-boundaries
#hence suggestion to use end of week. See below for possible solution
#date <- start_date+weeks(week)-days(6)
return (as.character(date))
})
Yields:
> dates
[1] "2012-10-14" "2016-01-29" "2018-12-31"
And to simply get the month from these full dates:
month(dates)
Yields:
> month(dates)
[1] 10 1 12

Delete part of a value for the whole column

If a have a vector such as the following:
REF_YEAR
1994-01-01
1995-01-01
1996-01-01
how can I delete the part "-01-01", so that I only get the year for the whole column?
If your vector is formatted as Dates, you can do:
x <- as.Date("2001-01-01")
format(x, "%Y")
#[1] "2001"
And for your example data:
# Your sample data:
df <- read.table(header=TRUE, text = "REF_YEAR
1994-01-01
1995-01-01
1996-01-01", stringsAsFactors = FALSE)
Convert your data to Date format:
df$REF_YEAR <- as.Date(df$REF_YEAR) # skip this step if it's already formatted as Date
Now convert to year format:
df$REF_YEAR <- format(df$REF_YEAR, "%Y")
Or
transform(df, REF_YEAR = format(REF_YEAR, "%Y"))
Result in both cases:
df
# REF_YEAR
#1 1994
#2 1995
#3 1996
You only need to make sure your data is in Date format (use as.Date() for conversion).
This can be done using regular expression. You can either keep the first four digit or eliminate the last six. Here is how you can do using the second option as asked by you.
ref_year = as.character("1994-01-01")
ref_year_only = substr(ref_year, 1, nchar(ref_year) - 6) ; ref_year_only
Also, please show some effort while asking questions on stack.
Without converting to Date, you could also try:
library(stringr)
df$YEAR <- str_extract(df$REF_YEAR, perl('\\d+(?=-)'))
df$YEAR
#[1] "1994" "1995" "1996"

Resources