as.Date function gives different result in a for loop - r

Slight problem where my as.Date function gives a different result when I put it in a for loop. I'm looking in a folder with subfolders (per date) that contain images. I build date_list to organize all the dates (for plotting options in a later stage). The Julian Day starts from the first of January of the year, so because I have 4 years of date, the year must be flexible.
# Set up list with 4 columns and counter Q. jan is used to set all dates to the first of january
date_list <- outer(1:52, 1:4)
q = 1
jan <- "-01-01"
for (scene in folders){
year <- as.numeric(substr(scene, start=10, stop=13))
day <- as.numeric(substr(scene, start=14, stop=16))
datum <- paste(year, day, sep='_')
date_list[q, 1] <- datum
date_list[q, 2] <- year
date_list[q, 3] <- day
date_list[q, 4] <- as.Date(day, origin = as.Date(paste(year,jan, sep="")))
q = q+1
}
Output final row:
[52,] "2016_267" "2016" "267" "17068"
What am i missing in date_list[q, 4] that doesn't transfer my integer to a date?
running the following code does work, but due to the large amount of scenes and folders I like to automate this:
as.Date(day, origin = as.Date(paste(year,jan, sep="")))
Thank you for your time!

Well, I assume this would answer your first question:
date_list[q, 4] <- as.character(as.Date(datum,format="%Y_%j"))
as.Date accept a format argument, (the %Y and %j are documented in strptime), the %jis the julian day, this is a little easier to read than using origin and multiple paste calls.
Your problem is actually linked to what a Date object is:
> dput(as.Date("2016-01-10"))
structure(16810, class = "Date")
When entered into a matrix (your date_list) it is coerced to character w
without special treatment before like this:
> d<-as.Date("2016-01-10")
> class(d)<-"character"
> d
[1] "16810"
Hence you get only the number of days since 1970-01-01. When you ask for the date as character representation with as.character, it gives the correct value because the Date class as a as.character method which first compute the date in human format before returning a character value.
Now if I understood well your problem I would go this way:
First create a function to work on one string:
name_to_list <- function(name) {
dpart <- substr(name, start=10, stop=16)
date <- as.POSIXlt(dpart, format="%Y%j")
c("datum"=paste(date$year+1900,date$yday,sep="_"), "year"=date$year+1900, "julian_day"=date$yday, "date"=as.character(date) )
}
this function just get your substring, and then convert it to POSIXlt class, which give us julian day, year and date in one pass. as the year is stored as integer since 1900 (could be negative), we have to add 1900 when storing the year in the fields.
Then if your folders variable is a vector of string:
lapply(folders,name_to_list)
wich for folders=c("LC81730382016267LGN00","LC81730382016287LGN00","LC81730382016167LGN00") gives:
[[1]]
datum year julian_day date
"2016_266" "2016" "266" "2016-09-23"
[[2]]
datum year julian_day date
"2016_286" "2016" "286" "2016-10-13"
[[3]]
datum year julian_day date
"2016_166" "2016" "166" "2016-06-15"

Do you mean to output your day as 3 numbers? Should it not be 2 numbers?
day <- as.numeric(substr(scene, start=15, stop=16))
or
day <- as.numeric(substr(scene, start=14, stop=15))
That could at least be part of the issue. Providing an example of what typical values of "scene" are would be helpful here.

Related

data frame with mixed date format

I would like to change all the mixed date format into one format for example d-m-y
here is the data frame
x <- data.frame("Name" = c("A","B","C","D","E"), "Birthdate" = c("36085.0","2001-sep-12","Feb-18-2005","05/27/84", "2020-6-25"))
I hv tried using this code down here, but it gives NAs
newdateformat <- as.Date(x$Birthdate,
format = "%m%d%y", origin = "2020-6-25")
newdateformat
Then I tried using parse, but it also gives NAs which means it failed to parse
require(lubridate)
parse_date_time(my_data$Birthdate, orders = c("ymd", "mdy"))
[1] NA NA "2001-09-12 UTC" NA
[5] "2005-02-18 UTC"
and I also could find what is the format for the first date in the data frame which is "36085.0"
i did found this code but still couldn't understand what the number means and what is the "origin" means
dates <- c(30829, 38540)
betterDates <- as.Date(dates,
origin = "1899-12-30")
p/s : I'm quite new to R, so i appreciate if you can use an easier explanation thank youuuuu
You should parse each format separately. For each format, select the relevant rows with a regular expression and transform only those rows, then move on the the next format. I'll give the answer with data.table instead of data.frame because I've forgotten how to use data.frame.
library(lubridate)
library(data.table)
x = data.table("Name" = c("A","B","C","D","E"),
"Birthdate" = c("36085.0","2001-sep-12","Feb-18-2005","05/27/84", "2020-6-25"))
# or use setDT(x) to convert an existing data.frame to a data.table
# handle dates like "2001-sep-12" and "2020-6-25"
# this regex matches strings beginning with four numbers and then a dash
x[grepl('^[0-9]{4}-',Birthdate),Birthdate1:=ymd(Birthdate)]
# handle dates like "36085.0": days since 1904 (or 1900)
# see https://learn.microsoft.com/en-us/office/troubleshoot/excel/1900-and-1904-date-system
# this regex matches strings that only have numeric characters and .
x[grepl('^[0-9\\.]+$',Birthdate),Birthdate1:=as.Date(as.numeric(Birthdate),origin='1904-01-01')]
# assume the rest are like "Feb-18-2005" and "05/27/84" and handle those
x[is.na(Birthdate1),Birthdate1:=mdy(Birthdate)]
# result
> x
Name Birthdate Birthdate1
1: A 36085.0 2002-10-18
2: B 2001-sep-12 2001-09-12
3: C Feb-18-2005 2005-02-18
4: D 05/27/84 1984-05-27
5: E 2020-6-25 2020-06-25

Find and extract year within sentence for each cell in R

I have a large dataframe of 22641 obs. and 12 variables.
The first column "year" includes extracted values from satellite images in the format below.
1_1_1_1_LT05_127024_19870517_00005ff8aac6b6bf60bc
From this format, I only want to keep the date which in this case is 19870517 and format it as date (so two different things). Usually, I use the regex to extract the words that I want, but here the date is different for each cell and I have no idea how to replace the above text with only the date. Maybe the way to do this is to search by position within the sentence but I do not know how.
Any ideas?
Thanks.
It's not clear what the "date is different in each cell" means but if it means that the value of the date is different and it is always the 7th field then either of (1) or (2) will work. If it either means that it consists of 8 consecutive digits anywhere in the text or 8 consecutive digits surrounded by _ anywhere in the text then see (3).
1) Assuming the input DF shown in reproducible form in the Note at the end use read.table to read year, pick out the 7th field and then convert it to Date class. No packages are used.
transform(read.table(text = DF$year, sep = "_")[7],
year = as.Date(as.character(V7), "%Y%m%d"), V7 = NULL)
## year
## 1 1987-05-17
2) Another alternative is separate in tidyr. 0.8.2 or later is needed.
library(dplyr)
library(tidyr)
DF %>%
separate(year, c(rep(NA, 6), "year"), extra = "drop") %>%
mutate(year = as.Date(as.character(year), "%Y%m%d"))
## year
## 1 1987-05-17
3) This assumes that the date is the only sequence of 8 digits in the year field use this or if we know it is surrounded by _ delimiters then the regular expression "_(\\d{8})_" can be used instead.
library(gsubfn)
transform(DF,
year = do.call("c", strapply(DF$year, "\\d{8}", ~ as.Date(x, "%Y%m%d"))))
## year
## 1 1987-05-17
Note
DF <- data.frame(year = "1_1_1_1_LT05_127024_19870517_00005ff8aac6b6bf60bc",
stringsAsFactors = FALSE)
Not sure if this will generalize to your whole data but maybe:
gsub(
'(^(?:.*?[^0-9])?)(\\d{8})((?:[^0-9].*)?$)',
'\\2',
'1_1_1_1_LT05_127024_19870517_00005ff8aac6b6bf60bc',
perl = TRUE
)
## [1] "19870517"
This uses group capturing and throws away anything but bounded 8 digit strings.
You can use sub to extract the data string and as.Date to convert it into R's date format:
as.Date(sub(".+?([0-9]+)_[^_]+$", "\\1", txt), "%Y%m%d")
# [1] "1987-05-17"
where txt <- "1_1_1_1_LT05_127024_19870517_00005ff8aac6b6bf60bc"

R - character string with week-Year: week is lost when converting to Date format

I have a character string of the date in Year-week format as such:
weeks.strings <- c("2002-26", "2002-27", "2002-28", "2002-29", "2002-30", "2002-31")
However, converting this character to Date class results in a loss of week identifier:
> as.Date(weeks.strings, format="%Y-%U")
[1] "2002-08-28" "2002-08-28" "2002-08-28" "2002-08-28" "2002-08-28"
[6] "2002-08-28"
As shown above, the format is converted into year- concatenated with today's date, so any information about the original week is lost (ex - when using the format function or strptime to try and coerce back into the original format.
One solution I found in a help group is to specify the day of the week:
as.Date(weeks.strings, format="%Y-%u %U")
[1] "2002-02-12" "2002-02-19" "2002-02-26" "2002-03-05" "2002-01-02"
[6] "2002-01-09"
But it looks like this results in incorrect week numbering (doesn't match the original string).
Any guidance would be appreciated.
You just need to add a weekday to your weeks.strings in order to make the dates unambiguous (adapted from Jim Holtman's answer on R-help).
as.Date(paste(weeks.strings,1),"%Y-%U %u")
As pointed out in the comments, the Date class is not appropriate if the dates span a long horizon because--at some point--the chosen weekday will not exist in the first/last week of the year. In that case you could use a numeric vector where the whole portion is the year and the decimal portion is the fraction of weeks/year. For example:
wkstr <- sprintf("%d-%02d", rep(2000:2012,each=53), 0:52)
yrwk <- lapply(strsplit(wkstr, "-"), as.numeric)
yrwk <- sapply(yrwk, function(x) x[1]+x[2]/53)
Obviously, there's no unique solution, since each week could be represented by any of up to 7 different dates. That said, here's one idea:
weeks.strings <- c("2002-26", "2002-27", "2002-28", "2002-29",
"2002-30", "2002-31")
x <- as.Date("2002-1-1", format="%Y-%m-%d") + (0:52*7)
x[match(weeks.strings, format(x, "%Y-%U"))]
# [1] "2002-07-02" "2002-07-09" "2002-07-16" "2002-07-23"
# [5] "2002-07-30" "2002-08-06"

Disambiguating day of the week in R

I have a certain start time and a specified day of the week.
start = as.POSIXct(1234567, origin = "1970-1-1")
format(start, format = "%A %c")
target1 = "TUE"
target2 = "Wednesday"
What I want is to find the first day, following start, that matches the corresponding day of the week. (And hopefully is somewhat flexible as to how the user might input the day of the week target) Any idea? I imagine a string lookup table might work, but there's gotta be a neater way.
Bonus points if the solution can be made to vectorise....
I haven't tried vectorizing this yet (not sure if I can), but here's an attempt:
find_day <- function(start,target){
target <- tolower(target)
next_week <- as.Date(start) + 1:7
next_week[match(target,substr(tolower(weekdays(next_week)),1,nchar(target)))]
}
It should accept any length or capitalized abbreviation of a day. How to use it:
> find_day(start,"TUE")
[1] "1970-01-20"
> find_day(start,"friday")
[1] "1970-01-16"

How to import data in the right monthly order using list.files

I have 100 years of monthly data where each month is a file and the file name ends with the year and month of the data.
e.g. "cru_ts_3_10.1901.2009.pet_1901_1.asc" is the file for year 1901, month 1 (January).
The problem is when I list my files the order of the files changes, the months 10, 11 and 12 come after 1:
files <- list.files(pattern=".asc")
head(files)
[1] "cru_ts_3_10.1901.2009.pet_1901_1.asc" "cru_ts_3_10.1901.2009.pet_1901_10.asc" "cru_ts_3_10.1901.2009.pet_1901_11.asc"
[4] "cru_ts_3_10.1901.2009.pet_1901_12.asc" "cru_ts_3_10.1901.2009.pet_1901_2.asc" "cru_ts_3_10.1901.2009.pet_1901_3.asc"
I can see why that happens, but how can I import my data in the right monthly order?
Another regex based solution. It works by extracting the year and month from a filename to construct a real date, and then uses the sort order to print the file list.
pat <- "^.*pet_([0-9]{1,})_([0-9]{1,}).asc$"
ord_files <- as.Date(gsub(pat, sprintf("%s-%s-01", "\\1", "\\2"), files))
files[order(ord_files)]
EXPLANATION
We use regular expressions to match the year and month in the file name. Accordingly \\1 matches the year and \\2 matches the month. We still need to convert it to a date. The statement sprintf("%s-%s-01",\1,\2) substitutes the values of year and month in place of %s. The as.Date is required to convert the string into a date.
files <- c("cru_ts_3_10.1901.2009.pet_1901_1.asc",
"cru_ts_3_10.1901.2009.pet_1901_10.asc",
"cru_ts_3_10.1901.2009.pet_1901_11.asc",
"cru_ts_3_10.1901.2009.pet_1901_12.asc",
"cru_ts_3_10.1901.2009.pet_1901_2.asc",
"cru_ts_3_10.1901.2009.pet_1901_3.asc",
"cru_ts_3_10.1901.2009.pet_1902_1.asc",
"cru_ts_3_10.1901.2009.pet_1902_10.asc",
"cru_ts_3_10.1901.2009.pet_1902_11.asc")
This splits the names on underscores, and selects the last part. (e.g. "1.asc") and removes the ".asc" using sub. Then it converts what is is left into a number and uses sprintf on the number to get a 2 character (digit) string. Then it turns the year and month into a number and orders based on that.
files[order(sapply(strsplit(files, "_"), function(x) {
m <- sprintf("%02d", as.numeric(sub(".asc", "", last(x)))) # turns "1.asc" into "01"
as.numeric(paste(x[length(x) - 1], m, sep=""))
}))]
Returns:
[1] "cru_ts_3_10.1901.2009.pet_1901_1.asc"
[2] "cru_ts_3_10.1901.2009.pet_1901_2.asc"
[3] "cru_ts_3_10.1901.2009.pet_1901_3.asc"
[4] "cru_ts_3_10.1901.2009.pet_1901_10.asc"
[5] "cru_ts_3_10.1901.2009.pet_1901_11.asc"
[6] "cru_ts_3_10.1901.2009.pet_1901_12.asc"
[7] "cru_ts_3_10.1901.2009.pet_1902_1.asc"
[8] "cru_ts_3_10.1901.2009.pet_1902_10.asc"
[9] "cru_ts_3_10.1901.2009.pet_1902_11.asc"
Look at the mixedsort function in the gtools package.

Resources