R: Why is class Date lost upon subsetting - r

Here is an easy example. I have a a data frame with three dates in it:
Data <- as.data.frame(as.Date(c('1970/01/01', '1970/01/02', '1970/01/03')))
names(Data) <- "date"
Now I add a column consisting of the same entries:
for(i in 1:3){
Data[i, "date2"] <- Data[i, "date"]
}
Output looks like this:
date date2
1 1970-01-01 0
2 1970-01-02 1
3 1970-01-03 2
For unknown reasons the class of column date2 is numeric instead of date which was the class of date. Curiously, if you tell R explicitly to use the Date format:
for(i in 1:3){
Data[i, "date3"] <- as.Date(Data[i, "date"])
}
it doesn't make any difference.
date date2 date3
1 1970-01-01 0 0
2 1970-01-02 1 1
3 1970-01-03 2 2
The problem seems to be in the use of subsetting [], in more interesting examples where you have two columns of dates and want to create a third one that picks a date from one of the two other columns depending on some factor the same happens.
Of course we can fix everything in retrospect by doing something like:
Data$date4 <- as.Date(Data$date2, origin = "1970-01-01")
but I'm still wondering: why? Why is this happening? Why can't my dates just stay dates when being transferred to another column??

This is not a final solution, but I think that can help to understand.
Here your data :
Data <- data.frame(date =
as.Date(c('2000/01/01', '2012/01/02', '2013/01/03')))
Take this 2 vectors , one typed by default as numeric and the second as Date.
vv <- vector("numeric",3)
vv.Date <- vector("numeric",3)
class(vv.Date) <- 'Date'
vv
[1] 0 0 0
> vv.Date
[1] "1970-01-01" "1970-01-01" "1970-01-01" ## type dates is initialized by the origin 01-01-1970
Now if I try to assign the first element of each vector as you do in the first step of your loop:
vv[1] <- Data$date[1]
vv.Date[1] <- Data$date[1]
vv
[1] 10957 0 0
> vv.Date
[1] "2000-01-01" "1970-01-01" "1970-01-01"
As you see the typed vector is well created. What happen, when you assign a vector by a scalar value , R try internally to convert it to the type of the vector. To return to your example, When you do this :
You a creating a numeric vector (vv), and you try to assign dates to it:
for(i in 1:3){
Data[i, "date3"] <- as.Date(Data[i, "date"])
}
If you type your date3 , for example:
Data$date3 <- vv.Date
then you try again
for(i in 1:3){
Data[i, "date3"] <- as.Date(Data[i, "date"])
}
You will get a good result:
date date3
1 2000-01-01 2000-01-01
2 2012-01-02 2012-01-02
3 2013-01-03 2013-01-03

Related

How to change specific dates in POSIXct/POSIXt format to NA

I have imported an SPSS file, which contains several date/time variables of the following class:
[1] "POSIXct" "POSIXt"
The user-defined missing value for these variables is 8888-08-08 00:00:00. How can I convert this value to NA for the set of relevant date/time variables in R?
I tried running df$datetime[df$datetime == "8888-08-08"] <- NA as well as df$datetime[df$datetime == as.Date("8888-08-08")] <- NA to no avail.
As these are in POSIXct, use the same type to convert and assign to NA
df$datetime[df$datetime == as.POSIXct("8888-08-08 00:00:00")] <- NA
data
set.seed(24)
df <- data.frame(datetime = sample(c(Sys.time(), Sys.time() + 1:5,
as.POSIXct("8888-08-08 00:00:00")), 20, replace =TRUE))

How to change syntax of column in R?

I have df1:
ID Time
1 16:00:00
2 14:30:00
3 9:23:00
4 10:00:00
5 23:59:00
and would like to change the current 'character' column 'Time' into a an 'integer' as below:
ID Time
1 1600
2 1430
3 923
4 1000
5 2359
We could replace the :'s, make numeric, divide by 100, and convert to integer like this:
df1$Time = as.integer(as.numeric(gsub(':', '', df1$Time))/100)
You want to use as.POSIXct().
Functions to manipulate objects of classes "POSIXlt" and "POSIXct" representing calendar dates and times.
R Documents as.POSIXct()
So in the case of row 1: as.POSIXct("16:00:00", format = "%H%M")
Then use as.numeric if you need it to truly be an int.
Converts a character matrix to a numeric matrix.
R Docs as.Numeric()
df1 <- data.frame(Time = "16:00:00")
df1[, "Time"] <- as.numeric(paste0(substr(df1[, "Time"], 1, 2), substr(df1[, "Time"], 4, 5)))
print(df1)
# Time
# 1 1600
There are many ways to process this, but here's one example:
library(dplyr)
df1 <- mutate(df1, Time = gsub(":", "", Time) # replace colons with blanks
df1 <- mutate(df1, Time = as.numeric(Time)/100) # coerce to numeric type, divide by 100

How to convert list to dataframe without type conversion on date

I'm trying to write a script which can take a file, look up some metadata of relating to the file, and convert certain columns based on that metadata. For example, suppose my data looks like the output of the following:
test_data <- data.frame(date1 = c("03/02/2018","04/25/2018"),date2 = c("9/14/17","9/27/17"))
and suppose that, based on a metadata lookup I found that the columns date1 and date2 of the input file have, respectively, the formats
date_formats <- c("%m/%d/%Y","%m/%d/%y")
So my script would then proceed to define index as a boolean vector which contains the value TRUE where I have a date column and FALSE otherwise, and then attempt to convert all such columns to a standardized R date format:
test_data[,index] <- as.data.frame(
lapply(test_data[,index],as.Date,
format = date_formats[index],
origin ="1970-01-01")))
But this produces some bizarre output:
date1 date2
1 2018-03-02 0017-09-14
2 2020-04-25 2017-09-27
Notice that the years for the (1,2) and (2,1) entries are off. I don't understand why the other values were properly converted. That is mystery #1.
The other mystery is that, if I try to convert only one column, say
as.data.frame(lapply(test_data[,1],as.Date,format = c("%m/%d/%Y")))
then I get undesirable output:
structure.17592..class....Date.. structure.17646..class....Date..
1 2018-03-02 2018-04-25
and if I first wrap this with cbind a la
as.data.frame( cbind(lapply(test_data[,1],as.Date,format = c("%m/%d/%Y"))))
then what I get are the raw, unformatted date values because of the behaviour of cbind:
V1
1 17592
2 17646
So how can I write this generic method which can handle an arbitrary number of columns, with different formats, and convert them all to the same formatted date type in a dataframe?
Try this:
test_data <- data.frame(date1 = c("03/02/2018","04/25/2018"),date2 = c("9/14/17","9/27/17"))
date_formats <- c("%m/%d/%Y","%m/%d/%y")
index <- c(TRUE,TRUE)
test_data[,index] <-
as.data.frame(
lapply(which(index),function(i)
as.Date(test_data[[i]],
format = date_formats[i],
origin ="1970-01-01")))
# date1 date2
# 1 2018-03-02 2017-09-14
# 2 2018-04-25 2017-09-27
The index you were feeding to date_formats in your lapply loop was always of length 2, lapply didn't loop on it. We need to convert your boolean index to numeric, and then loop on it.
Here is cleaner code to achieve what you want:
test_data[,index] <-
Map(as.Date,test_data[index],date_formats[index],origin ="1970-01-01")
# date1 date2
# 1 2018-03-02 2017-09-14
# 2 2018-04-25 2017-09-27
It seems OP's intention is to read the data from a file, certain columns of which contain date in different formats. #Moody_Mudskipper has already provided a nice solution to convert data once it has been read from file.
Another option is to use colClasses argument of the read functions(i.e. read.table, read.csv etc.) itself and get the date columns converted.
# Test data to be read from file. I have added one more column ID in data from OP
textData <- "
ID date1 date2
1 03/02/2018 9/14/17
2 04/25/2018 9/27/17"
setClass("dateformat1")
setClass("dateformat2")
setAs("character", "dateformat1", function(from)as.Date(from, format = "%m/%d/%Y"))
setAs("character", "dateformat2", function(from)as.Date(from, format = "%m/%d/%y"))
read.table(text = textData, header = TRUE, stringsAsFactors = FALSE,
colClasses = c("numeric", "dateformat1","dateformat2"))
# ID date1 date2
# 1 1 2018-03-02 2017-09-14
# 2 2 2018-04-25 2017-09-27

R - How to format the date of several columns in a datatable/dataframe

I want to format several columns in datatable/dataframe using lubridate and column indexing.
Suppose that there is a very large data set which has several unformatted date columns. The question is how can I identify those columns (most likely through indexing) and then format them at the same time in one script using lubridate.
library(data.table)
library (lubridate)
> dt <- data.frame(date1 = c("14.01.2009", "9/2/2005", "24/1/2010", "28.01.2014"),var1 = rnorm(4,2,1), date2 = c("09.01.2009", "23/8/2005","17.01.2000", "04.01.2005"))
> dt
date1 var1 date2
1 14.01.2009 2.919293 09.01.2009
2 9/2/2005 2.390123 23/8/2005
3 24/1/2010 0.878209 17.01.2000
4 28.01.2014 2.224461 04.01.2005
dt <- setDT(dt)
I tried these :
> dmy(dt$date1,dt$date2)# his dose not generate two columns
[1] "2009-01-14" "2005-02-09" "2010-01-24" "2014-01-28" "2009-01-09" "2005-08-23"
[7] "2000-01-17" "2005-01-04"
> as.data.frame(dmy(dt$date1,dt$date2))
dmy(dt$date1, dt$date2) # this dose not generate two columns either
1 2009-01-14
2 2005-02-09
3 2010-01-24
4 2014-01-28
5 2009-01-09
6 2005-08-23
7 2000-01-17
8 2005-01-04
dmy(dt[,.SD, .SD =c(1,3)])
[1] NA NA
> sapply(dmy(dt$date1,dt$date2),dmy)
[1] NA NA NA NA NA NA NA NA
Warning messages:
1: All formats failed to parse. No formats found.
Any help is highly appreciated.
How about:
dt <- data.frame(date1 = c("14.01.2009", "9/2/2005", "24/1/2010", "28.01.2014"),var1 = rnorm(4,2,1), date2 = c("09.01.2009", "23/8/2005","17.01.2000", "04.01.2005"))
for(i in c(1,3)){
dt[,i] <- dmy(dt[,i])
}
Here's a data.table way. Suppose you have k columns named dateX:
k = 2
date_cols = paste0('date', 1:k)
for (col in date_cols) {
set(dt, j=col, value=dmy(dt[[col]])
}
You can avoid the loop, but apparently the loop may be faster; see this answer
dt[,(date_cols) := lapply(.SD, dmy), .SDcols=date_cols]
EDIT
If you have aribitray column names, assuming data looks as in OP
date_cols = names(dt)[grep("^\\d{4}(\\.|/)", names(dt))]
date_cols = c(date_cols, names(dt)[grep("(\\.|/)\\d{4}", names(dt))])
You can add regular expressions if there are more delimiters than . or /, and you can combine this into a single grep but this is clearer to me.
Far from perfect, this is a solution which should be more general:
The only assumption here is, that the date columns contain digits separated by either . , / or -. If there's other separators, they may be added. But if you have another variable which is similar, but not a date, this won't work well.
for (j in seq_along(dt)) if (all(grepl('\\d+(\\.|/|-)\\d+(\\.|/|-)\\d+',dt[,j]))) dt[,j] <- dmy(dt[,j])
This loops through the columns and checks if a date could be present using regular expressions. If so, it will convert it to a date and overwrite the column.
Using data.table:
for (j in seg_along(dt)) if (all(grepl('\\d+(\\.|/|-)\\d+(\\.|/|-)\\d+',dt[,j]))) set(dt,j = j, value = dmy(dt[[j]]))
You could also replace all with any with the idea that if you have any match in the column, you could assume all of the values in that column are dates which can be read by dmy.

Conditional subset of data from list base on date R

I have several .csv files containing hourly data. Each file represents data from a point in space. The start and end date is different in each file.
The data can be read into R using:
lstf1<- list.files(pattern=".csv")
lst2<- lapply(lstf1,function(x) read.csv(x,header = TRUE,stringsAsFactors=FALSE,sep = ",",fill=TRUE, dec = ".",quote = "\""))
head(lst2[[800]])
datetime precip code
1 2003-12-30 00:00:00 NA M
2 2003-12-30 01:00:00 NA M
3 2003-12-30 02:00:00 NA M
4 2003-12-30 03:00:00 NA M
5 2003-12-30 04:00:00 NA M
6 2003-12-30 05:00:00 NA M
datetime is YYYY-MM-DD-HH-MM-SS, precip is the data value, codecan be ignored.
For each dataframe (df) in lst2 I want to select data for the period 2015-04-01 to 2015-11-30 based on the following conditions:
1) If precip in a df contains all NAswithin this period, delete it (do not select)
2) If precip is not all NAs select it.
The desired output (lst3) contains the sub-setted data for the period 2015-04-01 to 2015-11-30.
All dataframes in lst3 should have equal length with days and hourswithout precipdenoted as NA
The I can write the files in lst3 to my directory using something like:
sapply(names(lst2),function (x) write.csv(lst3[[x]],file = paste0(names(lst2[x]), ".csv"),row.names = FALSE))
The link to a sample file can be found here (~200 KB)
It's a little hard to understand exactly what you are trying to do, but this example (using dplyr, which has nice filter syntax) on the file you provided should get you close:
library(dplyr)
df <- read.csv ("L112FN0M.262.csv")
df$datetime <- as.POSIXct(df$datetime, format="%d/%m/%Y %H:%M")
# Get the required date range and delete the NAs
df.sub <- filter(df, !is.na(precip),
datetime >= as.POSIXct("2015-04-01"),
datetime < as.POSIXct("2015-12-01"))
# Check if the subset has any rows left (it will be empty if it was full of NA for precip)
if nrow(df.sub > 0) {
df.result <- filter(df, datetime >= as.POSIXct("2015-04-01"),
datetime < as.POSIXct("2015-12-01"))
# Then add df.result to your list of data frames...
} # else, don't add it to your list
I think you are saying that you want to retain NAs in the data frame if there are also valid precip values--you only want to discard if there are NAs for the entire period. If you just want to strip all NAs, then just use the first filter statement and you are done. You obviously don't need to use POSIXct if you've already got your dates encoded correctly another way.
EDIT: w/ function wrapper so you can use lapply:
library(dplyr)
# Get some example data
df <- read.csv ("L112FN0M.262.csv")
df$datetime <- as.POSIXct(df$datetime, format="%d/%m/%Y %H:%M")
dfnull <- df
dfnull$precip <- NA
# list of 3 input data frames to test, 2nd one has precip all NA
df.list <- list(df, dfnull, df)
# Function to do the filtering; returns list of data frames to keep or null
filterprecip <- function(d) {
if (nrow(filter(d, !is.na(precip), datetime >= as.POSIXct("2015-04-01"), datetime < as.POSIXct("2015-12-01"))) >
0) {
return(filter(d, datetime >= as.POSIXct("2015-04-01"), datetime < as.POSIXct("2015-12-01")))
}
}
# Function to remove NULLS in returned list
# (Credit to Hadley Wickham: http://tolstoy.newcastle.edu.au/R/e8/help/09/12/8102.html)
compact <- function(x) Filter(Negate(is.null), x)
# Filter the list
results <- compact(lapply(df.list, filterprecip))
# Check that you got a list of 2 data frames in the right date range
str(results)
Based on what you've written, is sounds like you're just interested in subsetting your list of files if data exists in the precip column for this specific date range.
> valuesExist <- function(df,start="2015-04-01 0:00:00",end="2015-11-30 23:59:59"){
+ sub.df <- df[df$datetime>=start & df$datetime>=end,]
+ if(sum(is.na(sub.df$precip)==nrow(df)){return(FALSE)}else{return(TRUE)}
+ }
> lst2.bool <- lapply(lst2, valuesExist)
> lst2 <- lst2[lst2.bool]
> lst3 <- lapply(lst2, function(x) {x[x$datetime>="2015-04-01 0:00:00" & x$datetime>="2015-11-30 23:59:59",]}
> sapply(names(lst2), function (x) write.csv(lst3[[x]],file = paste0(names(lst2[x]), ".csv"),row.names = FALSE))
If you want to have a dynamic start and end time, toss a variable with these values into the valueExist function and replace the string timestamp in the lst3 assignment with that same variable.
If you wanted to combine the two lapply loops into one, be my guest, but I prefer having a boolean variable when I'm subsetting.

Resources