I have imported an SPSS file, which contains several date/time variables of the following class:
[1] "POSIXct" "POSIXt"
The user-defined missing value for these variables is 8888-08-08 00:00:00. How can I convert this value to NA for the set of relevant date/time variables in R?
I tried running df$datetime[df$datetime == "8888-08-08"] <- NA as well as df$datetime[df$datetime == as.Date("8888-08-08")] <- NA to no avail.
As these are in POSIXct, use the same type to convert and assign to NA
df$datetime[df$datetime == as.POSIXct("8888-08-08 00:00:00")] <- NA
data
set.seed(24)
df <- data.frame(datetime = sample(c(Sys.time(), Sys.time() + 1:5,
as.POSIXct("8888-08-08 00:00:00")), 20, replace =TRUE))
Related
I have imported an SPSS file, which contains several date/time variables of the following class:
[1] "POSIXct" "POSIXt"
The user-defined missing value for these variables is 8888-08-08 00:00:00. How can I convert this value to NA for the set of relevant date/time variables in R?
I tried running df$datetime[df$datetime == "8888-08-08"] <- NA as well as df$datetime[df$datetime == as.Date("8888-08-08")] <- NA to no avail.
As these are in POSIXct, use the same type to convert and assign to NA
df$datetime[df$datetime == as.POSIXct("8888-08-08 00:00:00")] <- NA
data
set.seed(24)
df <- data.frame(datetime = sample(c(Sys.time(), Sys.time() + 1:5,
as.POSIXct("8888-08-08 00:00:00")), 20, replace =TRUE))
I have a large POSIXct of around 70,000 elements.
resolutionDate <- c(as.POSIXct(data$Resolution.Date, format = '%b %d, %Y'))
The code above changes the values from Jun 5, 2018 3:21 PM to 2018-06-05.
However, some values are NA and I would like to replace all NA's with Sys.time(), for today's date.
I tried using the replace() method as so,
replace(resolutionDate, if(resolutionData == "NA"), Sys.time())
But did not work..
How can I do this?
Something like this?
# generate time vector
a <- as.POSIXct(1:70000,origin="1970-01-01")
# replace the 5th with a NA value and show first 10 elements
a[5] <- NA
a[1:10]
# replace all na values with the current system time
a[is.na(a)] <- Sys.time()
# show result
a[1:10]
> df <- read.csv("C:\\Users\\Vikas Kumar Dwivedi\\Desktop\\Yahoo.csv")
> df
Date Open High Low Close Adj.Close Volume
1 01-03-2013 null null null null null null
2 01-04-2013 1569.180054 1597.569946 1536.030029 1597.569946 1597.569946 77098000000
3 01-05-2013 1597.550049 1687.180054 1581.280029 1630.73999 1630.73999 76447250000
> df$Date <- as.Date(df$Date, format("%m/%d/%Y"))
> df <- df[order(df$Date), ]
> df<- as.xts(df[, 2], order.by = df$Date)
Error in UseMethod("as.xts") :
no applicable method for 'as.xts' applied to an object of class "factor"
I am not able to convert dataframe into xts? Could you please help me.
The problem is that the columns in your CSV contain numbers and characters, so read.csv() interprets them as factors. You need to do what quantmod::getSymbols.yahoo() does and set na.strings = "null". That tells read.csv() to treat the character string "null" as a NA value.
csv <- "Date,Open,High,Low,Close,Adj.Close,Volume
01-03-2013,null,null,null,null,null,null
01-04-2013,1569.180054,1597.569946,1536.030029,1597.569946,1597.569946,77098000000
01-05-2013,1597.550049,1687.180054,1581.280029,1630.73999,1630.73999,76447250000"
d <- read.csv(text = csv, na.strings = "null")
# also note that your date format was wrong, and there is no need to wrap a character
# string in `format()`
d$Date <- as.Date(d$Date, format = "%m-%d-%Y")
#d <- d[order(d$Date), ] # this isn't necessary, xts() will do it for you
(x <- xts(d[, 2], order.by = d$Date))
# [,1]
# 2013-01-03 NA
# 2013-01-04 1569.18
# 2013-01-05 1597.55
Or you can do all of this with a call to read.csv.zoo() and wrap it in as.xts() if you prefer an xts object.
(x <- as.xts(read.csv.zoo(text = csv, format = "%m-%d-%Y", na.strings = "null")))
# Open High Low Close Adj.Close Volume
# 2013-01-03 NA NA NA NA NA NA
# 2013-01-04 1569.18 1597.57 1536.03 1597.57 1597.57 77098000000
# 2013-01-05 1597.55 1687.18 1581.28 1630.74 1630.74 76447250000
I want to format several columns in datatable/dataframe using lubridate and column indexing.
Suppose that there is a very large data set which has several unformatted date columns. The question is how can I identify those columns (most likely through indexing) and then format them at the same time in one script using lubridate.
library(data.table)
library (lubridate)
> dt <- data.frame(date1 = c("14.01.2009", "9/2/2005", "24/1/2010", "28.01.2014"),var1 = rnorm(4,2,1), date2 = c("09.01.2009", "23/8/2005","17.01.2000", "04.01.2005"))
> dt
date1 var1 date2
1 14.01.2009 2.919293 09.01.2009
2 9/2/2005 2.390123 23/8/2005
3 24/1/2010 0.878209 17.01.2000
4 28.01.2014 2.224461 04.01.2005
dt <- setDT(dt)
I tried these :
> dmy(dt$date1,dt$date2)# his dose not generate two columns
[1] "2009-01-14" "2005-02-09" "2010-01-24" "2014-01-28" "2009-01-09" "2005-08-23"
[7] "2000-01-17" "2005-01-04"
> as.data.frame(dmy(dt$date1,dt$date2))
dmy(dt$date1, dt$date2) # this dose not generate two columns either
1 2009-01-14
2 2005-02-09
3 2010-01-24
4 2014-01-28
5 2009-01-09
6 2005-08-23
7 2000-01-17
8 2005-01-04
dmy(dt[,.SD, .SD =c(1,3)])
[1] NA NA
> sapply(dmy(dt$date1,dt$date2),dmy)
[1] NA NA NA NA NA NA NA NA
Warning messages:
1: All formats failed to parse. No formats found.
Any help is highly appreciated.
How about:
dt <- data.frame(date1 = c("14.01.2009", "9/2/2005", "24/1/2010", "28.01.2014"),var1 = rnorm(4,2,1), date2 = c("09.01.2009", "23/8/2005","17.01.2000", "04.01.2005"))
for(i in c(1,3)){
dt[,i] <- dmy(dt[,i])
}
Here's a data.table way. Suppose you have k columns named dateX:
k = 2
date_cols = paste0('date', 1:k)
for (col in date_cols) {
set(dt, j=col, value=dmy(dt[[col]])
}
You can avoid the loop, but apparently the loop may be faster; see this answer
dt[,(date_cols) := lapply(.SD, dmy), .SDcols=date_cols]
EDIT
If you have aribitray column names, assuming data looks as in OP
date_cols = names(dt)[grep("^\\d{4}(\\.|/)", names(dt))]
date_cols = c(date_cols, names(dt)[grep("(\\.|/)\\d{4}", names(dt))])
You can add regular expressions if there are more delimiters than . or /, and you can combine this into a single grep but this is clearer to me.
Far from perfect, this is a solution which should be more general:
The only assumption here is, that the date columns contain digits separated by either . , / or -. If there's other separators, they may be added. But if you have another variable which is similar, but not a date, this won't work well.
for (j in seq_along(dt)) if (all(grepl('\\d+(\\.|/|-)\\d+(\\.|/|-)\\d+',dt[,j]))) dt[,j] <- dmy(dt[,j])
This loops through the columns and checks if a date could be present using regular expressions. If so, it will convert it to a date and overwrite the column.
Using data.table:
for (j in seg_along(dt)) if (all(grepl('\\d+(\\.|/|-)\\d+(\\.|/|-)\\d+',dt[,j]))) set(dt,j = j, value = dmy(dt[[j]]))
You could also replace all with any with the idea that if you have any match in the column, you could assume all of the values in that column are dates which can be read by dmy.
I have several .csv files containing hourly data. Each file represents data from a point in space. The start and end date is different in each file.
The data can be read into R using:
lstf1<- list.files(pattern=".csv")
lst2<- lapply(lstf1,function(x) read.csv(x,header = TRUE,stringsAsFactors=FALSE,sep = ",",fill=TRUE, dec = ".",quote = "\""))
head(lst2[[800]])
datetime precip code
1 2003-12-30 00:00:00 NA M
2 2003-12-30 01:00:00 NA M
3 2003-12-30 02:00:00 NA M
4 2003-12-30 03:00:00 NA M
5 2003-12-30 04:00:00 NA M
6 2003-12-30 05:00:00 NA M
datetime is YYYY-MM-DD-HH-MM-SS, precip is the data value, codecan be ignored.
For each dataframe (df) in lst2 I want to select data for the period 2015-04-01 to 2015-11-30 based on the following conditions:
1) If precip in a df contains all NAswithin this period, delete it (do not select)
2) If precip is not all NAs select it.
The desired output (lst3) contains the sub-setted data for the period 2015-04-01 to 2015-11-30.
All dataframes in lst3 should have equal length with days and hourswithout precipdenoted as NA
The I can write the files in lst3 to my directory using something like:
sapply(names(lst2),function (x) write.csv(lst3[[x]],file = paste0(names(lst2[x]), ".csv"),row.names = FALSE))
The link to a sample file can be found here (~200 KB)
It's a little hard to understand exactly what you are trying to do, but this example (using dplyr, which has nice filter syntax) on the file you provided should get you close:
library(dplyr)
df <- read.csv ("L112FN0M.262.csv")
df$datetime <- as.POSIXct(df$datetime, format="%d/%m/%Y %H:%M")
# Get the required date range and delete the NAs
df.sub <- filter(df, !is.na(precip),
datetime >= as.POSIXct("2015-04-01"),
datetime < as.POSIXct("2015-12-01"))
# Check if the subset has any rows left (it will be empty if it was full of NA for precip)
if nrow(df.sub > 0) {
df.result <- filter(df, datetime >= as.POSIXct("2015-04-01"),
datetime < as.POSIXct("2015-12-01"))
# Then add df.result to your list of data frames...
} # else, don't add it to your list
I think you are saying that you want to retain NAs in the data frame if there are also valid precip values--you only want to discard if there are NAs for the entire period. If you just want to strip all NAs, then just use the first filter statement and you are done. You obviously don't need to use POSIXct if you've already got your dates encoded correctly another way.
EDIT: w/ function wrapper so you can use lapply:
library(dplyr)
# Get some example data
df <- read.csv ("L112FN0M.262.csv")
df$datetime <- as.POSIXct(df$datetime, format="%d/%m/%Y %H:%M")
dfnull <- df
dfnull$precip <- NA
# list of 3 input data frames to test, 2nd one has precip all NA
df.list <- list(df, dfnull, df)
# Function to do the filtering; returns list of data frames to keep or null
filterprecip <- function(d) {
if (nrow(filter(d, !is.na(precip), datetime >= as.POSIXct("2015-04-01"), datetime < as.POSIXct("2015-12-01"))) >
0) {
return(filter(d, datetime >= as.POSIXct("2015-04-01"), datetime < as.POSIXct("2015-12-01")))
}
}
# Function to remove NULLS in returned list
# (Credit to Hadley Wickham: http://tolstoy.newcastle.edu.au/R/e8/help/09/12/8102.html)
compact <- function(x) Filter(Negate(is.null), x)
# Filter the list
results <- compact(lapply(df.list, filterprecip))
# Check that you got a list of 2 data frames in the right date range
str(results)
Based on what you've written, is sounds like you're just interested in subsetting your list of files if data exists in the precip column for this specific date range.
> valuesExist <- function(df,start="2015-04-01 0:00:00",end="2015-11-30 23:59:59"){
+ sub.df <- df[df$datetime>=start & df$datetime>=end,]
+ if(sum(is.na(sub.df$precip)==nrow(df)){return(FALSE)}else{return(TRUE)}
+ }
> lst2.bool <- lapply(lst2, valuesExist)
> lst2 <- lst2[lst2.bool]
> lst3 <- lapply(lst2, function(x) {x[x$datetime>="2015-04-01 0:00:00" & x$datetime>="2015-11-30 23:59:59",]}
> sapply(names(lst2), function (x) write.csv(lst3[[x]],file = paste0(names(lst2[x]), ".csv"),row.names = FALSE))
If you want to have a dynamic start and end time, toss a variable with these values into the valueExist function and replace the string timestamp in the lst3 assignment with that same variable.
If you wanted to combine the two lapply loops into one, be my guest, but I prefer having a boolean variable when I'm subsetting.