Conditional subset of data from list base on date R - r

I have several .csv files containing hourly data. Each file represents data from a point in space. The start and end date is different in each file.
The data can be read into R using:
lstf1<- list.files(pattern=".csv")
lst2<- lapply(lstf1,function(x) read.csv(x,header = TRUE,stringsAsFactors=FALSE,sep = ",",fill=TRUE, dec = ".",quote = "\""))
head(lst2[[800]])
datetime precip code
1 2003-12-30 00:00:00 NA M
2 2003-12-30 01:00:00 NA M
3 2003-12-30 02:00:00 NA M
4 2003-12-30 03:00:00 NA M
5 2003-12-30 04:00:00 NA M
6 2003-12-30 05:00:00 NA M
datetime is YYYY-MM-DD-HH-MM-SS, precip is the data value, codecan be ignored.
For each dataframe (df) in lst2 I want to select data for the period 2015-04-01 to 2015-11-30 based on the following conditions:
1) If precip in a df contains all NAswithin this period, delete it (do not select)
2) If precip is not all NAs select it.
The desired output (lst3) contains the sub-setted data for the period 2015-04-01 to 2015-11-30.
All dataframes in lst3 should have equal length with days and hourswithout precipdenoted as NA
The I can write the files in lst3 to my directory using something like:
sapply(names(lst2),function (x) write.csv(lst3[[x]],file = paste0(names(lst2[x]), ".csv"),row.names = FALSE))
The link to a sample file can be found here (~200 KB)

It's a little hard to understand exactly what you are trying to do, but this example (using dplyr, which has nice filter syntax) on the file you provided should get you close:
library(dplyr)
df <- read.csv ("L112FN0M.262.csv")
df$datetime <- as.POSIXct(df$datetime, format="%d/%m/%Y %H:%M")
# Get the required date range and delete the NAs
df.sub <- filter(df, !is.na(precip),
datetime >= as.POSIXct("2015-04-01"),
datetime < as.POSIXct("2015-12-01"))
# Check if the subset has any rows left (it will be empty if it was full of NA for precip)
if nrow(df.sub > 0) {
df.result <- filter(df, datetime >= as.POSIXct("2015-04-01"),
datetime < as.POSIXct("2015-12-01"))
# Then add df.result to your list of data frames...
} # else, don't add it to your list
I think you are saying that you want to retain NAs in the data frame if there are also valid precip values--you only want to discard if there are NAs for the entire period. If you just want to strip all NAs, then just use the first filter statement and you are done. You obviously don't need to use POSIXct if you've already got your dates encoded correctly another way.
EDIT: w/ function wrapper so you can use lapply:
library(dplyr)
# Get some example data
df <- read.csv ("L112FN0M.262.csv")
df$datetime <- as.POSIXct(df$datetime, format="%d/%m/%Y %H:%M")
dfnull <- df
dfnull$precip <- NA
# list of 3 input data frames to test, 2nd one has precip all NA
df.list <- list(df, dfnull, df)
# Function to do the filtering; returns list of data frames to keep or null
filterprecip <- function(d) {
if (nrow(filter(d, !is.na(precip), datetime >= as.POSIXct("2015-04-01"), datetime < as.POSIXct("2015-12-01"))) >
0) {
return(filter(d, datetime >= as.POSIXct("2015-04-01"), datetime < as.POSIXct("2015-12-01")))
}
}
# Function to remove NULLS in returned list
# (Credit to Hadley Wickham: http://tolstoy.newcastle.edu.au/R/e8/help/09/12/8102.html)
compact <- function(x) Filter(Negate(is.null), x)
# Filter the list
results <- compact(lapply(df.list, filterprecip))
# Check that you got a list of 2 data frames in the right date range
str(results)

Based on what you've written, is sounds like you're just interested in subsetting your list of files if data exists in the precip column for this specific date range.
> valuesExist <- function(df,start="2015-04-01 0:00:00",end="2015-11-30 23:59:59"){
+ sub.df <- df[df$datetime>=start & df$datetime>=end,]
+ if(sum(is.na(sub.df$precip)==nrow(df)){return(FALSE)}else{return(TRUE)}
+ }
> lst2.bool <- lapply(lst2, valuesExist)
> lst2 <- lst2[lst2.bool]
> lst3 <- lapply(lst2, function(x) {x[x$datetime>="2015-04-01 0:00:00" & x$datetime>="2015-11-30 23:59:59",]}
> sapply(names(lst2), function (x) write.csv(lst3[[x]],file = paste0(names(lst2[x]), ".csv"),row.names = FALSE))
If you want to have a dynamic start and end time, toss a variable with these values into the valueExist function and replace the string timestamp in the lst3 assignment with that same variable.
If you wanted to combine the two lapply loops into one, be my guest, but I prefer having a boolean variable when I'm subsetting.

Related

How can I remove datasets from a list based on conditon using lapply?

I have list of data frames for which I have to perform the same operations for each one. I currently do this with a for loop but it is too slow. I would like to use lapply instead. The operations which I need to perform is to check how many of the values in a date column in each dataset that are missing from a vector of dates.
The data have the following structure.
# the dates which are of interest
dates <- seq(as.Date("2020-02-01"), as.Date("2020-02-09"), by = "days")
# the list of data frames
df_1 <- data.frame(seq(as.Date("2020-02-01"), as.Date("2020-02-09"), by = "days"), 1:9)
names(df_1) <- c("date", "value")
df_2 <- data.frame(seq(as.Date("2020-02-01"), as.Date("2020-02-07"), by = "days"), 1:7)
names(df_2) <- c("date", "value")
df_list <- list(df_1, df_2)
The loop which is working but is too slow looks like this.
for (i in 1:length(df_list)) {
# get range of dates in the data frame
df_date_range <- unique(df_list[[i]][["date"]])
# get range of dates that occur from the point of the beginning of the data frame
dates_reduced <- dates[dates >= min(df_date_range)]
# get the share of dates missing
missing <- mean(!(dates_reduced %in% df_date_range))
# remove data frames where the share of missing values are above 1 %
if (missing > 0.1) {
df_list[[i]] <- NULL
}
}
I tried the following lapply approach.
# write function to use in lapply
clean <- function(data, date_range) {
# get range of dates in the data frame
df_date_range <- unique(data$date)
# get range of dates that occur from the point of the beginning of the data frame
dates_reduced <- date_range[date_range >= min(df_date_range)]
# get the share of dates missing
missing <- mean(!(dates_reduced %in% df_date_range))
# remove data frames where the share of missing values are above 1 %
if (missing > 0.1) {
data <- NULL
}
}
# apply the function to the list of data frames
new_df_list <- lapply(df_list, clean, date_range = dates)
This however only yields a list of NULLs. Any help on what I'm doing wrong would be greatly appreciated.
While we dont't have to explicitly specify return values in R it's always better to do so. Your problem illustrates this point! (R function implicitly return the result of the last expression, but that is not always what one would expect!):
Consider the following function:
no_explicit_return_value <- function() {
some_non_NULL_value <- 10000
}
If we run:
test_value <- no_explicit_return_value()
test_value
We get back:
[1] 10000
Since the last expression returned 10000... so far all good!
Now, consider this function:
no_explicit_return_value <- function() {
some_non_NULL_value <- 10000
if (1000 < 4) {
x <- NULL
}
}
If we run:
test_value <- no_explicit_return_value()
test_value
We get back:
NULL
Not because the if clause evaluated to TRUE but because there is no return value from the if clause
The Solution:
clean <- function(data, date_range) {
# get range of dates in the data frame
df_date_range <- unique(data$date)
# get range of dates that occur from the point of the beginning of the data frame
dates_reduced <- date_range[date_range >= min(df_date_range)]
# get the share of dates missing
missing <- mean(!(dates_reduced %in% df_date_range))
# remove data frames where the share of missing values are above 1 %
if (missing > 0.1) {
data <- NULL
}
return(data)
}
# apply the function to the list of data frames
new_df_list <- lapply(df_list, clean, date_range = dates)
new_df_list
Returns:
[[1]]
date value
1 2020-02-01 1
2 2020-02-02 2
3 2020-02-03 3
4 2020-02-04 4
5 2020-02-05 5
6 2020-02-06 6
7 2020-02-07 7
8 2020-02-08 8
9 2020-02-09 9
[[2]]
NULL

R - How to format the date of several columns in a datatable/dataframe

I want to format several columns in datatable/dataframe using lubridate and column indexing.
Suppose that there is a very large data set which has several unformatted date columns. The question is how can I identify those columns (most likely through indexing) and then format them at the same time in one script using lubridate.
library(data.table)
library (lubridate)
> dt <- data.frame(date1 = c("14.01.2009", "9/2/2005", "24/1/2010", "28.01.2014"),var1 = rnorm(4,2,1), date2 = c("09.01.2009", "23/8/2005","17.01.2000", "04.01.2005"))
> dt
date1 var1 date2
1 14.01.2009 2.919293 09.01.2009
2 9/2/2005 2.390123 23/8/2005
3 24/1/2010 0.878209 17.01.2000
4 28.01.2014 2.224461 04.01.2005
dt <- setDT(dt)
I tried these :
> dmy(dt$date1,dt$date2)# his dose not generate two columns
[1] "2009-01-14" "2005-02-09" "2010-01-24" "2014-01-28" "2009-01-09" "2005-08-23"
[7] "2000-01-17" "2005-01-04"
> as.data.frame(dmy(dt$date1,dt$date2))
dmy(dt$date1, dt$date2) # this dose not generate two columns either
1 2009-01-14
2 2005-02-09
3 2010-01-24
4 2014-01-28
5 2009-01-09
6 2005-08-23
7 2000-01-17
8 2005-01-04
dmy(dt[,.SD, .SD =c(1,3)])
[1] NA NA
> sapply(dmy(dt$date1,dt$date2),dmy)
[1] NA NA NA NA NA NA NA NA
Warning messages:
1: All formats failed to parse. No formats found.
Any help is highly appreciated.
How about:
dt <- data.frame(date1 = c("14.01.2009", "9/2/2005", "24/1/2010", "28.01.2014"),var1 = rnorm(4,2,1), date2 = c("09.01.2009", "23/8/2005","17.01.2000", "04.01.2005"))
for(i in c(1,3)){
dt[,i] <- dmy(dt[,i])
}
Here's a data.table way. Suppose you have k columns named dateX:
k = 2
date_cols = paste0('date', 1:k)
for (col in date_cols) {
set(dt, j=col, value=dmy(dt[[col]])
}
You can avoid the loop, but apparently the loop may be faster; see this answer
dt[,(date_cols) := lapply(.SD, dmy), .SDcols=date_cols]
EDIT
If you have aribitray column names, assuming data looks as in OP
date_cols = names(dt)[grep("^\\d{4}(\\.|/)", names(dt))]
date_cols = c(date_cols, names(dt)[grep("(\\.|/)\\d{4}", names(dt))])
You can add regular expressions if there are more delimiters than . or /, and you can combine this into a single grep but this is clearer to me.
Far from perfect, this is a solution which should be more general:
The only assumption here is, that the date columns contain digits separated by either . , / or -. If there's other separators, they may be added. But if you have another variable which is similar, but not a date, this won't work well.
for (j in seq_along(dt)) if (all(grepl('\\d+(\\.|/|-)\\d+(\\.|/|-)\\d+',dt[,j]))) dt[,j] <- dmy(dt[,j])
This loops through the columns and checks if a date could be present using regular expressions. If so, it will convert it to a date and overwrite the column.
Using data.table:
for (j in seg_along(dt)) if (all(grepl('\\d+(\\.|/|-)\\d+(\\.|/|-)\\d+',dt[,j]))) set(dt,j = j, value = dmy(dt[[j]]))
You could also replace all with any with the idea that if you have any match in the column, you could assume all of the values in that column are dates which can be read by dmy.

how to use loop to subset 1000 csv files and do co-integration test in R?

data looks like sample1
sample2
I have 1000 csv files, all of them have two columns, first column is date, second column is price. Files have different time periods, some data start from 1995, some data start from 2000,since I need to do co integration test, all data in test should have exactly same time point.
I need extract same time period from 1000 csv files, for example start from 1998-4-20.
It works if I do it individually using:
newdata208 <- subset(data208, Date >= "1998-04-20")
but when I try to loop them, error happen, could anyone help me fixed error?
v1 <- list()
for (i in 1:length(datasets)) {
v1[i] <- subset(datasets[i], Date >= "1998-04-20")
}
Error in subset.default(datasets[i], Date >= "1998-04-20") :
object 'Date' not found
for original problem, just add get(i) in loop, then problem is fixed, I do not why, could anyone tell me?
v1 <- list()
for (i in 1:length(datasets)) {
data <- get(i)
v1[i] <- subset(data, Date >= "1998-04-20")}
then, problem is fixed
updating my stupid code
#set dictionary.
setwd("F:/xxx/folder")
dataset <- list.files(pattern = "*.CSV")
datasets <- c()
for (i in 1:1000)) {
datasets[i] <- substr(dataset[i], 1, (nchar(dataset[i])-4))
}
# we only need closing price column and date column
setClass("myDate")
setAs("character", "myDate", function(from) as.Date(from, format = "%m/%d/%Y"))
# read date column and closing price column
for (i in 1:length(temps)) {
assign(temps[i], read.csv(temp[i],
colClass = c("myDate", rep("NULL", 4),
rowClass = "numeric",
rep("NULL", 2)), stringsAsFactor = FALSE, header = TRUE))
}
# extract same time period
v1 <- c()
for (i in temps) {
data <- get(i)
v1[i] <- subset(data, Date >= "1998-04-20", select = C)
}
# lengths are different, file505 has short time period
index <- subset(file505, Date >= "1998-04-20")
indexs <- index$Date
# try use index to extract data
selectdate <- which(file001$Date %in% indexs)
file001CLOSE <- file001[selectdate, "C"]
#redo loop to get same period
v2 <- c()
for (i in datasets) {
data2 <- get(i)
v2[[i]] <- data2[selectdate, "C"]
}
v2table <- do.call(cbind,v2)
# right now, data is wonderful, let's begin do time series.
# test co-integration
install.packages("urca")
library("urca")
comb <- combn(1000, 2)
pairs <- c()
for (i in 1:499500) {
pairs[[i]] <- v2table[, comb[, i]]
}
# test:FF <- pairs[[88]], it is working wonderful, display all details in result
# do ca.jo
testresults <- list()
for (i in 1:499500) {
testdata <- pairs[[i]]
testresults[[i]] <- ca.jo(testdata, ecdet = "const", type = "eigen", K = 1)
}
it creates a huge list containing all test results, I need to split Values of teststatistic and critical values of test, and find all pairs have co-integration factor.
I've got a two step solution for you:
x <- list.files(path = "your directory", pattern = ".csv")
y <- lapply(x,fread)
data <- rbindlist(y)
Reads in all the csvs in your directory as data tables then binds then together.
After that i'd just subset as follows:
data <- data[Date >= "your dates",]
EDIT*
I get the feeling you want to bring in your data and bind it column wise/merge. As it stands that would be incorrect as you have the same "C" value in each column. After looking at your samples, if you were to stack those row wise you'd just be getting a very long C column. I wonder whether each C column represents the same or a different variable. If it is a different variable, I've written up some code which would truncate your data appropriately.
I've used the first 6 rows of your sample data 1 and sample data 2
files <- list.files(path = dir, pattern = ".csv")
data_mock <- lapply(files,fread)
data_mock[[1]][, Date := data_mock[[2]][,Date]]
#I change the dates here because your sample dates are too far apart to test for date truncation to work
for (i in 1:length(data_mock)){
data_mock[[i]]$Date <- as.Date(data_mock[[i]]$Date, format = "%Y-%m-%d")
}
for (i in 1:length(data_mock)){
setnames(data_mock[[i]], old = names(data_mock[[i]]), new = c("Date", paste0("C",i)))
}
#I change the variable names here because I'm not sure whether you want to stack Cs ontop of one another or whether each C is a different variable.
#I've assumed each C is different.
start_finish <- function(data, start, finish){
data[Date >= start & Date <= finish,]
}
results <- list()
for (i in 1:length(data_mock)){
results[[i]] <- start_finish(data_mock[[i]], "1987-01-15", "1987-01-17")
}
This is what the original data looked like:
[[1]]
Date C
1: 1998-04-20 12.667
2: 1998-04-21 12.587
3: 1998-04-22 12.625
4: 1998-04-23 12.601
5: 1998-04-24 12.584
6: 1998-04-25 12.624
[[2]]
Date C
1: 1987-01-14 95.89
2: 1987-01-15 97.72
3: 1987-01-16 98.10
4: 1987-01-17 97.07
5: 1987-01-18 98.86
6: 1987-01-19 99.95
This is what it looks like once you run a loop over the "start_finish" function I wrote:
[[1]]
Date C
1: 1987-01-15 12.587
2: 1987-01-16 12.625
3: 1987-01-17 12.601
[[2]]
Date C2
1: 1987-01-15 97.72
2: 1987-01-16 98.10
3: 1987-01-17 97.07
I believe you wanted your data to start and end at the same time? You merely need to alter the "start" and "finish" dates in the pretty simple function I wrote.
Is that what you're after?

Matrix Error- data length doesn't match, even though it should

I am trying to write a function that will return the number of business days between two dates (not just excluding weekends, but holidays as well). I'm approaching it by building a matrix with rownames corresponding to days of the week with the elements of the matrix either a 1 or a 0: a 0 if it is a holiday or the extra couple elements to fill the matrix.
I've checked the length of each vector in the code. It checks out. I've run the code manually in the console, one line at a time, and it works perfectly. BUT if I run the function, it displays this error message:
Warning message:
In matrix(da, nrow = 7, dimnames = list(n)) :
data length [132] is not a sub-multiple or multiple of the number of rows [7]
I'm using R 3.1.1, mostly working in Rstudio. The cal mentioned in the code can be found here.
Here's the code:
dte <- function(date) {
#Input a date and it tells you the number of business (not including holidays)
#days until that date
#Take the target date and turn it into a date
d <- strptime(date,format="%Y-%m-%d")
#Obtain current date
c <- strptime(Sys.Date(), format="%Y-%m-%d")
#Calculate the difference in days
diff <- d-c
#Extract the actual number difference
f <- diff[[1]]
#Get the list of holidays
cal <- dget("cal")
cal <- as.Date(cal)
#Get the full list of dates between now and the target date
b <- Sys.Date()+0:f
#Find which days in the range are holidays
if(any(b %in% cal)) {
bt <- b[b %in% cal]
#Return the position of the holidays within the range
bn <- which(b %in% bt)
} else {
#Set holidays present to 0
bn <- 0
}
#Build a vector of the weekdays starting with the current weekday
n <- weekdays(Sys.Date()+0:6)
#Create a vector as long as the difference with a 1 in each place
v <- rep(1,f)
#Set each holiday to 0
v[bn] <- v[bn]-1
#Extra steps to make sure that the matrix is full but only with 1s where we want them.
g <- ((trunc(f/7)+1)*7)-f
u <- rep(0,g)
da <- c(v,u)
#Create the matrix
m <- matrix(da,nrow=7,dimnames=list(n))
#Extract all of the workweeks and add them up
ww <- m[c("Monday","Tuesday","Wednesday","Thursday","Friday"),]
r <- sum(ww)
r
}
The problem is that your strptime calls return POSIXt objects which have time components and are then effected by daylight savings time. Observe
(d1<-strptime("2014-08-24",format="%Y-%m-%d"))
# [1] "2014-08-24 EDT"
(d2<-strptime("2014-12-31",format="%Y-%m-%d"))
# [1] "2014-12-31 EST"
d2-d1
# Time difference of 129.0417 days
So there are not a while number of dates between the two values which causes complications for you later in your code. If you use as.Date rather than strptime then you won't have this problem because Date objects don't care about time.
But i'm not sure really why you're even bothering with the matrix at all. I think a simpler implementation would look like
dte <- function(date) {
d <- as.Date(date,format="%Y-%m-%d")
c <- Sys.Date()
cal <- dget("cal")
cal <- as.Date(cal)
#Get the full list of dates between now and the target date
b <- seq(c, d, by="1 day")
return(sum(as.POSIXlt(b)$wday %in% 1:5 & (!b %in% cal)))
}

Conditional subsetting of data frame based on HH:MM:SS formatted column

So I have a large df with a column called "session" that is in the format
HH:MM:SS (e.g. 0:35:24 for 35 mins and 24 secs).
I want to create a subset of the df based on a condition like > 2 mins or < 90 mins from the "sessions" column
I tried to first convert the column format into Date:
df$session <- as.Date(df$session, "%h/%m/%s")
I was going to then use the subset() to create my conditional subset but the above code generates a column of NAs.
subset.morethan2min <-subset(df, CONDITION)
where CONDITION is df$session >2 mins?
How should I manipulate the "session" column in order to be able to subset on a condition as described?
Sorry very new to R so welcome any suggestions.
Thanks!
UPDATE:
I converted the session column to POSIXct then used function minute() from lubridate package to get numerical values for hour and minute components. Not a near solution but seems to work for my needs right now. Still would welcome a neater solution though.
df$sessionPOSIX <- as.POSIXct(strptime(df$session, "%H:%M:%S"))
df$minute <- minute(df$sessionPOSIX)
subset.morethan2min <- subset(df, minute > 2)
A date is not the same as a period. The easiest way to handle periods is to use the lubridate package:
library(lubridate)
df$session <- hms(df$session)
df.morethan2min <- subset(df, df$session > period(2, 'minute'))
hms() converts your duration stamps into period objects, and period() creates a period object of the specified length for comparison.
As an aside, there are numerous other ways to subset data frames, including the [ operator and functions like filter() in the dplyr package, but that's beyond what you need for your current purposes.
Probably simpler ways to do this, but here's one solution:
set.seed(1234)
tDF <- data.frame(
Val = rnorm(100),
Session = paste0(
sample(0:23,100,replace=TRUE),
":",
sample(0:59,100,replace=TRUE),
":",
sample(0:59,100,replace=TRUE),
sep="",collapse=NULL),
stringsAsFactors=FALSE
)
##
toSec <- function(hms){
Long <- as.POSIXct(
paste0(
"2013-01-01 ",
hms),
format="%Y-%m-%d %H:%M:%S",
tz="America/New_York")
3600*as.numeric(substr(Long,12,13))+
60*as.numeric(substr(Long,15,16))+
as.numeric(substr(Long,18,19))
}
##
tDF <- cbind(
tDF,
Seconds = toSec(tDF$Session),
Minutes = toSec(tDF$Session)/60
)
##
> head(tDF)
Val Session Seconds Minutes
1 -1.2070657 15:21:41 55301 921.6833
2 0.2774292 12:58:24 46704 778.4000
3 1.0844412 7:32:45 27165 452.7500
4 -2.3456977 18:26:46 66406 1106.7667
5 0.4291247 12:56:34 46594 776.5667
6 0.5060559 17:27:11 62831 1047.1833
Then you can just subset your data easily by doing subset(Data, Minutes > some_number).

Resources