Extract and save as csv files based on specified timeframe - r

Below is my dataset example saved as a csv file. Is it possible to extract them and save as several csv files based on specified timeframe.
For example:
The specified timeframes are:
daytime: 07:30 (same date) to 20:30 (same date)
nighttime: 21:30 (same date) to 06:30 (next date).
After the extraction, datasets are save as csv files based on this filename format:
daytime: "date"-day
daytime: "date"-night
"date" is the date from the timestamp.
Thanks for your help.
timestamp c3.1 c3.2 c3.3 c3.4 c3.5 c3.6 c3.7 c3.8 c3.9 c3.10 c3.11 c3.12
8/13/15 15:43 1979.84 1939.6 2005.21 1970 1955.55 1959.82 1989 2001.12 2004.38 1955.75 1958.75 1986.53
8/13/15 15:44 1979.57 1939.64 2005.14 1970.4 1956.43 1958.56 1989.7 2000.78 2004.53 1954.9 1959.76 1986.18
8/13/15 15:45 1979.32 1938.92 2004.52 1970.21 1955.75 1960.12 1989.07 2001.47 2003.7 1955.32 1958.94 1985.79
8/13/15 15:46 1979.33 1939.7 2004.66 1971.25 1955.89 1958.27 1989.24 2000.86 2003.92 1955.29 1959.25 1985.49

Assuming that dat is your data:
## The date-time format in the data set
format <- "%m/%d/%y %H:%M"
## Convert date-time to POSIXct
timestamp <- as.POSIXct(dat$timestamp, format = format)
## First and last dates in the data
first <- as.Date(min(timestamp))
last <- as.Date(max(timestamp))
## The start of day and night timeframes
start.day <- paste(first, "07:30")
start.night <- paste(first - 1, "20:30") ## first night timeframe starts the day before
end <- paste(last + 1, "20:30")
## The breakpoints, assuming that day is 7:30-20:30 and nigth 20:31-7:29 (i.e. no missing records)
breaks <- sort.POSIXlt(c(seq.POSIXt(as.POSIXct(start.day), as.POSIXct(end), by= "day"),
seq.POSIXt(as.POSIXct(start.night), as.POSIXct(end), by= "day")))
## The corresponding labels
labels <- head(paste0(as.Date(breaks), c("-night", "-day")), - 1)
## Add column with timeframe
dat$timeframe <-cut.POSIXt(timestamp, breaks = breaks, labels = labels)
## Save csv files
for(x in levels(dat$timeframe)) {
subset <- dat[dat$timeframe == x, ]
subset$timeframe <- NULL ## Remove the timeframe column
if(nrow(subset) > 0) write.csv(subset, file = paste0(x, ".csv"))
}

Related

Converting the NetCDF file dates to the dates in R [duplicate]

I have a netcdf file with a timeseries and the time variable has the following typical metadata:
double time(time) ;
time:standard_name = "time" ;
time:bounds = "time_bnds" ;
time:units = "days since 1979-1-1 00:00:00" ;
time:calendar = "standard" ;
time:axis = "T" ;
Inside R I want to convert the time into an R date object. I achieve this at the moment in a hardwired way by reading the units attribute and splitting the string and using the third entry as my origin (thus assuming the spacing is "days" and the time is 00:00 etc):
require("ncdf4")
f1<-nc_open("file.nc")
time<-ncvar_get(f1,"time")
tunits<-ncatt_get(f1,"time",attname="units")
tustr<-strsplit(tunits$value, " ")
dates<-as.Date(time,origin=unlist(tustr)[3])
This hardwired solution works for my specific example, but I was hoping that there might be a package in R that nicely handles the UNIDATA netcdf date conventions for time units and convert them safely to an R date object?
I have just discovered (two years after posting the question!) that there is a package called ncdf.tools which has the function:
convertDateNcdf2R
which
converts a time vector from a netCDF file or a vector of Julian days
(or seconds, minutes, hours) since a specified origin into a POSIXct R
vector.
Usage:
convertDateNcdf2R(time.source, units = "days", origin = as.POSIXct("1800-01-01",
tz = "UTC"), time.format = c("%Y-%m-%d", "%Y-%m-%d %H:%M:%S",
"%Y-%m-%d %H:%M", "%Y-%m-%d %Z %H:%M", "%Y-%m-%d %Z %H:%M:%S"))
Arguments:
time.source
numeric vector or netCDF connection: either a number of time units since origin or a netCDF file connection, In the latter case, the time vector is extracted from the netCDF file, This file, and especially the time variable, has to follow the CF netCDF conventions.
units
character string: units of the time source. If the source is a netCDF file, this value is ignored and is read from that file.
origin
POSIXct object: Origin or day/hour zero of the time source. If the source is a netCDF file, this value is ignored and is read from that file.
Thus it is enough to simply pass the netcdf connection as the first argument and the function handles the rest. Caveat: This will only work if the netCDF file follows CF conventions (e.g. if your units are "years since" instead of "seconds since" or "days since" it will fail for example).
More details on the function are available here:
https://rdrr.io/cran/ncdf.tools/man/convertDateNcdf2R.html
There is not, that I know of. I have this handy function using lubridate, which is basically identical to yours.
getNcTime <- function(nc) {
require(lubridate)
ncdims <- names(nc$dim) #get netcdf dimensions
timevar <- ncdims[which(ncdims %in% c("time", "Time", "datetime", "Datetime", "date", "Date"))[1]] #find time variable
times <- ncvar_get(nc, timevar)
if (length(timevar)==0) stop("ERROR! Could not identify the correct time variable")
timeatt <- ncatt_get(nc, timevar) #get attributes
timedef <- strsplit(timeatt$units, " ")[[1]]
timeunit <- timedef[1]
tz <- timedef[5]
timestart <- strsplit(timedef[4], ":")[[1]]
if (length(timestart) != 3 || timestart[1] > 24 || timestart[2] > 60 || timestart[3] > 60 || any(timestart < 0)) {
cat("Warning:", timestart, "not a valid start time. Assuming 00:00:00\n")
warning(paste("Warning:", timestart, "not a valid start time. Assuming 00:00:00\n"))
timedef[4] <- "00:00:00"
}
if (! tz %in% OlsonNames()) {
cat("Warning:", tz, "not a valid timezone. Assuming UTC\n")
warning(paste("Warning:", timestart, "not a valid start time. Assuming 00:00:00\n"))
tz <- "UTC"
}
timestart <- ymd_hms(paste(timedef[3], timedef[4]), tz=tz)
f <- switch(tolower(timeunit), #Find the correct lubridate time function based on the unit
seconds=seconds, second=seconds, sec=seconds,
minutes=minutes, minute=minutes, min=minutes,
hours=hours, hour=hours, h=hours,
days=days, day=days, d=days,
months=months, month=months, m=months,
years=years, year=years, yr=years,
NA
)
suppressWarnings(if (is.na(f)) stop("Could not understand the time unit format"))
timestart + f(times)
}
EDIT: One might also want to take a look at ncdf4.helpers::nc.get.time.series
EDIT2: note that the newly-proposed and currently in developement awesome stars package will handle dates automatically, see the first blog post for an example.
EDIT3: another way is to use the units package directly, which is what stars uses. One could do something like this: (still not handling the calendar correctly, I'm not sure units can)
getNcTime <- function(nc) { ##NEW VERSION, with the units package
require(units)
require(ncdf4)
options(warn=1) #show warnings by default
if (is.character(nc)) nc <- nc_open(nc)
ncdims <- names(nc$dim) #get netcdf dimensions
timevar <- ncdims[which(ncdims %in% c("time", "Time", "datetime", "Datetime", "date", "Date"))] #find (first) time variable
if (length(timevar) > 1) {
warning(paste("Found more than one time var. Using the first:", timevar[1]))
timevar <- timevar[1]
}
if (length(timevar)!=1) stop("ERROR! Could not identify the correct time variable")
times <- ncvar_get(nc, timevar) #get time data
timeatt <- ncatt_get(nc, timevar) #get attributes
timeunit <- timeatt$units
units(times) <- make_unit(timeunit)
as.POSIXct(time)
}
I couldn't get #AF7's function to work with my files so I wrote my own. The function below creates a POSIXct vector of dates, for which the start date, time interval, unit and length are read from the nc file. It works with nc files of many (but probably not every...) shapes or forms.
ncdate <- function(nc) {
ncdims <- names(nc$dim) #Extract dimension names
timevar <- ncdims[which(ncdims %in% c("time", "Time", "datetime", "Datetime",
"date", "Date"))[1]] # Pick the time dimension
ntstep <-nc$dim[[timevar]]$len
tm <- ncvar_get(nc, timevar) # Extract the timestep count
tunits <- ncatt_get(nc, timevar, "units") # Extract the long name of units
tspace <- tm[2] - tm[1] # Calculate time period between two timesteps, for the "by" argument
tstr <- strsplit(tunits$value, " ") # Extract string components of the time unit
a<-unlist(tstr[1]) # Isolate the unit .i.e. seconds, hours, days etc.
uname <- a[which(a %in% c("seconds","hours","days"))[1]] # Check unit
startd <- as.POSIXct(gsub(paste(uname,'since '),'',tunits$value),format="%Y-%m-%d %H:%M:%S") ## Extract the start / origin date
tmulti <- 3600 # Declare hourly multiplier for date
if (uname == "days") tmulti =86400 # Declare daily multiplier for date
## Rename "seconds" to "secs" for "by" argument and change the multiplier.
if (uname == "seconds") {
uname <- "secs"
tmulti <- 1 }
byt <- paste(tspace,uname) # Define the "by" argument
if (byt == "0.0416666679084301 days") { ## If the unit is "days" but the "by" interval is in hours
byt= "1 hour" ## R won't understand "by < 1" so change by and unit to hour.
uname = "hours"}
datev <- seq(from=as.POSIXct(startd+tm[1]*tmulti),by= byt, units=uname,length=ntstep)
}
Edit
To address the flaw highlighted by #AF7's comment that the above code would only work for regularly spaced files, datev could be calculated as
datev <- as.POSIXct(tm*tmulti,origin=startd)

Read in times from Excel in R

I have data saved in Excel that includes time data.
When reading it in with read.xlsx in R, it adds "1899-12-30" to the time column, I presume in an attempt to read in a date in addition to the time that doesn't exist.
library(xlsx)
times<-read.xlsx("times.xlsx", sheetName = "Sheet1")
times
Time
1 1899-12-30 20:13:24
2 1899-12-30 08:13:54
3 1899-12-30 08:14:24
4 1899-12-30 08:14:54
5 1899-12-30 08:15:24
I tried
times<-read.xlsx("times.xlsx", sheetName = "Sheet1", colClasses('POSIXct'))
and
times<-read.xlsx("times.xlsx", sheetName = "Sheet1", colClasses('POSIXct(format='%H:%M:%S')'))
but the first doesn't do anything and the second gives me an error.
Note that read.xlsx() recognizes TIME as %H:%M:%S, and converts it into the dummy POSIXct/POSIXt object, i.e. 1899-12-31 08:00:00 and 1899-12-31 20:00:00
#use readxl
library(readxl)
df <- read_excel('test.xlsx')
OR use format
read.xlsx("myfile.xlsx") %>%
mutate(
TIME = format(TIME, "%I:%M %p")
)
OR after reading df convert it into time using
as.POSIXct(df$Time, format="%H:%M:%S", tz="CET")
EDIT:
I don't have data to replicate your errors or problem that you are facing , so i have made one according to those date format
df = data.frame(Time = c("1899-12-30 20:13:24","1899-12-30 08:13:54","1899-12-30 08:14:24","1899-12-30 08:14:54","1899-12-30 08:15:24"))
df <- as.POSIXct(df$Time, format = "%Y-%m-%d %H:%M") #apply function to create a POSIXct object
#use the `strftime()` function to split the column and then the function times() to create a chronological object.
library(chron)
time <- times(strftime(df, format="%H:%M:%S"))
This method should def work, hope you got the idea there are many ways to achieve this

Changing Dates in R from webscraper but not able to convert

I am trying to complete a problem that pulls from two data sets that need to be combined into one data set. To get to this point, I need to rbind both data sets by the year-month information. Unfortunately, the first data set needs to be tallied by year-month info, and I can't seem to figure out how to change the date so I can have month-year info rather than month-day-year info.
This is data on avalanches and I need to write code totally the number of avalanches each moth for the Snow Season, defined as Dec-Mar. How do I do that?
I keep trying to convert the format of the date to month-year but after I change it with
as.Date(avalancheslc$Date, format="%y-%m")
all the values for Date turn to NA's....help!
# write the webscraper
library(XML)
library(RCurl)
avalanche<-data.frame()
avalanche.url<-"https://utahavalanchecenter.org/observations?page="
all.pages<-0:202
for(page in all.pages){
this.url<-paste(avalanche.url, page, sep=" ")
this.webpage<-htmlParse(getURL(this.url))
thispage.avalanche<-readHTMLTable(this.webpage, which=1, header=T)
avalanche<-rbind(avalanche,thispage.avalanche)
}
# subset the data to the Salt Lake Region
avalancheslc<-subset(avalanche, Region=="Salt Lake")
str(avalancheslc)
avalancheslc$monthyear<-format(as.Date(avalancheslc$Date),"%Y-%m")
# How can I tally the number of avalanches?
The final output of my dataset should be something like:
date avalanches
2000-1 18
2000-2 4
2000-3 10
2000-12 12
2001-1 52
This should work (I tried it on only 1 page, not all 203). Note the use of the option stringsAsFactors = F in the readHTMLTable function, and the need to add names because 1 column does not automatically get one.
library(XML)
library(RCurl)
library(dplyr)
avalanche <- data.frame()
avalanche.url <- "https://utahavalanchecenter.org/observations?page="
all.pages <- 0:202
for(page in all.pages){
this.url <- paste(avalanche.url, page, sep=" ")
this.webpage <- htmlParse(getURL(this.url))
thispage.avalanche <- readHTMLTable(this.webpage, which = 1, header = T,
stringsAsFactors = F)
names(thispage.avalanche) <- c('Date','Region','Location','Observer')
avalanche <- rbind(avalanche,thispage.avalanche)
}
avalancheslc <- subset(avalanche, Region == "Salt Lake")
str(avalancheslc)
avalancheslc <- mutate(avalancheslc, Date = as.Date(Date, format = "%m/%d/%Y"),
monthyear = paste(year(Date), month(Date), sep = "-"))

Sub setting times out of time series in R

I downloaded stock market data from Yahoo (code below) - for context, at first I tried with getSymbols(^DJI) but I got error messages possibly related to Yahoo... different issue.
The point is that once downloaded, and imported into R, I massaged it into a format close enough to a time series to be able to run chartSeries(DJI):
require(RCurl)
require(foreign)
x <- getURL("https://raw.githubusercontent.com/RInterested/datasets/gh-pages/%5EDJI.csv")
DJI <- read.csv(text = x, sep =",")
DJI$Date <- as.Date(DJI$Date, format = "%m/%d/%Y") # Formatting Date as.Date
rownames(DJI) <- DJI$Date # Assigning Date to row names
DJI$Date <- NULL # Removing the Date column
chartSeries(DJI, type="auto", theme=chartTheme('white'))
even if the dataset is not really a time series:
> is.ts(DJI)
[1] FALSE
The problem comes about when I try to find out the date of, for instance, the minimum closing value of the Dow. I can do something like
> DJI[DJI$Close == min(DJI$Close),]
Open High Low Close Adj.Close Volume
1985-05-01 1257.18 1262.81 1239.07 1242.05 1242.05 10050000
yielding the entire row, including the row name (1985-05-01), which is the only part I want. However, if I insist on just getting the actual date, I have to juggle a second dataset containing the dates in one of the columns:
require(RCurl)
require(foreign)
x <- getURL("https://raw.githubusercontent.com/RInterested/datasets/gh-pages/%5EDJI.csv")
DJI <- read.csv(text = x, sep =",")
DJI$Date <- as.Date(DJI$Date, format = "%m/%d/%Y") # Formatting Date as.Date
rownames(DJI) <- DJI$Date # Assigning Date to row names
DJI.raw <- DJI # Second dataset for future subsetting
DJI$Date <- NULL # Removing the Date column
which does allow me to run
> DJI.raw$Date[DJI.raw$Close == min(DJI.raw$Close)]
[1] "1985-05-01"
Further, I don't think that turning the dataset into an .xts file would help.
I'm not clear what you want but it sounds like you just want the date? You mention xts is not an option (which would have been runnable)
time(as.xts(DJI))[which.min(DJI$Close)] # POSIXct format
# [1] "1985-05-01 EDT"
Otherwise a simple rownames + which.min would get the date for you?
as.Date(rownames(DJI)[which.min(DJI$Close)]) # Date format
# [1] "1985-05-01"

Subsetting data table in R by date

I need to subset my data by a date range, below is the code.
I read in two .csv (data2010, data2), I changed the date format to exclude the timestamp, rename the headers so they are the same for both files, then merge(data2011).
The files seem to actually merge but when I subset by the date range, no observations are created.
However, the date is grouped like 01/01/10 01/01/11 01/02/10 01/02/11 =
so same month/same day/different year pairing.
data2010 <- read.csv(file="2010final.csv")
data2 <- read.csv(file="2011final.csv")
#change format of timestamp to date with mm/dd/yyyy for 2011
data2$newdate <-strptime(as.character(data2$Date), "%m/%d/%y")
data2$Date <- format(data2$newdate, "%m/%d/%y")
data2$newdate <- NULL
#rename and format 2010
names(data2010) <- c("Region", "District", "Age", "Gender", "Marital Status", "Date", "Reason")
data2010$newdate <-strptime(as.character(data2010$Date), "%m/%d/%y %H")
data2010$Date <- format(data2010$newdate, "%m/%d/%y")
data2010$newdate <- NULL
#merge
data2011 <- rbind(data2010, data2)
summary(data2011)
str(data2011)
#I see from the above commands that the files have merged
jan6Before <- subset(data2011, Date >= "12/22/10" & Date <= "01/06/11")
summary(jan6Before)
str(jan6Before)
#But this does not produce any observations
I suspect it's because your Date variable is a character, not date, being compared to another character constant i.e. "12/22/10".
I suggest you have a look at the package lubridate. You can then easily convert character (in this case month-date-year) to compare, e.g. mdy(Date) >= mdy("12/22/10") .
Merge on your variable newDate, and use that for subsetting also.

Resources