Functions and plots with dates in r - r

First time caller, longtime listener.
I am trying to solve two problems.
my function does not perform as anticipated.
I cannot figure out how to make a plot from date data
I have tried to approach my function problem from multiple angles but I am only making things harder than they need to be. The issue that I cannot overcome is that the date sequence I have created for the date range of the data set is not equal to the length of the data set columns.
For the y-axis of my plot, I want:
f(dates[x])= number of data set entries on or before dates[x],
Where dates[x] refers to a given date in the data set date range
I'm sure there is an easy solution but I cannot figure it out.
Note: I used to have a basic understanding of r but I am relearning after a long break, please use the simplest terms possible
# import data
data <- read.csv("https://raw.githubusercontent.com/washingtonpost/data-police-shootings/master/fatal-police-shootings-data.csv")
#
# coerce date column into date class
data$date <- as.POSIXlt.date(data$date)
#
# sequence of dates for date range of data set
dates <- seq(data$date[1], data$date[length(data$date)], by = "days")
#
# numeric vector for the number of days in the date range of data set
xx <- c(1:length(dates))
#
# function meant to return a numeric vector of the count of entries in the data set that occurred on or before a given date
# within the data set date range.
fun <- function(x){
sum(dates[x]<=data$date)
}
# This function returns a single value and not a vector as I'd expected.
# This plot is the objective. x = number of days in data set date range, y = number of entries in data set on or before date(x)
plot(xx,y=fun(xx))

Working with dates is a loaded topic. It is extremely powerful, but it pays to be careful. Here is my take:
data <- read.csv(paste0("https://raw.githubusercontent.com/washingtonpost/", # wrapped
"data-police-shootings/master/fatal-police-shootings-data.csv"))
library(anytime) ## helper package
data$date <- anydate(data$date) ## helper function not requiring format
Now we have a date type and you can do
data[ data$date <= anydate(20150110), ]
If you use the date on the x-axis it all works out correctly too.
That said, I tend to do all this inside of data.table objects, but that is more learning for you. Another day :) Keep it in mind -- the grouping aggregation and
filtering are absolutely worth it. And it is the fastest tool around.

Related

How to filter via a logical expression that filters via a variable [duplicate]

This question already has an answer here:
r - Filter a rows by a date that alters each day
(1 answer)
Closed 1 year ago.
I have a question about the use of a logical expression in combination with a variable.
Imagine that I have a data frame with multiple rows that each contain a date saved as 2021-09-25T06:04:35:689Z.
I also have a variable that contains the date of yesterday as '2021-09-24' - yesterday <- Sys.Date()-1.
How do I filter the rows in my data frame based on the date of yesterday which is stored in the variable 'yesterday'?
To solve my problem, I have looked at multiple posts, for example:
Using grep to help subset a data frame
I am well aware that this question might be a duplicate. However, current questions do not provide me with the help that I need help. I hope that one of you can help me.
As an initial matter, it looks like you have a vector instead of a data frame (only one column). If you really do have a data frame and only ran str() on one column, the very similar technique at the end will work for you.
The first thing to know is that your dates are stored as character strings, while your yesterday object is in the Date format. R will not let you compare objects of different types, so you need to convert at least one of the two objects.
I suggest converting both to the POSIXct format so that you do not lose any information in your dates column but can still compare it to yesterday. Make sure to set the timezone to the same as your system time (mine is "America/New_York").
Dates <- c("2021-09-09T06:04:35.689Z", "2021-09-09T06:04:35.690Z", "2021-09-09T06:04:35.260Z", "2021-09-24T06:04:35.260Z")
Dates <- gsub("T", " ", Dates)
Dates <- gsub("Z", "", Dates)
Dates <- as.POSIXct(Dates, '%Y-%m-%d %H:%M:%OS', tz = "America/New_York")
yesterday <- Sys.time()-86400 #the number of seconds in one day
Now you can tell R to ignore the time any only compare the dates.
trunc(Dates, units = c("days")) == trunc(yesterday, units = c("days"))]
The other part of your question was about filtering. The easiest way to filter is subsetting. You first ask R for the indices of the matching values in your vector (or column) by wrapping your comparison in the which() function.
Indices <- which(trunc(Dates, units = c("days")) == trunc(yesterday, units = c("days"))])
None of the dates in your str() results match yesterday, so I added one at the end that matches. Calling which() returns a 4 to tell you that the fourth item in your vector matches yesterday's date. If more dates matched, it would have more values. I saved the results in "Indices"
We can then use the Indices from which() to subset your vector or dataframe.
Filtered_Dates <- Dates[Indices]
Filtered_Dataframe <- df[Indices,] #note the comma, which indicates that we are filtering rows instead of columns.

R apply function returns numeric value on date variables

I have a R dataframe which have sequence of dates. I want to create a dataframe from the existing one which consists of one month prior dates.
For example let x be the initial dataframe
x = data.frame(dt = c("28/02/2000","29/02/2000","1/03/2000","02/03/2000"))
My required dataframe y would be
y = c("28/01/2000","29/01/2000","1/02/2000","02/02/2000")
The list is quite big so I don't want looping. I have created a inline function which works fine when I give individual dates.
datefun <- function(x) seq(as.Date(strptime(x,format = "%d/%m/%Y")), length =2, by = "-1 month")[2]
datefun("28/02/2000") gives "28/01/2000" as an output
But while I use it inside R apply it gives random numerical values.
apply(x,1,function(x) datefun(x))
The output for this is
[1] 10984 10985 10988 10989
I don't know from where these numbers are getting generated, am I missing something.
You should not use apply since the result will be returned as a matrix. Matrices in R cannot store values of class Date. You have to use lapply instead. This returns a list of results. These results can be combined with Reduce and c to create a Date vector.
Reduce(c, lapply(x$dt, datefun))
# [1] "2000-01-28" "2000-01-29" "2000-02-01" "2000-02-02"
I believe that R internally is storing your dates as time elapsed since the UNIX epoch, which is January 1, 1970. You can easily view your updated dates as readable strings using as.Date with an apporpriate origin, e.g.
y <- apply(x,1,function(x) datefun(x))
as.Date(y, origin='1970-01-01')
[1] "2000-01-28" "2000-01-29" "2000-02-01" "2000-02-02"
The gist here is that the numerical output you saw perhaps misled you into thinking that your date information were somehow lost. To the contrary, the dates are stored in a numerical format, and it is up to you to tell R how you want to view that information as dates.
Demo
You could also skip your function with lubridate:
require(lubridate)
format(dmy(x$dt) %m+% months(-1),"%d/%m/%Y")

How to avoid date formatted values getting converted to numeric when assigned to a matrix or data frame?

I have run into an issue I do not understand, and I have not been able to find an answer to this issue on this website (I keep running into answers about how to convert dates to numeric or vice versa, but that is exactly what I do not want to know).
The issue is that R converts values that are formatted as a date (for instance "20-09-1992") to numeric values when you assign them to a matrix or data frame.
For example, we have "20-09-1992" with a date format, we have checked this using class().
as.Date("20-09-1992", format = "%d-%m-%Y")
class(as.Date("20-09-1992", format = "%d-%m-%Y"))
We now assign this value to a matrix, imaginatively called Matrix:
Matrix <- matrix(NA,1,1)
Matrix[1,1] <- as.Date("20-09-1992", format = "%d-%m-%Y")
Matrix[1,1]
class(Matrix[1,1])
Suddenly the previously date formatted "20-09-1992" has become a numeric with the value 8298. I don't want a numeric with the value 8298, I want a date that looks like "20-09-1992" in date format.
So I was wondering whether this is simply how R works, and we are not allowed to assign dates to matrices and data frames (somehow I have managed to have dates in other matrices/data frames, but it beats me why those other times were different)? Is there a special method to assigning dates to data frames and matrices that I have missed and have failed to deduce from previous (somehow successful) attempts at assigning dates to data frames/matrices?
I don't think you can store dates in a matrix. Use a data frame or data table. If you must store dates in a matrix, you can use a matrix of lists.
Matrix <- matrix(NA,1,1)
Matrix[1,1] <- as.list(as.Date("20-09-1992", format = "%d-%m-%Y"),1)
Matrix
[[1]]
[1] "1992-09-20"
Edited: I also just re-read you had this issue with data frame. I'm not sure why.
mydate<-as.Date("20-09-1992", format = "%d-%m-%Y")
mydf<-data.frame(mydate)
mydf
mydate
1 1992-09-20
Edited: This has been a learning experience for me with R and dates. Apparently the date you supplied was converted to number of days since origin. Origin is defined as Jan 1st,1970. To convert this back to a date format at some point
Matrix
[,1]
[1,] 8298
as.Date(Matrix, origin ="1970-01-01")
[1] "1992-09-20"
try the following: First specify your date vector & then use
rownames(mat) <- as.character(date_vector)
the dates will appear as a text.
This happens mostly when we are loading Excel Workbook
You need to add detectDates = TRUE in the function
DataFrame <- read.xlsx("File_Nmae", sheet = 3, detectDates = TRUE)

How to calculate daily means, medians, from weather variables data collected hourly in R?

I have this dataframe, "Data", containing one full year of data collected about every half-hour, but for some days only a few hours of data were collected.
Dates are in the format: 31.01.2010 00:30 (all in one cell)
Variables are: Temperature, humidity, PM10, windspeed, etc.
First question: How can I calculate the daily means, medians, max, min, values of these variables, so I can test each of them in further analysis such as survival analysis with GAM),instead of the hourly/half-hourly data?
Obviously, the calculated daily average/median should be assigned to its corresponding date.
Second question: the DATES column contains both date and time together, separated by one space in the same cell.
in R, its type is 'Factor' and I cannot do any calculations, because the error "dates" is missing, appears.
My guess is that I need to convert it first from Factor into date/time so it can be recognized and then to calculate means/medians. But how do I do this?
Can you please indicate what would be the arguments/functions to use?
I think that I have solved the conversion of date from 'Factor' to POSIXlt: I used the function strptime (Data$DATES, format="%d.%m.%Y %H:%M") and now $DATES are recognized as POSIXlt, format "2010-01-01 00:00:00" ....
But I still need to find the function that calculates daily means or averages or medians or whatever.
First, convert your time series into a xts object.
Then compute the data you want using xts functions such as apply.daily()
See, the xts vignette here.
I feel that the following snippet should work:
# Load library xts
require(xts)
# Create example dataframe
datetime <- c('31.01.2010 00:30', '31.01.2010 00:31', '31.01.2010 10:32', '01.02.2010 10:00', '01.02.2010 11:03', '01.03.2011 08:09', '01.03.2011 21:00', '01.03.2011 22:00')
value <- c(1.5, 2, 2.5, 7, 3.5, 9, 4.5, 7.5)
df <- data.frame(datetime, value)
# Create xts object
df.xts <- as.xts(df[,2], order.by=as.Date(df[,1], format='%d.%m.%Y %H:%M'))
# Daily mean
d.mean <- apply.daily(df.xts, mean)
# Daily median
d.median <- apply.daily(df.xts, median)
# Daily min
d.min <- apply.daily(df.xts, min)
# Daily max
d.max <- apply.daily(df.xts, max)
(alternatively, see RFiddle)
There are several parts to the problem. Before calculating the median statistics, you need to massage the dataframe so that it has the appropriate types.
For these explanations I'm going to assume you have a dataframe named dt.
Part 1: Converting the datatypes of the dataframe
date factor to datetime StackOverflow
datetime POSIXct conversion StackOverflow
First you need to convert the Date column from the factor type to the datetime type.
dt$Date <- strptime(x = as.character(dt$Date),
format = "%d.%m.%Y %H:%M")
dt$date_alt <- as.POSIXct(dt$date_alt) # convert the format of datetime to allow use with ddply
Then, since I'm assuming you want the median statistics by day-month-year, not including the time, we'll need to extract that info. You'll want to put it in a new column to preserve the times.
dt$date_alt <- strptime(x = as.character(dt$Date),
format = "%d.%m.%Y")
Part 2: Calculating summary statistics grouped by a particular field
Now that we have the dataframe looking the way we want it, you can calculate the average statistics grouped by the day-month-year, which in our case is the date_alt column.
The plyr package provides a really nice funciton for doing this: ddply
library(plyr) # need this library for the plyr call
summ <- ddply(dt, .(date_alt), summarize,
med_temp = median(!is.na(Temperature)),
mean_temp = mean(!is.na(Temperature)), # you can also calc mean if you want
med_humidity = median(!is.na(humidity)),
med_windspeed = median(!is.na(windspeed))
# etc for the rest of your vars
)
Breaking down the ddply call:
ddply cookbook explanation
ddply is essentially a function which acts over a dataframe. Here's a breakdown of the arguments to the function call:
dt -- the name of the dataframe you want to iterate over
.(date_alt) -- the names of the columns you want to group by.
Conceptually, this splits the dataframe up into a bunch of subdataframes whose rows consist of rows from the original dataframe which share the same values in the columns listed in the parentheses.
summarize -- this tells the ddply call that you want to calculate aggregate statistics on the subdataframes
med_temp = median(Temperature) and all similar lines -- defines a column in the result data frame. this says that you want the new dataframe to have a column called med_temp that contains the median(Tempurature) results for each sub-dataframe.
Keep in mind that instead of median you can use whatever function you want for aggregate values.

Having difficulty with the start argument for ts( ). Losing date formatting

First, new to programming.
I built a table with 3 columns and I want to evaluate based on time series, so I'm playing around with the ts() function. The first column of my table is DATE as.date in the format "yyyy-mm-dd". I have one observation per variable per day. I've apply ts() to the table and tried start=1 (first observation?) and checked head(df) and the DATE column is sending back loose sequence of numbers that I can't identify (12591, 12592, 12593, 12594, 12597, 12598).
Could it be that the as.date is messing things up?
The line I use is:
ts(dy2, start=1, frequency= 1)
I've also been playing with the deltat argument. In the help file it suggests 1/12 for monthly data. Naturally, I tried 1/365 (for daily data), but have yet to be successful.
As suggested by G. Grothendieck you can use the zoo package. Try this:
require(zoo)
dates <- as.Date(dy2[,1], format = "%Y-%m-%d")
x1 <- zoo(dy2[,2], dates)
plot(x2)
x2 <- zoo(dy2[,3], dates)
plot(x1)
If this does not work, please provide further details about your data as requested by MrFlick. For example, print the output of dput(dy2) or at least head(dy2).

Resources