Counting Observations Within Date Range R - r

This probably has a really simply solution. I have two data sets. One is a vector of POSIXct tweet timestamps and the second is a vector of POSIXct ADL HEAT Map timestamps.
I'm looking to build a function that lets me take the dates from the tweets vector and for each one count the number of timestamps in the ADL HEAT Map vector that fall within a specified range from the tweet.
My aim is to build the function such that I can put in the tweets vector, the ADL vector, the number of days from the tweets vector to start counting, and the number of days from the tweets vector to stop counting, and return a vector of counts the same length as the tweets data.
I already tried the solution here, and it didn't work: Count number of occurences in date range in R
Here's an example of what I'm trying to do. Here's a smaller version of the data sets I'm using:
tweets <- c("2016-12-12 14:34:00 GMT", "2016-12-5 17:20:06 GMT")
ADLData <- c("2016-12-11 16:30:00 GMT", "2016-12-7 18:00:00 GMT", "2016-12-2 09:10:00 GMT")
I want to create a function, let's call it countingfunction that lets me input the first data set, the second one, and call a number of days to look back. In this example, I chose 7 days:
countingfunction(tweets, ADLData, 7)
Ideally this would return a vector of the length of tweets or in this case 2 with counts for each of how many events in ADLData occurred within the past 7 days from the date in tweets. In this case, c(2,1).

So, if I have understood you correctly you have that kind of data:
tweets <- c(as.POSIXct("2020-08-16", tz = ""), as.POSIXct("2020-08-15", tz = ""), as.POSIXct("2020-08-14", tz = ""), as.POSIXct("2020-08-13", tz = ""))
ADL <- c(as.POSIXct("2020-08-15", tz = ""), as.POSIXct("2020-08-14", tz = ""))
And what you want to do, is to say whether a tweet is within the ADL date range or not. That could be accomplished doing this:
ifelse(tweets %in% ADL, print("its in"), print("its not"))
You can assign this easily to another vector, which then states whether it is in or not.

You can write countingfunction with the help of outer and calculate the difference in time between every value of two vectors using difftime.
countingfunction <- function(x1, x2, n) {
mat <- outer(x1, x2, difftime, units = 'days')
rowSums(mat > 0 & mat <= n)
}
Assuming you have vectors of class POSIXct like these :
tweets <- as.POSIXct(c("2016-12-12 14:34:00", "2016-12-5 17:20:06"), tz = 'GMT')
ADLData <- as.POSIXct(c("2016-12-11 16:30:00","2016-12-7 18:00:00",
"2016-12-2 09:10:00"), tz = 'GMT')
n <- 7
You can pass them as :
countingfunction(tweets, ADLData, n)
#[1] 2 1

Related

Formatting 24-hour time variable to capture observations in different ranges

I currently have a data frame with a column for Start.Time (imported from a *.csv file), and the format is in 24 hour format (e.g., 20:00:00 equals 8pm). My goal is to capture observations with a start time in various intervals (e.g., between 9:00:00 and 10:00:00), which also meet other criteria. However, it seems that R sorts this 'character' variable in a way that does not align with how our day goes (e.g., 14:00:00 is considered a lower value than 9:00:00).
For example, below is a line of code that works as intended, where I am capturing observations on two different trail segments, which had a start time between 8:00:00 and 9:00:00.
RLLtoMist8.9<-sum((dataset1$Trail.Segment==52|dataset1$Trail.Segment==55) &
(dataset1$Start.Time>="8:00" & dataset1$Start.Time < "9:00"),
na.rm=TRUE)
RLLtoMist8.9
But, this code below does not work as intended, as R is 'valuing' 9:00:00 as greater than 10:00:00.
RLLtoMist9.10 <-
sum((dataset1$Trail.Segment==52|dataset1$Trail.Segment==55) &
(dataset1$Start.Time>="9:00:00 AM" & dataset1$Start.Time < "10:00:00 AM"),
na.rm=TRUE)
It's certainly true that character types are sorted so that "14:00" is less than "9:00". However R has a datetime class which would sort times correctly once a character representation has been parsed.
a <- as.POSIXct("14:00", format="%H:%M")
b <- as.POSIXct("8:00", format="%H:%M")
# test
> a < b
[1] FALSE
You would be able to convert an entire column with:
dataset1$Start.Time <- as.POSIXct(dataset1$Start.Time, format="%H:%M")
The dates of a and b were the system date at the time of conversion, so if you printed them you would see dates and times in the default format. There are packages, such as chron, that let you use just times, but POSIXt objects have dates and times necessarily. See ?DateTimeClasses. The lubridate package also has an 'interval' class and there exist a difftime function in base-R.
There's also seq.POSIXt and cut.POSIXt functions, either of which could be used to create multiple time or date boundaries for categorical transformations of datetimes.
Using the data.table library:
# convert to data table
dataset1<-data.table(dataset1)
# format to a date format rather that character
dataset1[, Start.Time := as.POSIXct(Start.Time, format="%H:%M:%S")]
#now do your filtering
dataset1[between(Start.Time, as.POSIXct("09:00:00", format="%H:%M:%S"), as.POSIXct("10:00:00", format="%H:%M:%S")) & (Trail.Segment==52 | Trail.Segment==55)]

Why does as.Date applied to an xts index produce a different result?

Why do x and y (below) produce different results when filtering the xts object? Both x and y appear to store unique dates, one as characters and the other as dates. ob[x] returns all records. ob[y] returns 1 record per date (only if a record matches to midnight, 00:00:00 ).
seq1<- seq(as.POSIXct("2015-09-01"),as.POSIXct("2015-09-14"), by = "30 mins")
ob<- xts(data.frame(closingPrice=1:(length(seq1))),seq1)
x = unique(format(index(ob), format = "%Y-%m-%d"))
y = as.Date(unique(format(index(ob), format = y = as.Date(unique(format(index(ob), format = "%Y-%m-%d"))))))
ob[x]
ob[y]
x is a character vector, y is a Date vector. When you subset an xts object by a date-time object, you only get exact matches (in this case, midnight of each day).
When you subset by a character vector, you use xts' ISO-8601-based subsetting (see ?"[.xts"). One feature of that type of subsetting is that you get all observations that match up to the lowest specified component
You specified year, month, and day, so you'll get all index observations that occur on that specific day. For another example: specify everything up to an hour, and you'll get all observations for that hour.
> ob[paste(x[1],"12")]
closingPrice
2015-09-01 12:00:00 25
2015-09-01 12:30:00 26

Recall in date POSIXct

I couldn't find a solution of my problem with POSIXct format - I have a monthly data. This is a scrap of my code:
Data <- as.POSIXct(as.character(czerwiec$Data), format = "%Y-%m-%d %H:%M:%S")
get.rows <- Data >= as.POSIXct(as.character("2013-06-03 00:00:01")) & Data <= as.POSIXct(as.character("2013-06-09 23:59:59"))
czerwiec <- czerwiec[get.rows,]
Data <- Data[get.rows]
I chose one hole week of June from 3 to 9 and wanted to estimate the sum of column X (czerwiec$X) by every hours. As you see I could reduce time, but it will be stupid to do it, like this
get.rows <- Data >= as.POSIXct(as.character("2013-06-03 00:00:01")) &
Data <= as.POSIXct(as.character("2013-06-03 00:59:59"))
then
get.rows <- Data >= as.POSIXct(as.character("2013-06-04 00:00:01")) &
Data <= as.POSIXct(as.character("2013-06-04 00:59:59"))
And in the end of this operations, I can estimate sum for this hour etc.
Do you have any idea, how I can recall to every rows, which have time like 2013-06-03 to 2013-06-09 and 00:00:01 to 00:59:59??
Something about data frame "czerwiec", so I have three columns, where first call "ID", second "Price" and third "Data" (means Date).
Thx for help :)
This might help. I've used the lubridate package, which doesn't really do anything you can't do in base R, but it makes handling dates much easier
# Set up Data as a string vector
Data <- c("2013-06-01 05:05:05", "2013-06-06 05:05:05", "2013-06-06 08:10:05", "2013-07-07 05:05:05")
require(lubridate)
# Set up the data frame with fake data. This makes a reproducible example
set.seed(4) #For reproducibility, always set the seed when using random numbers
# Create a data frame with Data and price
czerwiec <- data.frame(price=runif(4))
# Use lubridate to turn the Data string into a vector of POSIXctn objects
czerwiec$Data <- ymd_hms(Data)
# Determine the 'yearday' -i.e. yearday of Jan 1 is 1; yearday of Dec 31 is 365 (or 366 in a leap year)
czerwiec$yday <- yday(czerwiec$Data)
# in.range is true if the date is in the desired date range
czerwiec$in.range <- czerwiec$yday[czerwiec$yday >= yday(ymd("2013-06-03")) &
czerwiec$yday yday(ymd("2013-06-09")]
# Pick out the dates that have the range that you want
selected_dates <- subset(czerwiec, in.range==TRUE)

How convert Time and Date

I have one question. How to convert that format 20110711201023 of date and time, to the number of hours. This is output of software which I use to image analysis, and I can’t change it. It is very important to define starting Date and Time.
Format: 2011 year, 07 month, 11 day, 20 hour, 10 minute, 23 second.
Example:
Starting Data and Time - 20110709201023,
First Data and Time - 20110711214020
Result = 49,5h.
I have 10000 data in this format so I don't want to do this manually.
I will be very gratefully for any advice.
Best is to first make it a real R time object using strptime:
time_obj = strptime("20110711201023", format = "%Y%m%d%H%M%S")
If you do this with both the start and the end date, you can simply say:
end_time - start_time
to get the difference in seconds, which can easily be converted to number of hours. To convert a whole list of these time strings, simply do:
time_vector = strptime(dat$time_string, format = "%Y%m%d%H%M%S")
where dat is the data.frame with the data, and time_string the column containing the time strings. Note that strptime works also on a vector (it is vectorized). You can also make the new time vector part of dat:
dat$time = strptime(dat$time_string, format = "%Y%m%d%H%M%S")
or more elegantly (at least if you hate $ as much as me :)):
dat = within(dat, { time = strptime(dat$time_string, format = "%Y%m%d%H%M%S") })

R: adding time/duration in the form of hh:mm:ss in excess of 24 hours

I'm trying work with time data regarding the duration of several experiments. That data are in the form hh:mm:ss.
exp1 <- "34:03:07"
exp2 <- "00:01:10"
exp3 <- "01:13:41"
The first experiment is given as the total duration and the subsequent experiments are listed as time in excess of exp1. The total time result I'm looking for of exp2, for example, would then be 34:04:17 and for exp3 would be 35:16:48. What is the best method to add the times to exp1, when packages such as chron only work in hh:mm:ss up to 24:00:00?
I often prefer to avoid Date/Time classes fur durations like this and will simply parse the string myself and convert them to a numeric value in minutes or seconds.
But as always, there's typically a package with similar functionality. In this case, lubridate has a duration class that may be useful:
d <- new_duration(hour = 34,minute = 23,second = 23)
d1 <- new_duration(hour = 12,minute = 12,second = 23)
> d+d1
[1] 167746s (1.94d)
So you can do arithmetic with them and you can also create a duration object by passing numeric values (in seconds) to as.duration.

Resources