I am trying to filter a large dataset down to records that occur on the hour. The data looks like this:
I want to filter the Date_Time field to be only the records that are on the hour i.e. "yyyy-mm-dd XX:00:00" or within 10 min of the hour. So, for example, this dataset would reduce down to row 1 and 5. Does anyone have a suggestion?
You can extract the minute value from datetime and select the rows which is within 10 minutes.
result <- subset(df, as.integer(format(UTC_datetime, '%M')) <= 10)
Or with dplyr and lubridate -
library(dplyr)
library(lubridate)
result <- df %>% filter(minute(UTC_datetime) <= 10)
Using data.table
library(data.table)
setDT(df)[minute(UTC_datetime)<=10]
Related
I want to filter the dataframe below, to where only certain rows are kept.
total.Date = date of event
total.start = start time of event
total.TotalTime = duration of event (minutes)
total.ISSUE_DATE = date of item ordered
total.ISSUE_TIME = time of item ordered
In this specific subsetted dataset, I believe all rows will be excluded. However when I perform this on the entire dataset, some rows are expected to remain.
First pasting together the surgery and order dates and times to form proper datetimes, then converting the integer minutes into a "period" object in lubridate terminology.
Then it's straightforward to filter: greater than the start time minus 30 minutes AND less than the start time plus the length of the surgery.
library(dplyr)
library(lubridate)
your_df %>%
mutate(
surgery_start = mdy_hms(paste(total.Date, total.PTIN)),
order_time = mdy_hms(paste(total.ISSUE_DATE, total.ISSUE_TIME)),
surgery_duration = minutes(total.TotalORTime)
) %>%
filter(
order_time > surgery_start - minutes(30),
order_time < surgery_start + surgery_duration
)
I want to add a column which is a subtraction of Store_Entry_Time from Store_Exit_Time.
For example the result for row 1 should be (2014-12-02 18:49:05.402863 - 2014-12-02 16:56:32.394052) = 1 hour 53 minutes approximately.( I want this result in just hours).
I entered class(Store_Entry_Time) and it says "character".
How do I obtain the subtracting and put it into new column as "Time Spent"?
You can use ymd_hms from lubridate to convert the column into POSIXct format and then use difftime to caluclate the difference in time.
library(dplyr)
df <- df %>%
mutate(across(c(Store_Entry_Time, Store_Exit_Time), lubridate::ymd_hms),
Time_Spent = as.numeric(difftime(Store_Exit_Time,
Store_Entry_Time, units = 'hours')))
For a base R option here, we can try using as.POSIXct:
df$Time_Spent <- as.numeric(as.POSIXct(df$Store_Exit_Time) -
as.POSIXct(df$Store_Entry_Time)
The above column would give the difference in time, measured in hours.
Example:
Store_Exit_Time <- "2014-12-02 18:49:05.402863"
Store_Entry_Time <- "2014-12-02 16:56:32.394052"
Time_Spent <- as.numeric(as.POSIXct(Store_Exit_Time) - as.POSIXct(Store_Entry_Time))
Time_Spent
[1] 1.875836
I would like to calculate mean of Mean.Temp.c. before certain date, such as 1963-03-23 as showed in date2 column in this example. This is time when peak snowmelt runoff occurred in 1963 in my area. I want to know 10 day’s mean temperature before this date (ie., 1963-03-23). How to do it? I have 50 years data, and each year peak snowmelt date is different.
example data
You can try:
library(dplyr)
df %>%
mutate(date2 = as.Date(as.character(date2)),
ten_day_mean = mean(Mean.Temp.c[between(date2, "1963-03-14", "1963-03-23")]))
In this case the desired mean would populate the whole column.
Or with data.table:
library(data.table)
setDT(df)[between(as.Date(as.character(date2)), "1963-03-14", "1963-03-23"), ten_day_mean := mean(Mean.Temp.c)]
In the latter case you'd get NA for those days that are not relevant for your date range.
Supposing date2 is a Date field and your data.frame is called x:
start_date <- as.Date("1963-03-23")-10
end_date <- as.Date("1963-03-23")
mean(x$Mean.Temp.c.[x$date2 >= start_date & x$date2 <= end_date])
Now, if you have multiple years of interest, you could wrap this code within a for loop (or [s|l]apply) taking elements from a vector of dates.
I have a date frame df that simply looks like this:
month values
2012M01 99904
2012M02 99616
2012M03 99530
2012M04 99500
2012M05 99380
2012M06 99103
2013M01 98533
2013M02 97600
2013M03 96431
2013M04 95369
2013M05 94527
2013M06 93783
with month that was written in form of "M01", "M02"... and so on.
Now I want to convert this column to date format, is there a way to do it in R with lubridate?
I also want to select columns that contain one certain month from each year, like only March columns from all these years, what is the best way to do it?
The short answer is that dates require a year, month and day, so you cannot convert directly to a date format. You have 2 options.
Option 1: convert to a year-month format using zoo::as.yearmon.
library(zoo)
df$yearmon <- as.yearmon(df$month, "%YM%m")
# you can get e.g. month from that
months(df$yearmon[1])
# [1] "January"
Option 2: convert to a date by assuming that the day is always the first day of the month.
df$date <- as.Date(paste(df$month, "01", sep = "-"), "%YM%m-%d")
For selection (and I think you mean select rows, not columns), you already have everything you need. For example, to select only March 2013:
library(dplyr)
df %>% filter(month == "2013M03")
Something like this will get it:
raw <- "2012M01"
dt <- strptime(raw,format = "%YM%m")
dt will be in a Posix format. The strptime function will assign a '1' as the default day of month to make it a complete date.
The below is an example of the data I have.
date time size filename day.of.week
1 2015-01-16 5:36:12 1577 01162015053400.xml Friday
2 2015-01-16 5:38:09 2900 01162015053600.xml Friday
3 2015-01-16 5:40:09 3130 01162015053800.xml Friday
What I would like to do is sum up the size of the files for each hour.
I would like a resulting data table that looks like:
date hour size
2015-01-16 5 7607
2015-01-16 6 10000
So forth and so on.
But I can't quite seem to get the output I need.
I've tried ddply and aggregate, but I'm summing up the entire day, I'm not sure how to break it down by the hour in the time column.
And I've got multiple days worth of data. So it's not only for that one day. It's from that day, almost every day until yesterday.
Thanks!
The following should do the trick, assuming your example data are stored in a data frame called "test":
library(lubridate) # for hms and hour functions
test$time <- hms(test$time)
test$hour <- factor(hour(test$time))
library(dplyr)
test %>%
select(-time) %>% # dplyr doesn't like this column for some reason
group_by(date, hour) %>%
summarise(size=sum(size))
You can use data.table
library(data.table)
# Define a time stamp column.
dt[, timestamp=as.POSIXct(strptime(paste(df$date, df$time), format = "%Y-%m-%d %H:%M:%S"))]
# Aggregate by hours
dt[, size = .N, by = as.POSIXct(round(timestamp, "hour"))]
Benefit is that data.table is blazing fast!
Use a compound group_by(day,hour)
That will do it.
If you convert your date and time columns into a single POSIX date when (similar to a previous answer, i.e. df$when <- as.POSIXct(strptime(paste(df$date, df$time), format = "%Y-%m-%d %H:%M:%S"))), you could use:
aggregate(df[c("size")], FUN=sum, by=list(d=as.POSIXct(trunc(df$when, "hour"))))