I have a column of times that have been entered as raw text. An example is below (code for data input at the bottom of the post):
#> id time
#> 1 NA <NA>
#> 2 1 7:50 pm
#> 3 2 7:20 pm
#> 4 3 3:20 pm
I would like to add indicator variables, that for example, indicate if the time is:
after 7pm
between 7pm and 7.30pm
So my desired output would look like this:
#> id time before_1930 between_1900_1930
#> 1 NA <NA> NA NA
#> 2 1 7:50 pm 0 0
#> 3 2 7:20 pm 1 1
#> 4 3 3:20 pm 1 0
So far, I have tried reading in the times with parse_date_time, but this adds on a date:
library(lubridate)
df <- df %>% mutate(time = lubridate::parse_date_time(time, '%I:%M %p'))
df
#> id time
#> 1 NA <NA>
#> 2 1 0000-01-01 19:50:00
#> 3 2 0000-01-01 19:20:00
#> 4 3 0000-01-01 15:20:00
Is there an easy way to work directly with the hours and minutes, and then create the dummy variables I mentioned?
Code for data input
df <- data.frame(
id = c(NA, 1, 2, 3),
time = c(NA, "7:50 pm", "7:20 pm", "3:20 pm")
)
Try this one:
library(dplyr)
library(lubridate)
data.frame(
id = c(NA, 1, 2, 3),
time = c(NA, "7:50 pm", "7:20 pm", "3:20 pm")
) %>%
mutate(real_time = lubridate::parse_date_time(time, '%I:%M %p'),
is_before = case_when(
hour(real_time) < 19 ~ "Before 19",
hour(real_time) == 19 & minute(real_time) < 30 ~ "19:00 - 19:30",
T ~ "After 19:30"
))
id time real_time is_before
1 NA <NA> <NA> After 19:30
2 1 7:50 pm 0000-01-01 19:50:00 After 19:30
3 2 7:20 pm 0000-01-01 19:20:00 19:00 - 19:30
4 3 3:20 pm 0000-01-01 15:20:00 Before 19
Rather than trying to deal with it as a date/time, use your output from parse_date_time to calculate the number of hours since midnight on 0000-01-01.
df <- data.frame(
id = c(NA, 1, 2, 3),
time = c(NA, "7:50 pm", "7:20 pm", "3:20 pm")
)
library(dplyr)
library(lubridate)
df <- df %>% mutate(time = lubridate::parse_date_time(time, '%I:%M %p'),
time = difftime(time,
as.POSIXct("0000-01-01", tz = "UTC"),
units = "hours"),
before_1930 = as.numeric(time < 19.5),
between_1900_1930 = as.numeric(time > 19 & time < 19.5))
df
Related
I have a dataframe with date/time series and I am trying to find the monthly amount of time that values were above > x (for the purpose of this question lets say > 5).
Here is a sample dataframe
# Create a, b, c, d variables
a <- c("06-25-20 08:00:00 AM","06-25-20 08:15:00 AM",
"06-25-20 08:30:00 AM","06-25-20 08:45:00 AM",
"07-25-20 08:45:00 AM", "07-25-20 08:45:00 AM",
"08-25-20 08:45:00 AM", "08-25-20 08:45:00 AM",
"09-25-20 08:45:00 AM","09-25-20 08:45:00 AM")
b <- c(4,5,8, "N/A", 4,5,"N/A",7,7,6)
c <- c(6,10,8, "N/A", 8,5,"N/A",8,7,2)
# Join the variables to create a data frame
df <- data.frame(a,b,c)
df$a = as.POSIXlt(df$a, format="%m-%d-%y%H:%M:%S", tz = 'EST')
I started by separating the date and time
#Put date and time into seperate columns
df$Day <- as.Date(stewiacke_WA$a)
df$Time <- format(df$b,"%H:%M:%S")
There are 2 problems I'm left with. Firstly, the Time column is class(character) and when I use the code
df$Time = as.POSIXct(df$Time, format = "%H:%M:%S", tz = 'EST')
the Time column adds back on the dates.
My second issue, is that I don't know how to calculate the monthly amount of time that each column values were > 5. Can anyone help?
The lubridate package is very helpful when working with dates and times.
I’ll also use dplyr for calculating the time per month of b or c being above 5.
library(lubridate)
library(dplyr)
# Create a, b, c, d variables
a <- c(
"06-25-20 08:00:00 AM",
"06-25-20 08:15:00 AM",
"06-25-20 08:30:00 AM",
"06-25-20 08:45:00 AM",
"07-25-20 08:45:00 AM",
"07-25-20 08:45:00 AM",
"08-25-20 08:45:00 AM",
"08-25-20 08:45:00 AM",
"09-25-20 08:45:00 AM",
"09-25-20 08:45:00 AM"
)
When defining missing values in your data make sure to use NA not "N/A"!
b <- c(4, 5, 8, NA, 4, 5, NA, 7, 7, 6)
c <- c(6, 10, 8, NA, 8, 5, NA, 8, 7, 2)
tibble() instead of data.frame() makes it easier to see the class of the columns.
df <-
tibble(a, b, c)
df$a = as.POSIXlt(df$a, format = "%m-%d-%y%H:%M:%S", tz = 'EST')
df$month <- month(df$a)
Time per month of b being bigger than five for each month
df %>%
group_by(month) %>%
mutate(prev_a = lag(a, 1),
diff_time = a - prev_a) %>%
filter(b > 5) %>%
summarise(sum_diff_time = sum(diff_time, na.rm = TRUE))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 3 x 2
#> month sum_diff_time
#> <dbl> <drtn>
#> 1 6 900 secs
#> 2 8 0 secs
#> 3 9 0 secs
And the same for c
df %>%
group_by(month) %>%
mutate(prev_a = lag(a, 1),
diff_time = a - prev_a) %>%
filter(c > 5) %>%
summarise(sum_diff_time = sum(diff_time, na.rm = TRUE))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 4 x 2
#> month sum_diff_time
#> <dbl> <drtn>
#> 1 6 1800 secs
#> 2 7 0 secs
#> 3 8 0 secs
#> 4 9 0 secs
Note: This assumes that the values for b and c are the same at the time
a and the previous value of a. I guess your are looking for a result that
is somewhat different from that, but this should point you in the right direction.
I need to filter a large dataset (100K + observations) in R so that it only includes data from 2014-present. The raw data contain observations from 2001-present. Here is the sample data to work from:
df <- data.frame(student = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), GPA = c(4,3.7,2.0,1.3,2.9,2.4,4.0,3.0,2.0,3.3),
Failed_Course = c(1,0,1,1,1,1,1,1,1,0),
Exam_date = c ("01/06/2010 06:55:00 AM", "03/30/2020 11:55:00 PM","12/30/2014 12:55:00 AM","04/20/2016 11:55:00 PM","09/28/2014 11:12:00 PM","07/30/2017 11:55:00 PM", "4/3/2005 09:55:00 PM",
"8/20/2004 11:55:00 PM","8/20/2015 11:22:00 AM","6/22/2001 08:55:00 PM"))
Using dplyr and lubridate
library(lubridate)
library(dplyr)
# Converts variable Exam_date into date format (month,date,year_hours,mins,secs)
df$Exam_date <- mdy_hms(df$Exam_date)
# Creates a new variable called date_year that only contains the year,
#filters for years greater than or equal to 2014,
#and drops the date_year variable
df <- df %>%
mutate(date_year = year(Exam_date)) %>%
filter(date_year >= 2014) %>%
select(-date_year)
Here is a base R approach.
df$Exam_date <- as.POSIXct(df$Exam_date,format = "%m/%d/%Y %I:%M:%S %p", tz="UTC")
df[df$Exam_date > as.POSIXct("2014-01-01 00:00:00"),]
# student GPA Failed_Course Exam_date
#2 2 3.7 0 2020-03-30 23:55:00
#3 3 2.0 1 2014-12-30 00:55:00
#4 4 1.3 1 2016-04-20 23:55:00
#5 5 2.9 1 2014-09-28 23:12:00
#6 6 2.4 1 2017-07-30 23:55:00
#9 9 2.0 1 2015-08-20 11:22:00
I have two columns in a data frame first is water consumption and the second column is for date+hour. for example
Value Time
12.2 1/1/2016 1:00
11.2 1/1/2016 2:00
10.2 1/1/2016 3:00
The data is for 4 years and I want to create separate columns for month date year and hour.
I would appreciate any help
We can convert to Datetime and then extract the components. We assume the format of 'Time' column is 'dd/mm/yyyy H:M' (in case it is different i.e. 'mm/dd/yyyy H:M', change the dmy_hm to mdy_hm)
library(dplyr)
library(lubridate)
df1 %>%
mutate(Time = dmy_hm(Time), month = month(Time),
year = year(Time), hour = hour(Time))
# Value Time month year hour
#1 12.2 2016-01-01 01:00:00 1 2016 1
#2 11.2 2016-01-01 02:00:00 1 2016 2
#3 10.2 2016-01-01 03:00:00 1 2016 3
In base R, we can either use strptime or as.POSIXct and then use either format or extract components
df1$Time <- strptime(df1$Time, "%d/%m/%Y %H:%M")
transform(df1, month = Time$mon+1, year = Time$year + 1900, hour = Time$hour)
# Value Time month year hour
#1 12.2 2016-01-01 01:00:00 1 2016 1
#2 11.2 2016-01-01 02:00:00 1 2016 2
#3 10.2 2016-01-01 03:00:00 1 2016 3
data
df1 <- structure(list(Value = c(12.2, 11.2, 10.2), Time = c("1/1/2016 1:00",
"1/1/2016 2:00", "1/1/2016 3:00")), class = "data.frame", row.names = c(NA,
-3L))
I have below-mentioned dataframe in R.
DF
ID Datetime Value
T-1 2020-01-01 15:12:14 10
T-2 2020-01-01 00:12:10 20
T-3 2020-01-01 03:11:11 25
T-4 2020-01-01 14:01:01 20
T-5 2020-01-01 18:07:11 10
T-6 2020-01-01 20:10:09 15
T-7 2020-01-01 15:45:23 15
By utilizing the above-mentioned dataframe, I want to bifurcate the count basis month and time bucket considering the Datetime.
Required Output:
Month Count Sum
Jan-20 7 115
12:00 AM to 05:00 AM 2 45
06:00 AM to 12:00 PM 0 0
12:00 PM to 03:00 PM 1 20
03:00 PM to 08:00 PM 3 35
08:00 PM to 12:00 AM 1 15
You can bin the hours of the day by using hour from the lubridate package and then cut from base R, before summarizing with dplyr.
Here, I am assuming that your Datetime column is actually in a date-time format and not just a character string or factor. If it is, ensure you have done DF$Datetime <- as.POSIXct(as.character(DF$Datetime)) first to convert it.
library(tidyverse)
DF$bins <- cut(lubridate::hour(DF$Datetime), c(-1, 5.99, 11.99, 14.99, 19.99, 24))
levels(DF$bins) <- c("00:00 to 05:59", "06:00 to 11:59", "12:00 to 14:59",
"15:00 to 19:59", "20:00 to 23:59")
newDF <- DF %>%
group_by(bins, .drop = FALSE) %>%
summarise(Count = length(Value), Total = sum(Value))
This gives the following result:
newDF
#> # A tibble: 5 x 3
#> bins Count Total
#> <fct> <int> <dbl>
#> 1 00:00 to 05:59 2 45
#> 2 06:00 to 11:59 0 0
#> 3 12:00 to 14:59 1 20
#> 4 15:00 to 19:59 3 35
#> 5 20:00 to 23:59 1 15
And if you want to add January as a first row (though I'm not sure how much sense this makes in this context) you could do:
newDF %>%
summarise(bins = "January", Count = sum(Count), Total = sum(Total)) %>% bind_rows(newDF)
#> # A tibble: 6 x 3
#> bins Count Total
#> <chr> <int> <dbl>
#> 1 January 7 115
#> 2 00:00 to 05:59 2 45
#> 3 06:00 to 11:59 0 0
#> 4 12:00 to 14:59 1 20
#> 5 15:00 to 19:59 3 35
#> 6 20:00 to 23:59 1 15
Incidentally, the reproducible version of the data I used for this was:
structure(list(ID = structure(1:7, .Label = c("T-1", "T-2", "T-3",
"T-4", "T-5", "T-6", "T-7"), class = "factor"), Datetime = structure(c(1577891534,
1577837530, 1577848271, 1577887261, 1577902031, 1577909409, 1577893523
), class = c("POSIXct", "POSIXt"), tzone = ""), Value = c(10,
20, 25, 20, 10, 15, 15)), class = "data.frame", row.names = c(NA,
-7L))
I have a data frame with a datetime column. I want to know the number of rows by hour of the day. However, I care only about the rows between 8 AM and 10 PM.
The lubridate package requires us to filter hours of the day using the 24-hour convention.
library(tidyverse)
library(lubridate)
### Fake Data with Date-time ----
x <- seq.POSIXt(as.POSIXct('1999-01-01'), as.POSIXct('1999-02-01'), length.out=1000)
df <- data.frame(myDateTime = x)
### Get all rows between 8 AM and 10 PM (inclusive)
df %>%
mutate(myHour = hour(myDateTime)) %>%
filter(myHour >= 8, myHour <= 22) %>% ## between 8 AM and 10 PM (both inclusive)
count(myHour) ## number of rows
Is there a way for me to use 10:00 PM rather than the integer 22?
You can use the ymd_hm and hour functions to do 12-hour to 24-hour conversions.
df %>%
mutate(myHour = hour(myDateTime)) %>%
filter(myHour >= hour(ymd_hm("2000-01-01 8:00 AM")), ## hour() ignores year, month, date
myHour <= hour(ymd_hm("2000-01-01 10:00 PM"))) %>% ## between 8 AM and 10 PM (both inclusive)
count(myHour)
A more elegant solution.
## custom function to convert 12 hour time to 24 hour time
hourOfDay_12to24 <- function(time12hrFmt){
out <- paste("2000-01-01", time12hrFmt)
out <- hour(ymd_hm(out))
out
}
df %>%
mutate(myHour = hour(myDateTime)) %>%
filter(myHour >= hourOfDay_12to24("8:00 AM"),
myHour <= hourOfDay_12to24("10:00 PM")) %>% ## between 8 AM and 10 PM (both inclusive)
count(myHour)
You can also use base R to do this
#Extract the hour
df$hour_day <- as.numeric(format(df$myDateTime, "%H"))
#Subset data between 08:00 AM and 10:00 PM
new_df <- df[df$hour_day >= as.integer(format(as.POSIXct("08:00 AM",
format = "%I:%M %p"), "%H")) & as.integer(format(as.POSIXct("10:00 PM",
format = "%I:%M %p"), "%H")) >= df$hour_day, ]
#Count the frequency
stack(table(new_df$hour_day))
# values ind
#1 42 8
#2 42 9
#3 41 10
#4 42 11
#5 42 12
#6 41 13
#7 42 14
#8 41 15
#9 42 16
#10 42 17
#11 41 18
#12 42 19
#13 42 20
#14 41 21
#15 42 22
This gives the same output as the tidyverse/lubridate approach
library(tidyverse)
library(lubridate)
df %>%
mutate(myHour = hour(myDateTime)) %>%
filter(myHour >= hour(ymd_hm("2000-01-01 8:00 AM")),
myHour <= hour(ymd_hm("2000-01-01 10:00 PM"))) %>%
count(myHour)